HERA: A Hierarchical-Compensatory Ranking Framework for Paired Benchmarking with Data-Driven Effect-Size Thresholds

Lukas von Erdmannsdorff
Institute of Neuroradiology, Goethe University Frankfurt

Summary

In scientific disciplines ranging from clinical research to machine learning, researchers face the challenge of objectively comparing multiple algorithms, experimental conditions, or datasets across up to three performance or quality metrics. This process, often framed as Multi-Criteria Decision Making (MCDM), is critical for identifying state-of-the-art methods. However, traditional ranking approaches frequently suffer from limitations: they may rely on central tendencies that ignore data variability (Benavoli et al., 2016; Demšar, 2006), depend solely on p-values which can be misleading in large samples (Wasserstein & Lazar, 2016), or require subjective weighting of conflicting metrics (Taherdoost & Madanchian, 2023).

HERA (Hierarchical-Compensatory, Effect-Size-Driven Ranking Algorithm) is a MATLAB-based ranking framework designed to automate the Paired Benchmarking process, bridging the gap between elementary statistical tests and complex decision-making methods. Unlike weighted-sum approaches that collapse multi-dimensional performance into a single scalar, HERA implements a hierarchical-compensatory logic. This logic integrates non-parametric significance testing (Wilcoxon signed-rank test), robust effect size estimation (Cliff’s Delta, Relative Difference), and bootstrapping (e.g. Percentile and Cluster) to produce rankings that are both statistically robust and practically relevant. HERA is designed for researchers in biomedical imaging, machine learning, and applied statistics who need to compare method performance across multiple quality metrics in a statistically rigorous manner without requiring subjective parameter tuning.

Statement of Need

The scientific community increasingly recognizes the pitfalls of relying on simple summary statistics or p-values alone (Wasserstein & Lazar, 2016). In benchmarking studies, specifically, several issues persist:

Ignoring Variance: Ranking based on mean scores fails to account for the stability of performance across different subjects or folds. A method might achieve a high average score due to exceptional performance on a few easy cases while failing catastrophically on others, yet still outrank a more consistent competitor (Demšar, 2006).
Statistical vs. Practical Significance: A result can be statistically significant but practically irrelevant, especially in large datasets where even trivial differences yield p < 0.05. Standard tests do not inherently distinguish between these cases, potentially leading to the adoption of methods that offer no tangible benefit (Sullivan & Feinn, 2012).
Subjectivity in Aggregation: Many MCDM methods require users to assign subjective weights to metrics (e.g., “Accuracy is 0.7, Speed is 0.3”). These weights are often chosen post-hoc or lack empirical justification, introducing researcher bias that can be manipulated to favor a specific outcome (Taherdoost & Madanchian, 2023).
Distributional Assumptions: Parametric tests (e.g., t-test) assume normality, which is often violated in real-world benchmarks where performance metrics may be skewed, bounded, or ordinal (Romano et al., 2006).

HERA addresses these challenges by providing a standardized, data-driven framework. It ensures that a method is only ranked higher if it demonstrates a statistically significant and sufficiently large advantage, preventing “wins” based on negligible differences or noise. Existing MCDM software packages such as the Python libraries pyDecision (Pereira et al., 2024) and pymcdm (Kizielewicz et al., 2023), or R’s RMCDA (Najafi & Mirzaei, 2025) often implement classical methods like TOPSIS (Hwang & Yoon, 1981), PROMETHEE (Brans & Vincke, 1985), and ELECTRE (Roy, 1968) that require user-defined weights or preference functions. With HERA, subjective parameterization is reduced by using data-driven thresholds derived from bootstrap resampling. Furthermore, the framework integrates statistical hypothesis testing directly into the ranking process, a feature absent in standard MCDM toolboxes. By unifying strictly paired statistical inference with adaptive hierarchical-compensatory logic, the algorithm establishes an improved methodological paradigm. This framework fills a critical gap across scientific software ecosystems—including MATLAB—by providing a dedicated, open-source environment for statistically rigorous, multi-criteria benchmarking that minimizes both subjective weighting and the need for ad-hoc scripting.

Methodological Framework

HERA operates on paired data matrices where rows represent subjects (or tests) and columns represent the candidates to be compared. The core innovation is its sequential logic, which allows for “compensation” between metrics based on strict statistical evidence.

Statistical Rigor and Effect Sizes

HERA quantifies differences using statistical significance and effect sizes to ensure practical relevance independent of sample size (Cohen, 1988; Sullivan & Feinn, 2012). A “win” always requires satisfying three conjunctive criteria, if not it is considered “neutral”:

Significance: p < α_Holm (Holm-Bonferroni corrected). Pairwise comparisons use the Wilcoxon signed-rank test (Wilcoxon, 1945), with p-values corrected using the step-down Holm-Bonferroni method (Holm, 1979) to control the Family-Wise Error Rate (FWER).
Stochastic Dominance (Cliff’s Delta): Cliff’s Delta (d = P(X>Y) − P(Y>X)) quantifies distribution overlap, is robust to outliers, and relates to common-language effect sizes (Cliff, 1993; Vargha & Delaney, 2000). The effect size d must exceed a bootstrapped threshold θ_d.
Magnitude (Relative Difference): The Relative Mean Difference (RelDiff) quantifies effect magnitude on the original metric scale, defined for group means x̄ and ȳ as: $\text{RelDiff} = \frac{\vert\bar{x} - \bar{y}\vert}{\vert\frac{1}{2}(\bar{x} + \bar{y})\vert}$. This normalization is formally identical to the Symmetric Mean Absolute Percentage Error (SMAPE) used in forecasting (Makridakis, 1993). However, by being applied to group means rather than individual observations, it becomes a distinct between-group measure of practical magnitude, conceptually related to the Response Ratio in meta-analysis (Hedges et al., 1999). The metric enables scale-independent comparisons and facilitates the interpretation of percentage changes (Kampenes et al., 2007). RelDiff must exceed a threshold θ_RelDiff.
Complementary Criteria & SEM Lower Bound: HERA’s complementary logic requires statistical significance, stochastic dominance and practical magnitude, preventing “wins” based on trivial consistent differences or noisy outliers. This approach is conceptually aligned with the recommendation to evaluate both significance and effect sizes for a comprehensive assessment of experimental results (Lakens, 2013). Thresholds are determined via Percentile Bootstrapping (lower α/2-quantile) (Rousselet et al., 2021). To filter trivial effects (e.g. in low-variance datasets), the RelDiff threshold enforces a lower bound based on the Standard Error of the Mean (SEM), ensuring θ_RelDiff ≥ θ_SEM. This approach is inspired by the concept of the “Smallest Worthwhile Effect” (Hopkins, 2004), but adapted for HERA to quantify the uncertainty of the group mean rather than individual measurement error.

Hierarchical-Compensatory Logic

The ranking process is structured as a multi-stage tournament. It does not use a global score but refines the rank order iteratively (see Fig. 1):

Stage 1 (Initial Sort): Methods are initially ranked based on the win count of the primary metric M₁. In case of a tie in adjacent ranks, Cliff’s Delta is used to break the tie. If Cliff’s Delta is zero, the raw mean values are used to break the tie.
Stage 2 (Compensatory Correction): This stage addresses the trade-off between metrics. A lower-ranked candidate can “swap” places with a higher-ranked one if it demonstrates a statistically significant and relevant superiority in the secondary metric M₂. In this hierarchy, M₂ acts as a strict veto mechanism: a significant disadvantage in this critical metric (e.g. critical safety concerns) cannot be offset by any magnitude of advantage in M₁. This effectively implements a hierarchical-compensatory ordering (Keeney & Raiffa, 1976), allowing candidates that are worse in the primary metric but superior in a secondary metric to correct the rank order.
Stage 3 (Tie-Breaking): This stage resolves “neutral” results using a tertiary metric M₃. It applies two distinct sub-logics to ensure a total ordering while maintaining hierarchical stability:
- Sublogic 3a (Local Correction): A one-time correction for adjacent pairs if the previous metric (M₂) is neutral. This handles cases where two methods are indistinguishable in the higher-priority criteria, allowing M₃ to locally correct the initial order without triggering cascading chain reactions that could destabilize the global hierarchy.
- Sublogic 3b (Indifference Resolution): To resolve clusters of remaining undecided methods, an iterative correction loop is applied to subsets where both M₁ and M₂ are “neutral,” utilizing metric M₃ until a final stable order is found.

Validation and Uncertainty

HERA integrates advanced resampling methods to quantify uncertainty:

BCa Confidence Intervals: Bias-Corrected and Accelerated (BCa) intervals are calculated for all effect sizes in the pairwise comparisons for each metric (Efron, 1987).
Cluster Bootstrap: To assess the stability of the final ranking, HERA performs a cluster bootstrap resampling subjects with replacement (Field & Welsh, 2007). This yields a 95% confidence interval using the percentile method for the achieved ranks of each method.
Power Analysis: A post-hoc simulation with cluster-bootstrap estimates the relative frequency of detecting a “win”, “loss” or “neutral” result in all pairwise comparisons per metric given the data characteristics.
Sensitivity Analysis: The algorithm permutes the metric hierarchy and aggregates the resulting rankings using a Borda Count (Young, 1974) to evaluate the robustness of the decision against hierarchy changes.

Software Features

HERA offers a flexible configuration of up to three metrics (see Fig. 2). This allows users to adapt the ranking logic to different study designs and needs. It also combines this flexibility with a range of reporting options, data integration, and reproducibility features. By providing robust default parameters for all statistical and convergence settings, researchers can focus exclusively on core scientific decisions—such as the selection of metrics, the choice of ranking logic, and the assignment of metrics to the specific stages—while all technical tuning and validation steps are fully automated. This allows the ranking process to directly reflect the user’s scientific expertise without requiring ad-hoc parameterization.

Automated Reporting: Generates PDF reports, Win-Loss Matrices, Sankey Diagrams, and machine-readable JSON/CSV exports.
Reproducibility: Supports fixed-seed execution and configuration file-based workflows. The full analysis state, including random seeds and parameter settings, is saved in a JSON file, allowing other researchers to exactly replicate the ranking results.
Convergence Analysis: To avoid the common pitfall of using an arbitrary number of bootstrap iterations, HERA implements an adaptive algorithm. It automatically monitors the stability of the estimated confidence intervals and effect size thresholds, continuing the resampling process until the estimates converge within a specified tolerance, thus determining the optimal number of iterations B dynamically (Pattengale et al., 2010). If the characteristics of the data for bootstrapping are known, the number of bootstrap iterations can be set manually.
Data Integration: HERA supports seamless data import from standard formats (CSV, Excel), MATLAB tables, and NumPy arrays or Pandas DataFrames when using the python interface, facilitating integration into existing research pipelines. Example datasets and workflows demonstrating practical applications are included in the repository.
Accessibility: HERA can be installed by cloning the GitHub repository, via the hera-matlab Python interface on PyPI, or deployed as a standalone application that requires no MATLAB license. The Python interface enables license-free integration into standard data science pipelines, requiring only a MATLAB Runtime. The MATLAB toolbox and standalone application feature an interactive CLI that guides users through the analysis without programming expertise, while an API and JSON Configuration allow for automated batch processing.

Flexible Configuration options for Ranking Logic

For detailed information on the algorithmic background, we refer to the Ranking Logic and Methodology, the Bootstrap Logic and Convergence, and the Methodological Guidelines and Limitations pages. An Example Analysis using HERA on synthetic data is also available.

References

Benavoli, A., Corani, G., & Mangili, F. (2016). Should we really use post-hoc tests based on mean-ranks? Journal of Machine Learning Research, 17, 1–10. https://jmlr.org/papers/v17/benavoli16a.html

Brans, J. P., & Vincke, P. (1985). A preference ranking organization method (the PROMETHEE method for multiple criteria decision-making). Management Science, 31(6), 647–656. https://doi.org/10.1287/mnsc.31.6.647

Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates. https://doi.org/10.4324/9780203771587

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. https://jmlr.org/papers/v7/demsar06a.html

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185. https://doi.org/10.1080/01621459.1987.10478410

Field, C. A., & Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3), 369–390. https://doi.org/10.1111/j.1467-9868.2007.00593.x

Hedges, L. V., Gurevitch, J., & Curtis, P. S. (1999). The meta-analysis of response ratios in experimental ecology. Ecology, 80(4), 1150–1156. https://doi.org/10.1890/0012-9658(1999)080[1150:TMAORR]2.0.CO;2

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. https://www.jstor.org/stable/4615733

Hopkins, W. G. (2004). How to interpret changes in an athletic performance test. Sportscience, 8, 1–7. https://www.sportsci.org/jour/04/wghtests.htm

Hwang, C. L., & Yoon, K. (1981). Multiple attribute decision making: Methods and applications. Springer. https://doi.org/10.1007/978-3-642-48318-9

Kampenes, V. B., Dybå, T., Hannay, J. E., & Sjøberg, D. I. K. (2007). A systematic review of effect size in software engineering experiments. Information and Software Technology, 49(11–12), 1073–1086. https://doi.org/10.1016/j.infsof.2007.02.015

Keeney, R. L., & Raiffa, H. (1976). Decisions with multiple objectives: Preferences and value trade-offs. Wiley. https://doi.org/10.1017/CBO9781139174084

Kizielewicz, B., Shekhovtsov, A., & Salabun, W. (2023). Pymcdm—the universal library for solving multi-criteria decision-making problems. SoftwareX, 22, 101368. https://doi.org/10.1016/j.softx.2023.101368

Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863

Makridakis, S. (1993). Accuracy measures: Theoretical and practical concerns. International Journal of Forecasting, 9(4), 527–529. https://doi.org/10.1016/0169-2070(93)90079-3

Najafi, A., & Mirzaei, S. (2025). RMCDA: The comprehensive r library for applying multi-criteria decision analysis methods. Software Impacts, 24, 100762. https://doi.org/10.1016/j.simpa.2025.100762

Pattengale, N. D., Alipour, M., Bininda-Emonds, O. R. P., Moret, B. M. E., & Stamatakis, A. (2010). How many bootstrap replicates are necessary? Journal of Computational Biology, 17(3), 337–354. https://doi.org/10.1089/cmb.2009.0179

Pereira, V., Basilio, M. P., & Santos, C. H. T. (2024). Enhancing decision analysis with a large language model: pyDecision a comprehensive library of MCDA methods in python. Journal of Modelling in Management. https://doi.org/10.1108/JM2-04-2024-0118

Romano, J., Kromrey, J. D., Coraggio, J., & Skowronek, J. (2006). Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’s d for evaluating group differences on the NSSE and other surveys? Proceedings of the Annual Meeting of the Florida Association of Institutional Research. https://www.researchgate.net/publication/237544991

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The percentile bootstrap: A primer with step-by-step instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 1–10. https://doi.org/10.1177/2515245920911881

Roy, B. (1968). Classement et choix en présence de points de vue multiples (la méthode ELECTRE). Revue Française d’informatique Et de Recherche Opérationnelle, 2(V1), 57–75. https://doi.org/10.1051/ro/196802V100571

Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the p value is not enough. Journal of Graduate Medical Education, 4(3), 279–282. https://doi.org/10.4300/JGME-D-12-00156.1

Taherdoost, H., & Madanchian, M. (2023). Multi-criteria decision making (MCDM) methods and concepts. Encyclopedia, 3(1), 235–250. https://doi.org/10.3390/encyclopedia3010006

Vargha, A., & Delaney, H. D. (2000). A critique and improvement of the CL common language effect size statistics of McGraw and wong. Journal of Educational and Behavioral Statistics, 25(2), 101–132. https://doi.org/10.3102/10769986025002101

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. https://doi.org/10.2307/3001968

Young, H. P. (1974). An axiomatization of borda’s rule. Journal of Economic Theory, 9(1), 43–52. https://doi.org/10.1016/0022-0531(74)90073-8