Date: Thursday, December 4, 2025
I’m Anthony Clairmont, an evaluator with a background in social sciences who primarily evaluates human services programs (mental health, housing access, and jobs), as well as public spaces (libraries, museums, and city halls). In this post I want to challenge one of the most entrenched practices in quantitative evaluation: our obsession with statistical significance at the expense of practical significance.
Statistical significance tells us whether an observed effect is likely due to chance. Practical significance asks the more crucial question: does this effect matter in the real world? A program might produce a statistically significant improvement in test scores (p < 0.05), but if that improvement is only half a point on a 100-point scale, who cares? Statistical significance often says more about our sample size than about whether our findings matter.
The problem is particularly acute in evaluation research. I am not brought onto projects to advance theory in academic journals – I’m brought on to inform decisions about real programs affecting real people. Stakeholders need to know whether a program creates meaningful change, not whether it clears an arbitrary statistical threshold. Many researchers now believe that statistical significance testing should be abandoned altogether in favor of more informative approaches.
In evaluation, practical significance is intimately connected to standard-setting, that is, usually the process of determining what level of performance constitutes “success.” This is fundamentally an evaluative judgment, not just a statistical one. I have argued that standard-setting is the hardest problem in evaluation because it often uses normative, descriptive, and predictive reasoning.
Several methods can help us move beyond p-values to more meaningful metrics:
The Bayesian Region of Practical Equivalence (ROPE) offers a principled approach. Instead of testing against a null hypothesis of zero effect, ROPE defines a range of values considered practically equivalent to zero. You can set standards so that, for example, if your 95% credible interval falls entirely within this ROPE, you conclude the effect is negligible.
Clinical significance, developed in psychology and medicine, focuses on whether changes are large enough to be noticeable in everyday functioning. The concept uses metrics like the Reliable Change Index to determine whether individual-level changes exceed measurement error and represent genuine improvement.
Cost-benefit thresholds bring economic data into the evaluation. Rather than asking “is there an effect?” we ask “is the effect worth the investment?” This approach is particularly relevant for resource-constrained evaluation contexts where opportunity costs matter deeply.
It has been nearly a decade since the American Statistical Association issued a statement saying that “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” Statistical significance is an outdated concept that should no longer be seriously considered in isolation as a criterion for useful evaluation results. We need to embrace alternatives that actually inform decision-making. The Bayesian ROPE, clinical significance measures, and cost-benefit thresholds all offer more helpful frameworks for stakeholders trying to understand whether a program truly makes a difference. The new rules are clear: stop hiding behind p-values and start grappling with what constitutes meaningful change in your specific context.
The American Evaluation Association is hosting Quantitative Methods: Theory & Design TIG Week. The contributions all this week to AEA365 come from evaluators who do quantitative methods in evaluation. Do you have questions, concerns, kudos, or content to extend this AEA365 contribution? Please add them in the comments section for this post on the AEA365 webpage so that we may enrich our community of practice. Would you like to submit an AEA365 Tip? Please send a note of interest to AEA365@eval.org. AEA365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.