Sifting the evidence—what’s wrong with significance tests?
- Core Critique: This classic commentary criticizes the widespread misuse and over-reliance on statistical significance tests (p-values) in medical literature, arguing that this practice fundamentally distorts evidence.
- Central Problem: The authors identify publication bias (accentuating positive results over null results) as the key factor causing the mistaken publication of chance findings as real effects, thereby eroding confidence in science.
- Recommendation: Researchers should shift their focus from the binary question of whether P < 0.05 to assessing the direction and the magnitude of the effect, specifically asking if the effect is of public health or clinical importance.
PubMed: 11159626 DOI: 10.1136/bmj.322.7280.226 Overview generated by: Gemini 2.5 Flash, 26/11/2025
Key Findings: The Misuse and Limitations of Statistical Significance
This highly influential article critiques the over-reliance on statistical significance testing (p-values) in medical and epidemiological research, arguing that this practice fundamentally distorts the scientific literature and leads to public skepticism.
The Central Problem: Publication Bias
The primary issue is the medical literature’s strong tendency to accentuate the positive, leading to publication bias (also known as the “file drawer effect”):
- Skewed Reporting: Studies with positive outcomes (those achieving statistical significance, typically P<0.05) are far more likely to be published than those with null results (non-significant findings).
- Increased Chance Findings: This creates a system where a host of purely chance findings are published and subsequently mistaken for real biological or clinical effects. The authors note that, by conventional reasoning, examining 20 associations will produce one “significant at P=0.05” result by chance alone.
- Erosion of Trust: The proliferation of inconsistent, purely chance findings contributes significantly to scepticism about medical research and epidemiological studies among the public and practitioners.
Shifting Focus: From P-value to Effect Magnitude
The authors argue that focusing solely on whether a P-value crosses the 0.05 threshold misses the point of much medical research, particularly in observational and interventional studies:
- Null Hypothesis Relevance: In many epidemiological studies and randomized controlled trials, there is often little reason to expect the true effect to be exactly null. The key issue is not consistency with a strict null hypothesis.
- The Real Questions: Instead of asking if an effect exists (P-value), researchers should focus on:
- Whether the direction of the effect has been reasonably and firmly established.
- Whether the magnitude of the effect is such that it is of public health or clinical importance. A small, statistically significant effect may be clinically irrelevant. A large, non-significant effect may warrant further study.
Historical Context
The critique is contextualized by referencing the founder of statistical significance testing, Ronald Fisher, and the early criticism he received from his colleague F. Yates, who noted that Fisher’s emphasis on significance testing was problematic. The paper implicitly advocates for the use of Confidence Intervals (CIs), which convey both the direction and the magnitude of the effect along with the precision of the estimate.