Step away from stepwise
- Core Argument: This report critiques stepwise regression, arguing it is a flawed statistical method, especially in the context of Big Data, because it uses statistical significance to select variables.
- Key Flaw: Stepwise procedures may include nuisance variables that are coincidentally significant and exclude true explanatory variables that happen to be non-significant, leading to models that fit data well in-sample but perform poorly out-of-sample.
- Big Data Effect: The belief that stepwise is more useful with more variables is false; Big Data exacerbates its failings by increasing the chance of selecting spurious correlations, causing a dramatic deterioration in out-of-sample predictive accuracy.
PubMed: Not Indexed (Journal of Big Data) DOI: 10.1186/s40537-018-0143-6 Overview generated by: Gemini 2.5 Flash, 26/11/2025
Key Findings: The Dangers of Stepwise Regression in the Era of Big Data
This short report by Gary Smith critiques the continued use of stepwise regression, a popular but flawed variable selection method, especially in the context of Big Data. The central argument is that stepwise procedures are fundamentally unable to distinguish between genuine explanatory variables and nuisance variables (spurious correlations), a problem that is dramatically exacerbated when the pool of potential predictors is large.
The Fundamental Flaws of Stepwise Regression
Stepwise regression (which includes forward selection, backward elimination, and bidirectional methods) uses an arbitrary threshold of statistical significance (p-value) to automate the inclusion or exclusion of variables in a multiple-regression model. The author identifies three major, interconnected problems with this approach:
- Selection of Nuisance Variables: By relying solely on statistical significance, stepwise procedures frequently select nuisance variables that happen to be coincidentally significant in the in-sample data. Since these variables have no true causal effect, they are useless for prediction with fresh data (out-of-sample).
- Exclusion of True Explanatory Variables: Conversely, genuine explanatory variables with causal effects may be incorrectly excluded because they happen not to be statistically significant in the particular sample analyzed.
- Severe Out-of-Sample Failure: The resulting model, while often providing an excellent fit to the estimation data (high \(R^2\) due to including significant noise variables), does poorly out-of-sample. The selection of irrelevant variables provides a false confidence in the estimated model because of the high t-values and the boost to the in-sample \(R^2\). ### Big Data Exacerbates the Problem
The article specifically addresses the belief held by some “Big-Data researchers” that the larger the number of possible explanatory variables, the more useful stepwise regression becomes.
- Increased Chance of Spurious Correlation: In reality, the efficacy of stepwise regression is less effective the larger the number of potential explanatory variables. The sheer number of variables in Big Data increases the probability of finding highly significant but spurious correlations purely by chance.
- Worsening Out-of-Sample Fit: As the number of candidate variables increases, the in-sample fit improves, but the out-of-sample fit deteriorates, causing the ratio of the out-of-sample errors to the in-sample errors to “balloon”.
Conclusion
The paper concludes that stepwise regression does not offer a solution to the challenge of too many explanatory variables in the Big Data era; rather, Big Data exacerbates the failings of stepwise regression. The focus should instead be on methods that prioritize predictive accuracy and robust out-of-sample validation.