Linear Statistical Models: Regression
Problems with Stepwise Regression
This statement by Singer & Willet (2003) represents one of the best statements concerning
the use of stepwise approaches:
Never let a computer select predictors mechanically. The computer does not know your
research questions nor the literature upon which they rest. It cannot distinguish predictors
of direct substantive interest from those whose effects you want to control.
These comments are from the Stata FAQ pages (www.stata.com)
Frank Harrell's comments:
Here are some of the problems with stepwise variable selection.
- It yields R-squared values that are badly biased high.
- The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
- The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson,
Statistics in Medicine).
- It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem.
- It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani,
1996).
- It has severe problems in the presence of collinearity.
- It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.
- Increasing the sample size doesn't help very much (see Derksen and Keselman).
- It allows us to not think about the problem.
- It uses a lot of paper.
Note that "all possible subsets" regression does not solve any of these problems.
References
Altman, D. G. and P. K. Andersen. 1989. Bootstrap investigation of the stability of a Cox regression model. Statistics in Medicine 8:
771-783.
Shows that stepwise methods yields confidence limits that are far too narrow.
Derksen, S. and H. J. Keselman. 1992. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining
authentic and noise variables. British Journal of Mathematical and Statistical Psychology 45: 265-282.
Conclusions
"The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables
found their way into the final model."
"The number of candidate predictor variables affected the number of noise variables that gained entry to the model."
"The size of the sample was of little practical importance in determining the number of authentic variables contained in the
final model."
"The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted
by the total number of candidate predictor variables rather than the number of variables in the final model."
Roecker, Ellen B. 1991. Prediction error and its estimation for subset--selected models. Technometrics 33: 459-468.
Shows that all-possible regression can yield models that are "too small".
Mantel, Nathan. 1970. Why stepdown procedures in variable selection. Technometrics 12: 621-625.
Hurvich, C. M. and C. L. Tsai. 1990. The impact of model selection on inference in linear regression.
American Statistician 44: 214-217.
Copas, J. B. 1983. Regression, prediction and shrinkage (with discussion). Journal of the Royal Statistical Society B 45: 311-354.
Shows why the number of CANDIDATE variables and not the number in the final model is the number of d.f. to consider.
Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society B 58: 267-288.
Ronan Conroy's comments:
I am struck by the fact that Judd and McClelland in their excellent book Data Analysis:
A Model Comparison Approach (Harcourt
Brace Jovanovich, ISBN 0-15-516765-0) devote less than 2 pages to stepwise methods.
What they do say, however, is worth repeating:
- Stepwise methods will not necessarily produce the best model if there are redundant predictors (common problem).
- All-possible-subset methods produce the best model for each possible number of terms, but larger models need not necessarily be
subsets of smaller ones, causing serious conceptual problems about the underlying logic of the investigation.
- Models identified by stepwise methods have an inflated risk of capitalising on chance features of the data. They frequently fail
when applied to new datasets. They are rarely tested in this way.
- Since the interpretation of coefficients in a model depends on the other terms included, "it seems unwise," to quote J and McC,
"to let an automatic algorithm determine the questions we do and do not ask about our data".
- I quote this last point directly, as it is sane and succinct:
"It is our experience and strong belief that better models and a better understanding of one's data result from focussed data analysis,
guided by substantive theory." (p 204)
They end with a quote from Henderson and Velleman's paper "Building multiple regression models interactively" (1981, Biometrics 37:
391-411):
"The data analyst knows more than the computer,"
and they add
"failure to use that knowledge produces inadequate data analysis".
Personally, I would no more let an automatic routine select my model
than I would let some best-fit procedure pack my suitcase.
Summary by Steve Blinkhorn:
So here is a brief abstract of the BJMSP paper,
plus odd extracts from elsewhere:
The use of automated subset search algorithms is reviewed and issues
concerning model selection and selection criteria are discussed. In
addition, a Monte Carlo study is reported which presents data
regarding the frequency with which authentic and noise variables are
selected by automated subset algorithms. In particular, the effects
of the correlation between predictor variables, the number of
candidate predictor variables, the size of the sample, and the level
of significance for entry and deletion of variables were studied for
three automated subset selection algorithms: BACKWARD ELIMINATION,
FORWARD SELECTION and STEPWISE. Results indicated that: (1) the
degree of correlation between the predictor variables affected the
frequency with which authentic predictor variables found their way
into the final model; (2) the number of candidate predictor variables
affected the number of noise variables that gained entry to the model;
(3) the size of the sample was of little practical importance in
determining the number of authentic variables contained in the final
model; and (4) the population multple coefficient of determination
could be faithfully estimated by adopting a statistic that is adjusted
by the total number of candidate predictor variables rather than the
number of variables in the final model.
..... the degree of collinearity between predictor variables was the
most important factor influencin the selection of authentic
variables....
... the number of candidate predictor variables affected the number of
noise variables that gained entry to the model ...
...Even in the most favourable case investigated ..... 20 per cent of
the variables finding their way into the model were noise. In the
worst case .... 74 per cent of the selected variables were noise.
... the average number of authentic variables found in the final
subset models was always less than half the number of available
authentic predictor variables.
.... the 'data mining' approach to model building is likely to result
in final models containing a large percentage of noise variables which
wil be interpreted incorrectly as authentic.
Linear Statistical Models Course
Phil Ender, 14jan00