Linear Statistical Models: Regression

Regression Diagnostics


Descriptive Statistics and Exploratory Data Analysis

Of course, any data analysis would begin with descriptive statistics and exploratory data analysis. Included in these analyses would be distribution plots (histograms, stem plots, or kdensity plots).

Plot Dependent Variable vs Predicted Values

One visual check of the goodness-of-fit of the model is to plot the values of the dependent variable versus the predicted values. When there is perfect prediction the plot will be a diagonal line.

Residual Analysis

Residuals are the difference between the observed score and the predicted score.

Residuals come in three varieties:

  1. Raw Residuals: The difference between the raw observed score and the predicted score, as given in the formula above. Often denoted e or resid.
  2. Standardized Residuals: These are the raw residuals divided by the standard error of estimate. Can be denoted rstan or zresid.
  3. Studentized Residuals: These are raw residuals divided by the standard error of the residual with that case deleted. These are sometimes called studentized deleted residuals or studentized jackknifed residuals. Can be denoted rstu.
Outliers

Outliers are cases with large residuals.

Plotting Residuals

In General: Residual Plots

The picture should look something like this-

DV vs Predictors

Overall Plot of Residuals

Index Plot -- Plot of Residuals by Case

Time Sequence Plot

1. Watch out for situations in which variance increases with time; try Weighted Least Squares (W.L.S.)

2. This pattern could indicate that a linear term is missing from the model.

3. This pattern could indicate that both a linear and a quadratic term in time are missing from the model.

Plot Residuals versus Predicted (Fitted) Scores

1. Watch out for situations in which variance is not constant as assumed (may need W.L.S. or a transformation of Y).

2. This pattern could indicate that a variable is missing from the model (Also caused by wrongly omitting intercept term in model).

3. An additional term is needed in the model, the square of a variable or an interaction (again maybe transformation of Y).

Residual Plot versus Predictors

1. May need W.L.S. or a transformation of Y.

2. Perhaps errors in calculation?.

3. Need an additional term in X (X2) or transformation of Y..

Leverage

Influence

Some Measures of Leverage and Influence Values to Watch Out For

Other Diagnostic Tests

  • Test for model specification error, sometimes called linktest. Performed by regressing the residuals and residuals squared against the dependent variable.
    regress gpa grev greq
    
    
    
      Source |       SS       df       MS                  Number of obs =      30
    ---------+------------------------------               F(  2,    27) =   12.73
       Model |  5.06332584     2  2.53166292               Prob > F      =  0.0001
    Residual |  5.37134081    27  .198938548               R-squared     =  0.4852
    ---------+------------------------------               Adj R-squared =  0.4471
       Total |  10.4346666    29  .359816091               Root MSE      =  .44603
    
    ------------------------------------------------------------------------------
         gpa |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
        grev |   .0027326   .0011288      2.421   0.022       .0004166    .0050487
        greq |   .0053559   .0019278      2.778   0.010       .0014003    .0093114
       _cons |  -1.286698   .9765207     -1.318   0.199      -3.290353    .7169566
    ------------------------------------------------------------------------------
    
    
  • Link test looks for one type of specification error. linktest Source | SS df MS Number of obs = 30 ---------+------------------------------ F( 2, 27) = 13.50 Model | 5.21793375 2 2.60896687 Prob > F = 0.0001 Residual | 5.2167329 27 .193212329 R-squared = 0.5001 ---------+------------------------------ Adj R-squared = 0.4630 Total | 10.4346666 29 .359816091 Root MSE = .43956 ------------------------------------------------------------------------------ gpa | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- _hat | -2.744288 4.190284 -0.655 0.518 -11.34204 5.853464 _hatsq | .5622472 .6285344 0.895 0.379 -.7273988 1.851893 _cons | 6.13873 6.89339 0.891 0.381 -8.005337 20.2828 ------------------------------------------------------------------------------
  • Omitted variable test. Looks for omitted variables by including Y'2, Y'3 and Y'4 in the original model.
    ovtest
    
    Ramsey RESET test using powers of the fitted values of gpa
           Ho:  model has no omitted variables
                      F(3, 24) =      1.20
                      Prob > F =      0.3307
    
  • Heterogeneity of variance test. Looks for heterogeneity by modeling the variance as a function of the predicted values.
    hettest
    
    Cook-Weisberg test for heteroscedasticity using fitted values of gpa
         Ho: Constant variance
             chi2(1)      =      0.03
             Prob > chi2  =      0.8686
    

  • Another heterogeneity of variance test. Preferred by some over the Cook-Weisberg test above.
    whitetst   /* Downloaded from Stata (STB 55, sg137) via the Internet */
    
    White's general test statistic :  10.10781  Chi-sq( 5)  P-value =  .0722
    

    Diagnostic Plots

    Stata Commands Stata Regression Diagnostic Plots


    Linear Statistical Models Course

    Phil Ender, 15Jun98