Cross Validation

Linear Statistical Models: Regression

Cross Validation

Shrinkage

The tendency for regression to bias the R upwards due to capitalizing on chance associations within a sample.

When the regression weights are applied to another sample the R² is smaller.

Degree of overestimation is affected by ratio of independent variables to sample size.

Rules of thumb:

30 Ss per IV
At least 400 Ss

Estimating Shrinkage

Cross Validation

Involves two samples: Sample 1) the screening sample and Sample 2) the calibration sample.

Compute regression coefficients in screening sample.

Use weights from the screening sample to compute predicted scores in calibration sample.

Compute r²_YY' in calibration sample.

If shrinkage is small and coefficients change little, combine samples and recompute regression.

Double Cross Validation

Do cross validation using "Sample A" as screening sample and "Sample B" as calibration sample.

Repeat cross validation using "Sample B" as screening sample and "Sample A" as calibration sample.

Stata Cross Validation Example

We will begin by looking at the 1999 API data for 67 Orange County high schools and and for 226 Los Angeles County high schools.

use http://www.philender.com/courses/data/ochi, clear

regress api99 pctmeal pctel yrrnd core avged pctemer

      Source |       SS       df       MS              Number of obs =      67
-------------+------------------------------           F(  6,    60) =   77.36
       Model |  846646.429     6  141107.738           Prob > F      =  0.0000
    Residual |  109446.288    60   1824.1048           R-squared     =  0.8855
-------------+------------------------------           Adj R-squared =  0.8741
       Total |  956092.716    66  14486.2533           Root MSE      =   42.71

------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     pctmeal |   .6088963   .3348236     1.82   0.074    -.0608507    1.278643
       pctel |  -3.372816   .7189063    -4.69   0.000    -4.810842   -1.934789
       yrrnd |  -30.10476   24.75397    -1.22   0.229    -79.62008    19.41056
        core |   -4.97981   3.339713    -1.49   0.141    -11.66023    1.700611
       avged |   69.37089   15.76158     4.40   0.000     37.84305    100.8987
     pctemer |  -.1026734   .8709507    -0.12   0.907    -1.844834    1.639487
       _cons |   693.2872   105.3493     6.58   0.000     482.5573    904.0171
------------------------------------------------------------------------------

use http://www.philender.com/courses/data/lahi, clear

regress api99 pctmeal pctel yrrnd core avged pctemer

      Source |       SS       df       MS              Number of obs =     226
-------------+------------------------------           F(  6,   219) =  229.23
       Model |  3408806.48     6  568134.413           Prob > F      =  0.0000
    Residual |   542769.29   219  2478.39858           R-squared     =  0.8626
-------------+------------------------------           Adj R-squared =  0.8589
       Total |  3951575.77   225   17562.559           Root MSE      =  49.784

------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     pctmeal |  -.4042542   .1405528    -2.88   0.004    -.6812634    -.127245
       pctel |  -.6719812   .3342255    -2.01   0.046    -1.330691    -.013271
       yrrnd |   -31.0319   11.23642    -2.76   0.006    -53.17725   -8.886547
        core |  -.9548313   1.404674    -0.68   0.497    -3.723241    1.813579
       avged |   132.6249   9.269121    14.31   0.000     114.3568     150.893
     pctemer |  -1.967988   .3124097    -6.30   0.000    -2.583702   -1.352274
       _cons |   329.3452   58.39626     5.64   0.000     214.2546    444.4358
------------------------------------------------------------------------------

We will demonstrate cross validation by starting with the Orange County data.

use http://www.philender.com/courses/data/ochi, clear

regress api99 pctmeal pctel yrrnd core avged pctemer

  Source |       SS       df       MS                  Number of obs =      67
---------+------------------------------               F(  6,    60) =   77.36
   Model |  846646.429     6  141107.738               Prob > F      =  0.0000
Residual |  109446.288    60   1824.1048               R-squared     =  0.8855
---------+------------------------------               Adj R-squared =  0.8741
   Total |  956092.716    66  14486.2533               Root MSE      =   42.71

------------------------------------------------------------------------------
   api99 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
 pctmeal |   .6088963   .3348236      1.819   0.074      -.0608506    1.278643
   pctel |  -3.372816   .7189063     -4.692   0.000      -4.810842   -1.934789
   yrrnd |  -30.10476   24.75397     -1.216   0.229      -79.62008    19.41056
    core |   -4.97981   3.339713     -1.491   0.141      -11.66023     1.70061
   avged |   69.37089   15.76158      4.401   0.000       37.84305    100.8987
 pctemer |  -.1026734   .8709507     -0.118   0.907      -1.844834    1.639487
   _cons |   693.2872   105.3493      6.581   0.000       482.5573    904.0171
------------------------------------------------------------------------------

Next we will load the Los Angeles County data. This dataset, lahi, has the same variables as the first dataset, ochi.

use http://www.philender.com/courses/data/lahi, clear

predict p1
(option xb assumed; fitted values)
(10 missing values generated)

corr api99 p1
(obs=67)

         |    api99       p1
---------+------------------
   api99 |   1.0000
      p1 |   0.8522   1.0000


regress api99 p1

      Source |       SS       df       MS              Number of obs =     226
-------------+------------------------------           F(  1,   224) =  594.14
       Model |  2869665.82     1  2869665.82           Prob > F      =  0.0000
    Residual |  1081909.95   224  4829.95512           R-squared     =  0.7262
-------------+------------------------------           Adj R-squared =  0.7250
       Total |  3951575.77   225   17562.559           Root MSE      =  69.498

------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          p1 |   1.236506   .0507285    24.37   0.000      1.13654    1.336472
       _cons |   -247.279   34.42901    -7.18   0.000    -315.1252   -179.4328
------------------------------------------------------------------------------

Note that the R² of .7262 is much lower than the R² of .8855 from our original regression analysis.

Stata Cross Validation Method 2

use http://www.philender.com/courses/data/ochi, clear

count
   71

append using http://www.philender.com/courses/data/lahi
(label yn already defined)

count
  307

generate sample=1 in 1/71
(236 missing values generated)

replace sample=2 in 72/l
(236 real changes made)

regress api99 pctmeal pctel yrrnd core avged pctemer if sample==1

      Source |       SS       df       MS              Number of obs =      67
-------------+------------------------------           F(  6,    60) =   77.36
       Model |  846646.429     6  141107.738           Prob > F      =  0.0000
    Residual |  109446.288    60   1824.1048           R-squared     =  0.8855
-------------+------------------------------           Adj R-squared =  0.8741
       Total |  956092.716    66  14486.2533           Root MSE      =   42.71

------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     pctmeal |   .6088963   .3348236     1.82   0.074    -.0608507    1.278643
       pctel |  -3.372816   .7189063    -4.69   0.000    -4.810842   -1.934789
       yrrnd |  -30.10476   24.75397    -1.22   0.229    -79.62008    19.41056
        core |   -4.97981   3.339713    -1.49   0.141    -11.66023    1.700611
       avged |   69.37089   15.76158     4.40   0.000     37.84305    100.8987
     pctemer |  -.1026734   .8709507    -0.12   0.907    -1.844834    1.639487
       _cons |   693.2872   105.3493     6.58   0.000     482.5573    904.0171
------------------------------------------------------------------------------

predict pre
(option xb assumed; fitted values)
(14 missing values generated)

regress api99 pre if sample==2

      Source |       SS       df       MS              Number of obs =     226
-------------+------------------------------           F(  1,   224) =  594.14
       Model |  2869665.82     1  2869665.82           Prob > F      =  0.0000
    Residual |  1081909.95   224  4829.95512           R-squared     =  0.7262
-------------+------------------------------           Adj R-squared =  0.7250
       Total |  3951575.77   225   17562.559           Root MSE      =  69.498

------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         pre |   1.236506   .0507285    24.37   0.000      1.13654    1.336472
       _cons |   -247.279   34.42901    -7.18   0.000    -315.1252   -179.4328
------------------------------------------------------------------------------

In this cross validation the R² has decreased from 0.8855 to .7262.

Using regvalidate on the combined sample

The program regvalidate (findit regvalidate) uses resampling methods within single sample to assess validation. We will demonstrate its use on the combined Los Angeles and Orange County samples.

regress api99 pctmeal pctel yrrnd core avged pctemer

      Source |       SS       df       MS              Number of obs =     293
-------------+------------------------------           F(  6,   286) =  277.36
       Model |  4538777.22     6   756462.87           Prob > F      =  0.0000
    Residual |  780037.386   286  2727.40345           R-squared     =  0.8533
-------------+------------------------------           Adj R-squared =  0.8503
       Total |  5318814.61   292  18215.1185           Root MSE      =  52.225

------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     pctmeal |  -.5078381   .1305415    -3.89   0.000    -.7647822   -.2508941
       pctel |  -.5148149   .3020057    -1.70   0.089    -1.109251    .0796209
       yrrnd |  -35.90818   10.89156    -3.30   0.001    -57.34596   -14.47041
        core |  -1.813285   1.368788    -1.32   0.186    -4.507461    .8808907
       avged |   123.0551   8.487517    14.50   0.000     106.3492     139.761
     pctemer |   -2.53015   .2807462    -9.01   0.000     -3.08274   -1.977559
       _cons |   399.1887   54.36759     7.34   0.000     292.1773    506.2001
------------------------------------------------------------------------------

regvalidate, reps(200)

original sample size = 293   reps = 200
regression model: regress api99 pctmeal pctel yrrnd core avged pctemer

                 orig          train         test          diff          orig adj
R-squared        0.8533        0.8585        0.8416        0.0169        0.8365
rss/n         2662.2436     2558.9878     2849.9416     -290.9538     2953.1974
fit slope        1.0000        1.0000        0.9876        0.0124        0.9876
fit _cons        0.0000       -0.0000        7.8432       -7.8432        7.8432

Linear Statistical Models Course

Phil Ender, 16oct10, 29Jan98