Linear Statistical Models: Regression

Cross Validation


Shrinkage

  • The tendency for regression to bias the R upwards due to capitalizing on chance associations within a sample.
  • When the regression weights are applied to another sample the R2 is smaller.
  • Degree of overestimation is affected by ratio of independent variables to sample size.
  • Rules of thumb:

    Estimating Shrinkage

    Cross Validation

  • Involves two samples: Sample 1) the screening sample and Sample 2) the calibration sample.
  • Compute regression coefficients in screening sample.
  • Use weights from the screening sample to compute predicted scores in calibration sample.
  • Compute r2YY' in calibration sample.
  • If shrinkage is small and coefficients change little, combine samples and recompute regression.

    Double Cross Validation

  • Do cross validation using "Sample A" as screening sample and "Sample B" as calibration sample.
  • Repeat cross validation using "Sample B" as screening sample and "Sample A" as calibration sample.

    Stata Cross Validation Example

    We will begin by looking at the 1999 API data for 67 Orange County high schools and and for 226 Los Angeles County high schools.

    use http://www.philender.com/courses/data/ochi, clear
    
    regress api99 pctmeal pctel yrrnd core avged pctemer
    
          Source |       SS       df       MS              Number of obs =      67
    -------------+------------------------------           F(  6,    60) =   77.36
           Model |  846646.429     6  141107.738           Prob > F      =  0.0000
        Residual |  109446.288    60   1824.1048           R-squared     =  0.8855
    -------------+------------------------------           Adj R-squared =  0.8741
           Total |  956092.716    66  14486.2533           Root MSE      =   42.71
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         pctmeal |   .6088963   .3348236     1.82   0.074    -.0608507    1.278643
           pctel |  -3.372816   .7189063    -4.69   0.000    -4.810842   -1.934789
           yrrnd |  -30.10476   24.75397    -1.22   0.229    -79.62008    19.41056
            core |   -4.97981   3.339713    -1.49   0.141    -11.66023    1.700611
           avged |   69.37089   15.76158     4.40   0.000     37.84305    100.8987
         pctemer |  -.1026734   .8709507    -0.12   0.907    -1.844834    1.639487
           _cons |   693.2872   105.3493     6.58   0.000     482.5573    904.0171
    ------------------------------------------------------------------------------
    
    use http://www.philender.com/courses/data/lahi, clear
    
    regress api99 pctmeal pctel yrrnd core avged pctemer
    
          Source |       SS       df       MS              Number of obs =     226
    -------------+------------------------------           F(  6,   219) =  229.23
           Model |  3408806.48     6  568134.413           Prob > F      =  0.0000
        Residual |   542769.29   219  2478.39858           R-squared     =  0.8626
    -------------+------------------------------           Adj R-squared =  0.8589
           Total |  3951575.77   225   17562.559           Root MSE      =  49.784
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         pctmeal |  -.4042542   .1405528    -2.88   0.004    -.6812634    -.127245
           pctel |  -.6719812   .3342255    -2.01   0.046    -1.330691    -.013271
           yrrnd |   -31.0319   11.23642    -2.76   0.006    -53.17725   -8.886547
            core |  -.9548313   1.404674    -0.68   0.497    -3.723241    1.813579
           avged |   132.6249   9.269121    14.31   0.000     114.3568     150.893
         pctemer |  -1.967988   .3124097    -6.30   0.000    -2.583702   -1.352274
           _cons |   329.3452   58.39626     5.64   0.000     214.2546    444.4358
    ------------------------------------------------------------------------------
    We will demonstrate cross validation by starting with the Orange County data.

    use http://www.philender.com/courses/data/ochi, clear
    
    regress api99 pctmeal pctel yrrnd core avged pctemer
    
      Source |       SS       df       MS                  Number of obs =      67
    ---------+------------------------------               F(  6,    60) =   77.36
       Model |  846646.429     6  141107.738               Prob > F      =  0.0000
    Residual |  109446.288    60   1824.1048               R-squared     =  0.8855
    ---------+------------------------------               Adj R-squared =  0.8741
       Total |  956092.716    66  14486.2533               Root MSE      =   42.71
    
    ------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
     pctmeal |   .6088963   .3348236      1.819   0.074      -.0608506    1.278643
       pctel |  -3.372816   .7189063     -4.692   0.000      -4.810842   -1.934789
       yrrnd |  -30.10476   24.75397     -1.216   0.229      -79.62008    19.41056
        core |   -4.97981   3.339713     -1.491   0.141      -11.66023     1.70061
       avged |   69.37089   15.76158      4.401   0.000       37.84305    100.8987
     pctemer |  -.1026734   .8709507     -0.118   0.907      -1.844834    1.639487
       _cons |   693.2872   105.3493      6.581   0.000       482.5573    904.0171
    ------------------------------------------------------------------------------
    

    Next we will load the Los Angeles County data. This dataset, lahi, has the same variables as the first dataset, ochi.

    use http://www.philender.com/courses/data/lahi, clear
    
    predict p1
    (option xb assumed; fitted values)
    (10 missing values generated)
    
    corr api99 p1
    (obs=67)
    
             |    api99       p1
    ---------+------------------
       api99 |   1.0000
          p1 |   0.8522   1.0000
    
    
    regress api99 p1
    
          Source |       SS       df       MS              Number of obs =     226
    -------------+------------------------------           F(  1,   224) =  594.14
           Model |  2869665.82     1  2869665.82           Prob > F      =  0.0000
        Residual |  1081909.95   224  4829.95512           R-squared     =  0.7262
    -------------+------------------------------           Adj R-squared =  0.7250
           Total |  3951575.77   225   17562.559           Root MSE      =  69.498
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              p1 |   1.236506   .0507285    24.37   0.000      1.13654    1.336472
           _cons |   -247.279   34.42901    -7.18   0.000    -315.1252   -179.4328
    ------------------------------------------------------------------------------

    Note that the R2 of .7262 is much lower than the R2 of .8855 from our original regression analysis.

    Stata Cross Validation Method 2

    use http://www.philender.com/courses/data/ochi, clear
    
    count
       71
    
    append using http://www.philender.com/courses/data/lahi
    (label yn already defined)
    
    count
      307
    
    generate sample=1 in 1/71
    (236 missing values generated)
    
    replace sample=2 in 72/l
    (236 real changes made)
    
    regress api99 pctmeal pctel yrrnd core avged pctemer if sample==1
    
          Source |       SS       df       MS              Number of obs =      67
    -------------+------------------------------           F(  6,    60) =   77.36
           Model |  846646.429     6  141107.738           Prob > F      =  0.0000
        Residual |  109446.288    60   1824.1048           R-squared     =  0.8855
    -------------+------------------------------           Adj R-squared =  0.8741
           Total |  956092.716    66  14486.2533           Root MSE      =   42.71
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         pctmeal |   .6088963   .3348236     1.82   0.074    -.0608507    1.278643
           pctel |  -3.372816   .7189063    -4.69   0.000    -4.810842   -1.934789
           yrrnd |  -30.10476   24.75397    -1.22   0.229    -79.62008    19.41056
            core |   -4.97981   3.339713    -1.49   0.141    -11.66023    1.700611
           avged |   69.37089   15.76158     4.40   0.000     37.84305    100.8987
         pctemer |  -.1026734   .8709507    -0.12   0.907    -1.844834    1.639487
           _cons |   693.2872   105.3493     6.58   0.000     482.5573    904.0171
    ------------------------------------------------------------------------------
    
    predict pre
    (option xb assumed; fitted values)
    (14 missing values generated)
    
    regress api99 pre if sample==2
    
          Source |       SS       df       MS              Number of obs =     226
    -------------+------------------------------           F(  1,   224) =  594.14
           Model |  2869665.82     1  2869665.82           Prob > F      =  0.0000
        Residual |  1081909.95   224  4829.95512           R-squared     =  0.7262
    -------------+------------------------------           Adj R-squared =  0.7250
           Total |  3951575.77   225   17562.559           Root MSE      =  69.498
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             pre |   1.236506   .0507285    24.37   0.000      1.13654    1.336472
           _cons |   -247.279   34.42901    -7.18   0.000    -315.1252   -179.4328
    ------------------------------------------------------------------------------
    In this cross validation the R2 has decreased from 0.8855 to .7262.

    Using regvalidate on the combined sample

    The program regvalidate (findit regvalidate) uses resampling methods within single sample to assess validation. We will demonstrate its use on the combined Los Angeles and Orange County samples.

    regress api99 pctmeal pctel yrrnd core avged pctemer
    
          Source |       SS       df       MS              Number of obs =     293
    -------------+------------------------------           F(  6,   286) =  277.36
           Model |  4538777.22     6   756462.87           Prob > F      =  0.0000
        Residual |  780037.386   286  2727.40345           R-squared     =  0.8533
    -------------+------------------------------           Adj R-squared =  0.8503
           Total |  5318814.61   292  18215.1185           Root MSE      =  52.225
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         pctmeal |  -.5078381   .1305415    -3.89   0.000    -.7647822   -.2508941
           pctel |  -.5148149   .3020057    -1.70   0.089    -1.109251    .0796209
           yrrnd |  -35.90818   10.89156    -3.30   0.001    -57.34596   -14.47041
            core |  -1.813285   1.368788    -1.32   0.186    -4.507461    .8808907
           avged |   123.0551   8.487517    14.50   0.000     106.3492     139.761
         pctemer |   -2.53015   .2807462    -9.01   0.000     -3.08274   -1.977559
           _cons |   399.1887   54.36759     7.34   0.000     292.1773    506.2001
    ------------------------------------------------------------------------------
    
    regvalidate, reps(200)
    
    original sample size = 293   reps = 200
    regression model: regress api99 pctmeal pctel yrrnd core avged pctemer
    
                     orig          train         test          diff          orig adj
    R-squared        0.8533        0.8585        0.8416        0.0169        0.8365
    rss/n         2662.2436     2558.9878     2849.9416     -290.9538     2953.1974
    fit slope        1.0000        1.0000        0.9876        0.0124        0.9876
    fit _cons        0.0000       -0.0000        7.8432       -7.8432        7.8432


    Linear Statistical Models Course

    Phil Ender, 16oct10, 29Jan98