Shrinkage
Estimating Shrinkage
Cross Validation
Double Cross Validation
Stata Cross Validation Example
We will begin by looking at the 1999 API data for 67 Orange County high schools and and for 226 Los Angeles County high schools.
use http://www.philender.com/courses/data/ochi, clear regress api99 pctmeal pctel yrrnd core avged pctemer Source | SS df MS Number of obs = 67 -------------+------------------------------ F( 6, 60) = 77.36 Model | 846646.429 6 141107.738 Prob > F = 0.0000 Residual | 109446.288 60 1824.1048 R-squared = 0.8855 -------------+------------------------------ Adj R-squared = 0.8741 Total | 956092.716 66 14486.2533 Root MSE = 42.71 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pctmeal | .6088963 .3348236 1.82 0.074 -.0608507 1.278643 pctel | -3.372816 .7189063 -4.69 0.000 -4.810842 -1.934789 yrrnd | -30.10476 24.75397 -1.22 0.229 -79.62008 19.41056 core | -4.97981 3.339713 -1.49 0.141 -11.66023 1.700611 avged | 69.37089 15.76158 4.40 0.000 37.84305 100.8987 pctemer | -.1026734 .8709507 -0.12 0.907 -1.844834 1.639487 _cons | 693.2872 105.3493 6.58 0.000 482.5573 904.0171 ------------------------------------------------------------------------------ use http://www.philender.com/courses/data/lahi, clear regress api99 pctmeal pctel yrrnd core avged pctemer Source | SS df MS Number of obs = 226 -------------+------------------------------ F( 6, 219) = 229.23 Model | 3408806.48 6 568134.413 Prob > F = 0.0000 Residual | 542769.29 219 2478.39858 R-squared = 0.8626 -------------+------------------------------ Adj R-squared = 0.8589 Total | 3951575.77 225 17562.559 Root MSE = 49.784 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pctmeal | -.4042542 .1405528 -2.88 0.004 -.6812634 -.127245 pctel | -.6719812 .3342255 -2.01 0.046 -1.330691 -.013271 yrrnd | -31.0319 11.23642 -2.76 0.006 -53.17725 -8.886547 core | -.9548313 1.404674 -0.68 0.497 -3.723241 1.813579 avged | 132.6249 9.269121 14.31 0.000 114.3568 150.893 pctemer | -1.967988 .3124097 -6.30 0.000 -2.583702 -1.352274 _cons | 329.3452 58.39626 5.64 0.000 214.2546 444.4358 ------------------------------------------------------------------------------We will demonstrate cross validation by starting with the Orange County data.
use http://www.philender.com/courses/data/ochi, clear regress api99 pctmeal pctel yrrnd core avged pctemer Source | SS df MS Number of obs = 67 ---------+------------------------------ F( 6, 60) = 77.36 Model | 846646.429 6 141107.738 Prob > F = 0.0000 Residual | 109446.288 60 1824.1048 R-squared = 0.8855 ---------+------------------------------ Adj R-squared = 0.8741 Total | 956092.716 66 14486.2533 Root MSE = 42.71 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- pctmeal | .6088963 .3348236 1.819 0.074 -.0608506 1.278643 pctel | -3.372816 .7189063 -4.692 0.000 -4.810842 -1.934789 yrrnd | -30.10476 24.75397 -1.216 0.229 -79.62008 19.41056 core | -4.97981 3.339713 -1.491 0.141 -11.66023 1.70061 avged | 69.37089 15.76158 4.401 0.000 37.84305 100.8987 pctemer | -.1026734 .8709507 -0.118 0.907 -1.844834 1.639487 _cons | 693.2872 105.3493 6.581 0.000 482.5573 904.0171 ------------------------------------------------------------------------------
Next we will load the Los Angeles County data. This dataset, lahi, has the same variables as the first dataset, ochi.
use http://www.philender.com/courses/data/lahi, clear predict p1 (option xb assumed; fitted values) (10 missing values generated) corr api99 p1 (obs=67) | api99 p1 ---------+------------------ api99 | 1.0000 p1 | 0.8522 1.0000 regress api99 p1 Source | SS df MS Number of obs = 226 -------------+------------------------------ F( 1, 224) = 594.14 Model | 2869665.82 1 2869665.82 Prob > F = 0.0000 Residual | 1081909.95 224 4829.95512 R-squared = 0.7262 -------------+------------------------------ Adj R-squared = 0.7250 Total | 3951575.77 225 17562.559 Root MSE = 69.498 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- p1 | 1.236506 .0507285 24.37 0.000 1.13654 1.336472 _cons | -247.279 34.42901 -7.18 0.000 -315.1252 -179.4328 ------------------------------------------------------------------------------
Note that the R2 of .7262 is much lower than the R2 of .8855 from our original regression analysis.
Stata Cross Validation Method 2
use http://www.philender.com/courses/data/ochi, clear count 71 append using http://www.philender.com/courses/data/lahi (label yn already defined) count 307 generate sample=1 in 1/71 (236 missing values generated) replace sample=2 in 72/l (236 real changes made) regress api99 pctmeal pctel yrrnd core avged pctemer if sample==1 Source | SS df MS Number of obs = 67 -------------+------------------------------ F( 6, 60) = 77.36 Model | 846646.429 6 141107.738 Prob > F = 0.0000 Residual | 109446.288 60 1824.1048 R-squared = 0.8855 -------------+------------------------------ Adj R-squared = 0.8741 Total | 956092.716 66 14486.2533 Root MSE = 42.71 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pctmeal | .6088963 .3348236 1.82 0.074 -.0608507 1.278643 pctel | -3.372816 .7189063 -4.69 0.000 -4.810842 -1.934789 yrrnd | -30.10476 24.75397 -1.22 0.229 -79.62008 19.41056 core | -4.97981 3.339713 -1.49 0.141 -11.66023 1.700611 avged | 69.37089 15.76158 4.40 0.000 37.84305 100.8987 pctemer | -.1026734 .8709507 -0.12 0.907 -1.844834 1.639487 _cons | 693.2872 105.3493 6.58 0.000 482.5573 904.0171 ------------------------------------------------------------------------------ predict pre (option xb assumed; fitted values) (14 missing values generated) regress api99 pre if sample==2 Source | SS df MS Number of obs = 226 -------------+------------------------------ F( 1, 224) = 594.14 Model | 2869665.82 1 2869665.82 Prob > F = 0.0000 Residual | 1081909.95 224 4829.95512 R-squared = 0.7262 -------------+------------------------------ Adj R-squared = 0.7250 Total | 3951575.77 225 17562.559 Root MSE = 69.498 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pre | 1.236506 .0507285 24.37 0.000 1.13654 1.336472 _cons | -247.279 34.42901 -7.18 0.000 -315.1252 -179.4328 ------------------------------------------------------------------------------In this cross validation the R2 has decreased from 0.8855 to .7262.
Using regvalidate on the combined sample
The program regvalidate (findit regvalidate) uses resampling methods within single sample to assess validation. We will demonstrate its use on the combined Los Angeles and Orange County samples.
regress api99 pctmeal pctel yrrnd core avged pctemer Source | SS df MS Number of obs = 293 -------------+------------------------------ F( 6, 286) = 277.36 Model | 4538777.22 6 756462.87 Prob > F = 0.0000 Residual | 780037.386 286 2727.40345 R-squared = 0.8533 -------------+------------------------------ Adj R-squared = 0.8503 Total | 5318814.61 292 18215.1185 Root MSE = 52.225 ------------------------------------------------------------------------------ api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pctmeal | -.5078381 .1305415 -3.89 0.000 -.7647822 -.2508941 pctel | -.5148149 .3020057 -1.70 0.089 -1.109251 .0796209 yrrnd | -35.90818 10.89156 -3.30 0.001 -57.34596 -14.47041 core | -1.813285 1.368788 -1.32 0.186 -4.507461 .8808907 avged | 123.0551 8.487517 14.50 0.000 106.3492 139.761 pctemer | -2.53015 .2807462 -9.01 0.000 -3.08274 -1.977559 _cons | 399.1887 54.36759 7.34 0.000 292.1773 506.2001 ------------------------------------------------------------------------------ regvalidate, reps(200) original sample size = 293 reps = 200 regression model: regress api99 pctmeal pctel yrrnd core avged pctemer orig train test diff orig adj R-squared 0.8533 0.8585 0.8416 0.0169 0.8365 rss/n 2662.2436 2558.9878 2849.9416 -290.9538 2953.1974 fit slope 1.0000 1.0000 0.9876 0.0124 0.9876 fit _cons 0.0000 -0.0000 7.8432 -7.8432 7.8432
Linear Statistical Models Course
Phil Ender, 16oct10, 29Jan98