The effect of scaling predictor variables can be easily demonstrated using the variable read in the hsbdemo dataset. We will begin with a model regressing write on female and read.
Example 1
use http://www.phhilender.com/courses/data/hsbdemo, clear regress write female read Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual | 10022.5538 197 50.8759077 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 read | .5658869 .0493849 11.46 0.000 .468496 .6632778 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------The coefficient for read (.57) indicates how much change is expected in write when there is a one unit increase in read with female held constant. The concern here is that a one unit change might not be terribly meaningful. Suppose that research has indicated that a 12 point change in read is meaningful. Here is what you could do.
generate read12 = read/12 regress write female read12 Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32128 2 3928.16064 Prob > F = 0.0000 Residual | 10022.5537 197 50.8759072 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 read12 | 6.790643 .5926186 11.46 0.000 5.621953 7.959334 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------Now a one unit change in read12 predicts a 6.8 point change in write with female held constant. A one point change in read12 is equivalent to a 12 point change in read.
Note that the standardized coefficients are identical for read and read12.
regress write female read, beta noheader ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------- female | 5.486894 1.014261 5.41 0.000 .2889851 read | .5658869 .0493849 11.46 0.000 .6121169 _cons | 20.22837 2.713756 7.45 0.000 . ------------------------------------------------------------------------------ regress write female read12, beta noheader ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------- female | 5.486894 1.014261 5.41 0.000 .2889851 read12 | 6.790643 .5926186 11.46 0.000 .6121169 _cons | 20.22837 2.713756 7.45 0.000 . ------------------------------------------------------------------------------Example 2
Now, what if reading was a categorical variable? We will divide read up into five categories. Please realize that I am not suggesting that you should take a continuous variable and break it up into categories, but to show the effect of scaling read as a categorical variable.
egen readcat = cut(read), group(5) icodes tabulate readcat readcat | Freq. Percent Cum. ------------+----------------------------------- 0 | 39 19.50 19.50 1 | 16 8.00 27.50 2 | 62 31.00 58.50 3 | 37 18.50 77.00 4 | 46 23.00 100.00 ------------+----------------------------------- Total | 200 100.00 tabstat read, by(readcat) Summary for variables: read by categories of: readcat readcat | mean ---------+---------- 0 | 38.61538 1 | 44.25 2 | 49.22581 3 | 57.13514 4 | 66.65217 ---------+---------- Total | 52.23 --------------------Let's run a regression with dummy coded readcat.
regress write female i.readcat Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 5, 194) = 28.02 Model | 7497.74329 5 1499.54866 Prob > F = 0.0000 Residual | 10381.1317 194 53.5109882 R-squared = 0.4194 -------------+------------------------------ Adj R-squared = 0.4044 Total | 17878.875 199 89.843593 Root MSE = 7.3151 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.714997 1.0442 5.47 0.000 3.655556 7.774438 | readcat | 1 | 2.237243 2.173774 1.03 0.305 -2.050021 6.524508 2 | 6.692244 1.495062 4.48 0.000 3.743581 9.640906 3 | 11.49109 1.680671 6.84 0.000 8.176361 14.80583 4 | 15.76366 1.596531 9.87 0.000 12.61487 18.91244 | _cons | 41.65526 1.323366 31.48 0.000 39.04523 44.26529 ------------------------------------------------------------------------------ testparm i.readcat ( 1) 1.readcat = 0 ( 2) 2.readcat = 0 ( 3) 3.readcat = 0 ( 4) 4.readcat = 0 F( 4, 194) = 29.53 Prob > F = 0.0000We see that overall readcat is a significant predictor of write. The R2 for this model is .4199 as compared to .4394 when read is continuous. Next, let's use readcat in a model but treat it as a one degree of freedom linear predictor.
regress write female readcat Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 69.99 Model | 7426.82985 2 3713.41492 Prob > F = 0.0000 Residual | 10452.0452 197 53.0560668 R-squared = 0.4154 -------------+------------------------------ Adj R-squared = 0.4095 Total | 17878.875 199 89.843593 Root MSE = 7.284 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.688666 1.037052 5.49 0.000 3.643518 7.733814 readcat | 4.030212 .3713078 10.85 0.000 3.297964 4.76246 _cons | 40.90897 1.141635 35.83 0.000 38.65757 43.16036 ------------------------------------------------------------------------------The linear form of readcat is still significant but the R2 for the model has gone down to .4154, a trivial difference for a gain of four degrees of freedom in the residual.
We can test to see if the difference between using read and readcat is significant by including both in a model. The significant coefficient for read (below) suggests that the continuous form of read accounts variability in reading that is not captured in the categorical form.
regress write female i.readcat read Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 6, 193) = 25.62 Model | 7926.37551 6 1321.06258 Prob > F = 0.0000 Residual | 9952.49949 193 51.5673549 R-squared = 0.4433 -------------+------------------------------ Adj R-squared = 0.4260 Total | 17878.875 199 89.843593 Root MSE = 7.181 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.469592 1.028589 5.32 0.000 3.440875 7.49831 | readcat | 1 | -.6631964 2.359184 -0.28 0.779 -5.31629 3.989897 2 | 1.273686 2.384601 0.53 0.594 -3.429538 5.97691 3 | 2.011662 3.678693 0.55 0.585 -5.243941 9.267264 4 | 1.413838 5.218196 0.27 0.787 -8.878175 11.70585 | read | .5108452 .177188 2.88 0.004 .1613717 .8603187 _cons | 22.0735 6.91511 3.19 0.002 8.434609 35.71239 ------------------------------------------------------------------------------
Linear Statistical Models Course
Phil Ender, 20sep10, 22dec00