Regression Analysis So Far...
Coding Methods for Categorical Variables
Consider the Following 4 Group Design:
Level a1
a2 a3 a4 Total
1
3
2
2
2
3
4
3
5
6
4
5
10
10
9
11
Mean 2.0 3.0 5.0 10.0 5.0
Dummy Coding
Example Using Dummy Coding
y grp x1 x2 x3 1 1 1 0 0 3 1 1 0 0 2 1 1 0 0 2 1 1 0 0 2 2 0 1 0 3 2 0 1 0 4 2 0 1 0 3 2 0 1 0 5 3 0 0 1 6 3 0 0 1 4 3 0 0 1 5 3 0 0 1 10 4 0 0 0 10 4 0 0 0 9 4 0 0 0 11 4 0 0 0
Regression Analysis Using Dummy Coding
regress y x1 x2 x3 Source | SS df MS Number of obs = 16 ---------+------------------------------ F( 3, 12) = 76.00 Model | 152.00 3 50.6666667 Prob > F = 0.0000 Residual | 8.00 12 .666666667 R-squared = 0.9500 ---------+------------------------------ Adj R-squared = 0.9375 Total | 160.00 15 10.6666667 Root MSE = .8165 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | -8 .5773503 -13.856 0.000 -9.257938 -6.742062 x2 | -7 .5773503 -12.124 0.000 -8.257938 -5.742062 x3 | -5 .5773503 -8.660 0.000 -6.257938 -3.742062 _cons | 10 .4082483 24.495 0.000 9.110503 10.8895 ------------------------------------------------------------------------------
Interpretation of Coefficients
Effect Coding
Example Using Effect Coding
y grp x1 x2 x3 1 1 1 0 0 3 1 1 0 0 2 1 1 0 0 2 1 1 0 0 2 2 0 1 0 3 2 0 1 0 4 2 0 1 0 3 2 0 1 0 5 3 0 0 1 6 3 0 0 1 4 3 0 0 1 5 3 0 0 1 10 4 -1 -1 -1 10 4 -1 -1 -1 9 4 -1 -1 -1 11 4 -1 -1 -1
The Linear Model
Yij = m + aj + ei(j)
Where aj represents the treatment effect of the jth group.
Regression Analysis Using Effect Coding
regress y x1 x2 x3 Source | SS df MS Number of obs = 16 ---------+------------------------------ F( 3, 12) = 76.00 Model | 152.00 3 50.6666667 Prob > F = 0.0000 Residual | 8.00 12 .666666667 R-squared = 0.9500 ---------+------------------------------ Adj R-squared = 0.9375 Total | 160.00 15 10.6666667 Root MSE = .8165 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x1 | -3 .3535534 -8.485 0.000 -3.770327 -2.229673 x2 | -2 .3535534 -5.657 0.000 -2.770327 -1.229673 x3 | 0 .3535534 0.000 1.000 -.7703266 .7703266 _cons | 5 .2041241 24.495 0.000 4.555252 5.444748 ------------------------------------------------------------------------------
Interpretation of Coefficients
F-ratio Using R2
An example using hsbdemo
Let's analyze the hsbdemo data for the variable program type (prog) using write as the dependent variable. We will dummy code prog using the tabulate command with the generate option to create the dummy variables for us automatically.
use http://www.gseis.ucla.edu/courses/data/hsbdemo, clear tab prog, gen(prog) type of | program | Freq. Percent Cum. ------------+----------------------------------- general | 45 22.50 22.50 academic | 105 52.50 75.00 vocation | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00ed) regress write prog2 prog3 Source | SS df MS Number of obs = 200 ---------+------------------------------ F( 2, 197) = 21.27 Model | 3175.69786 2 1587.84893 Prob > F = 0.0000 Residual | 14703.1771 197 74.635417 R-squared = 0.1776 ---------+------------------------------ Adj R-squared = 0.1693 Total | 17878.875 199 89.843593 Root MSE = 8.6392 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- prog2 | 4.92381 1.539279 3.199 0.002 1.888231 7.959388 prog3 | -4.573333 1.775183 -2.576 0.011 -8.074134 -1.072533 _cons | 51.33333 1.287853 39.860 0.000 48.79359 53.87308 ------------------------------------------------------------------------------ test prog2 prog3 ( 1) prog2 = 0.0 ( 2) prog3 = 0.0 F( 2, 197) = 21.27 Prob > F = 0.0000
It is also possible to have Stata perform dummy coding on-the-fly using factor variables.
regress write i.prog Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 21.27 Model | 3175.69786 2 1587.84893 Prob > F = 0.0000 Residual | 14703.1771 197 74.635417 R-squared = 0.1776 -------------+------------------------------ Adj R-squared = 0.1693 Total | 17878.875 199 89.843593 Root MSE = 8.6392 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- prog | 2 | 4.92381 1.539279 3.20 0.002 1.888231 7.959388 3 | -4.573333 1.775183 -2.58 0.011 -8.074134 -1.072533 | _cons | 51.33333 1.287853 39.86 0.000 48.79359 53.87308 ------------------------------------------------------------------------------ testparm i.prog ( 1) 2.prog = 0 ( 2) 3.prog = 0 F( 2, 197) = 21.27 Prob > F = 0.0000
Effect coding using manual coding
In this example group one is the reference group, i.e., the group that would be coded -1.
replace prog2 = -1 if prog==1 replace prog3 = -1 if prog==1 regress write prog2 prog3 Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 21.27 Model | 3175.69786 2 1587.84893 Prob > F = 0.0000 Residual | 14703.1771 197 74.635417 R-squared = 0.1776 -------------+------------------------------ Adj R-squared = 0.1693 Total | 17878.875 199 89.843593 Root MSE = 8.6392 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- prog2 | 4.806984 .8161241 5.89 0.000 3.197523 6.416445 prog3 | -4.690159 .9626475 -4.87 0.000 -6.588576 -2.791742 _cons | 51.45016 .6550731 78.54 0.000 50.1583 52.74201 ------------------------------------------------------------------------------The ANOVA Alternative
Many people picture anova software as being good only for classical experimental designs with categorical variables. However, the Stata anova command is actually regression in disguise. Consider the following regression that has both categorical and continuous variables and their interactions.
use http://www.philender/courses/data/hsbdemo, clear tabulate prog, gen(prog) type of | program | Freq. Percent Cum. ------------+----------------------------------- general | 45 22.50 22.50 academic | 105 52.50 75.00 vocation | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00 regress write i.female i.prog##c.read i.prog##c.math Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 9, 190) = 25.80 Model | 9833.77329 9 1092.64148 Prob > F = 0.0000 Residual | 8045.10171 190 42.3426406 R-squared = 0.5500 -------------+------------------------------ Adj R-squared = 0.5287 Total | 17878.875 199 89.843593 Root MSE = 6.5071 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 5.706612 .9390611 6.08 0.000 3.854288 7.558937 | prog | 2 | 5.872569 8.496026 0.69 0.490 -10.88608 22.63122 3 | -3.126916 9.509236 -0.33 0.743 -21.88415 15.63032 | read | .5184569 .1172236 4.42 0.000 .2872301 .7496837 | prog#c.read | 2 | -.3111253 .1493291 -2.08 0.039 -.6056813 -.0165694 3 | -.2499231 .1666047 -1.50 0.135 -.5785557 .0787094 | math | .2072995 .1436855 1.44 0.151 -.0761243 .4907233 | prog#c.math | 2 | .2062863 .1759469 1.17 0.242 -.140774 .5533465 3 | .2725577 .1950709 1.40 0.164 -.1122251 .6573405 | _cons | 12.12411 7.35309 1.65 0.101 -2.380063 26.62829 ------------------------------------------------------------------------------ testparm prog#c.math ( 1) 2.prog#c.math = 0 ( 2) 3.prog#c.math = 0 F( 2, 190) = 1.06 Prob > F = 0.3482 test prog#c.read ( 1) 2.prog#c.read = 0 ( 2) 3.prog#c.read = 0 F( 2, 190) = 2.25 Prob > F = 0.1079 test 1.female ( 1) 1.female = 0 F( 1, 190) = 36.93 Prob > F = 0.0000
Admittedly, that wasn't a very interesting model but it did illustrate one way to put together all the pieces involved in model with categorical and interaction terms. Now, let's look at exactly the same model using the anova command.
anova write i.female i.prog##c.read i.prog##c.math Number of obs = 200 R-squared = 0.5500 Root MSE = 6.50712 Adj R-squared = 0.5287 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 9833.77329 9 1092.64148 25.80 0.0000 | female | 1563.67667 1 1563.67667 36.93 0.0000 prog | 66.1182428 2 33.0591214 0.78 0.4595 read | 1170.32031 1 1170.32031 27.64 0.0000 prog#read | 190.783714 2 95.3918572 2.25 0.1079 math | 1066.81222 1 1066.81222 25.19 0.0000 prog#math | 89.8348393 2 44.9174197 1.06 0.3482 | Residual | 8045.10171 190 42.3426406 -----------+---------------------------------------------------- Total | 17878.875 199 89.843593 regress Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 9, 190) = 25.80 Model | 9833.77329 9 1092.64148 Prob > F = 0.0000 Residual | 8045.10171 190 42.3426406 R-squared = 0.5500 -------------+------------------------------ Adj R-squared = 0.5287 Total | 17878.875 199 89.843593 Root MSE = 6.5071 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 5.706612 .9390611 6.08 0.000 3.854288 7.558937 | prog | 2 | 5.872569 8.496026 0.69 0.490 -10.88608 22.63122 3 | -3.126916 9.509236 -0.33 0.743 -21.88415 15.63032 | read | .5184569 .1172236 4.42 0.000 .2872301 .7496837 | prog#c.read | 2 | -.3111253 .1493291 -2.08 0.039 -.6056813 -.0165694 3 | -.2499231 .1666047 -1.50 0.135 -.5785557 .0787094 | math | .2072995 .1436855 1.44 0.151 -.0761243 .4907233 | prog#c.math | 2 | .2062863 .1759469 1.17 0.242 -.140774 .5533465 3 | .2725577 .1950709 1.40 0.164 -.1122251 .6573405 | _cons | 12.12411 7.35309 1.65 0.101 -2.380063 26.62829 ------------------------------------------------------------------------------The results are the same as the regression analysis however the set-up of the model to be tested was a little more straight forward. Let's try one more.
ANOVA Exampe 2
use http://www.philender.com/courses/data/htwt, clear anova weight i.female##c.height Number of obs = 1000 R-squared = 0.2795 Root MSE = 8.20887 Adj R-squared = 0.2773 Source | Partial SS df MS F Prob > F --------------+---------------------------------------------------- Model | 26034.4351 3 8678.14505 128.78 0.0000 | female | 587.074483 1 587.074483 8.71 0.0032 height | 19197.3548 1 19197.3548 284.89 0.0000 female#height | 547.82512 1 547.82512 8.13 0.0044 | Residual | 67115.9985 996 67.3855406 --------------+---------------------------------------------------- Total | 93150.4336 999 93.2436773 regress Source | SS df MS Number of obs = 1000 -------------+------------------------------ F( 3, 996) = 128.78 Model | 26034.4351 3 8678.14505 Prob > F = 0.0000 Residual | 67115.9985 996 67.3855406 R-squared = 0.2795 -------------+------------------------------ Adj R-squared = 0.2773 Total | 93150.4336 999 93.2436773 Root MSE = 8.2089 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 38.26321 12.96338 2.95 0.003 12.82455 63.70188 height | .7706638 .052059 14.80 0.000 .6685058 .8728217 | female#| c.height | 1 | -.2227448 .0781214 -2.85 0.004 -.3760463 -.0694434 | _cons | -72.01376 8.892743 -8.10 0.000 -89.46442 -54.56309 ------------------------------------------------------------------------------