This unit will cover a number of Stata commands that you have not seen before. Do not panic, this unit is primarily conceptual in nature. You do not have to learn all of the different procedures.
We begin with a fairly typical OLS regression analysis regressing api04 on meals, el, avg_ed and emer. This dataset has complete data on 4,702 schools from 834 school districts.
use http://www.philender.com/courses/data/api04, clear keep if stype==5 /* elementary schools only */ regress api04 meals el avg_ed emer Source | SS df MS Number of obs = 4702 -------------+------------------------------ F( 4, 4697) = 3086.43 Model | 28801025.3 4 7200256.32 Prob > F = 0.0000 Residual | 10957513.8 4697 2332.87499 R-squared = 0.7244 -------------+------------------------------ Adj R-squared = 0.7242 Total | 39758539.1 4701 8457.46418 Root MSE = 48.3 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.31768 .0469087 -28.09 0.000 -1.409643 -1.225717 el | .1810992 .0485153 3.73 0.000 .0859866 .2762119 avg_ed | 56.02456 1.825136 30.70 0.000 52.44643 59.60268 emer | -1.403778 .1232983 -11.39 0.000 -1.6455 -1.162055 _cons | 652.1849 7.106252 91.78 0.000 638.2533 666.1165 ------------------------------------------------------------------------------ keep if e(sample) (1017 observations deleted)The problem with this analysis is that we are treating these data as if the api scores were completely independent from one school to the next. This is unlikely to be true since scores within school districts are likely to be similar to one another.
We can see how much of the variability is within district versus how much is between district by computing an intraclass correlation using the loneway command in Stata.
loneway api04 dnum One-way Analysis of Variance for api04: Number of obs = 4702 R-squared = 0.5563 Source SS df MS F Prob > F ------------------------------------------------------------------------- Between dnum 22116036 833 26549.863 5.82 0.0000 Within dnum 17642503 3868 4561.1434 ------------------------------------------------------------------------- Total 39758539 4701 8457.4642 Intraclass Asy. correlation S.E. [95% Conf. Interval] ------------------------------------------------ 0.46321 0.03589 0.39287 0.53355 Estimated SD of dnum effect 62.73711 Estimated SD within dnum 67.53624 Est. reliability of a dnum mean 0.82820 (evaluated at n=5.59)The intraclass correlation of .46 is fairly substantial and puts into question whether the coefficients and standard errors from our original regression model are correct. The first thing that we can try is to rerun the analysis using the cluster option. The cluster option yields the same regression coefficients but allows for differences in the variance/standard errors due to arbitrary intra-group correlation.
regress api04 meals el avg_ed emer, vce(cluster dnum) Linear regression Number of obs = 4702 F( 4, 833) = 454.13 Prob > F = 0.0000 R-squared = 0.7244 Root MSE = 48.3 (Std. Err. adjusted for 834 clusters in dnum) ------------------------------------------------------------------------------ | Robust api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.31768 .1287663 -10.23 0.000 -1.570424 -1.064935 el | .1810992 .1077171 1.68 0.093 -.0303297 .3925281 avg_ed | 56.02456 5.358536 10.46 0.000 45.50674 66.54238 emer | -1.403778 .2983924 -4.70 0.000 -1.989467 -.8180886 _cons | 652.1849 21.312 30.60 0.000 610.3533 694.0164 ------------------------------------------------------------------------------Note that the variable el is nolonger significant at the .05 level.
The analysis using the cluster option works but it is kind a quick-and-dirty solution that would benefit from a more precise solution.
An alternative to using the cluster option is to include dummy coded variables for school district. The advantage of dummy coding district is that it allows for differences in the average level of across across districts in addition to adjusting the standard errors taking into account the specific intra-group correlation. However, regression with 833 dummy variables for school districts is both slow and memory intensive (it requires Stata SE). The alternative is to use the areg command which is logicaly equivalent to the dummy variable approach.
areg api04 meals el avg_ed emer, absorb(dnum) Linear regression, absorbing indicators Number of obs = 4702 F( 4, 3864) = 1793.95 Prob > F = 0.0000 R-squared = 0.8447 Adj R-squared = 0.8110 Root MSE = 39.976 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.622885 .059068 -27.47 0.000 -1.738693 -1.507078 el | -.1496362 .0648568 -2.31 0.021 -.2767931 -.0224794 avg_ed | 35.48807 2.231773 15.90 0.000 31.1125 39.86363 emer | -1.563359 .139788 -11.18 0.000 -1.837424 -1.289293 _cons | 734.2111 8.672639 84.66 0.000 717.2077 751.2145 -------------+---------------------------------------------------------------- dnum | F(833, 3864) = 3.593 0.000 (834 categories)In this analysis both the coefficients and the standard errors differ from the original regression model.
It is also possible to run the areg coomand with the robust option.
areg api04 meals el avg_ed emer, absorb(dnum) vce(robust) Linear regression, absorbing indicators Number of obs = 4702 F( 4, 3864) = 1382.06 Prob > F = 0.0000 R-squared = 0.8447 Adj R-squared = 0.8110 Root MSE = 39.976 ------------------------------------------------------------------------------ | Robust api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.622885 .1022788 -15.87 0.000 -1.823411 -1.42236 el | -.1496362 .1168033 -1.28 0.200 -.3786383 .0793658 avg_ed | 35.48807 3.802746 9.33 0.000 28.03249 42.94365 emer | -1.563359 .2104234 -7.43 0.000 -1.97591 -1.150807 _cons | 734.2111 14.99503 48.96 0.000 704.8121 763.61 -------------+---------------------------------------------------------------- dnum | absorbed (834 categories)As you can see the coefficients are the same as for the first areg model but the standard errors have been adjusted.
We will follow this analysis with fixed-effects (within) cross-sectional time-series model using xtreg.
xtreg api04 meals el avg_ed emer, i(dnum) fe Fixed-effects (within) regression Number of obs = 4702 Group variable: dnum Number of groups = 834 R-sq: within = 0.6500 Obs per group: min = 1 between = 0.7103 avg = 5.6 overall = 0.7149 max = 373 F(4,3864) = 1793.95 corr(u_i, Xb) = -0.0378 Prob > F = 0.0000 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.622885 .059068 -27.47 0.000 -1.738693 -1.507078 el | -.1496362 .0648568 -2.31 0.021 -.2767931 -.0224794 avg_ed | 35.48807 2.231773 15.90 0.000 31.1125 39.86363 emer | -1.563359 .139788 -11.18 0.000 -1.837424 -1.289293 _cons | 734.2111 8.672639 84.66 0.000 717.2077 751.2145 -------------+---------------------------------------------------------------- sigma_u | 44.098947 sigma_e | 39.975999 rho | .54892132 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(833, 3864) = 3.59 Prob > F = 0.0000This analysis is exactly the same as the areg without the robust option.
We will follow this up with a between-effects xtreg model.
xtreg api04 meals el avg_ed emer, i(dnum) be Between regression (regression on group means) Number of obs = 4702 Group variable: dnum Number of groups = 834 R-sq: within = 0.6412 Obs per group: min = 1 between = 0.7188 avg = 5.6 overall = 0.7200 max = 373 F(4,829) = 529.73 sd(u_i + avg(e_i.))= 43.54434 Prob > F = 0.0000 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.330267 .1039559 -12.80 0.000 -1.534315 -1.126219 el | .1249702 .1039732 1.20 0.230 -.0791115 .3290518 avg_ed | 50.28665 4.137794 12.15 0.000 42.16487 58.40844 emer | -2.345164 .2617322 -8.96 0.000 -2.8589 -1.831429 _cons | 663.9947 15.9463 41.64 0.000 632.6948 695.2946 ------------------------------------------------------------------------------The regression coefficients, standard errors and the R-squared between can also be obtained by generating a mean score for each variable for each district and then running an OLS regression with one observation per district.
sort dnum id by dnum: generate i = _n egen a = mean(api04), by(dnum) egen b = mean(meals), by(dnum) egen c = mean(el), by(dnum) egen d = mean(avg_ed), by(dnum) egen e = mean(emer), by(dnum) regress a b c d e if i==1 Source | SS df MS Number of obs = 834 -------------+------------------------------ F( 4, 829) = 529.73 Model | 4017708.46 4 1004427.11 Prob > F = 0.0000 Residual | 1571874.5 829 1896.10916 R-squared = 0.7188 -------------+------------------------------ Adj R-squared = 0.7174 Total | 5589582.96 833 6710.18362 Root MSE = 43.544 ------------------------------------------------------------------------------ a | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- b | -1.330267 .1039559 -12.80 0.000 -1.534315 -1.126219 c | .1249702 .1039732 1.20 0.230 -.0791115 .3290519 d | 50.28666 4.137794 12.15 0.000 42.16487 58.40844 e | -2.345164 .2617322 -8.96 0.000 -2.8589 -1.831429 _cons | 663.9947 15.9463 41.64 0.000 632.6948 695.2946 ------------------------------------------------------------------------------Since we have the mean values for each district let's subtract the district mean and add back in the grand mean and then do OLS regression with these adjusted scores.
sum api04, meanonly generate a2 = api04 - a + r(mean) sum meals, meanonly generate b2 = meals - b + r(mean) sum el, meanonly generate c2 = el - c + r(mean) sum avg_ed, meanonly generate d2 = avg_ed - d + r(mean) sum emer, meanonly generate e2 = emer - e + r(mean) regress a2 b2 c2 d2 e2 Source | SS df MS Number of obs = 4702 -------------+------------------------------ F( 4, 4697) = 2180.69 Model | 11467519.9 4 2866879.98 Prob > F = 0.0000 Residual | 6174982.85 4697 1314.66529 R-squared = 0.6500 -------------+------------------------------ Adj R-squared = 0.6497 Total | 17642502.8 4701 3752.9255 Root MSE = 36.258 ------------------------------------------------------------------------------ a2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- b2 | -1.622885 .0535748 -30.29 0.000 -1.727917 -1.517853 c2 | -.1496362 .0588253 -2.54 0.011 -.2649613 -.0343111 d2 | 35.48807 2.024223 17.53 0.000 31.51964 39.45649 e2 | -1.563359 .126788 -12.33 0.000 -1.811923 -1.314795 _cons | 734.2111 7.866102 93.34 0.000 718.7898 749.6323 ------------------------------------------------------------------------------This analysis gives us the same coefficients as the areg and fixed-effects xtreg but not the same standard errors. We also get the same R-squared as within for the fixed-effect model. The reason that we don't get the same standard errors is that we haven't adjusted the degrees of freedom of the variances and covariances to account for the 834 district means we have estimated.
Next, we will run a random-effects xtreg model. The random-effects model provides a gls solution giving a matrix weighted average of the between-effects and within-effects models.
xtreg api04 meals el avg_ed emer, i(dnum) re Random-effects GLS regression Number of obs = 4702 Group variable: dnum Number of groups = 834 R-sq: within = 0.6495 Obs per group: min = 1 between = 0.7139 avg = 5.6 overall = 0.7181 max = 373 Random effects u_i ~ Gaussian Wald chi2(4) = 9561.58 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.556664 .0510487 -30.49 0.000 -1.656718 -1.456611 el | -.0764894 .0542108 -1.41 0.158 -.1827405 .0297618 avg_ed | 39.97183 1.956283 20.43 0.000 36.13759 43.80608 emer | -1.715903 .1242845 -13.81 0.000 -1.959496 -1.47231 _cons | 708.6933 7.603373 93.21 0.000 693.791 723.5957 -------------+---------------------------------------------------------------- sigma_u | 31.435229 sigma_e | 39.975999 rho | .38208683 (fraction of variance due to u_i) ------------------------------------------------------------------------------Continuing with the random-effects model, let's try a maximum likelihood solution.
xtreg api04 meals el avg_ed emer, i(dnum) mle nolog Random-effects ML regression Number of obs = 4702 Group variable: dnum Number of groups = 834 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 5.6 max = 373 LR chi2(4) = 5189.39 Log likelihood = -24451.401 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.548644 .0509281 -30.41 0.000 -1.648461 -1.448827 el | -.0686962 .0538775 -1.28 0.202 -.1742942 .0369018 avg_ed | 40.52672 1.966052 20.61 0.000 36.67333 44.38011 emer | -1.729005 .1242775 -13.91 0.000 -1.972584 -1.485425 _cons | 706.9804 7.610685 92.89 0.000 692.0637 721.8971 -------------+---------------------------------------------------------------- /sigma_u | 28.50507 1.300925 26.06602 31.17235 /sigma_e | 40.24634 .4572658 39.36002 41.15261 rho | .3340611 .0219029 .2922954 .3779909 ------------------------------------------------------------------------------ Likelihood-ratio test of sigma_u=0: chibar2(01)= 899.23 Prob>=chibar2 = 0.000Using the xtmixed command wit restricted maximum likelihood estimator.
xtmixed api04 meals el avg_ed emer || dnum: Performing EM optimization: Performing gradient-based optimization: Iteration 0: log restricted-likelihood = -24454.551 Iteration 1: log restricted-likelihood = -24454.551 Computing standard errors: Mixed-effects REML regression Number of obs = 4702 Group variable: dnum Number of groups = 834 Obs per group: min = 1 avg = 5.6 max = 373 Wald chi2(4) = 9692.94 Log restricted-likelihood = -24454.551 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.548778 .0508131 -30.48 0.000 -1.64837 -1.449186 el | -.0688246 .0537867 -1.28 0.201 -.1742446 .0365954 avg_ed | 40.51744 1.949019 20.79 0.000 36.69743 44.33744 emer | -1.728796 .1241998 -13.92 0.000 -1.972223 -1.485369 _cons | 707.0091 7.570401 93.39 0.000 692.1714 721.8468 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ dnum: Identity | sd(_cons) | 28.5644 1.30312 26.12118 31.23613 -----------------------------+------------------------------------------------ sd(Residual) | 40.26107 .4575715 39.37416 41.16796 ------------------------------------------------------------------------------ LR test vs. linear regression: chibar2(01) = 901.11 Prob >= chibar2 = 0.0000And next, a population averaged GEE random-effects model.
xtreg api04 meals el avg_ed emer, i(dnum) pa nolog /* same as: xtgee api04 meals el avg_ed emer, i(dnum) nolog */ GEE population-averaged model Number of obs = 4702 Group variable: dnum Number of groups = 834 Link: identity Obs per group: min = 1 Family: Gaussian avg = 5.6 Correlation: exchangeable max = 373 Wald chi2(4) = 17223.17 Scale parameter: 2506.259 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.596325 .0378911 -42.13 0.000 -1.67059 -1.52206 el | -.1187029 .0409814 -2.90 0.004 -.199025 -.0383808 avg_ed | 37.24855 1.442306 25.83 0.000 34.42169 40.07542 emer | -1.63356 .0905914 -18.03 0.000 -1.811116 -1.456004 _cons | 717.0666 5.704247 125.71 0.000 705.8865 728.2467 ------------------------------------------------------------------------------We can also run the population averaged model with the robust option.
xtreg api04 meals el avg_ed emer, i(dnum) pa robust nolog /* same as: xtgee api04 meals el avg_ed emer, i(dnum) robust nolog */ GEE population-averaged model Number of obs = 4702 Group variable: dnum Number of groups = 834 Link: identity Obs per group: min = 1 Family: Gaussian avg = 5.6 Correlation: exchangeable max = 373 Wald chi2(4) = 4058.27 Scale parameter: 2506.259 Prob > chi2 = 0.0000 (Std. Err. adjusted for clustering on dnum) ------------------------------------------------------------------------------ | Semirobust api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.596325 .1943932 -8.21 0.000 -1.977329 -1.215322 el | -.1187029 .1347415 -0.88 0.378 -.3827913 .1453856 avg_ed | 37.24855 10.14065 3.67 0.000 17.37326 57.12385 emer | -1.63356 .2055431 -7.95 0.000 -2.036417 -1.230703 _cons | 717.0666 38.79005 18.49 0.000 641.0395 793.0937 ------------------------------------------------------------------------------Finally, just for the fun of it we will use svyreg with clusting even though the data are not from a complex survey sampling design.
svyset, psu(dnum) pweight:This analysis is the same as the OLS regression with the cluster option.VCE: linearized Single unit: missing Strata 1: SU 1: dnum FPC 1: svy: regress api04 meals el avg_ed emer (running regress on estimation sample) Survey: Linear regression Number of strata = 1 Number of obs = 4702 Number of PSUs = 834 Population size = 4702 Design df = 833 F( 4, 830) = 452.88 Prob > F = 0.0000 R-squared = 0.7244 ------------------------------------------------------------------------------ | Linearized api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.31768 .1287115 -10.24 0.000 -1.570317 -1.065043 el | .1810992 .1076713 1.68 0.093 -.0302397 .3924382 avg_ed | 56.02456 5.356256 10.46 0.000 45.51121 66.5379 emer | -1.403778 .2982654 -4.71 0.000 -1.989218 -.8183379 _cons | 652.1849 21.30293 30.61 0.000 610.3711 693.9986 ------------------------------------------------------------------------------
Collectively, these analyses provide a range of options for analyzing clustered data in Stata. There is no need to use a multilevel data analysis program for these data since all of the data are collected at the school level and no cross level hypotheses are being tested. Results identical to xtreg with the mle option were obtained using SAS proc mixed.
Linear Statistical Models Course
Phil Ender, 17sep10, 11nov04