When data do not completely meet the assumptions underlying the analysis of variance and/or when there are outliers or influential data points robust anova procedures can be used.
The most basic robust procedures are to analyze the data using regression with robust standard errors or to use the robust regression command rreg. Regress with the robust option is more appropriate designs with heterogeneity of variance. The rreg command is less useful except in situations with extreme outliers, although there may be other ways of dealing with these kinds of outliers. Of course, it is necessary to correctly code each of the categorical variables to use the regression approach.
One-Factor Designs
In addition to robust regression techniques, the fstar and wtest commands can be use with one-factor designs. Both fstar and wtest are more robust to heteroscedasticity that regular one-way anova.
Consider an example using write as the response variable and prog as the categorical variable.
use http://www.philender.com/courses/data/cr4new, clear tabstat y, by(a) stat(n mean sd) Summary for variables: y by categories of: a a | N mean sd ---------+------------------------------ 1 | 8 3 1.511858 2 | 8 3.5 .9258201 3 | 8 4.25 1.035098 4 | 8 6.25 2.12132 ---------+------------------------------ Total | 32 4.25 1.883716 ---------------------------------------- histogram y, by(a) normal anova y a Number of obs = 32 R-squared = 0.4455 Root MSE = 1.476 Adj R-squared = 0.3860 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 49.00 3 16.3333333 7.50 0.0008 | a | 49.00 3 16.3333333 7.50 0.0008 | Residual | 61.00 28 2.17857143 -----------+---------------------------------------------------- Total | 110.00 31 3.5483871 regress y i.a Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 7.50 Model | 49 3 16.3333333 Prob > F = 0.0008 Residual | 61 28 2.17857143 R-squared = 0.4455 -------------+------------------------------ Adj R-squared = 0.3860 Total | 110 31 3.5483871 Root MSE = 1.476 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- a | 2 | .5 .7379992 0.68 0.504 -1.011723 2.011723 3 | 1.25 .7379992 1.69 0.101 -.2617229 2.761723 4 | 3.25 .7379992 4.40 0.000 1.738277 4.761723 | _cons | 3 .5218443 5.75 0.000 1.93105 4.06895 ------------------------------------------------------------------------------ testparm i.a ( 1) 2.a = 0 ( 2) 3.a = 0 ( 3) 4.a = 0 F( 3, 28) = 7.50 Prob > F = 0.0008 regress y i.a, vce(robust) /* useful with heterogeneity */ Linear regression Number of obs = 32 F( 3, 28) = 5.02 Prob > F = 0.0066 R-squared = 0.4455 Root MSE = 1.476 ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- a | 2 | .5 .6267832 0.80 0.432 -.7839071 1.783907 3 | 1.25 .6477985 1.93 0.064 -.076955 2.576955 4 | 3.25 .9209855 3.53 0.001 1.363447 5.136553 | _cons | 3 .5345225 5.61 0.000 1.90508 4.09492 ------------------------------------------------------------------------------ testparm i.a ( 1) 2.a = 0 ( 2) 3.a = 0 ( 3) 4.a = 0 F( 3, 28) = 5.02 Prob > F = 0.0066 rreg y i.a /* useful wirh outliers */ Huber iteration 1: maximum difference in weights = .53333333 Huber iteration 2: maximum difference in weights = .11280178 Huber iteration 3: maximum difference in weights = .02450832 Biweight iteration 4: maximum difference in weights = .14902125 Biweight iteration 5: maximum difference in weights = .02171319 Biweight iteration 6: maximum difference in weights = .00926737 Robust regression Number of obs = 32 F( 3, 28) = 7.51 Prob > F = 0.0008 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- a | 2 | .6744035 .7076641 0.95 0.349 -.7751807 2.123988 3 | 1.404021 .7076641 1.98 0.057 -.0455634 2.853605 4 | 3.183657 .7076641 4.50 0.000 1.734072 4.633241 | _cons | 2.825597 .5003941 5.65 0.000 1.800586 3.850607 ------------------------------------------------------------------------------ testparm i.a ( 1) 2.a = 0 ( 2) 3.a = 0 ( 3) 4.a = 0 F( 3, 28) = 7.51 Prob > F = 0.0008 fstar y a /* available from ATS via the Internet */ ---------------------------------------------------------------------- Dependent Variable is y and Independent Variable is a Fstar( 3, 19.43) = 7.497, p= 0.0016 ---------------------------------------------------------------------- wtest y a /* available from ATS via the Internet */ ---------------------------------------------------------------------- Dependent Variable is y and Independent Variable is a WStat( 3, 15.06) = 4.612, p= 0.0176 ----------------------------------------------------------------------Nonparametric Test
We can also try a nonparametric test, such as the Kruskal-Wallis test. The Kruskal-Wallis is the nonparametric analog of the one-way anova.
kwallis y, by(a) Test: Equality of populations (Kruskal-Wallis test) a _Obs _RankSum 1 8 78.50 2 8 104.00 3 8 142.50 4 8 203.00 chi-squared = 12.496 with 3 d.f. probability = 0.0059 chi-squared with ties = 12.997 with 3 d.f. probability = 0.0046Bootstrap Standard Errors
regress y i.a, vce(bootstrap, reps(200)) (running regress on estimation sample) Bootstrap replications (200) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 Linear regression Number of obs = 32 Replications = 200 Wald chi2(3) = 16.01 Prob > chi2 = 0.0011 R-squared = 0.4455 Adj R-squared = 0.3860 Root MSE = 1.4760 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- a | 2 | .5 .6219155 0.80 0.421 -.7189321 1.718932 3 | 1.25 .6431964 1.94 0.052 -.0106417 2.510642 4 | 3.25 .935627 3.47 0.001 1.416205 5.083795 | _cons | 3 .5212292 5.76 0.000 1.97841 4.02159 ------------------------------------------------------------------------------ testparm i.a ( 1) 2.a = 0 ( 2) 3.a = 0 ( 3) 4.a = 0 chi2( 3) = 16.01 Prob > chi2 = 0.0011Least Absolute Deviation
Minimizes the absolute values of deviations from the median, which is why it is often known as median regression. We will run the example using bootstrap standard errors with 200 replications. bsqreg does not work with factor variables so we will use the xi: prefix.
xi: bsqreg y i.a, reps(200) i.a _Ia_1-4 (naturally coded; _Ia_1 omitted) (fitting base model) (bootstrapping) Median regression, bootstrap(200) SEs Number of obs = 32 Raw sum of deviations 44 (about 4) Min sum of deviations 32 Pseudo R2 = 0.2727 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ia_2 | 1 .8026182 1.25 0.223 -.6440889 2.644089 _Ia_3 | 1 .7716067 1.30 0.206 -.5805647 2.580565 _Ia_4 | 3 1.075819 2.79 0.009 .7962843 5.203716 _cons | 3 .62749 4.78 0.000 1.714645 4.285355 ------------------------------------------------------------------------------ test _Ia_2 _Ia_3 _Ia_4 ( 1) _Ia_2 = 0 ( 2) _Ia_3 = 0 ( 3) _Ia_4 = 0 F( 3, 28) = 2.59 Prob > F = 0.0725Permutation Tests
Permutation tests are based on Monte Carlo simulations and can be used to test the differences among the levels of the independent variable. For each repetition, the values of group variable are randomly permuted, the test statistic is computed, and a count is kept whether this value of the test statistic is more extreme than the observed test statistic.
Stata uses the permute command along with an ado program to perform permutation tests. Here is an example of a program that can be used with one-factor designs. It can be placed in the data directory and should be named panova1.ado.
program define panova1 version 8 args dv grp anova `dv' `grp' endHere is how it is used with cr4new.
permute a "panova1 y a" F=e(F), reps(1000) command: panova1 y a statistic: F = e(F) permute var: a Monte Carlo permutation statistics Number of obs = 32 Replications = 1000 ------------------------------------------------------------------------------ T | T(obs) c n p=c/n SE(p) [95% Conf. Interval] -------------+---------------------------------------------------------------- F | 7.497268 1 1000 0.0010 0.0010 .0000253 .0055589 ------------------------------------------------------------------------------ Note: confidence interval is with respect to p=c/n Note: c = #{|T| >= |T(obs)|}Next we will modify the dataset so that the results are not significant and repeat the permutation tests.
replace y=4 in 29/32 anova y a Number of obs = 32 R-squared = 0.2220 Root MSE = 1.12401 Adj R-squared = 0.1386 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 10.09375 3 3.36458333 2.66 0.0673 | a | 10.09375 3 3.36458333 2.66 0.0673 | Residual | 35.375 28 1.26339286 -----------+---------------------------------------------------- Total | 45.46875 31 1.46673387 permute a "panova1 y a" F=e(F), reps(1000) command: panova1 y a statistic: F = e(F) permute var: a Monte Carlo permutation statistics Number of obs = 32 Replications = 1000 ------------------------------------------------------------------------------ T | T(obs) c n p=c/n SE(p) [95% Conf. Interval] -------------+---------------------------------------------------------------- F | 2.663133 78 1000 0.0780 0.0085 .0621412 .0963936 ------------------------------------------------------------------------------ Note: confidence interval is with respect to p=c/n Note: c = #{|T| >= |T(obs)|}Two-Factor Designs
Consider a two-factor design using write as the response variable and female and prog as the categorical independent variables.
use http://www.philender.com/courses/data/hsb2, clear table female prog, contents(freq mean write sd write) row col -------------------------------------------------- | type of program female | general academic vocation Total ----------+--------------------------------------- male | 21 47 23 91 | 49.14286 54.61702 41.82609 50.12088 | 10.36478 8.656622 8.003705 10.30516 | female | 24 58 27 109 | 53.25 57.58621 50.96296 54.99083 | 8.205248 7.115672 8.341193 8.133716 | Total | 45 105 50 200 | 51.33333 56.25714 46.76 52.775 | 9.397776 7.943343 9.318754 9.478586 -------------------------------------------------- histogram write, by(female prog) start(30) width(5) normal ) anova write female prog female#prog Number of obs = 200 R-squared = 0.2590 Root MSE = 8.26386 Adj R-squared = 0.2399 Source | Partial SS df MS F Prob > F ------------+---------------------------------------------------- Model | 4630.36091 5 926.072182 13.56 0.0000 | female | 1261.85329 1 1261.85329 18.48 0.0000 prog | 3274.35082 2 1637.17541 23.97 0.0000 female#prog | 325.958189 2 162.979094 2.39 0.0946 | Residual | 13248.5141 194 68.2913097 ------------+---------------------------------------------------- Total | 17878.875 199 89.843593 regress write i.female##i.prog Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 5, 194) = 13.56 Model | 4630.36091 5 926.072182 Prob > F = 0.0000 Residual | 13248.5141 194 68.2913097 R-squared = 0.2590 -------------+------------------------------ Adj R-squared = 0.2399 Total | 17878.875 199 89.843593 Root MSE = 8.2639 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 4.107143 2.469299 1.66 0.098 -.7629757 8.977261 | prog | 2 | 5.474164 2.169095 2.52 0.012 1.196128 9.7522 3 | -7.31677 2.494224 -2.93 0.004 -12.23605 -2.397493 | female#prog | 1 2 | -1.137957 2.954299 -0.39 0.701 -6.964625 4.68871 1 3 | 5.029733 3.40528 1.48 0.141 -1.686391 11.74586 | _cons | 49.14286 1.803321 27.25 0.000 45.58623 52.69949 ------------------------------------------------------------------------------ /* user written program -- findit anovalator */ /* test main effects and two-way interaction */ anovalator female prog, main 2way fratio anovalator main-effect for female chi2(1) = 18.477509 p-value = .00001719 scaled as F-ratio = 18.477509 anovalator main-effect for prog chi2(2) = 47.946815 p-value = 3.877e-11 scaled as F-ratio = 23.973408 anovalator two-way interaction for female#prog chi2(2) = 4.7730552 p-value = .09194841 scaled as F-ratio = 2.3865276 regress write i.female##i.prog, vce(robust) Linear regression Number of obs = 200 F( 5, 194) = 15.07 Prob > F = 0.0000 R-squared = 0.2590 Root MSE = 8.2639 ------------------------------------------------------------------------------ | Robust write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 4.107143 2.791816 1.47 0.143 -1.399066 9.613352 | prog | 2 | 5.474164 2.575164 2.13 0.035 .3952513 10.55308 3 | -7.31677 2.787331 -2.63 0.009 -12.81413 -1.819409 | female#prog | 1 2 | -1.137957 3.207405 -0.35 0.723 -7.463818 5.187903 1 3 | 5.029733 3.61924 1.39 0.166 -2.108377 12.16784 | _cons | 49.14286 2.241144 21.93 0.000 44.72272 53.56299 ------------------------------------------------------------------------------ /* robust tests of main effects and two-way interaction */ anovalator female prog, main 2way fratio anovalator main-effect for female chi2(1) = 16.859042 p-value = .00004026 scaled as F-ratio = 16.859042 anovalator main-effect for prog chi2(2) = 49.702994 p-value = 1.611e-11 scaled as F-ratio = 24.851497 anovalator two-way interaction for female#prog chi2(2) = 4.9516052 p-value = .08409547 scaled as F-ratio = 2.4758026 rreg write i.female##i.prog Huber iteration 1: maximum difference in weights = .43221816 Huber iteration 2: maximum difference in weights = .0522718 Huber iteration 3: maximum difference in weights = .02118098 Biweight iteration 4: maximum difference in weights = .15802072 Biweight iteration 5: maximum difference in weights = .00897314 Robust regression Number of obs = 200 F( 5, 194) = 14.17 Prob > F = 0.0000 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | 4.127819 2.561848 1.61 0.109 -.9248306 9.180468 | prog | 2 | 5.772446 2.250392 2.57 0.011 1.33407 10.21082 3 | -8.110425 2.587707 -3.13 0.002 -13.21408 -3.006774 | female#prog | 1 2 | -1.307885 3.065025 -0.43 0.670 -7.352935 4.737164 1 3 | 5.590185 3.532909 1.58 0.115 -1.377657 12.55803 | _cons | 49.42263 1.870909 26.42 0.000 45.7327 53.11257 ------------------------------------------------------------------------------ /* tests of main effects and two-way interaction following rreg */ anovalator female prog, main 2way fratio anovalator main-effect for female chi2(1) = 18.138295 p-value = .00002054 scaled as F-ratio = 18.138295 anovalator main-effect for prog chi2(2) = 51.200604 p-value = 7.620e-12 scaled as F-ratio = 25.600302 anovalator two-way interaction for female#prog chi2(2) = 5.5390472 p-value = .06269186 scaled as F-ratio = 2.7695236
Linear Statistical Models Course
Phil Ender, 17sep10, 15may06, 4apr06, 2apr02