Checking Assumptions

Linear Statistical Models

Checking Assumptions

Three Universal Assumptions of Analysis of Variance

Independence
Normality
Homogeneity of Variance

Independence

Independence is verified by checking on how subjects were selected and how they were assigned to groups. If subjects are randomly sampled and randomly assigned to groups then the independence assumption is easily met. If some other method is used to sample and assign subjects then it is necessary to examine the procedure closely to determine whether the assumption of independence is still met.

Normality

Althought tests exist for checking normality, it is probably as effective to visually inspect distribution graphs for each group.

histograms
histogram y, by(b) normal
normaly probability plots
pnorm y if b==1
kdensity plots
twoway (kdensity y if b==1)(kdensity y if b==2)(kdensity y if b==3)(kdensity y if b==4), legend(off))

The tricky part is when group sample sizes are small. In these cases it is difficult to tell whether the sample was drawn from a normally distributed population or not.

Example Histograms of Random Samples

12 samples with n = 8

24 samples with n = 4

Homogeneity of Variance

This assumption is also called homoscedasicity. For this assumption, we need to check to see if the population variances for each of the groups from which the samples were drawn have equal variances.

Althought a number of tests exist for checking homogeneity of variance, most if not all of them are affected to some degree by non-normality in the samples. It is not recommended that tests of homogeneity of variance be run as a matter of course. It is usually safer to inspect the variances or standard deviations for each group. If the ratio of the variances differ by more than nine or the ratio of the standard deviations differ by more than three, then the researcher should be concerned about heterogeneity of variance.

Here are four methods for checking the homogeneity of variance assumption. Of the four, Levene's test is least affected by non-normality.

Fmax test
Bartlett's test
Cochran's test
Levene's test

Table of Group Means, Variances and Standard Deviations

tabstat y, by(a) stat(n mean sd var)

       b |         N      mean        sd  variance
---------+----------------------------------------
       1 |         8      2.75  1.488048  2.214286
       2 |         8       3.5  .9258201  .8571429
       3 |         8      6.25  1.035098  1.071429
       4 |         8         9  1.309307  1.714286
---------+----------------------------------------
   Total |        32     5.375  2.756225  7.596774
--------------------------------------------------

Fmax Test for Homogeneity of Variance

Quick and dirty
Affected by non-normality
Use Fmax table

with df = p & (n-1) Table of Fmax

From the Example:

Fmax = 2.214/0.857 = 2.58

with 4 & 7 degrees of freedom.

Critical Value of Fmax = 8.44 for α = .05

Decision: Fail to reject H₀

There is no evidence for heterogeneity of variance.

Bartlett's Test for Homogeneity of Variance

Also affected by non-normality
Uses the χ² distribution

From the Example:

B = 1.8281 (as computed by STATA)

with k - 1 = 3 degrees of freedom.

Critical Value of χ² = 7.815 for α = .05

Decision: Fail to reject H₀ of equal variances

Cochran's Test for Homogeneity of Variance

Computationally simpler than Bartlett's test
Also affected by non-normality
Uses Cochran's C table

df = k and n - 1 (the same as Fmax)

From the Example:

C = 2.214/5.856 = .3781

with 4 & 7 degrees of freedom.

Critical Value of C = .5365 for α = .05

Decision: Fail to reject H₀ of equal variances

Levene's's Test for Homogeneity of Variance

Relatively simple
Less affected by non-normality
Uses the F distribution
Has a tendency to falsely reject H₀ in some situations

Perform one-way ANOVA using d as the dependent variable.

From the Example:

F = 1.29, p = 0.2963

with 3 & 28 degrees of freedom.

Critical Value of F = 2.95 for α = .05

Decision: Fail to reject H₀ of equal variances

robvar y, by(a)

            |            Summary of y
          a |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          1 |           3   1.5118579           8
          2 |         3.5    .9258201           8
          3 |        4.25   1.0350983           8
          4 |        6.25   2.1213203           8
------------+------------------------------------
      Total |        4.25   1.8837163          32

W0  = 1.292876   df(3, 28)     Pr > F = .29625408  /*  <-- this is the Levene's test */

W50 = 1.037037   df(3, 28)     Pr > F = .39138742

W10 = 1.292876   df(3, 28)     Pr > F = .29625408

Levene's F is a relatively robust measure of homogeneity of variance. It was not really needed in this example because the standard deviations are so close together. The other tests of homoscedasticity, F-max, Bartlett's test, and Cochrans's, are more strongly influenced by non-normality than is Levene's test. The first step in checking on the assumption of homogeneity of variance should be to inspect the standard deviations or variances within each level.

Exploring Robustness

A statistical test is said to be robust if it yields correct conclusions even when some of the assumptions are not met. Generally, anova is considered to be relatively robust to violations of normality and homogeneity, especially when the sample sizes are equal or nearly equal.

We can explore the robustness of some one-way designs to heterscedasiticty and sample size using simanova. simanova performs a Monte Carlo simulation of completely randomized designs under the assumption that the group means are equal. We can compare the observed proportion of tests that fall into each of sevear nominal alpha levels.

net from http://www.ats.ucla.edu/stat/stata/ado/analysis
net install simanova

use http://www.philender.com/courses/data/cr4new, clear

anova y a

                           Number of obs =      32     R-squared     =  0.4455
                           Root MSE      =   1.476     Adj R-squared =  0.3860

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |          49     3  16.3333333       7.50     0.0008
                         |
                       a |          49     3  16.3333333       7.50     0.0008
                         |
                Residual |          61    28  2.17857143   
              -----------+----------------------------------------------------
                   Total |         110    31   3.5483871    

simanova y a

Information about Sample Sizes and Standard Deviations
------------------------------------------------------
N1 = 8 and S1 = 1.5118579
N2 = 8 and S2 = .92582011
N3 = 8 and S3 = 1.0350983
N4 = 8 and S4 = 2.1213202

Results of Standard ANOVA

----------------------------------------------------------------------
Dependent Variable is y and Independent Variable is a
F(  3,  28.00) =   7.497, p= 0.0008
----------------------------------------------------------------------

         1000 simulated ANOVA F tests
         --------------------------------
Nominal  Simulated   Simulated P value   
P Value  P Value     [95% Conf. Interval]
-----------------------------------------
 0.0008   0.0010       0.0000 - 0.0056
 0.2000   0.2160       0.1909 - 0.2428
 0.1000   0.1200       0.1005 - 0.1418
 0.0500   0.0600       0.0461 - 0.0766
 0.0100   0.0110       0.0055 - 0.0196

simanova, gr(4) n(8 8 8 8) s(1 1 1 3)

Information about Sample Sizes and Standard Deviations
------------------------------------------------------
N1 = 8 and S1 = 1
N2 = 8 and S2 = 1
N3 = 8 and S3 = 1
N4 = 8 and S4 = 3

         1000 simulated ANOVA F tests
         --------------------------------
Nominal  Simulated   Simulated P value   
P Value  P Value     [95% Conf. Interval]
-----------------------------------------
 0.2000   0.2110       0.1861 - 0.2376
 0.1000   0.1300       0.1098 - 0.1524
 0.0500   0.0790       0.0630 - 0.0975
 0.0100   0.0440       0.0321 - 0.0586

simanova, gr(4) n(8 8 8 8) s(1 1 1 9)

Information about Sample Sizes and Standard Deviations
------------------------------------------------------
N1 = 8 and S1 = 1
N2 = 8 and S2 = 1
N3 = 8 and S3 = 1
N4 = 8 and S4 = 9

         1000 simulated ANOVA F tests
         --------------------------------
Nominal  Simulated   Simulated P value   
P Value  P Value     [95% Conf. Interval]
-----------------------------------------
 0.2000   0.2340       0.2081 - 0.2615
 0.1000   0.1610       0.1387 - 0.1853
 0.0500   0.1190       0.0996 - 0.1407
 0.0100   0.0570       0.0435 - 0.0732

simanova, gr(4) n(16 16 16 8) s(1 1 1 9)

Information about Sample Sizes and Standard Deviations
------------------------------------------------------
N1 = 16 and S1 = 1
N2 = 16 and S2 = 1
N3 = 16 and S3 = 1
N4 = 8 and S4 = 9

         1000 simulated ANOVA F tests
         --------------------------------
Nominal  Simulated   Simulated P value   
P Value  P Value     [95% Conf. Interval]
-----------------------------------------
 0.2000   0.3950       0.3646 - 0.4261
 0.1000   0.3370       0.3077 - 0.3672
 0.0500   0.2850       0.2572 - 0.3141
 0.0100   0.1930       0.1690 - 0.2188

Linear Statistical Models Course

Phil Ender, 4apr06, 4jun99; 15mar02