Looking at Anova Data

Linear Statistical Models

Looking at Anova Data

Updated for Stata 11

Displaying Raw Anova Data

Althought it isn't done for every analysis, there will be times that you want to display the raw data for an anova. The tabdisp command allows you to display the anova data in tabular form.

use http://www.philender.com/courses/data/cr4a, clear

sort grp

by grp: generate order = _n  /* number observations within each group */

tabdisp order grp, cellvar(y)

----------------------------------
          |          grp          
    order |    1     2     3     4
----------+-----------------------
        1 |    3     4     7     7
        2 |    6     5     8     8
        3 |    3     4     7     9
        4 |    3     3     6     8
        5 |    1     2     5    10
        6 |    2     3     6    10
        7 |    2     4     5     9
        8 |    2     3     6    11
----------------------------------


use http://www.philender.com/courses/data/crf33, clear

sort a b

by a b: generate order = _n  /* number observations within each cell */

tabdisp order b a, cellvar(y)

--------------------------------------------------------------------
          |                         a and b                         
          | ------- 1 ------    ------- 2 ------    ------- 3 ------
    order |    1     2     3       1     2     3       1     2     3
----------+---------------------------------------------------------
        1 |   37    44    38      34    35    36      21    39    52
        2 |   42    36    28      30    27    45      31    50    53
        3 |   29    27    48      26    40    26      10    34    64
        4 |   33    43    29      39    31    46      20    41    42
        5 |   24    25    47      21    22    27      18    36    49
--------------------------------------------------------------------

General Descriptive and Exploratory Data Analysis

In general, it is important to look at your data, to try to understand it as best you can. You can use all of the tools in descriptive statistics and exploratory data analysis that were covered in the regression part of the course.

use http://www.philender.com/courses/data/hsb2

describe

Contains data from http://www.gseis.ucla.edu/courses/data/hsb2.dta
  obs:           200                          highschool and beyond (200
                                                cases)
 vars:            11                          21 Jun 2000 08:54
 size:         9,600 (98.9% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g                  
female          float  %9.0g       fl         
race            float  %12.0g      rl         
ses             float  %9.0g       sl         
schtyp          float  %9.0g       scl        type of school
prog            float  %9.0g       sel        type of program
read            float  %9.0g                  reading score
write           float  %9.0g                  writing score
math            float  %9.0g                  math score
science         float  %9.0g                  science score
socst           float  %9.0g                  social studies score
-------------------------------------------------------------------------------

summarize read write math science

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
        read |     200       52.23   10.25294         28         76
       write |     200      52.775   9.478586         31         67
        math |     200      52.645   9.368448         33         75
     science |     200       51.85   9.900891         26         74

tab1 female prog

-> tabulation of female  

     female |      Freq.     Percent        Cum.
------------+-----------------------------------
       male |         91       45.50       45.50
     female |        109       54.50      100.00
------------+-----------------------------------
      Total |        200      100.00

-> tabulation of prog  

    type of |
    program |      Freq.     Percent        Cum.
------------+-----------------------------------
    general |         45       22.50       22.50
   academic |        105       52.50       75.00
   vocation |         50       25.00      100.00
------------+-----------------------------------
      Total |        200      100.00

stem write

Stem-and-leaf plot for write (writing score)

  3* | 1111
  3t | 3333
  3f | 55
  3s | 66777
  3. | 899999
  4* | 0001111111111
  4t | 223
  4f | 4444444444445
  4s | 66666666677
  4. | 99999999999
  5* | 00
  5t | 2222222222222223
  5f | 44444444444444444555
  5s | 777777777777
  5. | 9999999999999999999999999
  6* | 00001111
  6t | 2222222222222222223333
  6f | 5555555555555555
  6s | 7777777

pnorm write



qnorm write



histogram write, start(30) width(5) normal freq



kdensity write, normal

Looking at Data by Group

In addition to exploratory data analysis and general descriptive statistics you will want to look at data separately for each group in order to check on how well our data meet the assumptions of analysis of variance, in particular, the assumptions of normality and homogeneity of variance (homoscedasticity).

In analysis of variance, we often want to look at the variability and shape of the distribution within each cell of the design. Say that we wanted to look at the anova model for write with female and prog as our categorical variables. There are two levels of female and three levels of prog resulting in a total of six cells. Here are some commands that we can use to look at the data at the marginal and/or cell level.

graph box write, over(female) 



graph box write, over(prog)



graph box write, over(female) over(prog)



tabstat write, stat(n mean sd var) by(prog) 

Summary for variables: write
     by categories of: prog (type of program)

    prog |         N      mean        sd  variance
---------+----------------------------------------
 general |        45  51.33333  9.397775  88.31818
academic |       105  56.25714  7.943343   63.0967
vocation |        50     46.76  9.318754  86.83918
---------+----------------------------------------
   Total |       200    52.775  9.478586  89.84359
--------------------------------------------------

tabulate female prog, summ(write)

        Means, Standard Deviations and Frequencies of writing score

           |        type of program
    female |   general   academic   vocation |     Total
-----------+---------------------------------+----------
      male | 49.142857  54.617021  41.826087 | 50.120879
           | 10.364776  8.6566215  8.0037047 | 10.305161
           |        21         47         23 |        91
-----------+---------------------------------+----------
    female |     53.25  57.586207  50.962963 | 54.990826
           | 8.2052475  7.1156721  8.3411929 | 8.1337152
           |        24         58         27 |       109
-----------+---------------------------------+----------
     Total | 51.333333  56.257143      46.76 |    52.775
           | 9.3977754  7.9433433  9.3187544 |  9.478586
           |        45        105         50 |       200

/* or you could use */

table female prog, contents(freq mean write sd write) row col

--------------------------------------------------
          |            type of program            
   female |  general  academic  vocation     Total
----------+---------------------------------------
     male |       21        47        23        91
          | 49.14286  54.61702  41.82609  50.12088
          | 10.36478  8.656622  8.003705  10.30516
          | 
   female |       24        58        27       109
          |    53.25  57.58621  50.96296  54.99083
          | 8.205248  7.115672  8.341193  8.133716
          | 
    Total |       45       105        50       200
          | 51.33333  56.25714     46.76    52.775
          | 9.397776  7.943343  9.318754  9.478586
--------------------------------------------------

/* or you could use margins */

quietly anova write female##prog

margins female#prog, asbalanced

Adjusted predictions                              Number of obs   =        200

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 female#prog |
        0 1  |   49.14286   1.803321    27.25   0.000     45.60841     52.6773
        0 2  |   54.61702   1.205407    45.31   0.000     52.25447    56.97958
        0 3  |   41.82609   1.723133    24.27   0.000     38.44881    45.20337
        1 1  |      53.25   1.686852    31.57   0.000     49.94383    56.55617
        1 2  |   57.58621   1.085097    53.07   0.000     55.45946    59.71296
        1 3  |   50.96296    1.59038    32.04   0.000     47.84588    54.08005
------------------------------------------------------------------------------

/* or you could use */

sort female prog

by female prog: sum(write)

_______________________________________________________________________________
-> female = male, prog = general

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
       write |      21    49.14286   10.36478         31         65
_______________________________________________________________________________
-> female = male, prog = academic

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
       write |      47    54.61702   8.656621         33         67
_______________________________________________________________________________
-> female = male, prog = vocation

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
       write |      23    41.82609   8.003705         31         63
_______________________________________________________________________________
-> female = female, prog = general

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
       write |      24       53.25   8.205248         36         67
_______________________________________________________________________________
-> female = female, prog = academic

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
       write |      58    57.58621   7.115672         37         67
_______________________________________________________________________________
-> female = female, prog = vocation

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
       write |      27    50.96296   8.341193         35         67 


hist write, by(female prog) normal start(30) width(5)  /*  stata 8 */
(bin=7, start=30, width=5)
 



twoway kdensity write, by(prog)



twoway (kdensity write if prog==1)(kdensity write if prog==2)(kdensity write if prog==3), legend(off)

Graphing Cell Means

In addition to looking at the data to check on assumptions, it is often useful to graph the cell means as a way to help understand interactions. In this example, we will be using the anovaplot command.


anova write prog female prog#female 


                           Number of obs =     200     R-squared     =  0.2590
                           Root MSE      = 8.26386     Adj R-squared =  0.2399

                  Source |  Partial SS    df       MS           F     Prob > F
             ------------+----------------------------------------------------
                   Model |  4630.36091     5  926.072182      13.56     0.0000
                         |
                    prog |  3274.35082     2  1637.17541      23.97     0.0000
                  female |  1261.85329     1  1261.85329      18.48     0.0000
             prog#female |  325.958189     2  162.979094       2.39     0.0946
                         |
                Residual |  13248.5141   194  68.2913097   
             ------------+----------------------------------------------------
                   Total |   17878.875   199   89.843593  
 
anovaplot prog female, scatter(msym(none))   /* findit anovaplot */

Linear Statistical Models Course

Phil Ender, 17sep10, 18mar03; 27mar02; 23feb01