Linear Statistical Models

Coding Categorical Variables

Updated for Stata 11


Consider the Following 4 Group Design:

Levela1 a2a3a4Total
1
3
2
2
2
3
4
3
5
6
4
5
10
10
9
11
Mean2.03.05.010.05.0

Dummy Coding

  • For k groups, use k-1 coded vectors.
  • Uses only zeros and ones.
  • Reference group is coded with all zeros.
  • Each coded column is one degree of freedom.
  • Constant is the mean of the reference group.
  • Regression coefficients are the differences between the each group mean and the reference group mean.

    Dummy coded variables are also known as indicator variables.

    input y  grp d1  d2  d3
     1   1   1   0   0 
     3   1   1   0   0
     2   1   1   0   0
     2   1   1   0   0
     2   2   0   1   0
     3   2   0   1   0
     4   2   0   1   0
     3   2   0   1   0
     5   3   0   0   1
     6   3   0   0   1
     4   3   0   0   1
     5   3   0   0   1
    10   4   0   0   0
    10   4   0   0   0
     9   4   0   0   0
    11   4   0   0   0
    end
    
    tabstat y, by(grp)
    
    Summary for variables: y
         by categories of: grp 
    
         grp |      mean
    ---------+----------
           1 |         2
           2 |         3
           3 |         5
           4 |        10
    ---------+----------
       Total |         5
    --------------------
    
    regress y d1 d2 d3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          d1 |         -8   .5773503    -13.856   0.000      -9.257938   -6.742062
          d2 |         -7   .5773503    -12.124   0.000      -8.257938   -5.742062
          d3 |         -5   .5773503     -8.660   0.000      -6.257938   -3.742062
       _cons |         10   .4082483     24.495   0.000       9.110503     10.8895
    ------------------------------------------------------------------------------
    Introduced in Stata 11, dummy coded factor variables can be generated for most estomation models.
    regress y i.grp
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |         152     3  50.6666667           Prob > F      =  0.0000
        Residual |           8    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |         160    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             grp |
              2  |          1   .5773503     1.73   0.109    -.2579382    2.257938
              3  |          3   .5773503     5.20   0.000     1.742062    4.257938
              4  |          8   .5773503    13.86   0.000     6.742062    9.257938
                 |
           _cons |          2   .4082483     4.90   0.000     1.110503    2.889497
    ------------------------------------------------------------------------------
    
    /* change reference group to grp 4 */
    
    regress y ib4.grp
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |         152     3  50.6666667           Prob > F      =  0.0000
        Residual |           8    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |         160    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             grp |
              1  |         -8   .5773503   -13.86   0.000    -9.257938   -6.742062
              2  |         -7   .5773503   -12.12   0.000    -8.257938   -5.742062
              3  |         -5   .5773503    -8.66   0.000    -6.257938   -3.742062
                 |
           _cons |         10   .4082483    24.49   0.000     9.110503     10.8895
    ------------------------------------------------------------------------------
    
    
    /* anova treats all predictors as categorical unless otherwise indicated */
    
    anova y grp
    
                               Number of obs =      16     R-squared     =  0.9500
                               Root MSE      = .816497     Adj R-squared =  0.9375
    
                      Source |  Partial SS    df       MS           F     Prob > F
                  -----------+----------------------------------------------------
                       Model |      152.00     3  50.6666667      76.00     0.0000
                             |
                         grp |      152.00     3  50.6666667      76.00     0.0000
                             |
                    Residual |        8.00    12  .666666667   
                  -----------+----------------------------------------------------
                       Total |      160.00    15  10.6666667   
    
    regress
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |         152     3  50.6666667           Prob > F      =  0.0000
        Residual |           8    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |         160    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             grp |
              2  |          1   .5773503     1.73   0.109    -.2579382    2.257938
              3  |          3   .5773503     5.20   0.000     1.742062    4.257938
              4  |          8   .5773503    13.86   0.000     6.742062    9.257938
                 |
           _cons |          2   .4082483     4.90   0.000     1.110503    2.889497
    ------------------------------------------------------------------------------

    Effect Coding

  • For k groups, use k-1 coded vectors.
  • Uses ones, zeros, and minus ones.
  • Reference group is coded -1.
  • Each coded column is one degree of freedom.
  • Constant is the unweighted grand mean.
  • Regression coefficients are differences between the group mean and the grad mean.

    Effect coding is sometimes known as deviation coding.

     input y  grp e1  e2  e3
     1   1   1   0   0 
     3   1   1   0   0
     2   1   1   0   0
     2   1   1   0   0
     2   2   0   1   0
     3   2   0   1   0
     4   2   0   1   0
     3   2   0   1   0
     5   3   0   0   1
     6   3   0   0   1
     4   3   0   0   1
     5   3   0   0   1
    10   4  -1  -1  -1
    10   4  -1  -1  -1
     9   4  -1  -1  -1
    11   4  -1  -1  -1
    end
    
    regress y e1 e2 e3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          e1 |         -3   .3535534     -8.485   0.000      -3.770327   -2.229673
          e2 |         -2   .3535534     -5.657   0.000      -2.770327   -1.229673
          e3 |          0   .3535534      0.000   1.000      -.7703266    .7703266
       _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
    ------------------------------------------------------------------------------
    
    test e1 e2 e3
    
     ( 1)  e1 = 0
     ( 2)  e2 = 0
     ( 3)  e3 = 0
    
           F(  3,    12) =   76.00
                Prob > F =    0.0000

    Orthogonal Coding

  • For k groups, use k-1 coded vectors.
  • All vectors are pairwise orthogonal.
  • Constant is unweighted grand mean.
  • Each coded column is one degree of freedom.

    Example Using Orthogonal Coding

    input y grp x1  x2  x3
     1   1   1   1   1 
     3   1   1   1   1
     2   1   1   1   1
     2   1   1   1   1
     2   2  -1   1   1
     3   2  -1   1   1
     4   2  -1   1   1
     3   2  -1   1   1
     5   3   0  -2   1
     6   3   0  -2   1
     4   3   0  -2   1
     5   3   0  -2   1
    10   4   0   0  -3
    10   4   0   0  -3
     9   4   0   0  -3
    11   4   0   0  -3
    end
    
    table grp, contents(freq mean y sd y)
    
    ----------------------------------------------
          grp |      Freq.     mean(y)       sd(y)
    ----------+-----------------------------------
            1 |          4           2    .8164966
            2 |          4           3    .8164966
            3 |          4           5    .8164966
            4 |          4          10    .8164966
    ----------------------------------------------
    
    corr x1 x2 x3
    (obs=16)
    
                 |       x1       x2       x3
    -------------+---------------------------
              x1 |   1.0000
              x2 |   0.0000   1.0000
              x3 |   0.0000   0.0000   1.0000
    
    Anova
    
    anova y grp
    
                               Number of obs =      16     R-squared     =  0.9500
                               Root MSE      = .816497     Adj R-squared =  0.9375
    
                      Source |  Partial SS    df       MS           F     Prob > F
                  -----------+----------------------------------------------------
                       Model |      152.00     3  50.6666667      76.00     0.0000
                             |
                         grp |      152.00     3  50.6666667      76.00     0.0000
                             |
                    Residual |        8.00    12  .666666667   
                  -----------+----------------------------------------------------
                       Total |      160.00    15  10.6666667 
    				   
    Regression Analysis Using Orthogonal Coding
    
    regress y x1 x2 x3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          x1 |        -.5   .2886751     -1.732   0.109      -1.128969    .1289691
          x2 |  -.8333333   .1666667     -5.000   0.000      -1.196469   -.4701979
          x3 |  -1.666667   .1178511    -14.142   0.000      -1.923442   -1.409891
       _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
    ------------------------------------------------------------------------------
    
    test x1 x2 x3
    
     ( 1)  x1 = 0
     ( 2)  x2 = 0
     ( 3)  x3 = 0
    
           F(  3,    12) =   76.00
                Prob > F =    0.0000

    Orthogonal Coding Schema

    
    Grp X1 X2 X3 X4 X5 X6 X7 X8 X9
     1   1  1  1  1  1  1  1  1  1
     2  -1  1  1  1  1  1  1  1  1
     3   0 -2  1  1  1  1  1  1  1
     4   0  0 -3  1  1  1  1  1  1
     5   0  0  0 -4  1  1  1  1  1
     6   0  0  0  0 -5  1  1  1  1
     7   0  0  0  0  0 -6  1  1  1
     8   0  0  0  0  0  0 -7  1  1
     9   0  0  0  0  0  0  0 -8  1
    10   0  0  0  0  0  0  0  0 -9
    


    Linear Statistical Models Course

    Phil Ender, 17sep10, 21Feb02, 17Mar98