Applied Categorical & Nonnormal Data Analysis

Survey Logistic Regression

When using data from a survey design it is necessary to take into account such aspects as stratification, cluster sampling etc. If you don't take these aspects of the sampling design into account you may end up with biased coefficients and certainly with incorrect standard errors. In the next example we will demonstrate a logistic analysis using a stratified random sampling design

Survey Logit with Stratified Random Sampling

Using API data provided by the California State Department of Education we will take a stratified random sample of 100 elementary schools, 50 middle schools and 50 high schools. This is out of a total of 4,421 elementary schools, 1,018 middle schools and 755 high schools.

The file apistrat.dta contains the data for the stratified random sample.

use http://www.ats.ucla.edu/stat/stata/stat130/apistrat, clear

svyset [pw=pw], strata(stype) fpc(fpc)
pweight is pw
strata is stype
fpc is fpc

tabulate pw

         pw |      Freq.     Percent        Cum.
------------+-----------------------------------
       15.1 |         50       25.00       25.00
      20.36 |         50       25.00       50.00
      44.21 |        100       50.00      100.00
------------+-----------------------------------
      Total |        200      100.00

tabulate stype

      stype |      Freq.     Percent        Cum.
------------+-----------------------------------
          E |        100       50.00       50.00
          H |         50       25.00       75.00
          M |         50       25.00      100.00
------------+-----------------------------------
      Total |        200      100.00

tabulate fpc

        fpc |      Freq.     Percent        Cum.
------------+-----------------------------------
        755 |         50       25.00       25.00
       1018 |         50       25.00       50.00
       4421 |        100       50.00      100.00
------------+-----------------------------------
      Total |        200      100.00

codebook awards

---------------------------------------------------------------------------------------------------------------
awards                                                                                      eligible for awards
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  awards

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/200

            tabulation:  Freq.   Numeric  Label
                            87         1  No
                           113         2  Yes

generate award=awards==2

tabulate award

      award |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         87       43.50       43.50
          1 |        113       56.50      100.00
------------+-----------------------------------
      Total |        200      100.00

logit award meals ell yr_rnd avg_ed full enroll, nolog

Logit estimates                                   Number of obs   =        200
                                                  LR chi2(6)      =      25.56
                                                  Prob > chi2     =     0.0003
Log likelihood = -124.15328                       Pseudo R2       =     0.0933

------------------------------------------------------------------------------
       award |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |   .0144493   .0111761     1.29   0.196    -.0074555    .0363541
         ell |  -.0076087   .0120865    -0.63   0.529    -.0312978    .0160803
      yr_rnd |   1.396157   .6172752     2.26   0.024     .1863201    2.605995
      avg_ed |   .4774699   .4116758     1.16   0.246    -.3293999     1.28434
        full |   .0233389   .0131167     1.78   0.075    -.0023694    .0490471
      enroll |  -.0010137   .0003046    -3.33   0.001    -.0016107   -.0004167
       _cons |  -4.358013   2.137156    -2.04   0.041    -8.546761    -.169265
------------------------------------------------------------------------------

svylogit award meals ell yr_rnd avg_ed full enroll, nolog

Survey logistic regression

pweight:  pw                                      Number of obs    =       200
Strata:   stype                                   Number of strata =         3
PSU:      <observations>                          Number of PSUs   =       200
FPC:      fpc                                     Population size  =      6194
                                                  F(   6,    192)  =      2.97
                                                  Prob > F         =    0.0086

------------------------------------------------------------------------------
       award |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |   .0028622   .0116124     0.25   0.806    -.0200384    .0257628
         ell |  -.0043035   .0117412    -0.37   0.714    -.0274581    .0188512
      yr_rnd |   1.333261   .6838513     1.95   0.053    -.0153479     2.68187
      avg_ed |   .0238367   .4246062     0.06   0.955    -.8135203    .8611936
        full |   .0206382   .0137685     1.50   0.135    -.0065144    .0477907
      enroll |  -.0011205   .0003004    -3.73   0.000    -.0017129   -.0005281
       _cons |  -2.133523   2.386846    -0.89   0.372    -6.840573    2.573526
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

svylogit, or

Survey logistic regression

pweight:  pw                                      Number of obs    =       200
Strata:   stype                                   Number of strata =         3
PSU:      <observations>                          Number of PSUs   =       200
FPC:      fpc                                     Population size  =      6194
                                                  F(   6,    192)  =      2.97
                                                  Prob > F         =    0.0086

------------------------------------------------------------------------------
       award | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |   1.002866   .0116457     0.25   0.806      .980161    1.026098
         ell |   .9957058   .0116908    -0.37   0.714     .9729154     1.01903
      yr_rnd |   3.793393   2.594117     1.95   0.053     .9847692    14.61239
      avg_ed |   1.024123    .434849     0.06   0.955     .4432948    2.365983
        full |   1.020853   .0140556     1.50   0.135     .9935068    1.048951
      enroll |   .9988801   .0003001    -3.73   0.000     .9982885     .999472
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

Survey Logit with One-Stage Cluster Sampling

Another type of sampling design is cluster sampling. In this example we will use school districts as the cluster or primary sampling units. We will take a random sample of 15 school districts and look at all of the schools in each one. There are 757 school districts in the state.

The file apiclus1.dta will contain the data for the one-stage cluster sampling design.

use http://www.ats.ucla.edu/stat/stata/stat130/apiclus1, clear

svyset [pw=pw], psu(dnum) fpc(fpc)
pweight is pw
psu is dnum
fpc is fpc

tabulate pw

       pw = |
   6194/183 |      Freq.     Percent        Cum.
------------+-----------------------------------
     33.847 |        183      100.00      100.00
------------+-----------------------------------
      Total |        183      100.00

tabulate dnum

   district |
     number |      Freq.     Percent        Cum.
------------+-----------------------------------
         61 |         13        7.10        7.10
        135 |         34       18.58       25.68
        178 |          4        2.19       27.87
        197 |         13        7.10       34.97
        255 |         16        8.74       43.72
        406 |          2        1.09       44.81
        413 |          1        0.55       45.36
        437 |          4        2.19       47.54
        448 |         12        6.56       54.10
        510 |         21       11.48       65.57
        568 |          9        4.92       70.49
        637 |         11        6.01       76.50
        716 |         37       20.22       96.72
        778 |          2        1.09       97.81
        815 |          4        2.19      100.00
------------+-----------------------------------
      Total |        183      100.00

tabulate fpc

        fpc |      Freq.     Percent        Cum.
------------+-----------------------------------
        757 |        183      100.00      100.00
------------+-----------------------------------
      Total |        183      100.00

codebook awards

---------------------------------------------------------------------------------------------------------------
awards                                                                                      eligible for awards
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  awards

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/183

            tabulation:  Freq.   Numeric  Label
                            53         1  No
                           130         2  Yes

generate award=awards==2

tabulate award

tabulate award

      award |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         53       28.96       28.96
          1 |        130       71.04      100.00
------------+-----------------------------------
      Total |        183      100.00

logit award meals ell yr_rnd avg_ed full enroll, nolog

Logit estimates                                   Number of obs   =        157
                                                  LR chi2(6)      =      13.44
                                                  Prob > chi2     =     0.0366
Log likelihood = -88.235274                       Pseudo R2       =     0.0708

------------------------------------------------------------------------------
       award |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -.0202277   .0111922    -1.81   0.071     -.042164    .0017087
         ell |   .0424034   .0162426     2.61   0.009     .0105685    .0742383
      yr_rnd |   1.522462   1.123789     1.35   0.175    -.6801239    3.725049
      avg_ed |   .0805997   .4109408     0.20   0.845    -.7248294    .8860289
        full |  -.0041249   .0183279    -0.23   0.822    -.0400468    .0317971
      enroll |  -.0007729   .0004401    -1.76   0.079    -.0016356    .0000898
       _cons |  -.1568744   2.670404    -0.06   0.953    -5.390769     5.07702
------------------------------------------------------------------------------

svylogit award meals ell yr_rnd avg_ed full enroll, nolog

Survey logistic regression

pweight:  pw                                      Number of obs    =       157
Strata:                                      Number of strata =         1
PSU:      dnum                                    Number of PSUs   =        15
FPC:      fpc                                     Population size  = 5313.9784
                                                  F(   6,      9)  =     11.60
                                                  Prob > F         =    0.0009

------------------------------------------------------------------------------
       award |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -.0202277   .0076712    -2.64   0.020    -.0366808   -.0037745
         ell |   .0424034   .0164367     2.58   0.022     .0071501    .0776567
      yr_rnd |   1.522462   .1944188     7.83   0.000     1.105475    1.939449
      avg_ed |   .0805997   .3235535     0.25   0.807    -.6133535    .7745529
        full |  -.0041249   .0130164    -0.32   0.756    -.0320422    .0237924
      enroll |  -.0007729   .0004856    -1.59   0.134    -.0018144    .0002687
       _cons |  -.1568744   1.696472    -0.09   0.928    -3.795446    3.481697
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

svylogit, or

Survey logistic regression

pweight:  pw                                      Number of obs    =       157
Strata:                                      Number of strata =         1
PSU:      dnum                                    Number of PSUs   =        15
FPC:      fpc                                     Population size  = 5313.9784
                                                  F(   6,      9)  =     11.60
                                                  Prob > F         =    0.0009

------------------------------------------------------------------------------
       award | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |   .9799755   .0075176    -2.64   0.020     .9639838    .9962326
         ell |   1.043315   .0171487     2.58   0.022     1.007176    1.080752
      yr_rnd |   4.583498   .8911183     7.83   0.000      3.02066     6.95492
      avg_ed |   1.083937   .3507116     0.25   0.807     .5415318    2.169622
        full |   .9958836   .0129628    -0.32   0.756     .9684657    1.024078
      enroll |   .9992274   .0004852    -1.59   0.134     .9981872    1.000269
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

Categorical Data Analysis Course

Phil Ender