When using data from a survey design it is necessary to take into account such aspects as stratification, cluster sampling etc. If you don't take these aspects of the sampling design into account you may end up with biased coefficients and certainly with incorrect standard errors. In the next example we will demonstrate a logistic analysis using a stratified random sampling design
Survey Logit with Stratified Random Sampling
Using API data provided by the California State Department of Education we will take a stratified random sample of 100 elementary schools, 50 middle schools and 50 high schools. This is out of a total of 4,421 elementary schools, 1,018 middle schools and 755 high schools.
The file apistrat.dta contains the data for the stratified random sample.
use http://www.ats.ucla.edu/stat/stata/stat130/apistrat, clear svyset [pw=pw], strata(stype) fpc(fpc) pweight is pw strata is stype fpc is fpc tabulate pw pw | Freq. Percent Cum. ------------+----------------------------------- 15.1 | 50 25.00 25.00 20.36 | 50 25.00 50.00 44.21 | 100 50.00 100.00 ------------+----------------------------------- Total | 200 100.00 tabulate stype stype | Freq. Percent Cum. ------------+----------------------------------- E | 100 50.00 50.00 H | 50 25.00 75.00 M | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00 tabulate fpc fpc | Freq. Percent Cum. ------------+----------------------------------- 755 | 50 25.00 25.00 1018 | 50 25.00 50.00 4421 | 100 50.00 100.00 ------------+----------------------------------- Total | 200 100.00 codebook awards --------------------------------------------------------------------------------------------------------------- awards eligible for awards --------------------------------------------------------------------------------------------------------------- type: numeric (byte) label: awards range: [1,2] units: 1 unique values: 2 missing .: 0/200 tabulation: Freq. Numeric Label 87 1 No 113 2 Yes generate award=awards==2 tabulate award award | Freq. Percent Cum. ------------+----------------------------------- 0 | 87 43.50 43.50 1 | 113 56.50 100.00 ------------+----------------------------------- Total | 200 100.00 logit award meals ell yr_rnd avg_ed full enroll, nolog Logit estimates Number of obs = 200 LR chi2(6) = 25.56 Prob > chi2 = 0.0003 Log likelihood = -124.15328 Pseudo R2 = 0.0933 ------------------------------------------------------------------------------ award | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | .0144493 .0111761 1.29 0.196 -.0074555 .0363541 ell | -.0076087 .0120865 -0.63 0.529 -.0312978 .0160803 yr_rnd | 1.396157 .6172752 2.26 0.024 .1863201 2.605995 avg_ed | .4774699 .4116758 1.16 0.246 -.3293999 1.28434 full | .0233389 .0131167 1.78 0.075 -.0023694 .0490471 enroll | -.0010137 .0003046 -3.33 0.001 -.0016107 -.0004167 _cons | -4.358013 2.137156 -2.04 0.041 -8.546761 -.169265 ------------------------------------------------------------------------------ svylogit award meals ell yr_rnd avg_ed full enroll, nolog Survey logistic regression pweight: pw Number of obs = 200 Strata: stype Number of strata = 3 PSU: <observations> Number of PSUs = 200 FPC: fpc Population size = 6194 F( 6, 192) = 2.97 Prob > F = 0.0086 ------------------------------------------------------------------------------ award | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | .0028622 .0116124 0.25 0.806 -.0200384 .0257628 ell | -.0043035 .0117412 -0.37 0.714 -.0274581 .0188512 yr_rnd | 1.333261 .6838513 1.95 0.053 -.0153479 2.68187 avg_ed | .0238367 .4246062 0.06 0.955 -.8135203 .8611936 full | .0206382 .0137685 1.50 0.135 -.0065144 .0477907 enroll | -.0011205 .0003004 -3.73 0.000 -.0017129 -.0005281 _cons | -2.133523 2.386846 -0.89 0.372 -6.840573 2.573526 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. svylogit, or Survey logistic regression pweight: pw Number of obs = 200 Strata: stype Number of strata = 3 PSU: <observations> Number of PSUs = 200 FPC: fpc Population size = 6194 F( 6, 192) = 2.97 Prob > F = 0.0086 ------------------------------------------------------------------------------ award | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | 1.002866 .0116457 0.25 0.806 .980161 1.026098 ell | .9957058 .0116908 -0.37 0.714 .9729154 1.01903 yr_rnd | 3.793393 2.594117 1.95 0.053 .9847692 14.61239 avg_ed | 1.024123 .434849 0.06 0.955 .4432948 2.365983 full | 1.020853 .0140556 1.50 0.135 .9935068 1.048951 enroll | .9988801 .0003001 -3.73 0.000 .9982885 .999472 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.Survey Logit with One-Stage Cluster Sampling
Another type of sampling design is cluster sampling. In this example we will use school districts as the cluster or primary sampling units. We will take a random sample of 15 school districts and look at all of the schools in each one. There are 757 school districts in the state.
The file apiclus1.dta will contain the data for the one-stage cluster sampling design.
use http://www.ats.ucla.edu/stat/stata/stat130/apiclus1, clear svyset [pw=pw], psu(dnum) fpc(fpc) pweight is pw psu is dnum fpc is fpc tabulate pw pw = | 6194/183 | Freq. Percent Cum. ------------+----------------------------------- 33.847 | 183 100.00 100.00 ------------+----------------------------------- Total | 183 100.00 tabulate dnum district | number | Freq. Percent Cum. ------------+----------------------------------- 61 | 13 7.10 7.10 135 | 34 18.58 25.68 178 | 4 2.19 27.87 197 | 13 7.10 34.97 255 | 16 8.74 43.72 406 | 2 1.09 44.81 413 | 1 0.55 45.36 437 | 4 2.19 47.54 448 | 12 6.56 54.10 510 | 21 11.48 65.57 568 | 9 4.92 70.49 637 | 11 6.01 76.50 716 | 37 20.22 96.72 778 | 2 1.09 97.81 815 | 4 2.19 100.00 ------------+----------------------------------- Total | 183 100.00 tabulate fpc fpc | Freq. Percent Cum. ------------+----------------------------------- 757 | 183 100.00 100.00 ------------+----------------------------------- Total | 183 100.00 codebook awards --------------------------------------------------------------------------------------------------------------- awards eligible for awards --------------------------------------------------------------------------------------------------------------- type: numeric (byte) label: awards range: [1,2] units: 1 unique values: 2 missing .: 0/183 tabulation: Freq. Numeric Label 53 1 No 130 2 Yes generate award=awards==2 tabulate award tabulate award award | Freq. Percent Cum. ------------+----------------------------------- 0 | 53 28.96 28.96 1 | 130 71.04 100.00 ------------+----------------------------------- Total | 183 100.00 logit award meals ell yr_rnd avg_ed full enroll, nolog Logit estimates Number of obs = 157 LR chi2(6) = 13.44 Prob > chi2 = 0.0366 Log likelihood = -88.235274 Pseudo R2 = 0.0708 ------------------------------------------------------------------------------ award | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -.0202277 .0111922 -1.81 0.071 -.042164 .0017087 ell | .0424034 .0162426 2.61 0.009 .0105685 .0742383 yr_rnd | 1.522462 1.123789 1.35 0.175 -.6801239 3.725049 avg_ed | .0805997 .4109408 0.20 0.845 -.7248294 .8860289 full | -.0041249 .0183279 -0.23 0.822 -.0400468 .0317971 enroll | -.0007729 .0004401 -1.76 0.079 -.0016356 .0000898 _cons | -.1568744 2.670404 -0.06 0.953 -5.390769 5.07702 ------------------------------------------------------------------------------ svylogit award meals ell yr_rnd avg_ed full enroll, nolog Survey logistic regression pweight: pw Number of obs = 157 Strata:Number of strata = 1 PSU: dnum Number of PSUs = 15 FPC: fpc Population size = 5313.9784 F( 6, 9) = 11.60 Prob > F = 0.0009 ------------------------------------------------------------------------------ award | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -.0202277 .0076712 -2.64 0.020 -.0366808 -.0037745 ell | .0424034 .0164367 2.58 0.022 .0071501 .0776567 yr_rnd | 1.522462 .1944188 7.83 0.000 1.105475 1.939449 avg_ed | .0805997 .3235535 0.25 0.807 -.6133535 .7745529 full | -.0041249 .0130164 -0.32 0.756 -.0320422 .0237924 enroll | -.0007729 .0004856 -1.59 0.134 -.0018144 .0002687 _cons | -.1568744 1.696472 -0.09 0.928 -3.795446 3.481697 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. svylogit, or Survey logistic regression pweight: pw Number of obs = 157 Strata: Number of strata = 1 PSU: dnum Number of PSUs = 15 FPC: fpc Population size = 5313.9784 F( 6, 9) = 11.60 Prob > F = 0.0009 ------------------------------------------------------------------------------ award | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | .9799755 .0075176 -2.64 0.020 .9639838 .9962326 ell | 1.043315 .0171487 2.58 0.022 1.007176 1.080752 yr_rnd | 4.583498 .8911183 7.83 0.000 3.02066 6.95492 avg_ed | 1.083937 .3507116 0.25 0.807 .5415318 2.169622 full | .9958836 .0129628 -0.32 0.756 .9684657 1.024078 enroll | .9992274 .0004852 -1.59 0.134 .9981872 1.000269 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
Categorical Data Analysis Course
Phil Ender