The way we use Stata in class to analyze data is not the way one should do data analysis when analyzing "real" research data.
Consider the following hypothetical scenario. You published a research paper two years ago based on a data analysis you did four years ago. A question comes up concerning one of the analyses suggesting a modification to improve the analysis. You go back into your data archives and find that there are 10 different versions of the data file, all of which are slightly different. You cannot tell which data file you used for which analysis. Further, you recall that the analysis was very tricky requiring that you use several obscure options simultaneously. You cannot recall the exact combination of options needed. You cannot answer the question concerning your prior analysis because you cannot reproduce the published results let alone run the modified analysis being suggested.
The way to avoid this situation is to perform the data analysis using a series of do-files. These do-files preserve all the steps taken to clean and modify the data along with all of the commands used to analyze the data. The goal is to be able to perfectly reproduce your entire data analysis process from the beginning all the way to the final analysis.
Basically, there are four steps in the data analysis process:
Step 1: Reading the Raw Data
In this example we are given a comma separated file called hsberr.raw.
type hsberr.raw 70,1,4,1,1,1,57,52,41,47,57 121,2,4,2,1,3,68,59,53,63,61 86,1,4,3,1,1,44,33,54,58,31 141,1,4,3,1,3,63,44,47,53,56 172,1,4,2,1,2,47,52,57,53,61 113,1,4,2,1,2,44,52,51,63,61 50,1,3,2,1,1,50,59,42,53,61 11,1,1,2,1,2,34,46,45,39,36 84,1,4,2,1,1,63,57,54,.,51 48,1,3,2,1,2,57,55,52,50,51 [ output omitted] /* begin hsbinsheet.do */ clear insheet id gender race ses schtyp prog read write math science socst using hsberr.raw label def gl 1 "male" 2 "female" label def rl 1 "hispanic" 2 "asian" 3 "african-amer" 4 "white" label def sl 1 "low" 2 "middle" 3 "high" label def scl 1 "public" 2 "private" label def sel 1 "general" 2 "academic" 3 "vocation" label val gender gl label val race rl label val ses sl label val schtyp scl label val prog sel label var schtyp "type of school" label var prog "type of program" label var read "reading score" label var write "writing score" label var math "math score" label var science "science score" label var socst "social studies score" save hsberr, replace /* end hsbinsheet.do */Inspecting the Data
use http://www.gseis.ucla.edu/courses/data/hsberr describe Contains data from hsberr.dta obs: 200 highschool and beyond (200 cases) vars: 11 25 Mar 2003 13:51 size: 9,600 (99.7% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id float %9.0g gender float %9.0g gl race float %12.0g rl ses float %9.0g sl schtyp float %9.0g scl type of school prog float %9.0g sel type of program read float %9.0g reading score write float %9.0g writing score math float %9.0g math score science float %9.0g science score socst float %9.0g social studies score ------------------------------------------------------------------------------- summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 200 105.5 96.33093 1 1193 gender | 198 1.560606 .5554197 1 5 race | 200 3.44 1.068696 0 5 ses | 199 2.055276 .7261075 1 3 schtyp | 199 1.170854 .3904863 1 3 -------------+-------------------------------------------------------- prog | 200 2.015 .7051676 0 3 read | 200 52.73 12.24199 28 147 write | 200 52.775 9.478586 31 67 math | 200 52.645 9.368448 33 75 science | 195 51.0359 12.72483 -61 74 -------------+-------------------------------------------------------- socst | 200 52.405 10.73579 26 71 codebook --------------------------------------------------------------------------------------------------------------- id (unlabeled) --------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [1,1193] units: 1 unique values: 200 missing .: 0/200 mean: 105.5 std. dev: 96.3309 percentiles: 10% 25% 50% 75% 90% 20.5 50.5 100.5 150.5 180.5 --------------------------------------------------------------------------------------------------------------- gender (unlabeled) --------------------------------------------------------------------------------------------------------------- type: numeric (float) label: gl, but 1 nonmissing value is not labeled range: [1,5] units: 1 unique values: 3 missing .: 2/200 tabulation: Freq. Numeric Label 90 1 male 107 2 female 1 5 2 . --------------------------------------------------------------------------------------------------------------- race (unlabeled) --------------------------------------------------------------------------------------------------------------- type: numeric (float) label: rl, but 2 nonmissing values are not labeled range: [0,5] units: 1 unique values: 6 missing .: 0/200 tabulation: Freq. Numeric Label 1 0 23 1 hispanic 11 2 asian 20 3 african-amer 142 4 white 3 5 --------------------------------------------------------------------------------------------------------------- ses (unlabeled) --------------------------------------------------------------------------------------------------------------- type: numeric (float) label: sl range: [1,3] units: 1 unique values: 3 missing .: 1/200 tabulation: Freq. Numeric Label 47 1 low 94 2 middle 58 3 high 1 . --------------------------------------------------------------------------------------------------------------- schtyp type of school --------------------------------------------------------------------------------------------------------------- type: numeric (float) label: scl, but 1 nonmissing value is not labeled range: [1,3] units: 1 unique values: 3 missing .: 1/200 tabulation: Freq. Numeric Label 166 1 public 32 2 private 1 3 1 . --------------------------------------------------------------------------------------------------------------- prog type of program --------------------------------------------------------------------------------------------------------------- type: numeric (float) label: sel, but 1 nonmissing value is not labeled range: [0,3] units: 1 unique values: 4 missing .: 0/200 tabulation: Freq. Numeric Label 1 0 45 1 general 104 2 academic 50 3 vocation --------------------------------------------------------------------------------------------------------------- read reading score --------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [28,147] units: 1 unique values: 31 missing .: 0/200 mean: 52.73 std. dev: 12.242 percentiles: 10% 25% 50% 75% 90% 39 44 51 60 68 --------------------------------------------------------------------------------------------------------------- write writing score --------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [31,67] units: 1 unique values: 29 missing .: 0/200 mean: 52.775 std. dev: 9.47859 percentiles: 10% 25% 50% 75% 90% 39 45.5 54 60 65 --------------------------------------------------------------------------------------------------------------- math math score --------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [33,75] units: 1 unique values: 40 missing .: 0/200 mean: 52.645 std. dev: 9.36845 percentiles: 10% 25% 50% 75% 90% 40 45 52 59 65.5 --------------------------------------------------------------------------------------------------------------- science science score --------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [-61,74] units: 1 unique values: 34 missing .: 5/200 mean: 51.0359 std. dev: 12.7248 percentiles: 10% 25% 50% 75% 90% 39 44 53 58 63 --------------------------------------------------------------------------------------------------------------- socst social studies score --------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [26,71] units: 1 unique values: 22 missing .: 0/200 mean: 52.405 std. dev: 10.7358 percentiles: 10% 25% 50% 75% 90% 36 46 52 61 66Step 2: Data Cleaning
In this step we will use two do-files, one to check the data (hsbtest.do) and one to do the actual data cleaning (hsbclean.do).
/* begin hsbtest.do */ use `1',clear nmissing summarize assert id>=1 & id<=200 assert (gender>=1 & gender<=2) | gender==. assert (race>=1 & race<=4) | race==. assert (ses>=1 & ses<=3) | ses==. assert (schtyp>=1 & schtyp<=2) | schtyp==. assert (prog>=1 & prog<=3) | prog==. assert (read>=1 & read<=100) | read==. assert (write>=1 & write<=100) | write==. assert (math>=1 & math<=100) | math==. assert (science>=1 & science<=100) | science==. assert (socst>=1 & socst<=100) | socst==. /* end hsbtest.do */ /* begin hsbclean */ use hsberr, clear replace id=193 if id==1193 replace read=47 if read==147 replace science=61 if science==-61 replace gender=. if gender==5 replace race=. if race<1 | race>4 replace ses=. if ses<1 | ses>3 replace schtyp=. if schtyp<1 | schtyp>2 replace prog=. if prog<1 | prog>3 label data "hsb clean data using hsberr.do" save hsbclean, replace /* end hsbclean */ do hsbtest hsberr do hsbclean do hsbtest hsbcleanStep 2: Modifying the Data
In this step we will modify the cleam data creating new variables or modifying existing ones.
/* begin hsbmod.do */ use hsbclean, clear generate female = gender recode female 1=0 2=1 generate public=schtyp recode public 2=0 label define fem 0 "male" 1 "female" label value female fem generate read10=read/10 generate soc10=socst/10 label var read10 "(read score)/10" label var soc10 "(social studies score)/10" note: data modified using hsbmod.do note: read and soc rescaled -- divided by 10 save hsbmod, replace /* end hsbmod.do */ do hsbmodStep 3: Analyzing the Data
/* begin analysis1.do */ log using analysis1.log, text replace use hsbmod, clear set more off display "hsb analysis #1 using hsbmod.dta" summarize read write math science univar read write math science tabstat read10, stat(n mean sd p25 p50 p75) by(female) tabstat read10, stat(n mean sd p25 p50 p75) by(prog) kdensity read10, normal more kdensity soc10, normal more mlogit prog read10 soc10 female estimates store M1 xi: mlogit prog i.female*read10 i.female*soc10 lrtest M1 mlogit prog read10 soc10 female mlogtest, lr wald lrcomb combine listcoef fitstat prchange, x(female=0) prchange, x(female=1) set more on log close /* end analysis1.do */ /* begin analysis2.do */ log using analysis2.log, text replace use hsbmod, clear set more off display "hsb analysis #2 using hsbmod.dta" tabstat read10, stat(n mean sd p25 p50 p75) by(ses) ologit ses read10 soc10 female estimates store M1 xi: ologit ses i.female*read10 i.female*soc10 lrtest M1 ologit ses read10 soc10 female listcoef fitstat linktest prchange, x(female=0) prchange, x(female=1) set more on log close /* end analysis2.do */ do analysis1 type analysis1.log do analysis2 type analysis2,log
Categorical Data Analysis Course
Phil Ender