Education 231C

Applied Categorical & Nonnormal Data Analysis

The Process of Analyzing Data


The way we use Stata in class to analyze data is not the way one should do data analysis when analyzing "real" research data.

Consider the following hypothetical scenario. You published a research paper two years ago based on a data analysis you did four years ago. A question comes up concerning one of the analyses suggesting a modification to improve the analysis. You go back into your data archives and find that there are 10 different versions of the data file, all of which are slightly different. You cannot tell which data file you used for which analysis. Further, you recall that the analysis was very tricky requiring that you use several obscure options simultaneously. You cannot recall the exact combination of options needed. You cannot answer the question concerning your prior analysis because you cannot reproduce the published results let alone run the modified analysis being suggested.

The way to avoid this situation is to perform the data analysis using a series of do-files. These do-files preserve all the steps taken to clean and modify the data along with all of the commands used to analyze the data. The goal is to be able to perfectly reproduce your entire data analysis process from the beginning all the way to the final analysis.

Basically, there are four steps in the data analysis process:

  1. Reading the raw data
  2. Cleaning the data
  3. Modifying the data
  4. Analyzing the data
One important note, never change your original data file. In this case, never change hsberr.raw. If you change your original datfile and you make a mistake you may never be able to completely reproduce your data analysis.

Step 1: Reading the Raw Data

In this example we are given a comma separated file called hsberr.raw.

Inspecting the Data

Step 2: Data Cleaning

In this step we will use two do-files, one to check the data (hsbtest.do) and one to do the actual data cleaning (hsbclean.do).

Step 2: Modifying the Data

In this step we will modify the cleam data creating new variables or modifying existing ones.

Step 3: Analyzing the Data


Categorical Data Analysis Course

Phil Ender