Education 231C

Applied Categorical & Nonnormal Data Analysis

Instrumental Variables Regression


Using the hsb2 dataset, consider the correlation between science and math.

The correlation of 0.63 may be satisfyingly large but it is also somewhat misleading. It would be tempting to interpret the correlation as reflecting the relationship between a measure of ability in science and ability in math. The problem is that both the science and math tests are standardized written tests so that general academic skills and intelligence are likely to influence the results of both, leading to an inflated correlation. In addition to the inflated correlation there is a more subtle problem that can arise when you try to use these test scores in a regression analysis. Consider the following model:

Because portions of the variability of both science and math are jointly determined by general academic skills and intelligence there is a strong likelihood that there will be a correlation between math and the error (residuals) in the model. This correlation violates one of the basic assumptions of independence in OLS regression. Using reading and writing scores as indicators of general academic skills and intelligence, we can check out this possibility with the following commands.

The significant resmath coefficient indicates that there is a problem with using math as a predictor of science. In a traditional linear regression model the response variable is considered to be endogenous and the predictors to be exogenous.

An endogenous variable is a variable whose variation is explained by either exogenous variables or other endogenous variables in the model. Exogenous variables are variables whose variability is determined by variables outside of the model.

When one, or more, of the predictor variables is endogenous we encounter the problem of the variable being correlated with the error (residual). The test of resmath (above) can be considered to be a test of the endogeneity of math but is more specifically a test as to whether the OLS estimates in the model are consistent.

The ivreg command (or two-stage least squares; 2SLS) is designed to used in situations in which predictors are endogenous. In essence, ivreg simultaneously estimates two equations,

Now we have the situation in which read, write and female are exogenous and are instruments used to predict math, which is treated as an endogenous variable. In the second equation above math* is used to indicate that it is the instrumented form of the variable math that is being used.

The ivreg command for our example looks like this,

Next, we can use Stata's hausman command to test whether the differences between the ivreg and OLS estimates are large enough to suggest that the OLS estimates are not consistent. Sure enough, the there is a significant (chi-square = 39.2, df =1, p = 0.0000) difference between the ivreg and OLS coefficients, indicating clearly that OLS is an inconsistent estimator in this equation. The conclusion is that the reason for the inconsistent estimates is due to the endogeneity of math.

The R2 for the OLS model is much higher than the R2 for the ivreg model but this is due to the fact that both science and math are correlation with the exogenous variable read and write.

If we wanted to represent this model graphically, it would look something like this

with squares for the exogenous variables and circles for the endogenous variables.

Let's look at the variable science and the two predicted values, p1 from the ivreg model and p2 from the OLS model.

Finally, let's how close we can come to the ivreg results doing our own two-stage regression. In the first regression, we regressed the endogenous predictor on the three exogenous variables. In the second regression, we used the predicted math (pmath) as the instrumented variable in our model. Note that the coefficients in the second regression and the ivreg are the same, but that the standard errors are different.

One final note, it is also possible to estimate this system of equations using three-stage least squares (3SLS). Stata's reg3 command can perform either 2SLS (equivalent to ivreg) or 3SLS and clearly illustrates the two equation nature of the problem.


Categorical Data Analysis Course

Phil Ender