Using the hsb2 dataset, consider the correlation between science and math.
use http://www.gseis.ucla.edu/courses/data/hsb2 corr science math (obs=200) | science math -------------+------------------ science | 1.0000 math | 0.6307 1.0000The correlation of 0.63 may be satisfyingly large but it is also somewhat misleading. It would be tempting to interpret the correlation as reflecting the relationship between a measure of ability in science and ability in math. The problem is that both the science and math tests are standardized written tests so that general academic skills and intelligence are likely to influence the results of both, leading to an inflated correlation. In addition to the inflated correlation there is a more subtle problem that can arise when you try to use these test scores in a regression analysis. Consider the following model:
Because portions of the variability of both science and math are jointly determined by general academic skills and intelligence there is a strong likelihood that there will be a correlation between math and the error (residuals) in the model. This correlation violates one of the basic assumptions of independence in OLS regression. Using reading and writing scores as indicators of general academic skills and intelligence, we can check out this possibility with the following commands.
regress math female read write Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 72.33 Model | 9176.66954 3 3058.88985 Prob > F = 0.0000 Residual | 8289.12546 196 42.2914564 R-squared = 0.5254 -------------+------------------------------ Adj R-squared = 0.5181 Total | 17465.795 199 87.7678141 Root MSE = 6.5032 ------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | -2.023984 .991051 -2.04 0.042 -3.978476 -.0694912 read | .385395 .0581257 6.63 0.000 .270763 .500027 write | .3888326 .0649587 5.99 0.000 .2607249 .5169402 _cons | 13.09825 2.80151 4.68 0.000 7.573277 18.62322 ------------------------------------------------------------------------------ predict resmath, resid regress science math female resmath Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 72.86 Model | 10284.6648 3 3428.22159 Prob > F = 0.0000 Residual | 9222.83523 196 47.0552818 R-squared = 0.5272 -------------+------------------------------ Adj R-squared = 0.5200 Total | 19507.50 199 98.0276382 Root MSE = 6.8597 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | 1.007845 .0716667 14.06 0.000 .8665077 1.149182 female | -1.978643 .9748578 -2.03 0.044 -3.9012 -.0560859 resmath | -.7255874 .103985 -6.98 0.000 -.9306604 -.5205144 _cons | -.1296216 3.861937 -0.03 0.973 -7.745907 7.486664 ------------------------------------------------------------------------------The significant resmath coefficient indicates that there is a problem with using math as a predictor of science. In a traditional linear regression model the response variable is considered to be endogenous and the predictors to be exogenous.
An endogenous variable is a variable whose variation is explained by either exogenous variables or other endogenous variables in the model. Exogenous variables are variables whose variability is determined by variables outside of the model.
When one, or more, of the predictor variables is endogenous we encounter the problem of the variable being correlated with the error (residual). The test of resmath (above) can be considered to be a test of the endogeneity of math but is more specifically a test as to whether the OLS estimates in the model are consistent.
The ivreg command (or two-stage least squares; 2SLS) is designed to used in situations in which predictors are endogenous. In essence, ivreg simultaneously estimates two equations,
The ivreg command for our example looks like this,
ivreg science female (math = read write) Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 69.77 Model | 5920.63012 2 2960.31506 Prob > F = 0.0000 Residual | 13586.8699 197 68.9688827 R-squared = 0.3035 -------------+------------------------------ Adj R-squared = 0.2964 Total | 19507.50 199 98.0276382 Root MSE = 8.3048 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | 1.007845 .0867641 11.62 0.000 .836739 1.17895 female | -1.978643 1.180222 -1.68 0.095 -4.306134 .3488478 _cons | -.1296216 4.675495 -0.03 0.978 -9.350068 9.090824 ------------------------------------------------------------------------------ Instrumented: math Instruments: female read write ------------------------------------------------------------------------------ predict p1 estimates store ivregNext, we can use Stata's hausman command to test whether the differences between the ivreg and OLS estimates are large enough to suggest that the OLS estimates are not consistent.
regress science math female Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 68.38 Model | 7993.54995 2 3996.77498 Prob > F = 0.0000 Residual | 11513.95 197 58.4464469 R-squared = 0.4098 -------------+------------------------------ Adj R-squared = 0.4038 Total | 19507.50 199 98.0276382 Root MSE = 7.645 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .6631901 .0578724 11.46 0.000 .549061 .7773191 female | -2.168396 1.086043 -2.00 0.047 -4.310159 -.026633 _cons | 18.11813 3.167133 5.72 0.000 11.8723 24.36397 ------------------------------------------------------------------------------ predict p2 hausman ivreg . , constant sigmamore ---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | ivreg . Difference S.E. -------------+---------------------------------------------------------------- math | 1.007845 .6631901 .3446546 .0550478 female | -1.978643 -2.168396 .1897529 .0303071 _cons | -.1296216 18.11813 -18.24776 2.914507 ------------------------------------------------------------------------------ b = consistent under Ho and Ha; obtained from ivreg B = inconsistent under Ha, efficient under Ho; obtained from regress Test: Ho: difference in coefficients not systematic chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 39.20 Prob>chi2 = 0.0000Sure enough, the there is a significant (chi-square = 39.2, df =1, p = 0.0000) difference between the ivreg and OLS coefficients, indicating clearly that OLS is an inconsistent estimator in this equation. The conclusion is that the reason for the inconsistent estimates is due to the endogeneity of math.
The R2 for the OLS model is much higher than the R2 for the ivreg model but this is due to the fact that both science and math are correlation with the exogenous variable read and write.
If we wanted to represent this model graphically, it would look something like this
with squares for the exogenous variables and circles for the endogenous variables.
Let's look at the variable science and the two predicted values, p1 from the ivreg model and p2 from the OLS model.
summarize science p1 p2 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- science | 200 51.85 9.900891 26 74 p1 | 200 51.85 9.522247 31.15061 75.45872 p2 | 200 51.85 6.33787 37.83501 67.85739 corr science p1 p2 (obs=200) | science p1 p2 -------------+--------------------------- science | 1.0000 p1 | 0.6387 1.0000 p2 | 0.6401 0.9977 1.0000Finally, let's how close we can come to the ivreg results doing our own two-stage regression.
regress math read write female Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 72.33 Model | 9176.66954 3 3058.88985 Prob > F = 0.0000 Residual | 8289.12546 196 42.2914564 R-squared = 0.5254 -------------+------------------------------ Adj R-squared = 0.5181 Total | 17465.795 199 87.7678141 Root MSE = 6.5032 ------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .385395 .0581257 6.63 0.000 .270763 .500027 write | .3888326 .0649587 5.99 0.000 .2607249 .5169402 female | -2.023984 .991051 -2.04 0.042 -3.978476 -.0694912 _cons | 13.09825 2.80151 4.68 0.000 7.573277 18.62322 ------------------------------------------------------------------------------ predict pmath regress science pmath female Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 95.92 Model | 9624.27766 2 4812.13883 Prob > F = 0.0000 Residual | 9883.22234 197 50.1686413 R-squared = 0.4934 -------------+------------------------------ Adj R-squared = 0.4882 Total | 19507.50 199 98.0276382 Root MSE = 7.083 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pmath | 1.007845 .0739996 13.62 0.000 .8619116 1.153778 female | -1.978643 1.006591 -1.97 0.051 -3.963721 .0064348 _cons | -.1296241 3.987652 -0.03 0.974 -7.993588 7.73434 ------------------------------------------------------------------------------ /* this is the ivreg results from above */ Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 69.77 Model | 5920.63012 2 2960.31506 Prob > F = 0.0000 Residual | 13586.8699 197 68.9688827 R-squared = 0.3035 -------------+------------------------------ Adj R-squared = 0.2964 Total | 19507.50 199 98.0276382 Root MSE = 8.3048 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | 1.007845 .0867641 11.62 0.000 .836739 1.17895 female | -1.978643 1.180222 -1.68 0.095 -4.306134 .3488478 _cons | -.1296216 4.675495 -0.03 0.978 -9.350068 9.090824 ------------------------------------------------------------------------------ Instrumented: math Instruments: female read write ------------------------------------------------------------------------------In the first regression, we regressed the endogenous predictor on the three exogenous variables. In the second regression, we used the predicted math (pmath) as the instrumented variable in our model. Note that the coefficients in the second regression and the ivreg are the same, but that the standard errors are different.
One final note, it is also possible to estimate this system of equations using three-stage least squares (3SLS). Stata's reg3 command can perform either 2SLS (equivalent to ivreg) or 3SLS and clearly illustrates the two equation nature of the problem.
reg3 (science = math female)(math = read write female), 2sls Two-stage least-squares regression ---------------------------------------------------------------------- Equation Obs Parms RMSE "R-sq" F-Stat P ---------------------------------------------------------------------- science 200 2 8.304751 0.3035 69.7726 0.0000 math 200 3 6.503188 0.5254 72.32879 0.0000 ---------------------------------------------------------------------- ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- science | math | 1.007845 .0867641 11.62 0.000 .8372648 1.178424 female | -1.978643 1.180222 -1.68 0.094 -4.298981 .3416951 _cons | -.1296216 4.675495 -0.03 0.978 -9.321732 9.062489 -------------+---------------------------------------------------------------- math | read | .385395 .0581257 6.63 0.000 .2711189 .4996712 write | .3888326 .0649587 5.99 0.000 .2611226 .5165425 female | -2.023984 .991051 -2.04 0.042 -3.972408 -.075559 _cons | 13.09825 2.80151 4.68 0.000 7.590429 18.60607 ------------------------------------------------------------------------------ Endogenous variables: science math Exogenous variables: female read write ------------------------------------------------------------------------------ reg3 (science = math female)(math = read write female) Three-stage least squares regression ---------------------------------------------------------------------- Equation Obs Parms RMSE "R-sq" chi2 P ---------------------------------------------------------------------- science 200 2 8.24223 0.3035 141.6703 0.0000 math 200 3 6.438234 0.5253 221.4518 0.0000 ---------------------------------------------------------------------- ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- science | math | 1.007845 .0861109 11.70 0.000 .8390704 1.176619 female | -1.978643 1.171337 -1.69 0.091 -4.274421 .3171349 _cons | -.1296216 4.640297 -0.03 0.978 -9.224436 8.965192 -------------+---------------------------------------------------------------- math | read | .3772331 .0496322 7.60 0.000 .2799557 .4745104 write | .3981663 .0550155 7.24 0.000 .290338 .5059947 female | -2.078337 .9617418 -2.16 0.031 -3.963316 -.1933579 _cons | 13.06158 2.770268 4.71 0.000 7.631958 18.49121 ------------------------------------------------------------------------------ Endogenous variables: science math Exogenous variables: female read write ------------------------------------------------------------------------------
Categorical Data Analysis Course
Phil Ender