Education 231C

Applied Categorical & Nonnormal Data Analysis

Selection Models

Consider a model, in which, we try to predict women's wages from their education and age. We have an artificially constructed example of a sample of 2,000 women but we only have wage data for 1,343 of them. The remaining 657 women were not working and so did not receive wages. We will start off with a simple-minded model in which we estimate the regression model using only the observations that have wage data.

First Try

This analysis would be fine if, in fact, the missing wage data were missing completely at random. However, the decision to work or not work was made by the individual woman. Thus, those who were not working constitute a self-selected sample and not a random sample. It is likely some of the women that would earn low wages choose not to work and this would account for much of the missing wage data. Thus, it is likely that we will over estimate the wages of the women in the population. So, somehow, we need to account for information that we have on the non-working women. Maybe, we could replace the missing values with zeros. The variable wage0 does the trick.

Second Try

This analysis is also troubling. Its true that we are using data from all 2,000 women but using zero is not a fair estimate of what the women would have earned if they had chose to work. It is likely that this model will under estimate the wages of women in the population. The solution to our quandary is to use the Heckman selection model (Gronau 1974, Lewis 1974, Heckman 1976).

The Heckman selection model is a two equation model. First, there is the regression model,

And second, there is the selection model, Where the following holds, When ρ = 0 OLS regression provides unbiased estimates, when ρ ~= 0 the OLS estimates are biased. The Heckman selection model allows us to use information from non-working women to improve the estimates of the parameters in the regression model. The Heckman selection model provides consistent, asymptotically efficient estimates for all parameters in the model.

In our example, we have one model predicting wages and one model predicting whether a women will be working. We will use married, children, education and age to predict selection. Checkout this probit example.

Now we are ready to try the full Heckman selection model.

Third Time's a Charm

In addition to the two equations, heckman estimates rho (actually the inverse hyperbolic tangent of rho) the correlation of the residuals in the two equations and sigma (actually the log of sigma) the standard error of the residuals of the wage equation. Lambda is rho*sigma. The output also includes a likelihood ratio test of rho = 0.

Recall that it was stated at the beginning that this dataset was constructed. As it turns out, we do have full wage information on all 2,000 women. The variable wagefull has the complete wage data. We can therefore run a regression using the full wage information to use as a comarison.

If we compare (see below) the predicted wages from the first model (omit missing), the second model (substitute zero for missing) and the heckman model to the complete wage and predicted full wage values, we note the following:
1) The first model tends to over predict wages;
2) the second model tends to way underestimate wages;
3) the heckman model does the best job in predicting wages. Two-Stage Heckman Selection

It is possible to compute the Heckman Selection model manually using a two-stage process. Recall the selection model from above which we will run with Stat's twostep option.

We will begin with a probit model, do some transformations to obtain the inverse Mills ratio, which is then included in a standard OLS regression. Probit with Selection

Stata also includes another selection model the heckprob which works in a manner very similar to heckman except that the response variable is binary. heckprob stands for heckman probit estimation. We can illustrate heckprob using the same dataset and creating a binary reponse variable hw, for high wage.

We will begin just as we did in the heckman analysis by analyzing hw for the 1343 cases with complete data. As before, this solution is less than satisfying because information from 657 individuals was left out because they self-selected out of the labor force.

Next, we will recode all of the miss values of hw with zero and try again.

Now, we are using all of the observations but by setting all of the missing values to zero we are implying that all of these observations would have not been high wage had the indivdual chosen to work.

The solution, of course, is a Heckman selection model using heckprob.

These results are not that different from the first probit model but we can feel more confident about the analysis since it is using all of the information that is available.

Categorical Data Analysis Course

Phil Ender -- revised 3/23/05