Applied Categorical & Nonnormal Data Analysis

Selection Models

Consider a model, in which, we try to predict women's wages from their education and age. We have an artificially constructed example of a sample of 2,000 women but we only have wage data for 1,343 of them. The remaining 657 women were not working and so did not receive wages. We will start off with a simple-minded model in which we estimate the regression model using only the observations that have wage data.

First Try

use http://www.gseis.ucla.edu/courses/data/wages

univar wage education age
                                         -------------- Quantiles --------------
Variable        n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
    wage     1343    23.69     6.31     5.88    19.31    23.51    28.05    45.81
education    2000    13.08     3.05    10.00    10.00    12.00    16.00    20.00
     age     2000    36.21     8.29    20.00    30.00    36.00    42.00    59.00
-------------------------------------------------------------------------------

regress wage education age

      Source |       SS       df       MS              Number of obs =    1343
-------------+------------------------------           F(  2,  1340) =  227.49
       Model |  13524.0337     2  6762.01687           Prob > F      =  0.0000
    Residual |  39830.8609  1340  29.7245231           R-squared     =  0.2535
-------------+------------------------------           Adj R-squared =  0.2524
       Total |  53354.8946  1342  39.7577456           Root MSE      =   5.452

------------------------------------------------------------------------------
        wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .8965829   .0498061    18.00   0.000     .7988765    .9942893
         age |   .1465739   .0187135     7.83   0.000      .109863    .1832848
       _cons |   6.084875   .8896182     6.84   0.000     4.339679    7.830071
------------------------------------------------------------------------------

predict pwage

This analysis would be fine if, in fact, the missing wage data were missing completely at random. However, the decision to work or not work was made by the individual woman. Thus, those who were not working constitute a self-selected sample and not a random sample. It is likely some of the women that would earn low wages choose not to work and this would account for much of the missing wage data. Thus, it is likely that we will over estimate the wages of the women in the population. So, somehow, we need to account for information that we have on the non-working women. Maybe, we could replace the missing values with zeros. The variable wage0 does the trick.

Second Try

univar wage0

Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
   wage0    2000    15.91    12.27     0.00     0.00    19.39    25.77    45.81
-------------------------------------------------------------------------------

regress wage0 education age

      Source |       SS       df       MS              Number of obs =    2000
-------------+------------------------------           F(  2,  1997) =  208.32
       Model |  51956.6949     2  25978.3475           Prob > F      =  0.0000
    Residual |  249038.262  1997   124.70619           R-squared     =  0.1726
-------------+------------------------------           Adj R-squared =  0.1718
       Total |  300994.957  1999  150.572765           Root MSE      =  11.167

------------------------------------------------------------------------------
       wage0 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   1.064572   .0844208    12.61   0.000     .8990101    1.230134
         age |   .3907662   .0310308    12.59   0.000     .3299101    .4516223
       _cons |  -12.16843   1.398146    -8.70   0.000    -14.91041   -9.426456
------------------------------------------------------------------------------

predict pwage0

This analysis is also troubling. Its true that we are using data from all 2,000 women but using zero is not a fair estimate of what the women would have earned if they had chose to work. It is likely that this model will under estimate the wages of women in the population. The solution to our quandary is to use the Heckman selection model (Gronau 1974, Lewis 1974, Heckman 1976).

The Heckman selection model is a two equation model. First, there is the regression model,

y = vβ + u₁

And second, there is the selection model,

zγ + u₂ > 0

Where the following holds,

u₁ ~ N(0,σ)
u₂ ~ N(0, 1)
corr(u₁, u₂) = ρ

When ρ = 0 OLS regression provides unbiased estimates, when ρ ~= 0 the OLS estimates are biased. The Heckman selection model allows us to use information from non-working women to improve the estimates of the parameters in the regression model. The Heckman selection model provides consistent, asymptotically efficient estimates for all parameters in the model.

In our example, we have one model predicting wages and one model predicting whether a women will be working. We will use married, children, education and age to predict selection. Checkout this probit example.

generate s=wage~=.

tab s

          s |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        657       32.85       32.85
          1 |       1343       67.15      100.00
------------+-----------------------------------
      Total |       2000      100.00

probit s married children education age

Probit estimates                                  Number of obs   =       2000
                                                  LR chi2(4)      =     478.32
                                                  Prob > chi2     =     0.0000
Log likelihood = -1027.0616                       Pseudo R2       =     0.1889

------------------------------------------------------------------------------
           s |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
    children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
   education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
         age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
       _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
------------------------------------------------------------------------------

Now we are ready to try the full Heckman selection model.

Third Time's a Charm

heckman wage education age, select(married children education age)
/* can also be written as
  heckman wage education age, select(s=married children education age)  */

Heckman selection model                         Number of obs      =      2000
(regression model with sample selection)        Censored obs       =       657
                                                Uncensored obs     =      1343

                                                Wald chi2(2)       =    508.44
Log likelihood = -5178.304                      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage         |
   education |   .9899537   .0532565    18.59   0.000     .8855729    1.094334
         age |   .2131294   .0206031    10.34   0.000     .1727481    .2535108
       _cons |   .4857752   1.077037     0.45   0.652    -1.625179     2.59673
-------------+----------------------------------------------------------------
select       |
     married |   .4451721   .0673954     6.61   0.000     .3130794    .5772647
    children |   .4387068   .0277828    15.79   0.000     .3842534    .4931601
   education |   .0557318   .0107349     5.19   0.000     .0346917    .0767718
         age |   .0365098   .0041533     8.79   0.000     .0283694    .0446502
       _cons |  -2.491015   .1893402   -13.16   0.000    -2.862115   -2.119915
-------------+----------------------------------------------------------------
     /athrho |   .8742086   .1014225     8.62   0.000     .6754241    1.072993
    /lnsigma |   1.792559    .027598    64.95   0.000     1.738468     1.84665
-------------+----------------------------------------------------------------
         rho |   .7035061   .0512264                      .5885365    .7905862
       sigma |   6.004797   .1657202                       5.68862    6.338548
      lambda |   4.224412   .3992265                      3.441942    5.006881
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0):   chi2(1) =    61.20   Prob > chi2 = 0.0000
------------------------------------------------------------------------------

predict pheckman

In addition to the two equations, heckman estimates rho (actually the inverse hyperbolic tangent of rho) the correlation of the residuals in the two equations and sigma (actually the log of sigma) the standard error of the residuals of the wage equation. Lambda is rho*sigma. The output also includes a likelihood ratio test of rho = 0.

Recall that it was stated at the beginning that this dataset was constructed. As it turns out, we do have full wage information on all 2,000 women. The variable wagefull has the complete wage data. We can therefore run a regression using the full wage information to use as a comarison.

regress wagefull education age

      Source |       SS       df       MS              Number of obs =    2000
-------------+------------------------------           F(  2,  1997) =  398.82
       Model |   28053.371     2  14026.6855           Prob > F      =  0.0000
    Residual |  70234.8124  1997  35.1701614           R-squared     =  0.2854
-------------+------------------------------           Adj R-squared =  0.2847
       Total |  98288.1834  1999   49.168676           Root MSE      =  5.9304

------------------------------------------------------------------------------
    wagefull |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   1.004456   .0448325    22.40   0.000     .9165328    1.092379
         age |   .1874822   .0164792    11.38   0.000      .155164    .2198004
       _cons |   1.381099   .7424989     1.86   0.063    -.0750544    2.837253
------------------------------------------------------------------------------

predict pfull

If we compare (see below) the predicted wages from the first model (omit missing), the second model (substitute zero for missing) and the heckman model to the complete wage and predicted full wage values, we note the following:
1) The first model tends to over predict wages;
2) the second model tends to way underestimate wages;
3) the heckman model does the best job in predicting wages.

univar pwage pwage0 pheckman wagefull pfull

                                        -------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
   pwage    2000    23.12     3.24    17.98    20.36    22.56    25.71    32.66
  pwage0    2000    15.91     5.10     6.29    11.76    15.95    19.36    32.18
pheckman    2000    21.16     3.84    14.65    18.06    20.83    24.00    32.86
wagefull    2000    21.31     7.01    -1.68    16.46    21.18    26.14    45.81
   pfull    2000    21.31     3.75    15.18    18.18    20.77    24.20    32.53
-------------------------------------------------------------------------------

Two-Stage Heckman Selection

It is possible to compute the Heckman Selection model manually using a two-stage process. Recall the selection model from above which we will run with Stat's twostep option.

heckman wage education age, select(s = married children education age) twostep

Heckman selection model -- two-step estimates   Number of obs      =      2000
(regression model with sample selection)        Censored obs       =       657
                                                Uncensored obs     =      1343

                                                Wald chi2(4)       =    551.37
                                                Prob > chi2        =    0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage         |
   education |   .9825259   .0538821    18.23   0.000     .8769189    1.088133
         age |   .2118695   .0220511     9.61   0.000     .1686502    .2550888
       _cons |   .7340391   1.248331     0.59   0.557    -1.712645    3.180723
-------------+----------------------------------------------------------------
s            |
     married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
    children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
   education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
         age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
       _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
-------------+----------------------------------------------------------------
mills        |
      lambda |   4.001615   .6065388     6.60   0.000     2.812821     5.19041
-------------+----------------------------------------------------------------
         rho |    0.67284
       sigma |  5.9473529
      lambda |  4.0016155   .6065388
------------------------------------------------------------------------------

We will begin with a probit model, do some transformations to obtain the inverse Mills ratio, which is then included in a standard OLS regression.

probit s married children education age

Probit estimates                                  Number of obs   =       2000
                                                  LR chi2(4)      =     478.32
                                                  Prob > chi2     =     0.0000
Log likelihood = -1027.0616                       Pseudo R2       =     0.1889

------------------------------------------------------------------------------
           s |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
    children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
   education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
         age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
       _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
------------------------------------------------------------------------------

predict p1, xb

generate phi = (1/sqrt(2*_pi))*exp(-(p1^2/2))  /*standardize it*/

generate capphi = norm(p1)

generate invmills = phi/capphi

regress wage education age invmills

      Source |       SS       df       MS              Number of obs =    1343
-------------+------------------------------           F(  3,  1339) =  173.01
       Model |  14904.6806     3  4968.22688           Prob > F      =  0.0000
    Residual |   38450.214  1339  28.7156191           R-squared     =  0.2793
-------------+------------------------------           Adj R-squared =  0.2777
       Total |  53354.8946  1342  39.7577456           Root MSE      =  5.3587

------------------------------------------------------------------------------
        wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .9825259   .0504982    19.46   0.000     .8834616     1.08159
         age |   .2118695   .0206636    10.25   0.000      .171333     .252406
    invmills |   4.001616   .5771027     6.93   0.000     2.869492    5.133739
       _cons |   .7340391   1.166214     0.63   0.529    -1.553766    3.021844
------------------------------------------------------------------------------

Probit with Selection

Stata also includes another selection model the heckprob which works in a manner very similar to heckman except that the response variable is binary. heckprob stands for heckman probit estimation. We can illustrate heckprob using the same dataset and creating a binary reponse variable hw, for high wage.

summarize wage

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        wage |      1343    23.69217    6.305374    5.88497   45.80979


generate hw = wage>r(mean) if wage ~= .
(657 missing values generated)

tabulate hw

         hw |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        686       51.08       51.08
          1 |        657       48.92      100.00
------------+-----------------------------------
      Total |      1,343      100.00

tabulate hw, miss

         hw |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        686       34.30       34.30
          1 |        657       32.85       67.15
          . |        657       32.85      100.00
------------+-----------------------------------
      Total |      2,000      100.00

We will begin just as we did in the heckman analysis by analyzing hw for the 1343 cases with complete data.

probit hw education age, nolog

Probit estimates                                  Number of obs   =       1343
                                                  LR chi2(2)      =     246.68
                                                  Prob > chi2     =     0.0000
Log likelihood = -807.24513                       Pseudo R2       =     0.1325

------------------------------------------------------------------------------
          hw |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .1629735   .0126813    12.85   0.000     .1381186    .1878283
         age |   .0284745   .0046322     6.15   0.000     .0193956    .0375534
       _cons |   -3.29333   .2348306   -14.02   0.000    -3.753589    -2.83307

As before, this solution is less than satisfying because information from 657 individuals was left out because they self-selected out of the labor force.

Next, we will recode all of the miss values of hw with zero and try again.

gen hw0 = hw
(657 missing values generated)

replace hw0=0 if hw0 == .
(657 real changes made)

probit hw0 education age, nolog

Probit estimates                                  Number of obs   =       2000
                                                  LR chi2(2)      =     366.87
                                                  Prob > chi2     =     0.0000
Log likelihood = -1082.7874                       Pseudo R2       =     0.1449

------------------------------------------------------------------------------
         hw0 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .1436513   .0103329    13.90   0.000     .1233991    .1639035
         age |   .0392975   .0039473     9.96   0.000     .0315609    .0470341
       _cons |  -3.819543   .1952091   -19.57   0.000    -4.202146    -3.43694
------------------------------------------------------------------------------

Now, we are using all of the observations but by setting all of the missing values to zero we are implying that all of these observations would have not been high wage had the indivdual chosen to work.

The solution, of course, is a Heckman selection model using heckprob.

heckprob hw education age, select(married children education age) nolog

Probit model with sample selection              Number of obs      =      2000
                                                Censored obs       =       657
                                                Uncensored obs     =      1343

                                                Wald chi2(2)       =    288.91
Log likelihood = -1817.402                      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
hw           |
   education |   .1615956    .012297    13.14   0.000     .1374939    .1856973
         age |   .0374595   .0043218     8.67   0.000     .0289889    .0459301
       _cons |  -3.913347   .2142349   -18.27   0.000     -4.33324   -3.493454
-------------+----------------------------------------------------------------
select       |
     married |   .4455411   .0703079     6.34   0.000     .3077401    .5833421
    children |   .4443875   .0286673    15.50   0.000     .3882006    .5005744
   education |   .0569751   .0108367     5.26   0.000     .0357356    .0782145
         age |   .0347465   .0041812     8.31   0.000     .0265515    .0429414
       _cons |  -2.455993   .1908705   -12.87   0.000    -2.830092   -2.081894
-------------+----------------------------------------------------------------
     /athrho |   .9695628   .2283646     4.25   0.000     .5219765    1.417149
-------------+----------------------------------------------------------------
         rho |   .7485121   .1004187                       .479224    .8890027
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0):   chi2(1) =    33.81   Prob > chi2 = 0.0000
------------------------------------------------------------------------------

These results are not that different from the first probit model but we can feel more confident about the analysis since it is using all of the information that is available.

Categorical Data Analysis Course

Phil Ender -- revised 3/23/05