Applied Categorical & Nonnormal Data Analysis

Probit Regression Models

An alternative to logistic regression analysis is probit analysis. The term "probit' was coined in the 1930's by Chester Bliss and stands for probability unit. These two analyses, logit and probit, are very similar to one another. As discussed in the previous unit logit analysis is based on log odds while probit uses the cumulative normal probability distribution. Here is what a cumulative normal distribution looks like.

Notice the S-shaped curve that runs from zero to one. It is very similar to the graph of the logit function. The two procedures are so similar that they can easily be confused with one another. The bottom line is that logistic regression and probit analysis produce predicted probabilities that are very similar. An example of predicted probabilities for logit and probit is given below.

The probit model is defined as

Pr(y=1|x) = Φ(xb)

where Φ is the standard cumulative normal probability distribution and xb is called the probit score or index.

Since xb has a normal distribution, interpreting probit coefficients requires thinking in the Z (normal quantile) metric. The interpretation of a probit coefficient, b, is that a one-unit increase in the predictor leads to increasing the probit score by b standard deviations. Leaning to think and communicate in the Z metric takes practice and can be confusing to others. We will make use of a number of tools developed by Long and Freese to aid in the interpretation of the results.

The log-likelihood function for probit is

where w_j denotes optional weights.

Currently, logic models are more popular than probit models due to two reasons; 1) the exponentiated logistic coefficients can be interpreted as odds ratios, and 2) there are more diagnostic tools available in logistic regression. Although, this last reason can be a chicken-egg issue, that is, there might be more diagnostic tools because it is being used more often.

We will demonstrate probit analysis using the same datasets that were used in the logistic regression analysis unit.

Example 1

set matsize 100
use http://www.gseis.ucla.edu/courses/data/honors

describe

Contains data from http://www.gseis.ucla.edu/courses/data/honors.dta
  obs:           200                          
 vars:             7                          10 Feb 2001 16:27
 size:         6,400 (99.8% of memory free)
-------------------------------------------------------------------------------
   1. id        float  %9.0g                  
   2. female    float  %9.0g       fl         
   3. ses       float  %9.0g       sl         
   4. lang      float  %9.0g                  language test score
   5. math      float  %9.0g                  math score
   6. science   float  %9.0g                  science score
   7. honors    float  %9.0g                  
-------------------------------------------------------------------------------

summarize

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
      id |     200       100.5   57.87918          1        200  
  female |     200        .545   .4992205          0          1  
     ses |     200       2.055   .7242914          1          3  
    lang |     200       52.23   10.25294         28         76  
    math |     200      52.645   9.368448         33         75  
 science |     200       51.85   9.900891         26         74  
  honors |     200        .265   .4424407          0          1     

tab1 honors female

-> tabulation of honors  

     honors |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        147       73.50       73.50
          1 |         53       26.50      100.00
------------+-----------------------------------
      Total |        200      100.00

-> tabulation of female  

     female |      Freq.     Percent        Cum.
------------+-----------------------------------
       male |         91       45.50       45.50
     female |        109       54.50      100.00
------------+-----------------------------------
      Total |        200      100.00

tabulate ses, gen(ses)

        ses |      Freq.     Percent        Cum.
------------+-----------------------------------
        low |         47       23.50       23.50
     middle |         95       47.50       71.00
       high |         58       29.00      100.00
------------+-----------------------------------
      Total |        200      100.00

probit honors lang math science female ses1 ses2

Probit estimates                                  Number of obs   =        200
                                                  LR chi2(6)      =      90.64
                                                  Prob > chi2     =     0.0000
Log likelihood = -70.325874                       Pseudo R2       =     0.3919

------------------------------------------------------------------------------
  honors |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
    lang |   .0374474   .0167054      2.242   0.025       .0047055    .0701893
    math |   .0660721   .0190501      3.468   0.001       .0287347    .1034096
 science |    .027691   .0182851      1.514   0.130      -.0081471    .0635291
  female |   .7738415   .2655413      2.914   0.004       .2533901    1.294293
    ses1 |   .0239919   .3458658      0.069   0.945      -.6538925    .7018763
    ses2 |  -.5750086   .2756539     -2.086   0.037       -1.11528   -.0347369
   _cons |  -8.021886   1.198495     -6.693   0.000      -10.37089   -5.672879
------------------------------------------------------------------------------

Just a note on the interpretation of the probit coefficients. The coefficient for math is .07 to two decimal places. This indicates that a one-unit increase in the math score results in a .07 standard deviation increase in the predicted probit index. And the coefficient for female is interpreted to mean that the change from 0 to 1 increases the predicted probit index by .77 standard deviations.

probit honors lang math female ses1 ses2

Probit estimates                                  Number of obs   =        200
                                                  LR chi2(5)      =      88.28
                                                  Prob > chi2     =     0.0000
Log likelihood = -71.503442                       Pseudo R2       =     0.3817

------------------------------------------------------------------------------
  honors |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
    lang |   .0439894   .0162434      2.708   0.007       .0121528    .0758259
    math |   .0760789    .018053      4.214   0.000       .0406958     .111462
  female |   .6752606   .2523046      2.676   0.007       .1807526    1.169769
    ses1 |  -.0275906   .3397904     -0.081   0.935      -.6935676    .6383864
    ses2 |  -.6179796   .2723557     -2.269   0.023      -1.151787   -.0841724
   _cons |  -7.334563   1.056422     -6.943   0.000      -9.405111   -5.264015
------------------------------------------------------------------------------

test ses1 ses2

 ( 1)  ses1 = 0.0
 ( 2)  ses2 = 0.0

           chi2(  2) =    6.32
         Prob > chi2 =    0.0425

for var lang math: generate fxX = female*X

probit honors lang math female ses1 ses2 fxlang fxmath

Probit estimates                                  Number of obs   =        200
                                                  LR chi2(7)      =      89.08
                                                  Prob > chi2     =     0.0000
Log likelihood = -71.104283                       Pseudo R2       =     0.3851

------------------------------------------------------------------------------
  honors |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
    lang |   .0325027   .0233381      1.393   0.164      -.0132391    .0782445
    math |   .0717692   .0254528      2.820   0.005       .0218825    .1216559
  female |  -.9346668    1.92794     -0.485   0.628      -4.713361    2.844027
    ses1 |   -.003803   .3424154     -0.011   0.991      -.6749249     .667319
    ses2 |  -.5965207   .2774592     -2.150   0.032      -1.140331   -.0527107
  fxlang |   .0203053   .0323945      0.627   0.531      -.0431868    .0837974
  fxmath |   .0081221   .0363954      0.223   0.823      -.0632115    .0794558
   _cons |  -6.427969   1.443015     -4.455   0.000      -9.256227   -3.599711
------------------------------------------------------------------------------

test fxlang fxmath

 ( 1)  fxlang = 0.0
 ( 2)  fxmath = 0.0

           chi2(  2) =    0.81
         Prob > chi2 =    0.6682

probit honors lang math female ses1 ses2

Probit estimates                                  Number of obs   =        200
                                                  LR chi2(5)      =      88.28
                                                  Prob > chi2     =     0.0000
Log likelihood = -71.503442                       Pseudo R2       =     0.3817

------------------------------------------------------------------------------
  honors |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
    lang |   .0439894   .0162434      2.708   0.007       .0121528    .0758259
    math |   .0760789    .018053      4.214   0.000       .0406958     .111462
  female |   .6752606   .2523046      2.676   0.007       .1807526    1.169769
    ses1 |  -.0275906   .3397904     -0.081   0.935      -.6935676    .6383864
    ses2 |  -.6179796   .2723557     -2.269   0.023      -1.151787   -.0841724
   _cons |  -7.334563   1.056422     -6.943   0.000      -9.405111   -5.264015
------------------------------------------------------------------------------

listcoef

probit (N=200): Unstandardized and Standardized Estimates 

 Observed SD: .4424407
   Latent SD: 1.5392821

---------------------------------------------------------------------------
  honors |      b         z     P>|z|    bStdX    bStdY   bStdXY      SDofX
---------+-----------------------------------------------------------------
    lang |   0.04399    2.708   0.007   0.4510   0.0286   0.2930    10.2529
    math |   0.07608    4.214   0.000   0.7127   0.0494   0.4630     9.3684
  female |   0.67526    2.676   0.007   0.3371   0.4387   0.2190     0.4992
    ses1 |  -0.02759   -0.081   0.935  -0.0117  -0.0179  -0.0076     0.4251
    ses2 |  -0.61798   -2.269   0.023  -0.3094  -0.4015  -0.2010     0.5006
---------------------------------------------------------------------------

prchange

probit: Changes in Predicted Probabilities for honors

        min->max      0->1     -+1/2    -+sd/2  MargEfct
  lang    0.5114    0.0001    0.0110    0.1130    0.0110
  math    0.7624    0.0000    0.0191    0.1784    0.0191
female    0.1643    0.1643    0.1690    0.0845    0.1693
  ses1   -0.0069   -0.0069   -0.0069   -0.0029   -0.0069
  ses2   -0.1525   -0.1525   -0.1547   -0.0775   -0.1549

              0       1
Pr(y|x)  0.8324  0.1676

           lang     math   female     ses1     ses2
    x=    52.23   52.645     .545     .235     .475
sd(x)=  10.2529  9.36845   .49922  .425063  .500628

prtab math

probit: Predicted probabilities of positive outcome for honors

----------------------
math      |
score     | Prediction
----------+-----------
       33 |     0.0070
       35 |     0.0105
       37 |     0.0156
       38 |     0.0189
       39 |     0.0226
       40 |     0.0271
       41 |     0.0322
       42 |     0.0381
       43 |     0.0448
       44 |     0.0525
       45 |     0.0611
       46 |     0.0709
       47 |     0.0818
       48 |     0.0939
       49 |     0.1073
       50 |     0.1220
       51 |     0.1381
       52 |     0.1556
       53 |     0.1744
       54 |     0.1947
       55 |     0.2163
       56 |     0.2393
       57 |     0.2635
       58 |     0.2890
       59 |     0.3155
       60 |     0.3430
       61 |     0.3714
       62 |     0.4005
       63 |     0.4301
       64 |     0.4602
       65 |     0.4905
       66 |     0.5208
       67 |     0.5510
       68 |     0.5810
       69 |     0.6104
       70 |     0.6393
       71 |     0.6673
       72 |     0.6945
       73 |     0.7206
       75 |     0.7694
----------------------

      lang    math  female    ses1    ses2
x=   52.23  52.645    .545    .235    .475

prtab female

probit: Predicted probabilities of positive outcome for honors

----------------------
   female | Prediction
----------+-----------
     male |     0.0915
   female |     0.2557
----------------------

      lang    math  female    ses1    ses2
x=   52.23  52.645    .545    .235    .475

prtab math female

probit: Predicted probabilities of positive outcome for honors

--------------------------
math      |     female    
score     |   male  female
----------+---------------
       33 | 0.0024  0.0157
       35 | 0.0037  0.0228
       37 | 0.0058  0.0324
       38 | 0.0072  0.0383
       39 | 0.0089  0.0451
       40 | 0.0109  0.0528
       41 | 0.0133  0.0615
       42 | 0.0161  0.0713
       43 | 0.0194  0.0822
       44 | 0.0233  0.0944
       45 | 0.0278  0.1078
       46 | 0.0331  0.1226
       47 | 0.0391  0.1387
       48 | 0.0460  0.1563
       49 | 0.0538  0.1752
       50 | 0.0626  0.1955
       51 | 0.0726  0.2172
       52 | 0.0837  0.2402
       53 | 0.0960  0.2645
       54 | 0.1096  0.2900
       55 | 0.1245  0.3165
       56 | 0.1408  0.3441
       57 | 0.1585  0.3725
       58 | 0.1776  0.4016
       59 | 0.1981  0.4313
       60 | 0.2200  0.4614
       61 | 0.2431  0.4916
       62 | 0.2676  0.5220
       63 | 0.2932  0.5522
       64 | 0.3199  0.5821
       65 | 0.3476  0.6116
       66 | 0.3761  0.6404
       67 | 0.4053  0.6684
       68 | 0.4350  0.6955
       69 | 0.4651  0.7216
       70 | 0.4954  0.7466
       71 | 0.5257  0.7703
       72 | 0.5559  0.7927
       73 | 0.5858  0.8138
       75 | 0.6439  0.8518
--------------------------

      lang    math  female    ses1    ses2
x=   52.23  52.645    .545    .235    .475

Example 2

use http://www.gseis.ucla.edu/courses/data/api2000

describe

Contains data from api2000.dta
  obs:           250                          
 vars:             8                          10 Feb 2001 14:58
 size:         5,500 (99.9% of memory free)
-------------------------------------------------------------------------------
   1. snum      float  %9.0g                  school number
   2. api2000   int    %6.0g                  
   3. apigoal   float  %9.0g                  api>=800
   4. meals     byte   %4.0f                  pct free meals
   5. ell       byte   %4.0f                  english language learners
   6. aved      float  %9.0g                  avg parent ed
   7. full      byte   %4.0f                  pct full credential
   8. emer      byte   %4.0f                  pct emer credential
-------------------------------------------------------------------------------

summarize

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
    snum |     250    3165.612    1757.88         25       6186  
 api2000 |     250      669.92   137.6566        366        953  
 apigoal |     250          .2   .4008024          0          1  
   meals |     250      51.456   31.96321          0        100  
     ell |     250      26.352   25.60583          0         91  
    aved |     250      2.7422   .7750297          1       4.62  
    full |     250      87.684   13.57147         34        100  
    emer |     250      10.928   11.55512          0         63 

tab apigoal

api>=800 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        200       80.00       80.00
          1 |         50       20.00      100.00
------------+-----------------------------------
      Total |        250      100.00

probit apigoal meals ell aved full

Probit estimates                                  Number of obs   =        250
                                                  LR chi2(4)      =     151.08
                                                  Prob > chi2     =     0.0000
Log likelihood = -49.560174                       Pseudo R2       =     0.6038

------------------------------------------------------------------------------
 apigoal |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   meals |  -.0426557   .0138114     -3.088   0.002      -.0697255   -.0155858
     ell |   .0025918   .0191673      0.135   0.892      -.0349755    .0401591
    aved |   1.298466   .4034422      3.218   0.001       .5077338    2.089198
    full |   .0167719   .0216277      0.775   0.438      -.0256177    .0591614
   _cons |  -5.280958   2.613711     -2.020   0.043      -10.40374   -.1581779
------------------------------------------------------------------------------

Note: 11 failures and 0 successes completely determined.

probit apigoal meals aved

Probit estimates                                  Number of obs   =        250
                                                  LR chi2(2)      =     150.47
                                                  Prob > chi2     =     0.0000
Log likelihood = -49.865959                       Pseudo R2       =     0.6014

------------------------------------------------------------------------------
 apigoal |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   meals |  -.0431622   .0123059     -3.507   0.000      -.0672814   -.0190431
    aved |   1.295674   .4003215      3.237   0.001       .5110583     2.08029
   _cons |  -3.656406   1.527895     -2.393   0.017      -6.651026    -.661787
------------------------------------------------------------------------------

Note: 1 failure and 0 successes completely determined

listcoef

probit (N=250): Unstandardized and Standardized Estimates 

 Observed SD: .40080241
   Latent SD: 2.5147071

---------------------------------------------------------------------------
 apigoal |      b         z     P>|z|    bStdX    bStdY   bStdXY      SDofX
---------+-----------------------------------------------------------------
   meals |  -0.04316   -3.507   0.000  -1.3796  -0.0172  -0.5486    31.9632
    aved |   1.29567    3.237   0.001   1.0042   0.5152   0.3993     0.7750
---------------------------------------------------------------------------

Example 3

Example 3 involves the use of blocked data, i.e., each observation consists of the number of occurrances of a variable and the number of observations in the population. The syntax for bprobit looks like this,

bprobit  pos_var pop_var [predictors] [if exp] [in range] [, probit_options]

use http://www.gseis.ucla.edu/courses/data/ashford

describe

Contains data from http://www.gseis.ucla.edu/courses/data/ashford.dta
  obs:             9                          from Ashford & Snowden - 1970
 vars:             4                          15 Feb 2001 22:58
 size:           117 (100.0% of memory free)
-------------------------------------------------------------------------------
   1. age       byte   %8.0g                  
   2. pop       int    %8.0g                  population
   3. cases     int    %8.0g                  cases of breathlessness
   4. opro      float  %9.0g                  observed proportion
-------------------------------------------------------------------------------

bprobit cases pop age

Probit estimates                                  Number of obs   =      18282
                                                  LR chi2(1)      =    2346.44
                                                  Prob > chi2     =     0.0000
Log likelihood = -5980.1529                       Pseudo R2       =     0.1640

------------------------------------------------------------------------------
_outcome |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     age |   .0550296   .0012746     43.173   0.000       .0525313    .0575278
   _cons |   -3.59412   .0619437    -58.022   0.000      -3.715528   -3.472713
------------------------------------------------------------------------------

predict pp, p

list age opro pp

          age       opro         pp 
  1.       22   .0076844   .0085751  
  2.       27   .0178671   .0175016  
  3.       32    .034548   .0333883  
  4.       37   .0600072   .0596135  
  5.       42   .0980651   .0997673  
  6.       47   .1491851   .1567919  
  7.       52   .2492823   .2319065  
  8.       57   .3188571   .3236793  
  9.       62   .4207746   .4276788

Categorical Data Analysis Course

Phil Ender