Applied Categorical & Nonnormal Data Analysis

Poisson Models

Analysis of count data, while not new, has seen a tremendous increase in interest in the last 20 years. Along with this increase in interest there have been numerous improvements in the technology for analyzing these types of data. In this section we will cover poisson models and negative binomial models for analyzing count data.

Poisson Models

Poisson probabilities are use to model the number of occurrences (counts) of an event. One of the early recorded uses of the Poisson distribution was the 1898 study investigating the number of Prussian soldiers that were kicked to death by horses.

Here is the poisson distribution function,

with the single parameter λ. A poisson distribution has a mean equal to λ and a variance equal to λ. Table 1 shows poisson probabilities for λ = 1, 3 and 5. Table 1 is followed by the graph of the probabilities for the three lambdas.

                   Table 1

 y  lambda = 1   lambda = 3   lambda = 5
 0  0.36787945   0.04978707   0.00673795
 1  0.36787945   0.14936121   0.03368973
 2  0.18393973   0.22404180   0.08422434 
 3  0.06131324   0.22404180   0.14037390
 4  0.01532831   0.16803135   0.17546737  
 5  0.00306566   0.10081881   0.17546737
 6  0.00051094   0.05040941   0.14622281
 7  0.00007299   0.02160403   0.10444486
 8  0.00000912   0.00810151   0.06527804
 9  0.00000101   0.00270050   0.03626558
10  0.00000010   0.00081015   0.01813279

As lambda increases, the distribution shifts to the right. For large values of lambda the distribution is approximately normal.

Distribution in which the mean equals the variance have equidispersion. When the variance is greater than the mean there is overdispersion. In practice, it is rare to find distributions with equidispersion.

The poisson regression model can be estimated using maximum-likelihood, with the following likelihood funxtion and log-likelihood function.

In the poisson regression model, the incidence rate or predicted count is given by

The incedence rate ratio is used to compare incidence rates. The incidence rate ratio for a one-unit change in x_i with all of the variables in the model held constant is

The incidence rate ratio is the expected count for X+1 divided by the expected count for X.

Poisson Regression Example

We will illustrate poisson regression using the lahigh data set. In particular, we would like to know whether there is a gender difference in days absent and the relation between language NCE test scores and days absent. Note that for gender, 0 is female and 1 is male. Here is a histogram of days absent.

use http://www.gseis.ucla.edu/courses/data/lahigh

summarize gender langnce daysabs

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      gender |     316    .4873418   .5006325          0          1
     langnce |     316    50.06379   17.93921   1.007114   98.99289
     daysabs |     316    5.810127   7.449003          0         45

summarize daysabs, detail

                         days absent
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 316
25%            1              0       Sum of Wgt.         316

50%            3                      Mean           5.810127
                        Largest       Std. Dev.      7.449003
75%            8             35
90%           14             35       Variance       55.48764
95%           23             41       Skewness       2.250587
99%           35             45       Kurtosis       8.949302


poisson daysabs gender langnce

Poisson regression                                Number of obs   =        316
                                                  LR chi2(2)      =     171.50
                                                  Prob > chi2     =     0.0000
Log likelihood = -1549.8567                       Pseudo R2       =     0.0524

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |  -.4093528   .0482192    -8.49   0.000    -.5038606   -.3148449
     langnce |    -.01467   .0012934   -11.34   0.000    -.0172051   -.0121349
       _cons |   2.646977   .0697764    37.94   0.000     2.510217    2.783736
------------------------------------------------------------------------------

poisson, irr

Poisson regression                                Number of obs   =        316
                                                  LR chi2(2)      =     171.50
                                                  Prob > chi2     =     0.0000
Log likelihood = -1549.8567                       Pseudo R2       =     0.0524

------------------------------------------------------------------------------
     daysabs |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |   .6640799   .0320214    -8.49   0.000     .6041936    .7299021
     langnce |   .9854371   .0012746   -11.34   0.000      .982942    .9879384
------------------------------------------------------------------------------

listcoef  /* downloaded from Stata over the Internet */

poisson (N=316): Factor Change in Expected Count 

 Observed SD: 7.4490028

------------------------------------------------------------------
 daysabs |      b         z     P>|z|    e^b    e^bStdX      SDofX
---------+--------------------------------------------------------
  gender |  -0.40935   -8.489   0.000   0.6641   0.8147     0.5006
 langnce |  -0.01467  -11.342   0.000   0.9854   0.7686    17.9392
------------------------------------------------------------------

listcoef, percent

poisson (N=316): Percentage Change in Expected Count 

 Observed SD: 7.4490028

----------------------------------------------------------------------
     daysabs |      b         z     P>|z|      %      %StdX      SDofX
-------------+--------------------------------------------------------
      gender |  -0.40935   -8.489   0.000    -33.6    -18.5     0.5006
     langnce |  -0.01467  -11.342   0.000     -1.5    -23.1    17.9392
----------------------------------------------------------------------

Interpretation

From the incidence rate ratios, being male decreases the expected number of days absent by a factor of .66, or equivalently, it decreases the expected number by 100*(.66-1)% = -33%. And, for each point increase in the language normal curve equivalence the expected number of days absent decreses by a factor of .98 (or 100*(.98-1)% = -2%) when the other variables are held constant.

The listcoef command also provides for standardized factor change. For a one standard deviation increase (approximately 18 points) in the language nce the expected number of days absent would decrease by a factor of .77 (100*(.77-1)% = -23%) with the other variables in the model held constant.

Another way of interpreting the model is to look at the marginal effects, also known as, partial change in the expected value.

mfx compute, at(mean)

Marginal effects after poisson
      y  = predicted number of events (predict)
         =  5.5458276
------------------------------------------------------------------------------
variable |      dy/dx    Std. Err.     z    P>|z|  [    95% C.I.   ]      X
---------+--------------------------------------------------------------------
  gender*|  -2.274269      .26514   -8.58   0.000  -2.79393  -1.7546   .487342
 langnce |  -.0813573      .00695  -11.70   0.000  -.094982 -.067733   50.0638
------------------------------------------------------------------------------
(*) dy/dx is for discrete change of dummy variable from 0 to 1

mfx compute, at(mean langnce=60)

Marginal effects after poisson
      y  = predicted number of events (predict)
         =     4.7936
------------------------------------------------------------------------------
variable |      dy/dx    Std. Err.     z    P>|z|  [    95% C.I.   ]      X
---------+--------------------------------------------------------------------
  gender*|   -1.96579      .22735   -8.65   0.000  -2.41139  -1.5202   .487342
 langnce |  -.0703221      .00515  -13.67   0.000  -.080408 -.060236   60.0000
------------------------------------------------------------------------------
(*) dy/dx is for discrete change of dummy variable from 0 to 1

Finally, we will look at the poisson goodness of fit. We should have looked at it earlier before trying to interpret the model but we needed to take some time to discuss how one goes about interpreting a poisson model.

poisgof

         Goodness-of-fit chi2  =  2238.317
         Prob > chi2(313)      =    0.0000

The large chi-square suggest that there is not a very good fit for the poisson regression model. This could either be because the explanatory variables are not very good or the poisson model is not appropriate. We saw earlier that the variance for daysabs was much greater than the the mean. This suggest that there is overdispersion. We will use nbvargr to compare the fit for poisson versus negabitive binomial models.

nbvargr daysabs  /* downloaded over the Internet */

Obtaining Parameter Estimates

(23 observations deleted)

  Negative Binomial Probabilities
  with mean = 5.810127 & overdispersion = 1.397268

        k     nbprob      nbcum 
  1.    0  0.20559212  0.20559211  
  2.    1  0.13100202  0.33659413  
  3.    2  0.10005438  0.43664852  
  4.    3  0.08063899  0.51728749  
  5.    4  0.06669218  0.58397967  
  6.    5  0.05600163  0.63998133  
  7.    6  0.04749728  0.68747860  
  8.    7  0.04057066  0.72804928  
  9.    8  0.03483756  0.76288682  
 10.    9  0.03003709  0.79292393  
 11.   10  0.02598259  0.81890649  

 Poisson Probabilities for lambda = 5.810127

        k      pprob       pcum 
  1.    0  0.00299705  0.00299705  
  2.    1  0.01741324  0.02041029  
  3.    2  0.05058656  0.07099685  
  4.    3  0.09797145  0.16896829  
  5.    4  0.14230664  0.31127495  
  6.    5  0.16536394  0.47663888  
  7.    6  0.16013090  0.63676977  
  8.    7  0.13291156  0.76968133  
  9.    8  0.09652913  0.86621046  
 10.    9  0.06231628  0.92852676  
 11.   10  0.03620655  0.96473330

As we suspected, the poisson model did not do a good job of approximating daysabs. The fact is overdispersion is very common in "real" data, the poisson distribution which works well in theory does not perform all that well in practice. The negative binomial model looks to be a much better fit.

Categorical Data Analysis Course

Phil Ender