Analysis of count data, while not new, has seen a tremendous increase in interest in the last 20 years. Along with this increase in interest there have been numerous improvements in the technology for analyzing these types of data. In this section we will cover poisson models and negative binomial models for analyzing count data.
Poisson Models
Poisson probabilities are use to model the number of occurrences (counts) of an event. One of the early recorded uses of the Poisson distribution was the 1898 study investigating the number of Prussian soldiers that were kicked to death by horses.
Here is the poisson distribution function,
Table 1 y lambda = 1 lambda = 3 lambda = 5 0 0.36787945 0.04978707 0.00673795 1 0.36787945 0.14936121 0.03368973 2 0.18393973 0.22404180 0.08422434 3 0.06131324 0.22404180 0.14037390 4 0.01532831 0.16803135 0.17546737 5 0.00306566 0.10081881 0.17546737 6 0.00051094 0.05040941 0.14622281 7 0.00007299 0.02160403 0.10444486 8 0.00000912 0.00810151 0.06527804 9 0.00000101 0.00270050 0.03626558 10 0.00000010 0.00081015 0.01813279As lambda increases, the distribution shifts to the right. For large values of lambda the distribution is approximately normal.
Distribution in which the mean equals the variance have equidispersion. When the variance is greater than the mean there is overdispersion. In practice, it is rare to find distributions with equidispersion.
The poisson regression model can be estimated using maximum-likelihood, with the following likelihood funxtion and log-likelihood function.
Poisson Regression Example
We will illustrate poisson regression using the lahigh data set. In particular, we would like to know whether there is a gender difference in days absent and the relation between language NCE test scores and days absent. Note that for gender, 0 is female and 1 is male. Here is a histogram of days absent.
use http://www.gseis.ucla.edu/courses/data/lahigh summarize gender langnce daysabs Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- gender | 316 .4873418 .5006325 0 1 langnce | 316 50.06379 17.93921 1.007114 98.99289 daysabs | 316 5.810127 7.449003 0 45 summarize daysabs, detail days absent ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 316 25% 1 0 Sum of Wgt. 316 50% 3 Mean 5.810127 Largest Std. Dev. 7.449003 75% 8 35 90% 14 35 Variance 55.48764 95% 23 41 Skewness 2.250587 99% 35 45 Kurtosis 8.949302 poisson daysabs gender langnce Poisson regression Number of obs = 316 LR chi2(2) = 171.50 Prob > chi2 = 0.0000 Log likelihood = -1549.8567 Pseudo R2 = 0.0524 ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- gender | -.4093528 .0482192 -8.49 0.000 -.5038606 -.3148449 langnce | -.01467 .0012934 -11.34 0.000 -.0172051 -.0121349 _cons | 2.646977 .0697764 37.94 0.000 2.510217 2.783736 ------------------------------------------------------------------------------ poisson, irr Poisson regression Number of obs = 316 LR chi2(2) = 171.50 Prob > chi2 = 0.0000 Log likelihood = -1549.8567 Pseudo R2 = 0.0524 ------------------------------------------------------------------------------ daysabs | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- gender | .6640799 .0320214 -8.49 0.000 .6041936 .7299021 langnce | .9854371 .0012746 -11.34 0.000 .982942 .9879384 ------------------------------------------------------------------------------ listcoef /* downloaded from Stata over the Internet */ poisson (N=316): Factor Change in Expected Count Observed SD: 7.4490028 ------------------------------------------------------------------ daysabs | b z P>|z| e^b e^bStdX SDofX ---------+-------------------------------------------------------- gender | -0.40935 -8.489 0.000 0.6641 0.8147 0.5006 langnce | -0.01467 -11.342 0.000 0.9854 0.7686 17.9392 ------------------------------------------------------------------ listcoef, percent poisson (N=316): Percentage Change in Expected Count Observed SD: 7.4490028 ---------------------------------------------------------------------- daysabs | b z P>|z| % %StdX SDofX -------------+-------------------------------------------------------- gender | -0.40935 -8.489 0.000 -33.6 -18.5 0.5006 langnce | -0.01467 -11.342 0.000 -1.5 -23.1 17.9392 ----------------------------------------------------------------------Interpretation
From the incidence rate ratios, being male decreases the expected number of days absent by a factor of .66, or equivalently, it decreases the expected number by 100*(.66-1)% = -33%. And, for each point increase in the language normal curve equivalence the expected number of days absent decreses by a factor of .98 (or 100*(.98-1)% = -2%) when the other variables are held constant.
The listcoef command also provides for standardized factor change. For a one standard deviation increase (approximately 18 points) in the language nce the expected number of days absent would decrease by a factor of .77 (100*(.77-1)% = -23%) with the other variables in the model held constant.
Another way of interpreting the model is to look at the marginal effects, also known as, partial change in the expected value.
mfx compute, at(mean) Marginal effects after poisson y = predicted number of events (predict) = 5.5458276 ------------------------------------------------------------------------------ variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X ---------+-------------------------------------------------------------------- gender*| -2.274269 .26514 -8.58 0.000 -2.79393 -1.7546 .487342 langnce | -.0813573 .00695 -11.70 0.000 -.094982 -.067733 50.0638 ------------------------------------------------------------------------------ (*) dy/dx is for discrete change of dummy variable from 0 to 1 mfx compute, at(mean langnce=60) Marginal effects after poisson y = predicted number of events (predict) = 4.7936 ------------------------------------------------------------------------------ variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X ---------+-------------------------------------------------------------------- gender*| -1.96579 .22735 -8.65 0.000 -2.41139 -1.5202 .487342 langnce | -.0703221 .00515 -13.67 0.000 -.080408 -.060236 60.0000 ------------------------------------------------------------------------------ (*) dy/dx is for discrete change of dummy variable from 0 to 1Finally, we will look at the poisson goodness of fit. We should have looked at it earlier before trying to interpret the model but we needed to take some time to discuss how one goes about interpreting a poisson model.
poisgof Goodness-of-fit chi2 = 2238.317 Prob > chi2(313) = 0.0000The large chi-square suggest that there is not a very good fit for the poisson regression model. This could either be because the explanatory variables are not very good or the poisson model is not appropriate. We saw earlier that the variance for daysabs was much greater than the the mean. This suggest that there is overdispersion. We will use nbvargr to compare the fit for poisson versus negabitive binomial models.
nbvargr daysabs /* downloaded over the Internet */ Obtaining Parameter Estimates (23 observations deleted) Negative Binomial Probabilities with mean = 5.810127 & overdispersion = 1.397268 k nbprob nbcum 1. 0 0.20559212 0.20559211 2. 1 0.13100202 0.33659413 3. 2 0.10005438 0.43664852 4. 3 0.08063899 0.51728749 5. 4 0.06669218 0.58397967 6. 5 0.05600163 0.63998133 7. 6 0.04749728 0.68747860 8. 7 0.04057066 0.72804928 9. 8 0.03483756 0.76288682 10. 9 0.03003709 0.79292393 11. 10 0.02598259 0.81890649 Poisson Probabilities for lambda = 5.810127 k pprob pcum 1. 0 0.00299705 0.00299705 2. 1 0.01741324 0.02041029 3. 2 0.05058656 0.07099685 4. 3 0.09797145 0.16896829 5. 4 0.14230664 0.31127495 6. 5 0.16536394 0.47663888 7. 6 0.16013090 0.63676977 8. 7 0.13291156 0.76968133 9. 8 0.09652913 0.86621046 10. 9 0.06231628 0.92852676 11. 10 0.03620655 0.96473330As we suspected, the poisson model did not do a good job of approximating daysabs. The fact is overdispersion is very common in "real" data, the poisson distribution which works well in theory does not perform all that well in practice. The negative binomial model looks to be a much better fit.
Categorical Data Analysis Course
Phil Ender