Applies Categorical & Nonnormal Data Analysis

Generalized Linear Models

Generalized Linear Models

Most students are introduced to linear models through either multiple regression or analysis of variance. With these methods the expected value of the response variable is statistically modeled, that is, it is expressed as a linear combination of the explanatory variables. With categorical and count response variables, the regression cannot be linear. The problem of nonlinearity is handled through nonlinear functions that transform the expected value of the categorical or count variable into a linear function of the explanatory variables. Such transformations are referred to as link functions.

For example, in the analysis of count data, the expected frequencies must be nonnegative. To ensure that the predicted values from the linear models fit these constraints, the log link is used to transform the expected value of the response variable. This loglinear transformation serves two purposes: it ensures that the fitted values are appropriate for count data, and it permits the unknown regression parameters to lie within the real number space.

Different types of response variables utilize different link functions: both the logit and probit link functions work with binomial response variables while the log link function works with both poisson and negative binomial response variables. Growing out of the work of Nelder & Wedderburn (1972) and McCullagh & Nelder (1989), generalized linear models provides a unified framework which can be applied to various 'linear' models.

Generalized linear models take the form:

xβ

where F is the distribution family and g( ) is the link function.

You might recognize this example more easily if it were rewritten as follows:

₀

₁

₂

Now we can replace Y' with E(y),

₀

₁

₂

In OLS the distribution family is gaussian (normal), i.e., y -> {gaussian} and the link function is identity, i.e., g(y) = y. Thus, we can write g(E(y)) as just E(y).

Another example is poisson regression in which the distribution family is poisson, i.e., y -> {poisson} and the link function is the natural log, i.e., g(y) = ln(y). The glm model would then be written as,

₀

₁

₂

Here are examples of distributions and link functions for some common estimation procedures:

type of                   distribution   link
estimation                family         function
OLS regression            gaussian       identity
logistic regression       binomial       logit
probit                    binomial       probit
cloglog                   binomial       cloglog
poisson regression        poisson        log
neg binomial regression   neg binomial   log

Stata's GLM Procedure

Stata's glm procedure estimates generalized linear models in which the user can specify both the distribution family and the link function. Here is the basic syntax of the glm procedure:

glm depvar indvars [if exp] [in range] [, family(fname) link(lname) eform ]

An OLS regression would look like this using regress and glm:

regress write read math gender

glm write read math gender, family(gaus) link(iden)

A logistic regression would look like this:

logistic honors read math gender

glm honors read math gender, family(binom) link(logit)

A poisson regression would look like this:

poisson days read math gender

glm days read math gender, family(poisson) link(log)

A negative binomial regression would look like this:

nbreg days read math gender

glm days read math gender, family(nbinom) link(log)

Here is a list of the allowable distribution families:

And here is a list of the link functions that are available:

Of course, if all that glm could do was duplicate OLS, logistic, poisson and negative binomial regression that it would not appear to be very useful. However, it is possible to combine distribution families and link functions in ways that do not duplicate existing estimation procedures. The table below give the possible combinations that make sense from a data analysis perspective:

                   iden log logit probit cloglog nbinom power opower  loglog  logc
gaussian             X   X                                X
inverse gaussian     X   X                                X
binomial             X   X    X     X       X             X      X       X      X
poisson              X   X                                X
negative binomial    X   X                          X     X
gamma                X   X                                X

Examples

use http://www.gseis.ucla.edu/courses/data/hsb2

generate hon = write>=60

regress write read math female

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   72.52
       Model |  9405.34864     3  3135.11621           Prob > F      =  0.0000
    Residual |  8473.52636   196  43.2322773           R-squared     =  0.5261
-------------+------------------------------           Adj R-squared =  0.5188
       Total |   17878.875   199   89.843593           Root MSE      =  6.5751

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .3252389   .0607348     5.36   0.000     .2054613    .4450166
        math |   .3974826   .0664037     5.99   0.000      .266525    .5284401
      female |    5.44337   .9349987     5.82   0.000      3.59942    7.287319
       _cons |   11.89566   2.862845     4.16   0.000     6.249728     17.5416
------------------------------------------------------------------------------

glm write read math female, link(iden) fam(gauss) nolog

Generalized linear models                          No. of obs      =       200
Optimization     : ML: Newton-Raphson              Residual df     =       196
                                                   Scale parameter =  43.23228
Deviance         =  8473.526357                    (1/df) Deviance =  43.23228
Pearson          =  8473.526357                    (1/df) Pearson  =  43.23228

Variance function: V(u) = 1                        [Gaussian]
Link function    : g(u) = u                        [Identity]
Standard errors  : OIM

Log likelihood   = -658.4261736                    AIC             =  6.624262
BIC              =  7435.056153

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .3252389   .0607348     5.36   0.000     .2062009     .444277
        math |   .3974826   .0664037     5.99   0.000     .2673336    .5276315
      female |    5.44337   .9349987     5.82   0.000     3.610806    7.275934
       _cons |   11.89566   2.862845     4.16   0.000      6.28459    17.50674
------------------------------------------------------------------------------

logit hon read math female, nolog

Logit estimates                                   Number of obs   =        200
                                                  LR chi2(3)      =      80.87
                                                  Prob > chi2     =     0.0000
Log likelihood = -75.209827                       Pseudo R2       =     0.3496

------------------------------------------------------------------------------
         hon |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .0752424    .027577     2.73   0.006     .0211924    .1292924
        math |   .1317117   .0324607     4.06   0.000       .06809    .1953335
      female |   1.154801   .4340856     2.66   0.008      .304009    2.005593
       _cons |  -13.12749   1.850769    -7.09   0.000    -16.75493    -9.50005
------------------------------------------------------------------------------

logit, or

Logit estimates                                   Number of obs   =        200
                                                  LR chi2(3)      =      80.87
                                                  Prob > chi2     =     0.0000
Log likelihood = -75.209827                       Pseudo R2       =     0.3496

------------------------------------------------------------------------------
         hon | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   1.078145   .0297321     2.73   0.006     1.021419    1.138023
        math |   1.140779   .0370305     4.06   0.000     1.070462    1.215716
      female |   3.173393   1.377524     2.66   0.008     1.355281    7.430502
------------------------------------------------------------------------------

glm hon read math female, link(logit) fam(bin) nolog 

Generalized linear models                          No. of obs      =       200
Optimization     : ML: Newton-Raphson              Residual df     =       196
                                                   Scale parameter =         1
Deviance         =  150.4196543                    (1/df) Deviance =  .7674472
Pearson          =  164.2509104                    (1/df) Pearson  =  .8380148

Variance function: V(u) = u*(1-u)                  [Bernoulli]
Link function    : g(u) = ln(u/(1-u))              [Logit]
Standard errors  : OIM

Log likelihood   = -75.20982717                    AIC             =  .7920983
BIC              = -888.0505495

------------------------------------------------------------------------------
         hon |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .0752424   .0275779     2.73   0.006     .0211906    .1292941
        math |   .1317117   .0324623     4.06   0.000     .0680869    .1953366
      female |   1.154801   .4341012     2.66   0.008     .3039785    2.005624
       _cons |  -13.12749   1.850893    -7.09   0.000    -16.75517   -9.499808
------------------------------------------------------------------------------

glm, eform

Generalized linear models                          No. of obs      =       200
Optimization     : ML: Newton-Raphson              Residual df     =       196
                                                   Scale parameter =         1
Deviance         =  150.4196543                    (1/df) Deviance =  .7674472
Pearson          =  164.2509104                    (1/df) Pearson  =  .8380148

Variance function: V(u) = u*(1-u)                  [Bernoulli]
Link function    : g(u) = ln(u/(1-u))              [Logit]
Standard errors  : OIM

Log likelihood   = -75.20982717                    AIC             =  .7920983
BIC              = -888.0505495

------------------------------------------------------------------------------
         hon | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   1.078145    .029733     2.73   0.006     1.021417    1.138025
        math |   1.140779   .0370323     4.06   0.000     1.070458     1.21572
      female |   3.173393   1.377573     2.66   0.008      1.35524    7.430728
------------------------------------------------------------------------------

probit hon read math female, nolog

Probit estimates                                  Number of obs   =        200
                                                  LR chi2(3)      =      81.80
                                                  Prob > chi2     =     0.0000
Log likelihood = -74.745943                       Pseudo R2       =     0.3537

------------------------------------------------------------------------------
         hon |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .0473262   .0157561     3.00   0.003     .0164449    .0782076
        math |   .0735256   .0173216     4.24   0.000     .0395759    .1074754
      female |   .6824682   .2447275     2.79   0.005     .2028112    1.162125
       _cons |  -7.663304   .9921289    -7.72   0.000    -9.607841   -5.718767
------------------------------------------------------------------------------

glm hon read math female, link(probit) fam(bin) nolog 

Generalized linear models                          No. of obs      =       200
Optimization     : ML: Newton-Raphson              Residual df     =       196
                                                   Scale parameter =         1
Deviance         =  149.4918859                    (1/df) Deviance =  .7627137
Pearson          =  160.9679286                    (1/df) Pearson  =  .8212649

Variance function: V(u) = u*(1-u)                  [Bernoulli]
Link function    : g(u) = invnorm(u)               [Probit]
Standard errors  : OIM

Log likelihood   = -74.74594294                    AIC             =  .7874594
BIC              =  -888.978318

------------------------------------------------------------------------------
         hon |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .0473262   .0157561     3.00   0.003     .0164448    .0782077
        math |   .0735256   .0173217     4.24   0.000     .0395758    .1074755
      female |   .6824681   .2447281     2.79   0.005     .2028098    1.162126
       _cons |  -7.663303   .9921345    -7.72   0.000    -9.607851   -5.718755
------------------------------------------------------------------------------

use http://www.gseis.ucla.edu/courses/data/lahigh, clear

poisson daysabs langnce gender, nolog
 

Poisson regression                                Number of obs   =        316
                                                  LR chi2(2)      =     171.50
                                                  Prob > chi2     =     0.0000
Log likelihood = -1549.8567                       Pseudo R2       =     0.0524

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |    -.01467   .0012934   -11.34   0.000    -.0172051   -.0121349
      gender |  -.4093528   .0482192    -8.49   0.000    -.5038606   -.3148449
       _cons |   2.646977   .0697764    37.94   0.000     2.510217    2.783736
------------------------------------------------------------------------------

poisson, irr

Poisson regression                                Number of obs   =        316
                                                  LR chi2(2)      =     171.50
                                                  Prob > chi2     =     0.0000
Log likelihood = -1549.8567                       Pseudo R2       =     0.0524

------------------------------------------------------------------------------
     daysabs |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |   .9854371   .0012746   -11.34   0.000      .982942    .9879384
      gender |   .6640799   .0320214    -8.49   0.000     .6041936    .7299021
------------------------------------------------------------------------------

glm daysabs langnce gender, link(log) fam(poisson) nolog 

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       313
                                                   Scale parameter =         1
Deviance         =  2238.317597                    (1/df) Deviance =  7.151174
Pearson          =  2752.913231                    (1/df) Pearson  =   8.79525

Variance function: V(u) = u                        [Poisson]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   =  -1549.85665                    AIC             =  9.828207
BIC              =  436.7702841

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |    -.01467   .0012934   -11.34   0.000    -.0172051   -.0121349
      gender |  -.4093528   .0482192    -8.49   0.000    -.5038606   -.3148449
       _cons |   2.646977   .0697764    37.94   0.000     2.510217    2.783736
------------------------------------------------------------------------------

glm, eform

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       313
                                                   Scale parameter =         1
Deviance         =  2238.317597                    (1/df) Deviance =  7.151174
Pearson          =  2752.913231                    (1/df) Pearson  =   8.79525

Variance function: V(u) = u                        [Poisson]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   =  -1549.85665                    AIC             =  9.828207
BIC              =  436.7702841

------------------------------------------------------------------------------
     daysabs |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |   .9854371   .0012746   -11.34   0.000      .982942    .9879384
      gender |   .6640799   .0320214    -8.49   0.000     .6041936    .7299021
------------------------------------------------------------------------------

nbreg daysabs langnce gender, nolog 

Negative binomial regression                      Number of obs   =        316
                                                  LR chi2(2)      =      20.63
                                                  Prob > chi2     =     0.0000
Log likelihood =  -880.9274                       Pseudo R2       =     0.0116

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |  -.0156493   .0039485    -3.96   0.000    -.0233882   -.0079104
      gender |  -.4312069   .1396913    -3.09   0.002    -.7049968   -.1574169
       _cons |    2.70344   .2292762    11.79   0.000     2.254067    3.152813
-------------+----------------------------------------------------------------
    /lnalpha |     .25394    .095509                      .0667457    .4411342
-------------+----------------------------------------------------------------
       alpha |   1.289094   .1231201                      1.069024    1.554469
------------------------------------------------------------------------------
Likelihood ratio test of alpha=0:  chibar2(01) = 1337.86 Prob>=chibar2 = 0.000

glm daysabs langnce gender, link(log) fam(nbin) nolog 

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       313
                                                   Scale parameter =         1
Deviance         =   425.603464                    (1/df) Deviance =  1.359755
Pearson          =  415.6288036                    (1/df) Pearson  =  1.327888

Variance function: V(u) = u+(1)u^2                 [Neg. Binomial]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -884.4953535                    AIC             =  5.617059
BIC              = -1375.943849

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |  -.0156357   .0035438    -4.41   0.000    -.0225814   -.0086899
      gender |  -.4307736   .1253082    -3.44   0.001    -.6763732    -.185174
       _cons |   2.702606   .2052709    13.17   0.000     2.300282    3.104929
------------------------------------------------------------------------------

glm, eform

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       313
                                                   Scale parameter =         1
Deviance         =   425.603464                    (1/df) Deviance =  1.359755
Pearson          =  415.6288036                    (1/df) Pearson  =  1.327888

Variance function: V(u) = u+(1)u^2                 [Neg. Binomial]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -884.4953535                    AIC             =  5.617059
BIC              = -1375.943849

------------------------------------------------------------------------------
     daysabs |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |   .9844859   .0034888    -4.41   0.000     .9776716    .9913477
      gender |    .650006   .0814511    -3.44   0.001     .5084577    .8309596
------------------------------------------------------------------------------

glm daysabs langnce gender, fam(gamma) link(log) nolog

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       313
                                                   Scale parameter =  1.583724
Deviance         =  251.8270233                    (1/df) Deviance =  .8045592
Pearson          =  495.7055497                    (1/df) Pearson  =  1.583724

Variance function: V(u) = u^2                      [Gamma]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -856.2487643                    AIC             =  5.438283
BIC              =  -1549.72029

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |  -.0156852   .0040626    -3.86   0.000    -.0236478   -.0077226
      gender |  -.4326492   .1443719    -3.00   0.003    -.7156129   -.1496854
       _cons |   2.705757   .2383799    11.35   0.000     2.238541    3.172973
------------------------------------------------------------------------------

glm, eform

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       313
                                                   Scale parameter =  1.583724
Deviance         =  251.8270233                    (1/df) Deviance =  .8045592
Pearson          =  495.7055497                    (1/df) Pearson  =  1.583724

Variance function: V(u) = u^2                      [Gamma]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -856.2487643                    AIC             =  5.438283
BIC              =  -1549.72029

------------------------------------------------------------------------------
     daysabs |       ExpB   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     langnce |   .9844372   .0039994    -3.86   0.000     .9766296    .9923071
      gender |   .6487881   .0936668    -3.00   0.003     .4888924    .8609788
------------------------------------------------------------------------------

Categorical Data Analysis Course

Phil Ender