Education 231C

Applied Categorical & Nonnormal Data Analysis

A Matter of Proportion


Regression models involving proportions can present many of the same difficulties found with binary response variables. Proportions like binary variables have a minimum of zero and a maximum of one but unlike binary variables they also have values in between.

We can illustrate this with an OLS regression example. In the dataset proportion the variable meals is the proportion of free or reduced priced meals for each school.

OLS Proportion Example

One problem with this analysis is that some of the predicted proportions are less than zero or greater than one.

One solution to this situation is to use the logit transformation, ln(y/(1-y)). This is the same transformation that was used in logistic regression. There is one additional step, we will replace the zero and one values with .0001 and .9999 respectively in order to compute the log. Here is how the analysis with the logit transformation looks:

Logit Transformation

As you can see, meals and plgt have very different means, standard deviations and ranges. In order to get the predicted values scaled the same as the original variable, meals, it is necessary to do a back or inverse transformation, in this case, 1/(1 + exp(-x)). The graph of the predicted values looks better than the OLS graph and the correlation between meals and premeals is slightly larger than the correlation with preols. The problem with all transformations is that the coefficients are given in terms of the transformed variable. We need some additional tools to make the interpretation easier.

GLM Approach

The above analysis worked out pretty good, but we were left with the transformed dependent variable which was more difficult to interpret than the original variable. GLM can be a tool to use in this situation. We will begin with a standard OLS type analysis followed by an alternative analysis using a logit link.

Note that this analysis is identical to the one using regress at the beginning of the unit. Also note that the deviance is 63.67 and the BIC is -35475.75. Now, we will change the link from identity to logit and run the analysis again.

In this last analysis the deviance is reduced to 57.57 and the BIC has gone down to -35481.86, a reduction of 6.11(which is non-trivial).

Recent change to Stata allow users to run the model using the binomial family. Be sure to include the robust option to obtain the correct standard errors.

The deviance is very differnt from the previous models and the BIC actually appears to be larger. However, the plots of the predicted values are virtuall identical, leading me to belive that there is not much difference between the model using binomial and the model using gauss for this set of data.


Categorical Data Analysis Course

Phil Ender -- revised 8/4/04