Collinearity Issues

Applied Categorical & Nonnormal Data Analysis

Collinearity Issues

First Thoughts

Many students and researchers are familiar with collinearity issues through the study of OLS regression. But concerns about collinearity are common to many types of statistical models including categorical and count models. Here are some first thoughts on the matter:

Certainly, modern statistical software packages are capable of analyzing data with correlated independent variables.

However, problems can arise from situations in which two or more prredictor variables are highly intercorrelated.

No consensus about meaning of collinearity -

Is it any degree of correlation? or
Is it a matter of a high degree of intercorrelation?
What constitutes a high degree of intercorrelation?

Simple Collinearity

When two variables are highly correlated.

Can be detected by looking at the zero order correlations.

Usually, correlations in the .9's.

Multicollinearity

Involves combinations of more than two variables.

Variables that are uncorrelated are said to be orthogonal.

Computation of regression coefficients involves inverting a matrix. If one variable is a perfect linear combination of two or more other variables then the inverse cannot be computed and the matrix is said to be singular.

Example: sat total = sat verbal + sat math

In matrix terms, a linear dependency exists when a row (or column) of a matrix can be obtained as a linear combination of other rows (or columns).

Common Indicators of Collinearity

VIF -- variance inflation factor

VIF values are large
individual VIF greater than 10 should be inspected
average VIF greater than 6

tolerance

tolerance values are small, close to zero
tolerance less than .1
tolerance = 1/VIF

Other Indicators of Collinearity

Condition index -- large values

Condition number -- large values

Eigenvalues -- small values, close to zero

Determinant of correlation matrix -- very small, close to zero

Diagonal of R^-1 (inverse of correlation matrix) -- large values, values close to one are good

Effects of Collinearity

Imprecise estimates of regression coefficients.

Slight fluctuations in correlation may lead to large differences in regression coefficients.

Adding or dropping cases may lead to large differences in regression coefficients.

Increases the standard error of coefficients, thus reducing tests of significance.

Checking for Collinearity in Stata

Use the vif command after the regress command. See Stata example -->

Also, the collin program which can be downloaded from UCLA ATS over the Internet

Stata Example Using -collin-

Most statistical software packages have options associated with their regression programs that are designed to check for collinearity problems. But since collinearity is a property of the set of predictor variables, it is not necessary to run regression in order to check for high collinearity. The -collin- command (findit collin) will compute a number of collinearity diagnostics.

use http://www.ats.ucla.edu/stat/data/hsbdemo, clear
 
collin female schtyp read write math science socst

  Collinearity Diagnostics

                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
    female      1.25    1.12    0.8027      0.1973
    schtyp      1.02    1.01    0.9819      0.0181
      read      2.45    1.57    0.4080      0.5920
     write      2.52    1.59    0.3962      0.6038
      math      2.28    1.51    0.4378      0.5622
   science      2.12    1.46    0.4717      0.5283
     socst      1.91    1.38    0.5224      0.4776
----------------------------------------------------
  Mean VIF      1.94

                           Cond
        Eigenval          Index
---------------------------------
    1     3.4004          1.0000
    2     1.1347          1.7311
    3     0.9782          1.8644
    4     0.5229          2.5502
    5     0.3577          3.0831
    6     0.3299          3.2104
    7     0.2762          3.5087
---------------------------------
 Condition Number         3.5087 
 Eigenvalues & Cond Index computed from deviation sscp (no intercept)
 Det(correlation matrix)    0.0643

use http://www.philender.com/courses/data/lahigh, clear

collin mathnce langnce mathpr langpr

  Collinearity Diagnostics

                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
   mathnce     24.20    4.92    0.0413      0.9587
   langnce     28.31    5.32    0.0353      0.9647
    mathpr     25.02    5.00    0.0400      0.9600
    langpr     29.09    5.39    0.0344      0.9656
----------------------------------------------------
  Mean VIF     26.65

                           Cond
        Eigenval          Index
---------------------------------
    1     3.3643          1.0000
    2     0.5926          2.3827
    3     0.0287         10.8179
    4     0.0143         15.3294
---------------------------------
 Condition Number        15.3294 
 Eigenvalues & Cond Index computed from deviation sscp (no intercept)
 Det(correlation matrix)    0.0008

collin mathnce langnce

  Collinearity Diagnostics

                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
   mathnce      1.90    1.38    0.5256      0.4744
   langnce      1.90    1.38    0.5256      0.4744
----------------------------------------------------
  Mean VIF      1.90

                           Cond
        Eigenval          Index
---------------------------------
    1     1.6888          1.0000
    2     0.3112          2.3295
---------------------------------
 Condition Number         2.3295 
 Eigenvalues & Cond Index computed from deviation sscp (no intercept)
 Det(correlation matrix)    0.5256

Computational Examples

The following computational examples show some of the effects of high collinearity on standardized regression coefficients.

Example A

     1   2    3    Y
1   -  .20  .20  .50
2       -   .10  .50
3            -   .50
Y                 -

 R2 = .56373   Det = .918
      Beta   Std Err     F
1   .34314  .07001  24.025
2   .39216  .06894  32.360
3   .39216  .06894  32.360

Example B

     1   2    3    Y
1   -  .20  .20  .50
2       -   .85  .50
3            -   .50
Y                 -

R2 = .43079   Det = .2655
       Beta   Std Err     F
1   .40960  .07872   27.073
2   .22599  .14642    2.382
3   .22599  .14642    2.382

Example C

     1   2    3    Y
1   -  .20  .20  .50
2       -   .10  .50
3            -   .52
Y                 -

 R2 = .57983
      Beta   Std Err     F
1   .33922  .06870  24.378
2   .39085  .06765  33.376
3   .41307  .06765  37.279

Example D

     1   2    3    Y
1   -  .20  .20  .50
2       -   .85  .50
3            -   .52
Y                 -

R2 = .44128
       Beta   Std Err     F
1   .40734  .07799   27.277
2   .16497  .14507    1.293
3   .29831  .14507    4.229

Remedies

Delete variables - may cause specification errors.

Collection of additional data.

Grouping variables into blocks.

Principal components analysis or principal factor analysis.

Ridge regression

Categorical Data Analysis Course

Phil Ender