Applied Categorical & Nonnormal Data Analysis

Collinearity Issues


First Thoughts

Many students and researchers are familiar with collinearity issues through the study of OLS regression. But concerns about collinearity are common to many types of statistical models including categorical and count models. Here are some first thoughts on the matter:

  • Certainly, modern statistical software packages are capable of analyzing data with correlated independent variables.
  • However, problems can arise from situations in which two or more prredictor variables are highly intercorrelated.
  • No consensus about meaning of collinearity - Simple Collinearity

  • When two variables are highly correlated.
  • Can be detected by looking at the zero order correlations.
  • Usually, correlations in the .9's.

    Multicollinearity

  • Involves combinations of more than two variables.
  • Variables that are uncorrelated are said to be orthogonal.
  • Computation of regression coefficients involves inverting a matrix. If one variable is a perfect linear combination of two or more other variables then the inverse cannot be computed and the matrix is said to be singular.
  • Example: sat total = sat verbal + sat math
  • In matrix terms, a linear dependency exists when a row (or column) of a matrix can be obtained as a linear combination of other rows (or columns).

    Common Indicators of Collinearity

  • VIF -- variance inflation factor
  • tolerance Other Indicators of Collinearity

  • Condition index -- large values
  • Condition number -- large values
  • Eigenvalues -- small values, close to zero
  • Determinant of correlation matrix -- very small, close to zero
  • Diagonal of R-1 (inverse of correlation matrix) -- large values, values close to one are good

    Effects of Collinearity

  • Imprecise estimates of regression coefficients.
  • Slight fluctuations in correlation may lead to large differences in regression coefficients.
  • Adding or dropping cases may lead to large differences in regression coefficients.
  • Increases the standard error of coefficients, thus reducing tests of significance.

    Checking for Collinearity in Stata

  • Use the vif command after the regress command. See Stata example -->
  • Also, the collin program which can be downloaded from UCLA ATS over the Internet

    Stata Example Using -collin-

    Most statistical software packages have options associated with their regression programs that are designed to check for collinearity problems. But since collinearity is a property of the set of predictor variables, it is not necessary to run regression in order to check for high collinearity. The -collin- command (findit collin) will compute a number of collinearity diagnostics.

    use http://www.ats.ucla.edu/stat/data/hsbdemo, clear
     
    collin female schtyp read write math science socst
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
        female      1.25    1.12    0.8027      0.1973
        schtyp      1.02    1.01    0.9819      0.0181
          read      2.45    1.57    0.4080      0.5920
         write      2.52    1.59    0.3962      0.6038
          math      2.28    1.51    0.4378      0.5622
       science      2.12    1.46    0.4717      0.5283
         socst      1.91    1.38    0.5224      0.4776
    ----------------------------------------------------
      Mean VIF      1.94
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     3.4004          1.0000
        2     1.1347          1.7311
        3     0.9782          1.8644
        4     0.5229          2.5502
        5     0.3577          3.0831
        6     0.3299          3.2104
        7     0.2762          3.5087
    ---------------------------------
     Condition Number         3.5087 
     Eigenvalues & Cond Index computed from deviation sscp (no intercept)
     Det(correlation matrix)    0.0643
    
    use http://www.philender.com/courses/data/lahigh, clear
    
    collin mathnce langnce mathpr langpr
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
       mathnce     24.20    4.92    0.0413      0.9587
       langnce     28.31    5.32    0.0353      0.9647
        mathpr     25.02    5.00    0.0400      0.9600
        langpr     29.09    5.39    0.0344      0.9656
    ----------------------------------------------------
      Mean VIF     26.65
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     3.3643          1.0000
        2     0.5926          2.3827
        3     0.0287         10.8179
        4     0.0143         15.3294
    ---------------------------------
     Condition Number        15.3294 
     Eigenvalues & Cond Index computed from deviation sscp (no intercept)
     Det(correlation matrix)    0.0008
    
    collin mathnce langnce
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
       mathnce      1.90    1.38    0.5256      0.4744
       langnce      1.90    1.38    0.5256      0.4744
    ----------------------------------------------------
      Mean VIF      1.90
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     1.6888          1.0000
        2     0.3112          2.3295
    ---------------------------------
     Condition Number         2.3295 
     Eigenvalues & Cond Index computed from deviation sscp (no intercept)
     Det(correlation matrix)    0.5256

    Computational Examples

    The following computational examples show some of the effects of high collinearity on standardized regression coefficients.

    Example A

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .10  .50
    3            -   .50
    Y                 -
    
     R2 = .56373   Det = .918
          Beta   Std Err     F
    1   .34314  .07001  24.025
    2   .39216  .06894  32.360
    3   .39216  .06894  32.360
    

    Example B

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .85  .50
    3            -   .50
    Y                 -
    
    R2 = .43079   Det = .2655
           Beta   Std Err     F
    1   .40960  .07872   27.073
    2   .22599  .14642    2.382
    3   .22599  .14642    2.382
    

    Example C

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .10  .50
    3            -   .52
    Y                 -
    
     R2 = .57983
          Beta   Std Err     F
    1   .33922  .06870  24.378
    2   .39085  .06765  33.376
    3   .41307  .06765  37.279
    

    Example D

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .85  .50
    3            -   .52
    Y                 -
    
    R2 = .44128
           Beta   Std Err     F
    1   .40734  .07799   27.277
    2   .16497  .14507    1.293
    3   .29831  .14507    4.229
    

    Remedies

  • Delete variables - may cause specification errors.
  • Collection of additional data.
  • Grouping variables into blocks.
  • Principal components analysis or principal factor analysis.
  • Ridge regression


    Categorical Data Analysis Course

    Phil Ender