Introduction
Cluster analysis techniques and not the only way to find non-observed groupings in your data. In fact, from several perspectives cluster analysis may not be the best way to determine these groupings. There are several latent variable approaches that are available. In this unit we will explore two of them: Latent profile analysis and latent class analysis.
The advantages of these approaches over cluster analysis are that they are model based, generating probabilities for group membership. It is possible to test these models and to analyze their goodness of fit. The downside to this approach is that it requires sepcialized software that is more complex to run than typical statistical packages. We will demonstrate these techniques using the Mplus software from Muthén & Muthén. We will also use Stata for descriptive and subsidiary analyses.
Latent profile analysis will use continuous predictors and the latent class analysis will use binary predictor variables. We will use the reading, writing, math, science and social studies test scores from the hsb6a dataset. For the binary predictor variables we will do median splits on each of the tests to create hiread, hiwrite, himath, hisci and hiss.
Looking at the data
use hsb6a describe Contains data from hsb6a.dta obs: 600 highschool and beyond (600 cases) vars: 23 24 Oct 2003 14:18 size: 31,200 (99.0% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id int %9.0g gender byte %9.0g gl race byte %12.0g rl ses byte %9.0g sl sch byte %9.0g scl prog byte %9.0g pl locus float %9.0g locus of control concept float %9.0g self-concept mot float %9.0g motivation career byte %14.0g cl career choice read float %9.0g reading score write float %9.0g writing score math float %9.0g math score sci float %9.0g science score ss float %9.0g social studies score hiread byte %9.0g hiwrite byte %9.0g himath byte %9.0g hisci byte %9.0g hiss byte %9.0g sum read write math sci ss hiread hiwrite himath hisci hiss Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 600 51.90183 10.10298 28.3 76 write | 600 52.38483 9.726455 25.5 67.1 math | 600 51.849 9.414736 31.8 75.5 sci | 600 51.76333 9.706179 26 74.2 ss | 600 52.04567 9.879228 25.7 70.5 -------------+-------------------------------------------------------- hiread | 600 .525 .4997913 0 1 hiwrite | 600 .54 .4988133 0 1 himath | 600 .4966667 .5004061 0 1 hisci | 600 .5266667 .499705 0 1 hiss | 600 .6483333 .477889 0 1
A 2 Class Latent Profile Model
Data: File is I:\mplus\hsb6.dat ; Variable: Names are id gender race ses sch prog locus concept mot career read write math sci ss hiread hiwrite himath hisci hiss academic; Usevariables are read write math sci ss ; classes = c(2); Analysis: Type=mixture; MODEL: %C#1% [read math sci ss write * 30 ]; %C#2% [read math sci ss write * 60]; OUTPUT: TECH8; SAVEDATA: file is lca_ex1.txt ; save is cprob; format is free; THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value -5213.102 Information Criteria Number of Free Parameters 16 Akaike (AIC) 10458.203 Bayesian (BIC) 10517.464 Sample-Size Adjusted BIC 10466.721 (n* = (n + 2) / 24) Entropy 0.865 FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE BASED ON ESTIMATED POSTERIOR PROBABILITIES Class 1 123.03223 0.41011 Class 2 176.96777 0.58989 CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP Class Counts and Proportions Class 1 120 0.40000 Class 2 180 0.60000 Average Class Probabilities by Class 1 2 Class 1 0.961 0.039 Class 2 0.043 0.957 MODEL RESULTS Estimates S.E. Est./S.E. CLASS 1 Means READ 43.151 0.820 52.641 WRITE 44.524 1.024 43.485 MATH 43.860 0.757 57.947 SCI 43.322 1.051 41.239 SS 45.119 0.946 47.707 Variances READ 49.035 4.175 11.745 WRITE 44.303 3.927 11.283 MATH 45.062 3.768 11.958 SCI 48.986 5.184 9.450 SS 55.410 4.445 12.465 CLASS 2 Means READ 57.915 0.847 68.403 WRITE 58.115 0.625 93.039 MATH 57.136 0.800 71.386 SCI 56.729 0.668 84.953 SS 57.220 0.723 79.137 Variances READ 49.035 4.175 11.745 WRITE 44.303 3.927 11.283 MATH 45.062 3.768 11.958 SCI 48.986 5.184 9.450 SS 55.410 4.445 12.465 LATENT CLASS REGRESSION MODEL PART Means C#1 -0.364 0.179 -2.032 QUALITY OF NUMERICAL RESULTS Condition Number for the Information Matrix 0.462E-03 (ratio of smallest to largest eigenvalue)
A 3 Class Latent Profile Model
Data: File is I:\mplus\hsb6.dat ; Variable: Names are id gender race ses sch prog locus concept mot career read write math sci ss hiread hiwrite himath hisci hiss academic; Usevariables are read write math sci ss ; classes = c(3); Analysis: Type=mixture; MODEL: %C#1% [read math sci ss write *30 ]; %C#2% [read math sci ss write *45]; %C#3% [read math sci ss write *60]; OUTPUT: TECH8; SAVEDATA: file is lca_ex2.txt ; save is cprob; format is free; THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value -5100.544 Information Criteria Number of Free Parameters 22 Akaike (AIC) 10245.087 Bayesian (BIC) 10326.571 Sample-Size Adjusted BIC 10256.800 (n* = (n + 2) / 24) Entropy 0.877 FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE BASED ON ESTIMATED POSTERIOR PROBABILITIES Class 1 98.08460 0.32695 Class 2 137.86474 0.45955 Class 3 64.05066 0.21350 CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP Class Counts and Proportions Class 1 99 0.33000 Class 2 138 0.46000 Class 3 63 0.21000 Average Class Probabilities by Class 1 2 3 Class 1 0.961 0.039 0.000 Class 2 0.021 0.940 0.039 Class 3 0.000 0.068 0.932 MODEL RESULTS Estimates S.E. Est./S.E. CLASS 1 Means READ 41.866 0.614 68.208 WRITE 43.080 0.870 49.514 MATH 42.447 0.549 77.337 SCI 41.409 0.748 55.358 SS 44.232 0.819 54.010 Variances READ 33.867 3.334 10.159 WRITE 40.042 4.168 9.607 MATH 28.667 2.980 9.619 SCI 34.199 3.411 10.027 SS 48.355 4.323 11.185 CLASS 2 Means READ 53.058 0.726 73.044 WRITE 55.195 0.677 81.493 MATH 52.704 0.683 77.191 SCI 53.195 0.600 88.727 SS 53.377 0.745 71.657 Variances READ 33.867 3.334 10.159 WRITE 40.042 4.168 9.607 MATH 28.667 2.980 9.619 SCI 34.199 3.411 10.027 SS 48.355 4.323 11.185 CLASS 3 Means READ 64.588 0.949 68.070 WRITE 61.318 0.624 98.232 MATH 63.667 0.907 70.167 SCI 62.043 0.873 71.064 SS 62.139 0.827 75.163 Variances READ 33.867 3.334 10.159 WRITE 40.042 4.168 9.607 MATH 28.667 2.980 9.619 SCI 34.199 3.411 10.027 SS 48.355 4.323 11.185 LATENT CLASS REGRESSION MODEL PART Means C#1 0.426 0.201 2.120 C#2 0.767 0.196 3.901 QUALITY OF NUMERICAL RESULTS Condition Number for the Information Matrix 0.461E-03 (ratio of smallest to largest eigenvalue)
A 2 Class Latent Class Model
Data: File is h:\mplus\hsb6.dat ; Variable: Names are id gender race ses sch prog locus concept mot career read write math sci ss hiread hiwrite himath hisci hiss academic; Usevariables are hiread hiwrite himath hisci hiss ; categorical = hiread hiwrite himath hisci hiss; classes = c(2); Analysis: Type=mixture; MODEL: %C#1% [hiread$1 *2 himath$1 *2 hisci$1 *2 hiss$1 *2 hiwrite$1 *2 ]; %C#2% [hiread$1 *-2 himath$1 *-2 hisci$1 *-2 hiss$1 *-2 hiwrite$1 *-2 ]; OUTPUT: TECH8; SAVEDATA: file is lca_ex7.txt ; save is cprob; format is free; THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value -849.157 Information Criteria Number of Free Parameters 11 Akaike (AIC) 1720.315 Bayesian (BIC) 1761.057 Sample-Size Adjusted BIC 1726.171 (n* = (n + 2) / 24) Entropy 0.815 Chi-Square Test of Model Fit for the Latent Class Indicator Model Part Pearson Chi-Square Value 44.642 Degrees of Freedom 20 P-Value 0.0012 Likelihood Ratio Chi-Square Value 45.747 Degrees of Freedom 20 P-Value 0.0009 FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE BASED ON ESTIMATED POSTERIOR PROBABILITIES Class 1 123.33019 0.41110 Class 2 176.66981 0.58890 CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP Class Counts and Proportions Class 1 127 0.42333 Class 2 173 0.57667 Average Class Probabilities by Class 1 2 Class 1 0.930 0.070 Class 2 0.030 0.970 MODEL RESULTS Estimates S.E. Est./S.E. CLASS 1 CLASS 2 LATENT CLASS INDICATOR MODEL PART Class 1 Thresholds HIREAD$1 2.273 0.424 5.354 HIWRITE$1 1.376 0.276 4.990 HIMATH$1 2.081 0.399 5.209 HISCI$1 2.035 0.411 4.947 HISS$1 0.642 0.231 2.780 Class 2 Thresholds HIREAD$1 -1.540 0.264 -5.823 HIWRITE$1 -1.488 0.244 -6.109 HIMATH$1 -1.217 0.217 -5.616 HISCI$1 -1.264 0.213 -5.927 HISS$1 -2.047 0.279 -7.328 LATENT CLASS REGRESSION MODEL PART Means C#1 -0.359 0.161 -2.231 LATENT CLASS INDICATOR MODEL PART IN PROBABILITY SCALE Class 1 HIREAD Category 1 0.907 0.036 25.221 Category 2 0.093 0.036 2.599 HIWRITE Category 1 0.798 0.044 17.985 Category 2 0.202 0.044 4.542 HIMATH Category 1 0.889 0.039 22.555 Category 2 0.111 0.039 2.816 HISCI Category 1 0.884 0.042 21.036 Category 2 0.116 0.042 2.748 HISS Category 1 0.655 0.052 12.564 Category 2 0.345 0.052 6.615 Class 2 HIREAD Category 1 0.177 0.038 4.592 Category 2 0.823 0.038 21.417 HIWRITE Category 1 0.184 0.037 5.031 Category 2 0.816 0.037 22.288 HIMATH Category 1 0.228 0.038 5.980 Category 2 0.772 0.038 20.197 HISCI Category 1 0.220 0.037 6.015 Category 2 0.780 0.037 21.288 HISS Category 1 0.114 0.028 4.043 Category 2 0.886 0.028 31.304 QUALITY OF NUMERICAL RESULTS Condition Number for the Information Matrix 0.654E-01 (ratio of smallest to largest eigenvalue)
A 3 Class Latent Class Model
Data: File is h:\mplus\hsb6.dat ; Variable: Names are id gender race ses sch prog locus concept mot career read write math sci ss hiread hiwrite himath hisci hiss academic; Usevariables are hiread hiwrite himath hisci hiss ; categorical = hiread hiwrite himath hisci hiss; classes = c(3); Analysis: Type=mixture; MODEL: %C#1% [hiread$1 *2 himath$1 *2 hisci$1 *2 hiss$1 *2 hiwrite$1 *2 ]; %C#2% [hiread$1 *0 himath$1 *0 hisci$1 *0 hiss$1 *0 hiwrite$1 *0 ]; %C#3% [hiread$1 *-2 himath$1 *-2 hisci$1 *-2 hiss$1 *-2 hiwrite$1 *-2 ]; OUTPUT: TECH8; SAVEDATA: file is lca_ex8.txt ; save is cprob; format is free; THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value -839.066 Information Criteria Number of Free Parameters 17 Akaike (AIC) 1712.132 Bayesian (BIC) 1775.096 Sample-Size Adjusted BIC 1721.182 (n* = (n + 2) / 24) Entropy 0.682 Chi-Square Test of Model Fit for the Latent Class Indicator Model Part Pearson Chi-Square Value 21.369 Degrees of Freedom 14 P-Value 0.0925 Likelihood Ratio Chi-Square Value 25.564 Degrees of Freedom 14 P-Value 0.0294 FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE BASED ON ESTIMATED POSTERIOR PROBABILITIES Class 1 95.51732 0.31839 Class 2 127.98211 0.42661 Class 3 76.50058 0.25500 CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP Class Counts and Proportions Class 1 94 0.31333 Class 2 130 0.43333 Class 3 76 0.25333 Average Class Probabilities by Class 1 2 3 Class 1 0.913 0.087 0.000 Class 2 0.074 0.826 0.099 Class 3 0.000 0.163 0.837 MODEL RESULTS Estimates S.E. Est./S.E. CLASS 1 CLASS 2 CLASS 3 LATENT CLASS INDICATOR MODEL PART Class 1 Thresholds HIREAD$1 2.883 0.671 4.296 HIWRITE$1 1.735 0.418 4.150 HIMATH$1 2.863 0.739 3.877 HISCI$1 3.007 0.861 3.492 HISS$1 0.991 0.319 3.106 Class 2 Thresholds HIREAD$1 -0.392 0.348 -1.128 HIWRITE$1 -0.451 0.445 -1.013 HIMATH$1 -0.258 0.342 -0.754 HISCI$1 -0.453 0.269 -1.688 HISS$1 -1.201 0.400 -2.999 Class 3 Thresholds HIREAD$1 -4.377 6.575 -0.666 HIWRITE$1 -15.000 0.000 0.000 HIMATH$1 -2.932 1.699 -1.726 HISCI$1 -2.257 0.986 -2.289 HISS$1 -3.761 2.143 -1.755 LATENT CLASS REGRESSION MODEL PART Means C#1 0.222 0.398 0.558 C#2 0.515 0.499 1.032 LATENT CLASS INDICATOR MODEL PART IN PROBABILITY SCALE Class 1 HIREAD Category 1 0.947 0.034 28.108 Category 2 0.053 0.034 1.574 HIWRITE Category 1 0.850 0.053 15.951 Category 2 0.150 0.053 2.815 HIMATH Category 1 0.946 0.038 25.073 Category 2 0.054 0.038 1.431 HISCI Category 1 0.953 0.039 24.648 Category 2 0.047 0.039 1.219 HISS Category 1 0.729 0.063 11.577 Category 2 0.271 0.063 4.298 Class 2 HIREAD Category 1 0.403 0.084 4.819 Category 2 0.597 0.084 7.134 HIWRITE Category 1 0.389 0.106 3.680 Category 2 0.611 0.106 5.775 HIMATH Category 1 0.436 0.084 5.177 Category 2 0.564 0.084 6.702 HISCI Category 1 0.389 0.064 6.090 Category 2 0.611 0.064 9.582 HISS Category 1 0.231 0.071 3.249 Category 2 0.769 0.071 10.797 Class 3 HIREAD Category 1 0.012 0.081 0.154 Category 2 0.988 0.081 12.253 HIWRITE Category 1 0.000 0.000 0.000 Category 2 1.000 0.000 0.000 HIMATH Category 1 0.051 0.082 0.620 Category 2 0.949 0.082 11.641 HISCI Category 1 0.095 0.085 1.120 Category 2 0.905 0.085 10.700 HISS Category 1 0.023 0.048 0.477 Category 2 0.977 0.048 20.530 QUALITY OF NUMERICAL RESULTS Condition Number for the Information Matrix 0.323E-03 (ratio of smallest to largest eigenvalue)
Categorical Data Analysis Course
Phil Ender -- 24apr03