Multivariate Analysis
Partition Cluster Analysis


Introduction

Partition methods break the observation into distinct nonoverlapping groups. There are many different partition methods. This unit will illustrate two of them, kmeans and kmedians. Kmeans clustering is a popular nonhierarchical clustering technique. Kmeans clustering is particularly appropriate when the number of clusters or the approximate number of clusters is known apriori. Unlike hierarchical cluster analysis, kmeans clustering does not produce all possible clusters of n observations. Rather, the researcher provides the kmeans cluster analysis program with the number of clusters and the program searches for the best solution with that number of clusters.

Kmeans cluster analysis programs begin by creating the k clusters according to some arbitrary procedure. The program calculates the means or centroids of each of the clusters. If one of the observations is closer to the centroid of another cluster then the observation is made a member of that cluster. This process is repeated until none of the observations are reassigned to a different cluster.

The process of partioning is sensitive to the starting point. Stata selects a different random starting point each time the cluster command is exicuted. You can make Stata can use a specified random starting point using prandom option, making it is possible to replicate analyses exactly.

Kmeans Cluster Analysis in Stata

input lep read math lang str3 district
.38 626.5 601.3 605.3 lau
.18 654.0 647.1 641.8 ccu
.07 677.2 676.5 670.5 bhu
.09 639.9 640.3 636.0 ing
.19 614.7 617.3 606.2 com
.12 670.2 666.0 659.3 smm
.20 651.1 645.2 643.4 bur
.41 645.4 645.8 644.8 gln
.07 683.5 682.9 674.3 pvu
.39 648.6 647.8 643.1 sgu
.21 650.4 650.8 643.9 abc
.24 637.0 636.9 626.5 pas
.09 641.1 628.8 629.4 lan
.12 638.0 627.7 628.6 plm
.11 661.4 659.0 651.8 tor
.22 646.4 646.2 647.0 dow
.33 634.1 632.0 627.8 lbu
end

cluster kmeans lep read math lang, k(3) name(cl3) start(prandom(1122334455))

tabstat lep read math lang, by(cl3)

Summary statistics: mean
  by categories of: cl3 

     cl3 |       lep      read      math      lang
---------+----------------------------------------
       1 |      .225     631.9       624  620.6333
       2 |    .22625    649.65   647.775   643.975
       3 |  .0866667  676.9667  675.1333  668.0333
---------+----------------------------------------
   Total |  .2011765  648.2059  644.2118  639.9824
--------------------------------------------------

tabulate district cl3
 
           |               cl3
  district |         1          2          3 |     Total
-----------+---------------------------------+----------
       abc |         0          1          0 |         1 
       bhu |         0          0          1 |         1 
       bur |         0          1          0 |         1 
       ccu |         0          1          0 |         1 
       com |         1          0          0 |         1 
       dow |         0          1          0 |         1 
       gln |         0          1          0 |         1 
       ing |         0          1          0 |         1 
       lan |         1          0          0 |         1 
       lau |         1          0          0 |         1 
       lbu |         1          0          0 |         1 
       pas |         1          0          0 |         1 
       plm |         1          0          0 |         1 
       pvu |         0          0          1 |         1 
       sgu |         0          1          0 |         1 
       smm |         0          0          1 |         1 
       tor |         0          1          0 |         1 
-----------+---------------------------------+----------
     Total |         6          8          3 |        17

xi: mvreg lep read math lang = i.cl3
i.cl3             _Icl3_1-3           (naturally coded; _Icl3_1 omitted)

Equation          Obs  Parms        RMSE    "R-sq"          F        P
----------------------------------------------------------------------
lep                17      3    .1079696    0.2264   2.049005   0.1658
read               17      3    7.794986    0.8279   33.68502   0.0000
math               17      3    9.170697    0.8216   32.22942   0.0000
lang               17      3     8.15762    0.8356   35.57204   0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lep          |
     _Icl3_2 |     .00125   .0583103     0.02   0.983    -.1238131    .1263131
     _Icl3_3 |  -.1383333   .0763461    -1.81   0.091    -.3020793    .0254127
       _cons |       .225   .0440784     5.10   0.000     .1304612    .3195388
-------------+----------------------------------------------------------------
read         |
     _Icl3_2 |   17.75002   4.209774     4.22   0.001     8.720949    26.77908
     _Icl3_3 |   45.06668   5.511887     8.18   0.000     33.24486     56.8885
       _cons |      631.9    3.18229   198.57   0.000     625.0747    638.7253
-------------+----------------------------------------------------------------
math         |
     _Icl3_2 |   23.77499   4.952742     4.80   0.000     13.15242    34.39757
     _Icl3_3 |   51.13334   6.484662     7.89   0.000     37.22513    65.04156
       _cons |        624   3.743921   166.67   0.000     615.9701    632.0299
-------------+----------------------------------------------------------------
lang         |
     _Icl3_2 |   23.34167   4.405619     5.30   0.000     13.89256    32.79078
     _Icl3_3 |   47.39999   5.768309     8.22   0.000      35.0282    59.77179
       _cons |   620.6333   3.330335   186.36   0.000     613.4905    627.7762
------------------------------------------------------------------------------


cluster kmeans lep read math lang, k(4) name(cl4) start(prandom(9988776655))

tabstat lep read math lang, by(cl4)

Summary statistics: mean
  by categories of: cl4 

     cl4 |       lep      read      math      lang
---------+----------------------------------------
       1 |  .2457143  651.0429  648.8429  645.1143
       2 |  .0866667  676.9667  675.1333  668.0333
       3 |      .174    638.02    633.14    629.66
       4 |      .285     620.6     609.3    605.75
---------+----------------------------------------
   Total |  .2011765  648.2059  644.2118  639.9824
--------------------------------------------------

tabulate district cl4

           |                     cl4
  district |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
       abc |         1          0          0          0 |         1 
       bhu |         0          1          0          0 |         1 
       bur |         1          0          0          0 |         1 
       ccu |         1          0          0          0 |         1 
       com |         0          0          0          1 |         1 
       dow |         1          0          0          0 |         1 
       gln |         1          0          0          0 |         1 
       ing |         0          0          1          0 |         1 
       lan |         0          0          1          0 |         1 
       lau |         0          0          0          1 |         1 
       lbu |         0          0          1          0 |         1 
       pas |         0          0          1          0 |         1 
       plm |         0          0          1          0 |         1 
       pvu |         0          1          0          0 |         1 
       sgu |         1          0          0          0 |         1 
       smm |         0          1          0          0 |         1 
       tor |         1          0          0          0 |         1 
-----------+--------------------------------------------+----------
     Total |         7          3          5          2 |        17 

xi: mvreg lep read math lang = i.cl4
i.cl4             _Icl4_1-4           (naturally coded; _Icl4_1 omitted)

Equation          Obs  Parms        RMSE    "R-sq"          F        P
----------------------------------------------------------------------
lep                17      4    .1037779    0.3364   2.196513   0.1373
read               17      4    5.286934    0.9265   54.62785   0.0000
math               17      4     6.38132    0.9198   49.68041   0.0000
lang               17      4    4.338311    0.9568   96.01699   0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lep          |
     _Icl4_2 |  -.1590476   .0716136    -2.22   0.045    -.3137593   -.0043359
     _Icl4_3 |  -.0717143   .0607661    -1.18   0.259    -.2029915    .0595629
     _Icl4_4 |   .0392857   .0832074     0.47   0.645     -.140473    .2190444
       _cons |   .2457143   .0392244     6.26   0.000     .1609752    .3304534
-------------+----------------------------------------------------------------
read         |
     _Icl4_2 |   25.92381   3.648331     7.11   0.000     18.04207    33.80555
     _Icl4_3 |  -13.02287   3.095712    -4.21   0.001    -19.71075   -6.334991
     _Icl4_4 |  -30.44286   4.238978    -7.18   0.000    -39.60061    -21.2851
       _cons |   651.0429   1.998273   325.80   0.000     646.7259    655.3599
-------------+----------------------------------------------------------------
math         |
     _Icl4_2 |   26.29049   4.403529     5.97   0.000     16.77724    35.80374
     _Icl4_3 |  -15.70285   3.736518    -4.20   0.001    -23.77511   -7.630592
     _Icl4_4 |  -39.54286   5.116438    -7.73   0.000    -50.59626   -28.48947
       _cons |   648.8429   2.411912   269.02   0.000     643.6322    654.0535
-------------+----------------------------------------------------------------
lang         |
     _Icl4_2 |   22.91904   2.993719     7.66   0.000      16.4515    29.38658
     _Icl4_3 |  -15.45429   2.540255    -6.08   0.000    -20.94217   -9.966399
     _Icl4_4 |  -39.36428   3.478387   -11.32   0.000    -46.87888   -31.84968
       _cons |   645.1143   1.639728   393.43   0.000     641.5719    648.6567
------------------------------------------------------------------------------

cluster kmeans lep read math lang, k(5) name(cl5) start(prandom(7654321123))

tabstat lep read math lang, by(cl5)

Summary statistics: mean
  by categories of: cl5 

     cl5 |       lep      read      math      lang
---------+----------------------------------------
       1 |  .2457143  651.0429  648.8429  645.1143
       2 |       .09     639.9     640.3       636
       3 |      .195    637.55    631.35   628.075
       4 |      .285     620.6     609.3    605.75
       5 |  .0866667  676.9667  675.1333  668.0333
---------+----------------------------------------
   Total |  .2011765  648.2059  644.2118  639.9824
--------------------------------------------------

tabulate district cl5

           |                          cl5
  district |         1          2          3          4          5 |     Total
-----------+-------------------------------------------------------+----------
       abc |         1          0          0          0          0 |         1 
       bhu |         0          0          0          0          1 |         1 
       bur |         1          0          0          0          0 |         1 
       ccu |         1          0          0          0          0 |         1 
       com |         0          0          0          1          0 |         1 
       dow |         1          0          0          0          0 |         1 
       gln |         1          0          0          0          0 |         1 
       ing |         0          1          0          0          0 |         1 
       lan |         0          0          1          0          0 |         1 
       lau |         0          0          0          1          0 |         1 
       lbu |         0          0          1          0          0 |         1 
       pas |         0          0          1          0          0 |         1 
       plm |         0          0          1          0          0 |         1 
       pvu |         0          0          0          0          1 |         1 
       sgu |         1          0          0          0          0 |         1 
       smm |         0          0          0          0          1 |         1 
       tor |         1          0          0          0          0 |         1 
-----------+-------------------------------------------------------+----------
     Total |         7          1          4          2          3 |        17 

xi: mvreg lep read math lang = i.cl5
i.cl5             _Icl5_1-5           (naturally coded; _Icl5_1 omitted)

Equation          Obs  Parms        RMSE    "R-sq"          F        P
----------------------------------------------------------------------
lep                17      5    .1045578    0.3782   1.824595   0.1890
read               17      5    5.469259    0.9274    38.3217   0.0000
math               17      5     6.22692    0.9295   39.54416   0.0000
lang               17      5     4.02521    0.9657   84.42677   0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
lep          |
     _Icl5_2 |  -.1557143    .111777    -1.39   0.189    -.3992555    .0878269
     _Icl5_3 |  -.0507143   .0655351    -0.77   0.454     -.193503    .0920744
     _Icl5_4 |   .0392857   .0838328     0.47   0.648    -.1433702    .2219416
     _Icl5_5 |  -.1590476   .0721518    -2.20   0.048    -.3162528   -.0018424
       _cons |   .2457143   .0395191     6.22   0.000     .1596095    .3318191
-------------+----------------------------------------------------------------
read         |
     _Icl5_2 |  -11.14284   5.846884    -1.91   0.081    -23.88211    1.596427
     _Icl5_3 |  -13.49288    3.42804    -3.94   0.002    -20.96193   -6.023819
     _Icl5_4 |  -30.44286   4.385163    -6.94   0.000    -39.99731   -20.88841
     _Icl5_5 |   25.92381   3.774148     6.87   0.000     17.70065    34.14697
       _cons |   651.0429   2.067186   314.94   0.000     646.5389    655.5469
-------------+----------------------------------------------------------------
math         |
     _Icl5_2 |  -8.542864   6.656858    -1.28   0.224    -23.04691    5.961183
     _Icl5_3 |  -17.49285   3.902929    -4.48   0.001     -25.9966   -8.989095
     _Icl5_4 |  -39.54286   4.992643    -7.92   0.000     -50.4209   -28.66483
     _Icl5_5 |   26.29049   4.296983     6.12   0.000     16.92817    35.65281
       _cons |   648.8429   2.353555   275.69   0.000     643.7149    653.9708
-------------+----------------------------------------------------------------
lang         |
     _Icl5_2 |  -9.114284    4.30313    -2.12   0.056       -18.49     .261431
     _Icl5_3 |  -17.03929   2.522934    -6.75   0.000    -22.53629   -11.54229
     _Icl5_4 |  -39.36428   3.227348   -12.20   0.000    -46.39607    -32.3325
     _Icl5_5 |   22.91904   2.777659     8.25   0.000     16.86704    28.97104
       _cons |   645.1143   1.521386   424.03   0.000     641.7995    648.4291
------------------------------------------------------------------------------

Example Using Fisher's Iris Data

use http://www.gseis.ucla.edu/courses/data/iris, clear

cluster kmeans sl sw pl pw, k(3) name(c2) euc start(prandom(4343434343))

tab c2 type

           |           type of iris
        c2 |    setosa  versicolo  virginica |     Total
-----------+---------------------------------+----------
         1 |         0          3         36 |        39 
         2 |         0         47         14 |        61 
         3 |        50          0          0 |        50 
-----------+---------------------------------+----------
     Total |        50         50         50 |       150

Kmedians Cluster Analysis in Stata

Kmedians clustering is a variation on the kmeans method. The same process is followed except that medians are used instead of means. Kmedians would be appropriate when you need a more stable measure of the group centers.

cluster kmedians lep read math lang, k(5) name(med5) start(prandom(777444))

tabulate district med5

           |                          med5
  district |         1          2          3          4          5 |     Total
-----------+-------------------------------------------------------+----------
       abc |         0          0          1          0          0 |         1 
       bhu |         1          0          0          0          0 |         1 
       bur |         0          0          1          0          0 |         1 
       ccu |         0          0          1          0          0 |         1 
       com |         0          0          0          1          0 |         1 
       dow |         0          0          1          0          0 |         1 
       gln |         0          0          1          0          0 |         1 
       ing |         0          1          0          0          0 |         1 
       lan |         0          0          0          0          1 |         1 
       lau |         0          0          0          0          1 |         1 
       lbu |         0          0          0          1          0 |         1 
       pas |         0          0          0          1          0 |         1 
       plm |         0          0          0          0          1 |         1 
       pvu |         1          0          0          0          0 |         1 
       sgu |         0          0          1          0          0 |         1 
       smm |         1          0          0          0          0 |         1 
       tor |         0          0          1          0          0 |         1 
-----------+-------------------------------------------------------+----------
     Total |         3          1          7          3          3 |        17
Example Using Fisher's Iris Data

use http://www.gseis.ucla.edu/courses/data/iris, clear

cluster kmedians sl sw pl pw, k(3) name(c3) euc start(prandom(666565656))

tab c3 type

           |           type of iris
        c3 |    setosa  versicolo  virginica |     Total
-----------+---------------------------------+----------
         1 |         0         10         47 |        57 
         2 |        50          0          0 |        50 
         3 |         0         40          3 |        43 
-----------+---------------------------------+----------
     Total |        50         50         50 |       150


Multivariate Course Page

Phil Ender, 5jan05, 24apr00