Introduction
Partition methods break the observation into distinct nonoverlapping groups. There are many different partition methods. This unit will illustrate two of them, kmeans and kmedians. Kmeans clustering is a popular nonhierarchical clustering technique. Kmeans clustering is particularly appropriate when the number of clusters or the approximate number of clusters is known apriori. Unlike hierarchical cluster analysis, kmeans clustering does not produce all possible clusters of n observations. Rather, the researcher provides the kmeans cluster analysis program with the number of clusters and the program searches for the best solution with that number of clusters.
Kmeans cluster analysis programs begin by creating the k clusters according to some arbitrary procedure. The program calculates the means or centroids of each of the clusters. If one of the observations is closer to the centroid of another cluster then the observation is made a member of that cluster. This process is repeated until none of the observations are reassigned to a different cluster.
The process of partioning is sensitive to the starting point. Stata selects a different random starting point each time the cluster command is exicuted. You can make Stata can use a specified random starting point using prandom option, making it is possible to replicate analyses exactly.
Kmeans Cluster Analysis in Stata
input lep read math lang str3 district .38 626.5 601.3 605.3 lau .18 654.0 647.1 641.8 ccu .07 677.2 676.5 670.5 bhu .09 639.9 640.3 636.0 ing .19 614.7 617.3 606.2 com .12 670.2 666.0 659.3 smm .20 651.1 645.2 643.4 bur .41 645.4 645.8 644.8 gln .07 683.5 682.9 674.3 pvu .39 648.6 647.8 643.1 sgu .21 650.4 650.8 643.9 abc .24 637.0 636.9 626.5 pas .09 641.1 628.8 629.4 lan .12 638.0 627.7 628.6 plm .11 661.4 659.0 651.8 tor .22 646.4 646.2 647.0 dow .33 634.1 632.0 627.8 lbu end cluster kmeans lep read math lang, k(3) name(cl3) start(prandom(1122334455)) tabstat lep read math lang, by(cl3) Summary statistics: mean by categories of: cl3 cl3 | lep read math lang ---------+---------------------------------------- 1 | .225 631.9 624 620.6333 2 | .22625 649.65 647.775 643.975 3 | .0866667 676.9667 675.1333 668.0333 ---------+---------------------------------------- Total | .2011765 648.2059 644.2118 639.9824 -------------------------------------------------- tabulate district cl3 | cl3 district | 1 2 3 | Total -----------+---------------------------------+---------- abc | 0 1 0 | 1 bhu | 0 0 1 | 1 bur | 0 1 0 | 1 ccu | 0 1 0 | 1 com | 1 0 0 | 1 dow | 0 1 0 | 1 gln | 0 1 0 | 1 ing | 0 1 0 | 1 lan | 1 0 0 | 1 lau | 1 0 0 | 1 lbu | 1 0 0 | 1 pas | 1 0 0 | 1 plm | 1 0 0 | 1 pvu | 0 0 1 | 1 sgu | 0 1 0 | 1 smm | 0 0 1 | 1 tor | 0 1 0 | 1 -----------+---------------------------------+---------- Total | 6 8 3 | 17 xi: mvreg lep read math lang = i.cl3 i.cl3 _Icl3_1-3 (naturally coded; _Icl3_1 omitted) Equation Obs Parms RMSE "R-sq" F P ---------------------------------------------------------------------- lep 17 3 .1079696 0.2264 2.049005 0.1658 read 17 3 7.794986 0.8279 33.68502 0.0000 math 17 3 9.170697 0.8216 32.22942 0.0000 lang 17 3 8.15762 0.8356 35.57204 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lep | _Icl3_2 | .00125 .0583103 0.02 0.983 -.1238131 .1263131 _Icl3_3 | -.1383333 .0763461 -1.81 0.091 -.3020793 .0254127 _cons | .225 .0440784 5.10 0.000 .1304612 .3195388 -------------+---------------------------------------------------------------- read | _Icl3_2 | 17.75002 4.209774 4.22 0.001 8.720949 26.77908 _Icl3_3 | 45.06668 5.511887 8.18 0.000 33.24486 56.8885 _cons | 631.9 3.18229 198.57 0.000 625.0747 638.7253 -------------+---------------------------------------------------------------- math | _Icl3_2 | 23.77499 4.952742 4.80 0.000 13.15242 34.39757 _Icl3_3 | 51.13334 6.484662 7.89 0.000 37.22513 65.04156 _cons | 624 3.743921 166.67 0.000 615.9701 632.0299 -------------+---------------------------------------------------------------- lang | _Icl3_2 | 23.34167 4.405619 5.30 0.000 13.89256 32.79078 _Icl3_3 | 47.39999 5.768309 8.22 0.000 35.0282 59.77179 _cons | 620.6333 3.330335 186.36 0.000 613.4905 627.7762 ------------------------------------------------------------------------------ cluster kmeans lep read math lang, k(4) name(cl4) start(prandom(9988776655)) tabstat lep read math lang, by(cl4) Summary statistics: mean by categories of: cl4 cl4 | lep read math lang ---------+---------------------------------------- 1 | .2457143 651.0429 648.8429 645.1143 2 | .0866667 676.9667 675.1333 668.0333 3 | .174 638.02 633.14 629.66 4 | .285 620.6 609.3 605.75 ---------+---------------------------------------- Total | .2011765 648.2059 644.2118 639.9824 -------------------------------------------------- tabulate district cl4 | cl4 district | 1 2 3 4 | Total -----------+--------------------------------------------+---------- abc | 1 0 0 0 | 1 bhu | 0 1 0 0 | 1 bur | 1 0 0 0 | 1 ccu | 1 0 0 0 | 1 com | 0 0 0 1 | 1 dow | 1 0 0 0 | 1 gln | 1 0 0 0 | 1 ing | 0 0 1 0 | 1 lan | 0 0 1 0 | 1 lau | 0 0 0 1 | 1 lbu | 0 0 1 0 | 1 pas | 0 0 1 0 | 1 plm | 0 0 1 0 | 1 pvu | 0 1 0 0 | 1 sgu | 1 0 0 0 | 1 smm | 0 1 0 0 | 1 tor | 1 0 0 0 | 1 -----------+--------------------------------------------+---------- Total | 7 3 5 2 | 17 xi: mvreg lep read math lang = i.cl4 i.cl4 _Icl4_1-4 (naturally coded; _Icl4_1 omitted) Equation Obs Parms RMSE "R-sq" F P ---------------------------------------------------------------------- lep 17 4 .1037779 0.3364 2.196513 0.1373 read 17 4 5.286934 0.9265 54.62785 0.0000 math 17 4 6.38132 0.9198 49.68041 0.0000 lang 17 4 4.338311 0.9568 96.01699 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lep | _Icl4_2 | -.1590476 .0716136 -2.22 0.045 -.3137593 -.0043359 _Icl4_3 | -.0717143 .0607661 -1.18 0.259 -.2029915 .0595629 _Icl4_4 | .0392857 .0832074 0.47 0.645 -.140473 .2190444 _cons | .2457143 .0392244 6.26 0.000 .1609752 .3304534 -------------+---------------------------------------------------------------- read | _Icl4_2 | 25.92381 3.648331 7.11 0.000 18.04207 33.80555 _Icl4_3 | -13.02287 3.095712 -4.21 0.001 -19.71075 -6.334991 _Icl4_4 | -30.44286 4.238978 -7.18 0.000 -39.60061 -21.2851 _cons | 651.0429 1.998273 325.80 0.000 646.7259 655.3599 -------------+---------------------------------------------------------------- math | _Icl4_2 | 26.29049 4.403529 5.97 0.000 16.77724 35.80374 _Icl4_3 | -15.70285 3.736518 -4.20 0.001 -23.77511 -7.630592 _Icl4_4 | -39.54286 5.116438 -7.73 0.000 -50.59626 -28.48947 _cons | 648.8429 2.411912 269.02 0.000 643.6322 654.0535 -------------+---------------------------------------------------------------- lang | _Icl4_2 | 22.91904 2.993719 7.66 0.000 16.4515 29.38658 _Icl4_3 | -15.45429 2.540255 -6.08 0.000 -20.94217 -9.966399 _Icl4_4 | -39.36428 3.478387 -11.32 0.000 -46.87888 -31.84968 _cons | 645.1143 1.639728 393.43 0.000 641.5719 648.6567 ------------------------------------------------------------------------------ cluster kmeans lep read math lang, k(5) name(cl5) start(prandom(7654321123)) tabstat lep read math lang, by(cl5) Summary statistics: mean by categories of: cl5 cl5 | lep read math lang ---------+---------------------------------------- 1 | .2457143 651.0429 648.8429 645.1143 2 | .09 639.9 640.3 636 3 | .195 637.55 631.35 628.075 4 | .285 620.6 609.3 605.75 5 | .0866667 676.9667 675.1333 668.0333 ---------+---------------------------------------- Total | .2011765 648.2059 644.2118 639.9824 -------------------------------------------------- tabulate district cl5 | cl5 district | 1 2 3 4 5 | Total -----------+-------------------------------------------------------+---------- abc | 1 0 0 0 0 | 1 bhu | 0 0 0 0 1 | 1 bur | 1 0 0 0 0 | 1 ccu | 1 0 0 0 0 | 1 com | 0 0 0 1 0 | 1 dow | 1 0 0 0 0 | 1 gln | 1 0 0 0 0 | 1 ing | 0 1 0 0 0 | 1 lan | 0 0 1 0 0 | 1 lau | 0 0 0 1 0 | 1 lbu | 0 0 1 0 0 | 1 pas | 0 0 1 0 0 | 1 plm | 0 0 1 0 0 | 1 pvu | 0 0 0 0 1 | 1 sgu | 1 0 0 0 0 | 1 smm | 0 0 0 0 1 | 1 tor | 1 0 0 0 0 | 1 -----------+-------------------------------------------------------+---------- Total | 7 1 4 2 3 | 17 xi: mvreg lep read math lang = i.cl5 i.cl5 _Icl5_1-5 (naturally coded; _Icl5_1 omitted) Equation Obs Parms RMSE "R-sq" F P ---------------------------------------------------------------------- lep 17 5 .1045578 0.3782 1.824595 0.1890 read 17 5 5.469259 0.9274 38.3217 0.0000 math 17 5 6.22692 0.9295 39.54416 0.0000 lang 17 5 4.02521 0.9657 84.42677 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lep | _Icl5_2 | -.1557143 .111777 -1.39 0.189 -.3992555 .0878269 _Icl5_3 | -.0507143 .0655351 -0.77 0.454 -.193503 .0920744 _Icl5_4 | .0392857 .0838328 0.47 0.648 -.1433702 .2219416 _Icl5_5 | -.1590476 .0721518 -2.20 0.048 -.3162528 -.0018424 _cons | .2457143 .0395191 6.22 0.000 .1596095 .3318191 -------------+---------------------------------------------------------------- read | _Icl5_2 | -11.14284 5.846884 -1.91 0.081 -23.88211 1.596427 _Icl5_3 | -13.49288 3.42804 -3.94 0.002 -20.96193 -6.023819 _Icl5_4 | -30.44286 4.385163 -6.94 0.000 -39.99731 -20.88841 _Icl5_5 | 25.92381 3.774148 6.87 0.000 17.70065 34.14697 _cons | 651.0429 2.067186 314.94 0.000 646.5389 655.5469 -------------+---------------------------------------------------------------- math | _Icl5_2 | -8.542864 6.656858 -1.28 0.224 -23.04691 5.961183 _Icl5_3 | -17.49285 3.902929 -4.48 0.001 -25.9966 -8.989095 _Icl5_4 | -39.54286 4.992643 -7.92 0.000 -50.4209 -28.66483 _Icl5_5 | 26.29049 4.296983 6.12 0.000 16.92817 35.65281 _cons | 648.8429 2.353555 275.69 0.000 643.7149 653.9708 -------------+---------------------------------------------------------------- lang | _Icl5_2 | -9.114284 4.30313 -2.12 0.056 -18.49 .261431 _Icl5_3 | -17.03929 2.522934 -6.75 0.000 -22.53629 -11.54229 _Icl5_4 | -39.36428 3.227348 -12.20 0.000 -46.39607 -32.3325 _Icl5_5 | 22.91904 2.777659 8.25 0.000 16.86704 28.97104 _cons | 645.1143 1.521386 424.03 0.000 641.7995 648.4291 ------------------------------------------------------------------------------
Example Using Fisher's Iris Data
use http://www.gseis.ucla.edu/courses/data/iris, clear cluster kmeans sl sw pl pw, k(3) name(c2) euc start(prandom(4343434343)) tab c2 type | type of iris c2 | setosa versicolo virginica | Total -----------+---------------------------------+---------- 1 | 0 3 36 | 39 2 | 0 47 14 | 61 3 | 50 0 0 | 50 -----------+---------------------------------+---------- Total | 50 50 50 | 150
Kmedians Cluster Analysis in Stata
Kmedians clustering is a variation on the kmeans method. The same process is followed except that medians are used instead of means. Kmedians would be appropriate when you need a more stable measure of the group centers.
cluster kmedians lep read math lang, k(5) name(med5) start(prandom(777444)) tabulate district med5 | med5 district | 1 2 3 4 5 | Total -----------+-------------------------------------------------------+---------- abc | 0 0 1 0 0 | 1 bhu | 1 0 0 0 0 | 1 bur | 0 0 1 0 0 | 1 ccu | 0 0 1 0 0 | 1 com | 0 0 0 1 0 | 1 dow | 0 0 1 0 0 | 1 gln | 0 0 1 0 0 | 1 ing | 0 1 0 0 0 | 1 lan | 0 0 0 0 1 | 1 lau | 0 0 0 0 1 | 1 lbu | 0 0 0 1 0 | 1 pas | 0 0 0 1 0 | 1 plm | 0 0 0 0 1 | 1 pvu | 1 0 0 0 0 | 1 sgu | 0 0 1 0 0 | 1 smm | 1 0 0 0 0 | 1 tor | 0 0 1 0 0 | 1 -----------+-------------------------------------------------------+---------- Total | 3 1 7 3 3 | 17Example Using Fisher's Iris Data
use http://www.gseis.ucla.edu/courses/data/iris, clear cluster kmedians sl sw pl pw, k(3) name(c3) euc start(prandom(666565656)) tab c3 type | type of iris c3 | setosa versicolo virginica | Total -----------+---------------------------------+---------- 1 | 0 10 47 | 57 2 | 50 0 0 | 50 3 | 0 40 3 | 43 -----------+---------------------------------+---------- Total | 50 50 50 | 150
Multivariate Course Page
Phil Ender, 5jan05, 24apr00