Multidimensional scaling is a collection of statistical techniques for exploring similarities and dissimilarities in data. Multidimensional scaling takes item-item similarities and assigns each to a location in a low-dimensional space. In this respect it is similar to other data reduction techniques, such as, factor analysis.There are several variaites of multidimensional scaling; classical MDS, modern metric MDS, and modern nonmetric MDS.
As of Stata 10 there is support for classical MDS, modern metric MDS and nonmetric MDS. Whereas, Stata 9 only supported classical metric multidimensional scaling for dissimilarity between observations. For the purposes of this introductory unit, we will demonstrate multidimensional scaling using classical MDS.
This simple example uses airline distances between 10 US cities as the disimilarities.
/* enter distance matrix */ #delimit ; matrix d = ( 0,587,1212,701,1936,604,748,2139,2182,543\ 587,0,920,940,1745,1188,713,1858,1737,597\ 1212,920,0,879,831,1726,1631,949,1021,1494\ 701,940,879,0,1374,968,1420,1645,1891,1220\ 1936,1745,831,1374,0,2339,2451,347,959,2300\ 604,1188,1726,968,2339,0,1092,2594,2734,923\ 748,713,1631,1420,2451,1092,0,2571,2408,205\ 2139,1858,949,1645,347,2594,2571,0,678,2442\ 2182,1737,1021,1891,959,2734,2408,678,0,2329\ 543,597,1494,1220,2300,923,205,2442,2329,0); #delimit cr global names atl chi den hou la mi ny sf sea dc matrix rownames d = $names matrix colnames d = $names matrix list d symmetric d[10,10] atl chi den hou la mi ny sf sea dc atl 0 chi 587 0 den 1212 920 0 hou 701 940 879 0 la 1936 1745 831 1374 0 mi 604 1188 1726 968 2339 0 ny 748 713 1631 1420 2451 1092 0 sf 2139 1858 949 1645 347 2594 2571 0 sea 2182 1737 1021 1891 959 2734 2408 678 0 dc 543 597 1494 1220 2300 923 205 2442 2329 0 mdsmat d, names($names) Classical metric multidimensional scaling dissimilarity matrix: d Number of obs = 10 Eigenvalues > 0 = 6 Mardia fit measure 1 = 0.9954 Retained dimensions = 2 Mardia fit measure 2 = 1.0000 -------------------------------------------------------------------------- | abs(eigenvalue) (eigenvalue)^2 Dimension | Eigenvalue Percent Cumul. Percent Cumul. -------------+------------------------------------------------------------ 1 | 9582144.3 84.64 84.64 96.99 96.99 2 | 1686820.2 14.90 99.54 3.01 100.00 -------------+------------------------------------------------------------ 3 | 8157.2984 0.07 99.61 0.00 100.00 4 | 1432.8699 0.01 99.63 0.00 100.00 5 | 508.66869 0.00 99.63 0.00 100.00 6 | 25.143486 0.00 99.63 0.00 100.00 -------------------------------------------------------------------------- mdsconfig, autoaspect ynegate estat config Approximating configuration in 2-dimensional Euclidean space Category | dim1 dim2 -------------+---------------------------- atl | 718.7594 142.9943 chi | 382.0558 -340.8396 den | -481.6023 -25.2850 hou | 161.4663 572.7699 la | -1203.7380 390.1003 mi | 1133.5271 581.9073 ny | 1072.2357 -519.0242 sf | -1420.6033 112.5892 sea | -1341.7225 -579.7393 dc | 979.6220 -335.4728 ------------------------------------------
The next two examples are taken from the Stata manual.
This dataset consists of eight variables with nutrition data on 25 breakfast cereals.
use http://www.stata-press.com/data/r9/cerealnut, clear describe Contains data from http://www.stata-press.com/data/r9/cerealnut.dta obs: 25 Cereal Nutrition vars: 9 24 Feb 2005 17:19 size: 1,150 (99.9% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- brand str25 %25s Cereal Brand calories int %9.0g Calories (Cal/oz) protein byte %9.0g Protein (g) fat byte %9.0g Fat (g) Na int %9.0g Na (mg) fiber float %9.0g Fiber (g) carbs float %9.0g Carbs (g) sugar byte %9.0g Sugar (g) K int %9.0g K (mg) ------------------------------------------------------------------------------- summarize calories-K Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- calories | 25 109.6 21.30728 50 160 protein | 25 2.68 1.314027 1 6 fat | 25 .92 .7593857 0 2 Na | 25 195.8 71.32204 0 320 fiber | 25 1.7 2.056494 0 9 -------------+-------------------------------------------------------- carbs | 25 15.3 4.028544 7 22 sugar | 25 7.4 4.609772 0 14 K | 25 90.6 77.5043 15 320 replace brand = subinstr(brand, " ","_",.)
note the three variables, calories, Na, and K, which have standard deviations that are much higher than the other variables.The replace ckommand was used to replace spaces with underscores to make the graphs cleaner and easier to interpret.
list brand, clean brand 1. Cheerios 2. Cocoa_Puffs 3. Honey_Nut_Cheerios 4. Kix 5. Lucky_Charms 6. Oatmeal_Raisin_Crisp 7. Raisin_Nut_Bran 8. Total_Corn_Flakes 9. Total_Raisin_Bran 10. Trix 11. Wheaties_Honey_Gold 12. All-Bran 13. Apple_Jacks 14. Corn_Flakes 15. Corn_Pops 16. Mueslix_Crispy_Blend 17. Nut_&_Honey_Crunch 18. Nutri_Grain_Almond_Raisin 19. Nutri_Grain_Wheat 20. Product_19 21. Raisin_Bran 22. Rice_Krispies 23. Special_K 24. Life 25. Puffed_Rice mds calories-K, id(brand) config Classical metric multidimensional scaling dissimilarity: L2, computed on 8 variables Number of obs = 25 Eigenvalues > 0 = 8 Mardia fit measure 1 = 0.9603 Retained dimensions = 2 Mardia fit measure 2 = 0.9970 -------------------------------------------------------------------------- | abs(eigenvalue) (eigenvalue)^2 Dimension | Eigenvalue Percent Cumul. Percent Cumul. -------------+------------------------------------------------------------ 1 | 158437.92 56.95 56.95 67.78 67.78 2 | 108728.77 39.08 96.03 31.92 99.70 -------------+------------------------------------------------------------ 3 | 10562.645 3.80 99.83 0.30 100.00 4 | 382.67849 0.14 99.97 0.00 100.00 5 | 69.761715 0.03 99.99 0.00 100.00 6 | 12.520822 0.00 100.00 0.00 100.00 7 | 5.7559984 0.00 100.00 0.00 100.00 8 | 2.2243244 0.00 100.00 0.00 100.00 -------------------------------------------------------------------------- Approximating configuration in 2-dimensional Euclidean space brand | dim1 dim2 -------------+---------------------------- Cheerios | -61.8271 72.5534 Cocoa_Puffs | 38.5094 5.1037 Honey_Nut_~s | -28.0515 46.0667 Kix | 9.1693 81.4942 Lucky_Charms | 38.5024 5.1356 Oatmeal_Ra~p | -12.5635 -37.0897 Raisin_Nut~n | -12.0040 -73.7800 Total_Corn~s | 44.9827 33.2502 Total_Rais~n | -117.0067 -77.9962 Trix | 85.0033 -12.9330 Wheaties_H~d | 23.7367 19.7182 All-Bran | -226.1791 -67.6752 Apple_Jacks | 88.6199 -28.4323 Corn_Flakes | -1.8069 109.3770 Corn_Pops | 115.5366 -52.7072 Mueslix_Cr~d | -37.7449 -74.4727 Nut_&_Hone~h | 45.3886 21.9393 Nutri_Grai~n | -47.9441 0.6082 Nutri_Grai~t | 15.2261 -21.7290 Product_19 | -26.0875 129.4798 Raisin_Bran | -134.8587 -66.7255 Rice_Krisp~s | -2.3710 109.6115 Special_K | 12.1670 47.9540 Life | 20.9036 -41.4515 Puffed_Rice | 170.6994 -127.2995
With a little bit of work we can make the graph clearer by making use of the mlabvopos option to move the lable to different clock positions indicated by the new variable place.
generate place = 3 replace place = 9 if inlist(brand, "All_Bran","Corn_Flakes","Honey_Nut_Cheerios", /// "Wheaties_Honey_Gold","Nutri_Grain_Wheat","Nutri_Grain_Almond_Raisin", /// "Oatmeal_Raisin_Crisp","Cocoa_Puffs","Total_Raisin_Bran") replace place = 6 if inlist(brand,"Mueslix_Crispy_Blend") replace place = 12 if inlist(brand, "Raisin_Bran","Special_K") mdsconfig, autoaspect mlabvpos(place)
This configuration is due in large to the three variables with large standard deviations. An alternative is to analyze the data with standardized variables using the std option. This analysis with standardized Euclidean distances is equivalent to a principal components analysis of the correlation among the variables.
mds calories-K, id(brand) config std noplot Classical metric multidimensional scaling dissimilarity: L2, computed on 8 variables Number of obs = 25 Eigenvalues > 0 = 8 Mardia fit measure 1 = 0.5987 Retained dimensions = 2 Mardia fit measure 2 = 0.7697 -------------------------------------------------------------------------- | abs(eigenvalue) (eigenvalue)^2 Dimension | Eigenvalue Percent Cumul. Percent Cumul. -------------+------------------------------------------------------------ 1 | 65.645395 34.19 34.19 49.21 49.21 2 | 49.311416 25.68 59.87 27.77 76.97 -------------+------------------------------------------------------------ 3 | 38.826608 20.22 80.10 17.21 94.19 4 | 17.727805 9.23 89.33 3.59 97.78 5 | 11.230087 5.85 95.18 1.44 99.22 6 | 8.2386231 4.29 99.47 0.78 99.99 7 | .77953426 0.41 99.87 0.01 100.00 8 | .24053137 0.13 100.00 0.00 100.00 -------------------------------------------------------------------------- Approximating configuration in 2-dimensional Euclidean space brand | dim1 dim2 -------------+---------------------------- Cheerios | -1.3080 2.6638 Cocoa_Puffs | 0.6296 -1.7910 Honey_Nut_~s | -0.5050 -0.2227 Kix | 1.4003 1.3242 Lucky_Charms | 0.4178 -1.3534 Oatmeal_Ra~p | -1.1762 -0.7533 Raisin_Nut~n | -1.3523 -0.9414 Total_Corn~s | 1.5175 0.8541 Total_Rais~n | -2.3049 -0.6710 Trix | 1.0107 -1.8899 Wheaties_H~d | 0.5404 -0.2336 All-Bran | -4.0119 0.8411 Apple_Jacks | 0.7712 -2.0103 Corn_Flakes | 1.7864 1.8346 Corn_Pops | 1.3661 -2.1499 Mueslix_Cr~d | -2.0077 -0.8722 Nut_&_Hone~h | 0.7470 -0.6259 Nutri_Grai~n | -1.1706 0.8679 Nutri_Grai~t | 0.6929 1.0345 Product_19 | 1.3073 2.1645 Raisin_Bran | -2.4414 -0.2820 Rice_Krisp~s | 1.9619 1.7543 Special_K | 0.2362 1.9531 Life | -0.9843 -0.1881 Puffed_Rice | 2.8769 -1.3072 drop place generate place = 3 replace place = 9 if inlist(brand, "All_Bran","Corn_Flakes", /// "Nutri_Grain_Wheat","Apple_Jacks","Life", "Raisin_Bran" /// "Oatmeal_Raisin_Crisp","Cocoa_Puffs","Total_Raisin_Bran") replace place = 6 if inlist(brand,"Mueslix_Crispy_Blend","Nutri_Grain_Almond_Raisin", /// "Wheaties_Honey_Gold") replace place = 12 if inlist(brand,"Special_K","Honey_Nut_Cheerios") mdsconfig, autoaspect mlabvpos(place)
This dataset consists of seven variables measuring the number of topic pages in 25 multivariate statistics books.
use http://www.stata-press.com/data/r9/mvstatsbooks, clear describe Contains data from http://www.stata-press.com/data/r9/mvstatsbooks.dta obs: 25 vars: 8 15 Mar 2005 16:27 size: 825 (99.9% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- author str17 %17s math int %9.0g math other than statistics (e.g., linear algebra) corr int %9.0g correlation and regression, including linear structural and functional equations fact byte %9.0g factor analysis and principal component analysis cano byte %9.0g canonical correlation analysis disc int %9.0g discriminant analysis, classification, and cluster analysis stat int %9.0g statistics, incl. dist. theory, hypothesis testing & est.; categorical data mano int %9.0g manova and the general linear model ------------------------------------------------------------------------------- list, clean noobs author math corr fact cano disc stat mano Roy57 31 0 0 0 0 164 11 Kendall57 0 16 54 18 27 13 14 Kendall75 0 40 32 10 42 60 0 Anderson58 19 0 35 19 28 163 52 CooleyLohnes62 14 7 35 22 17 0 56 CooleyLohnes71 20 69 72 33 55 0 32 Morrison67 74 0 86 14 0 84 48 Morrison76 78 0 80 5 17 105 60 VandeGeer67 74 19 33 12 26 0 0 VandeGeer71 80 68 67 15 29 0 0 Dempster69 108 48 4 10 46 108 0 Tasuoka71 109 13 5 17 39 32 46 Harris75 16 35 69 24 0 26 41 Dagnelie75 26 86 60 6 48 48 28 GreenCaroll76 290 10 6 0 8 0 2 CailliezPages76 184 48 82 42 134 0 0 Giri77 29 0 0 0 41 211 32 Gnanadesikan77 0 19 56 0 39 75 0 Kshirsagar78 0 22 45 42 60 230 59 Thorndike78 30 128 90 28 48 0 0 MardiaKentBibby79 34 28 68 19 67 131 55 Seber84 16 0 59 13 116 129 101 Stevens96 23 87 67 21 30 43 249 EverittDunn01 0 54 65 0 56 20 30 Rencher02 38 0 71 19 105 135 131 mds math-mano, id(author) measure(corr) config noplot Classical metric multidimensional scaling similarity: correlation, computed on 7 variables dissimilarity: sqrt(2(1-similarity)) Number of obs = 25 Eigenvalues > 0 = 6 Mardia fit measure 1 = 0.6680 Retained dimensions = 2 Mardia fit measure 2 = 0.8496 -------------------------------------------------------------------------- | abs(eigenvalue) (eigenvalue)^2 Dimension | Eigenvalue Percent Cumul. Percent Cumul. -------------+------------------------------------------------------------ 1 | 8.469821 38.92 38.92 56.15 56.15 2 | 6.0665813 27.88 66.80 28.81 84.96 -------------+------------------------------------------------------------ 3 | 3.8157101 17.53 84.33 11.40 96.35 4 | 1.6926956 7.78 92.11 2.24 98.60 5 | 1.2576053 5.78 97.89 1.24 99.83 6 | .45929376 2.11 100.00 0.17 100.00 -------------------------------------------------------------------------- Approximating configuration in 2-dimensional Euclidean space author | dim1 dim2 -------------+---------------------------- Roy57 | 0.7420 -0.2268 Kendall57 | -0.3794 0.6839 Kendall75 | 0.2763 0.4147 Anderson58 | 0.8144 -0.0001 CooleyLoh~62 | -0.3466 0.2573 CooleyLoh~71 | -0.9160 0.5594 Morrison67 | 0.2397 -0.2910 Morrison76 | 0.4088 -0.2976 VandeGeer67 | -0.7041 -0.7320 VandeGeer71 | -0.9173 -0.3062 Dempster69 | 0.2363 -0.7154 Tasuoka71 | -0.1549 -0.9563 Harris75 | -0.3220 0.4854 Dagnelie75 | -0.3509 0.4532 GreenCaro~76 | -0.4211 -0.9574 CailliezP~76 | -0.6827 -0.6365 Giri77 | 0.7898 -0.1441 Gnanadesi~77 | 0.3820 0.3612 Kshirsagar78 | 0.8014 0.1435 Thorndike78 | -0.8695 0.3641 MardiaKen~79 | 0.6923 0.1491 Seber84 | 0.6004 0.2254 Stevens96 | -0.0850 0.3124 EverittDu~01 | -0.4346 0.7139 Rencher02 | 0.6007 0.1399 ------------------------------------------ generate spot = 3 replace spot = 2 if inlist(author,"Seber84","Kshirsagar78","Kendall75") replace spot = 5 if author == "MardiaKentBibby79" replace spot = 9 if inlist(author,"Dagnelie75","Rencher02", /// "GreenCaroll76","EverittDunn01","CooleyLohnes62","Morrison67") mdsconfig, mlabvpos(spot)
Multivariate Course Page
Phil Ender, 10may05