Multivariate Analysis

Hypothesis Testing: 1 & 2 Groups


Tests of Significance

Hotelling's T2

Single Sample Problems

Known Covariance Matrix Σ

Univariate


Multivariate

Single-Sample with known Population Covariance Matrix

  • Suppose that a sample of 25 observations is drawn from a bivariate normal population with unknown centroid m and covariance matrix Σ = [16 8, 8 9].

  • If the sample centroid is found to be Xbar' = [15.4 9.9], test the hypothesis that μ' = [17 10] at the 5% significance level.

    Stata Matrix Program

    scalar n = 25  
    matrix mu = (17 \ 10)  
    matrix xbar = (15.4 \ 9.9)  
    matrix sigma = (16, 8 \ 8, 9)
    matrix x = xbar - mu
    
    matrix list mu
    matrix list  xbar
    matrix list  x
    matrix list  sigma
    
    matrix Q = n * x'*syminv(sigma)*x
    display "Q = "  el(Q,1,1)

    Discuss

  • The true meaning of statistical significance.
  • The difference between multivariate tests and multiple univariate tests.

    Covariance Matrix Unknown

    Univariate

    Multivariate

    Single-Sample with Unknown Population Covariance Matrix

  • The centroid and SSCP matrix for a sample of 22 observations from a bivariate normal population were Xbar' = [32.6 33.5] and Σ = [47.25 42.02, 42.02 111.09].

  • Test the hypothesis that μ' = [31 32] at the 1% level of significance.

    Stata Matrix Program

    scalar n = 22
    matrix mu = (31 \ 32)  
    matrix xbar = (32.6 \ 33.5)  
    matrix s = (47.25, 42.02 \ 42.02, 111.09)
    scalar rows = rowsof(s)
    matrix x = xbar - mu
    matrix list mu  
    matrix list xbar  
    matrix list x  
    matrix list s
    
    scalar c = n * (n - 1)
    matrix T2 = c * x'*syminv(s)*x
    display "T-squared = "  el(T2,1,1)
    
    scalar df2 = n - rows
    scalar c = df2/((n - 1)*rows)
    matrix F = c * T2
    display "F = " el(F,1,1) 
    display "p = " rows "   df2 = " df2
    Stata Example

    To do the single-sample Hotelling's T2 in Stata, we first need to create variables that contain the hypothesized population means and then create difference variables. In this example, the hypothesized population means are the same same for both read and write.

    use http://www.philender.com/courses/data/hsb2, clear
    
    summarize read write
    
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
            read |       200       52.23    10.25294         28         76
           write |       200      52.775    9.478586         31         6
    
    /*  test against a population mean value vector [50, 50]  */
    
    generate mean=50
    generate dif1 = read-mean
    generate dif2 = write-mean
    
    hotel dif1 dif2
    
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
            dif1 |       200        2.23    10.25294        -22         26
            dif2 |       200       2.775    9.478586        -19         17
    
    1-group Hotelling's T-squared = 17.710866
    F test statistic: ((200-2)/(200-1)(2)) x 17.710866 = 8.8109335
    
    H0: Vector of means is equal to a vector of zeros
                  F(2,198) =    8.8109
           Prob > F(2,198) =    0.0002

    Dependent t

    Univariate

    with n-1 degrees of freedom.

    Multivariate

    with df= p & n-p

    Dependent Example

  • A researcher at a school for the deaf gave several motor skills tests to resident students, and also tested a group of hearing children, paired child for child on the basis of sex, age, and height with the deaf children. Scores for 10 deaf girls and their hearing counterparts on a test of grip (X1) and a test of balance (X2) are given below. Test the significance of the difference between the centroids of the deaf and the hearing groups at α = .01.

    X1 D 25  22  28  35   37  48  49  54  65  57
       H 26  22  29  39   34  51  42  54  77  68
       
    X2 D 2.0 2.0 2.7 2.7  3.0 1.7 2.0 2.0 2.7 1.0
       H 2.3 1.0 3.7 3.3 10.0 4.3 4.7 7.0 3.3 1.7 
       
    The difference scores (H -D) for each pair of the two variables are:
    D  1    0    1    4   -3    3   -7    0   12   11
      0.3 -1.0  1.0  0.6  7.0  2.6  2.7  5.0  0.6  0.7
      
      
    Yielding dbar' = [2.2 1.95] and S = [301.6 -56.4, -56.4  53.25]
    

    Stata Marix Program

    scalar n = 10 
    scalar p = 2
    matrix dbar = (2.2 \ 1.95) 
    matrix list dbar
    
    matrix s = (301.6, -56.4 \ -56.4, 53.325) 
    matrix list s
    
    matrix t2 = n*(n-1)*dbar'*(syminv(s))*dbar 
    display "T-squared = " el(t2,1,1)
    
    scalar df2 = n-p
    matrix f = df2/((n-1)*p)*t2  
    display "F = " el(f,1,1) 
    
    display "p = " p "  df2 = " df2
    Stata Example

    To do the dependent-sample Hotelling's T2 in Stata, we once again need to create difference variables. In this example, x1 is the difference betweeen the deaf and the hearing for grip and x2 is the difference for balance.

    input x1d   x1h   x2d   x2h
      25    26     2   2.3 
      22    22     2     1 
      28    29   2.7   3.7 
      35    39   2.7   3.3 
      37    34     3    10 
      48    51   1.7   4.3 
      49    42     2   4.7 
      54    54     2     7 
      65    77   2.7   3.3 
      57    68     1   1.7
    end
    
    gen x1 = x1d-x1h
    gen x2 = x2d-x2h
    
    hotel x1 x2
    
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
              x1 |        10        -2.2    5.788878        -12          7
              x2 |        10       -1.95    2.434132         -7          1
    
    1-group Hotelling's T-squared = 13.176046
    F test statistic: ((10-2)/(10-1)(2)) x 13.176046 = 5.8560205
    
    H0: Vector of means is equal to a vector of zeros
                  F(2,8) =    5.8560
           Prob > F(2,8) =    0.0271

    Two-Sample Problems

    Univariate

    with df = n1 + n2 -2

    Rewriting t yields

    Multivariate

  • S1 and S2 are the Deviation SSCPs for each of the two groups.

  • Then let W = S1 + S2, the pooled within-group SSCP.

    Two-Group Example

  • Suppose that two treatment groups, in an experiment using the randomized-group design, were measured on two criterion variables X1 and X2, and that the group centroids were Xbar1' = [14.2 9.0] and Xbar2' = [12.8 16.2] with pooled within-group sscp matrix W = [567.6 215.2, 215.2 96.8].

  • Would you conclude that the two groups were significantly different at alpha = .01?

    Stata Matrix Program

    
    scalar n1 = 5 
    scalar n2 = 5
    matrix xb1 = (14.2 \ 9.0)   
    matrix xb2 = (12.8 \ 16.2)
    matrix x = xb1 - xb2
    matrix w = (567.6, 215.2 \ 215.2, 96.8)
    scalar p = rowsof(w)
    scalar c = (n1 * n2 * (n1 + n2 -2))/(n1 + n2)
    
    matrix T2 = c * x'*syminv(w)*x
    display "T-squared = " el(T2,1,1)
    
    scalar df2 = n1 + n2 - p - 1
    scalar c = df2/((n1 + n2 - 2)*p)
    matrix F = c * T2   
    display "F = " el(F,1,1)
    
    display "degrees of freedom = " p " and " df2
    Stata Example

    input y1 y2 y3 group
    1.21 .61 .70 1
     .92 .43 .71 1
     .80 .35 .71 1
     .85 .48 .68 1
     .98 .42 .71 1
    1.15 .52 .72 1
    1.10 .50 .75 1
    1.02 .53 .70 1
    1.18 .45 .70 1
    1.09 .40 .69 1
    1.40 .50 .71 2
    1.17 .39 .69 2
    1.23 .44 .70 2
    1.19 .37 .72 2
    1.38 .42 .71 2
    1.17 .45 .70 2
    1.31 .41 .70 2
    1.30 .47 .67 2
    1.22 .29 .68 2
    1.00 .30 .70 2
    1.12 .27 .72 2
    1.09 .35 .73 2
    end
    
    tabstat y1 y2 y3, by(group) stat(mean sd)
    
    Summary statistics: mean, sd
      by categories of: group 
    
       group |        y1        y2        y3
    ---------+------------------------------
           1 |      1.03      .469      .707
             |  .1405544  .0748999  .0188856
    ---------+------------------------------
           2 |     1.215  .3883333     .7025
             |  .1181293  .0740802  .0171226
    ---------+------------------------------
       Total |  1.130909      .425  .7045455
             |  .1570535  .0834808  .0176547
    ----------------------------------------
    
    hotel y1 y2 y3, by(group)
    
    -> group=        1  
    Variable |     Obs        Mean   Std. Dev.       Min        Max
    ---------+-----------------------------------------------------
          y1 |      10        1.03   .1405544         .8       1.21  
          y2 |      10        .469   .0748999        .35        .61  
          y3 |      10        .707   .0188856        .68        .75  
    
    -> group=        2  
    Variable |     Obs        Mean   Std. Dev.       Min        Max
    ---------+-----------------------------------------------------
          y1 |      12       1.215   .1181293          1        1.4  
          y2 |      12    .3883333   .0740802        .27         .5  
          y3 |      12       .7025   .0171226        .67        .73  
    
    
    2-group Hotelling's T-squared = 52.342102
    F test statistic: ((22-3-1)/(22-2)(3)) x 52.342102 = 15.702631
    
    H0: Vectors of means are equal for the two groups
                  F(3,18) =   15.7026
             Pr > F(3,18) =    0.0000
    
    /* using manova */
    manova y1 y2 y3 = group
    
                               Number of obs =      22
    
                               W = Wilks' lambda      L = Lawley-Hotelling trace
                               P = Pillai's trace     R = Roy's largest root
    
                      Source |  Statistic     df   F(df1,    df2) =   F   Prob>F
                  -----------+--------------------------------------------------
                       group | W   0.2765      1     3.0    18.0    15.70 0.0000 e
                             | P   0.7235            3.0    18.0    15.70 0.0000 e
                             | L   2.6171            3.0    18.0    15.70 0.0000 e
                             | R   2.6171            3.0    18.0    15.70 0.0000 e
                             |--------------------------------------------------
                    Residual |                20
                  -----------+--------------------------------------------------
                       Total |                21
                  --------------------------------------------------------------
                               e = exact, a = approximate, u = upper bound on F
    
    /* using mvreg */
    xi: mvreg y1 y2 y3 = i.group
    i.group           _Igroup_1-2         (naturally coded; _Igroup_1 omitted)
    
    Equation          Obs  Parms        RMSE    "R-sq"          F        P
    ----------------------------------------------------------------------
    y1                 22      2    .1287051    0.3604   11.26965   0.0031
    y2                 22      2    .0744502    0.2425   6.403464   0.0199
    y3                 22      2    .0179374    0.0169   .3432919   0.5645
    
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    y1           |
       _Igroup_2 |       .185   .0551082     3.36   0.003     .0700462    .2999537
           _cons |       1.03   .0407001    25.31   0.000      .945101    1.114899
    -------------+----------------------------------------------------------------
    y2           |
       _Igroup_2 |  -.0806667   .0318777    -2.53   0.020    -.1471623    -.014171
           _cons |       .469   .0235432    19.92   0.000     .4198897    .5181103
    -------------+----------------------------------------------------------------
    y3           |
       _Igroup_2 |     -.0045   .0076803    -0.59   0.564    -.0205209    .0115209
           _cons |       .707   .0056723   124.64   0.000     .6951678    .7188322
    ------------------------------------------------------------------------------
    
    mvtest _Igroup_2  /* findit mvtest */
    
                                         MULTIVARIATE TESTS OF SIGNIFICANCE
    
    
    Multivariate Test Criteria and Exact F Statistics for
    the Hypothesis of no Overall "_Igroup_2" Effect(s)
    
                                                 S=1    M=.5    N=8
    
    Test                          Value          F       Num DF     Den DF   Pr > F
    Wilks' Lambda              0.27646418    15.7026          3    18.0000   0.0000
    Pillai's Trace             0.72353582    15.7026          3    18.0000   0.0000
    Hotelling-Lawley Trace     2.61710509    15.7026          3    18.0000   0.0000
    regress group y1 y2 y3
    
          Source |       SS       df       MS              Number of obs =      22
    -------------+------------------------------           F(  3,    18) =   15.70
           Model |  3.94655901     3  1.31551967           Prob > F      =  0.0000
        Residual |  1.50798645    18  .083777025           R-squared     =  0.7235
    -------------+------------------------------           Adj R-squared =  0.6775
           Total |  5.45454545    21   .25974026           Root MSE      =  .28944
    
    ------------------------------------------------------------------------------
           group |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              y1 |   2.246938   .4087437     5.50   0.000     1.388199    3.105677
              y2 |  -3.691243    .766785    -4.81   0.000    -5.302199   -2.080288
              y3 |  -2.242679   3.588098    -0.63   0.540    -9.780994    5.295636
           _cons |   2.153219   2.612195     0.82   0.421    -3.334799    7.641237
    ------------------------------------------------------------------------------
    
    /* using canonical correlation analysis */
    
    canon (y1 y2 y3)(group)
    
    Linear combinations for canonical correlations         Number of obs =      22
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    u1           |
              y1 |   5.183122   .9428692     5.50   0.000     3.222318    7.143926
              y2 |  -8.514772   1.768781    -4.81   0.000    -12.19315   -4.836391
              y3 |  -5.173297   8.276842    -0.63   0.539    -22.38593    12.03934
    -------------+----------------------------------------------------------------
    v1           |
           group |   1.962142   .2712094     7.23   0.000     1.398131    2.526153
    ------------------------------------------------------------------------------
                                         (Standard errors estimated conditionally)
    Canonical correlations:
      0.8506
    
    ----------------------------------------------------------------------------
    Tests of significance of all canonical correlations
    
                             Statistic      df1      df2            F     Prob>F
             Wilks' lambda     .276464        3       18      15.7026     0.0000 e
            Pillai's trace     .723536        3       18      15.7026     0.0000 e
    Lawley-Hotelling trace     2.61711        3       18      15.7026     0.0000 e
        Roy's largest root     2.61711        3       18      15.7026     0.0000 e
    ----------------------------------------------------------------------------
                                e = exact, a = approximate, u = upper bound on F
                             
    /* now the other way around */
    
    canon (group)(y1 y2 y3)
    
    Linear combinations for canonical correlations         Number of obs =      22
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    u1           |
           group |   1.962142   .2712094     7.23   0.000     1.398131    2.526153
    -------------+----------------------------------------------------------------
    v1           |
              y1 |   5.183122   .9428692     5.50   0.000     3.222318    7.143926
              y2 |  -8.514772   1.768781    -4.81   0.000    -12.19315   -4.836391
              y3 |  -5.173297   8.276842    -0.63   0.539    -22.38593    12.03934
    ------------------------------------------------------------------------------
                                         (Standard errors estimated conditionally)
    Canonical correlations:
      0.8506
    
    ----------------------------------------------------------------------------
    Tests of significance of all canonical correlations
    
                             Statistic      df1      df2            F     Prob>F
             Wilks' lambda     .276464        3       18      15.7026     0.0000 e
            Pillai's trace     .723536        3       18      15.7026     0.0000 e
    Lawley-Hotelling trace     2.61711        3       18      15.7026     0.0000 e
        Roy's largest root     2.61711        3       18      15.7026     0.0000 e
    ----------------------------------------------------------------------------
                                e = exact, a = approximate, u = upper bound on F
    
    /* using linear discriminant analysis in Stata 10*/
    
    candisc y1 y2 y3, group(group)
    
    Canonical linear discriminant analysis
    
          |                                 | Like- 
          | Canon.   Eigen-     Variance    | lihood
      Fcn | Corr.    value   Prop.   Cumul. | Ratio     F      df1    df2  Prob>F
      ----+---------------------------------+------------------------------------
        1 | 0.8506  2.61711  1.0000  1.0000 | 0.2765  15.703     3     18  0.0000 e
      ---------------------------------------------------------------------------
      Ho: this and smaller canon. corr. are zero;                     e = exact F
    
       
    [ ...output omitted... ]


    Multivariate Course Page

    Phil Ender, 17jul07, 18oct05, 28feb05, 6feb05, 29Jan98