Introduction to Research Design and Statistics

Measurement and Testing

Test Reliability

Test reliability refers to consistency of measurement, the extent to which the results are similar over different forms of the same instrument or occasions of test administration.

Test Validity

Test validity is the extent to which a test measures what it proports to measure. Also, the extent to which inferences made on the basis of test scores are appropriate, meaningful, and useful.

The Measurement Model

Where
x is the obtained score,
t is the true score, and
e is the error score (measurement error).

Measurement Model Variances

Where
σ_x² is the variance of the obtained score,
&sigma_t² is the variance of the true score, and
σ_e² is the variance of the errors.

Standard Error of Measurement

The standard error of measurement is the square root of the variance of the errors.

When the standard error of measurement is obtained from sample data it is often written as s_m or s_e.

Definition of Reliability

Reliability is defined and the ratio of the true score variance to the observed score variance.

Methods to Assess Reliability

Test-retest Reliability (.90 and above is acceptable)
Alternate Form Reliability (.85 and above is acceptable)
Internal Consistency (.95 and above is acceptable)
- Split-half Reliability
- Kuder-Richardson Reliability
- Cronbach Alpha
Note: The reliability of scales for research purposes can be much lower, as low as .7 is acceptable.

Stata Example

use http://www.philender/courses/data/alpha, clear

format id - i10 %4.0f

list, nodisplay noobs

  id    i1    i2    i3    i4    i5    i6    i7    i8    i9   i10
   1     2     2     2     2     1     2     2     2     2     1
   2     2     2     2     2     2     2     2     2     2     1
   3     2     2     2     2     2     2     2     2     2     1
[some output omitted]
  33     2     2     1     1     2     1     2     2     2     1
  34     2     2     2     2     1     0     0     2     0     0
  35     2     2     2     2     2     2     2     0     2     1

alpha i1 - i10, item

Test scale = mean(unstandardized items)

                         item-test     item-rest      interitem
Item     |  Obs  Sign   correlation   correlation    covariance       alpha
---------+--------------------------------------------------------------------
i1       |   35    +       0.3527        0.2223        .0495565      0.6073
i2       |   35    +       0.3248        0.2055        .0504902      0.6102
i3       |   35    +       0.2175        0.0683        .0534781      0.6306
i4       |   35    +       0.7369        0.6015        .0323529      0.5101
i5       |   35    +       0.7160        0.5812        .0338702      0.5195
i6       |   35    +       0.7512        0.5735        .0288515      0.5016
i7       |   35    +       0.5782        0.3820        .0392857      0.5689
i8       |   35    +       0.3226        0.0187        .0544118      0.6852
i9       |   35    +       0.4662        0.3119        .0453081      0.5899
i10      |   35    -       0.1148       -0.0281        .0562792      0.6432
---------+--------------------------------------------------------------------
Test                                                   .0443884      0.6185
---------+--------------------------------------------------------------------

More Item Analysis

Here is an example of an item analysis for a multiple choice test using mctest (available from ATS). The first row of the data gives the scoring key while the second row gives the number of choices for each item.

use http://www.philender.com/courses/data/items, clear

list

           i1        i2        i3        i4        i5        i6
  1.        4         1         1         4         1         3
  2.        4         4         4         4         4         4
  3.        1         2         1         2         1         3
  4.        4         2         4         1         1         2
  5.        4         2         1         4         1         3
  6.        1         4         4         4         3         3
  7.        2         4         4         2         1         1
  8.        4         2         1         1         4         3
  9.        4         2         1         4         1         3
 10.        1         3         1         1         4         1
 11.        4         3         4         4         1         2
 12.        4         2         1         4         1         3
 13.        4         4         4         3         1         3
 14.        4         1         4         4         3         3
 15.        4         2         1         4         1         3
 16.        4         2         1         4         1         3
 
mctest i1-i6, gen(score) delete

Multiple Choice Item Statistics
Number of items: 6  Number of observations: 14
(Note: point biserials computed with item deleted)
----------------------------------------------------------------------
          Prop    Disc    Point        Prop       Proportion     Point
Item     Correct  Index   Biser   Alt  Total   Low   Mid   High  Biser
----------------------------------------------------------------------
i1       0.71     0.75    0.48     1   0.21    0.50  0.14  0.00 -0.29
                                   2   0.07    0.25  0.00  0.00 -0.39
                                   3   0.00    0.00  0.00  0.00  0.00
                                   4   0.71*   0.25  0.57  1.00  0.48
                                   .   0.00    0.00  0.00  0.00
                                 Other 0.00    0.00  0.00  0.00

i2       0.07     0.00   -0.06     1   0.07*   0.00  0.14  0.00 -0.06
                                   2   0.57    0.25  0.29  1.00  0.68
                                   3   0.14    0.25  0.14  0.00 -0.37
                                   4   0.21    0.50  0.14  0.00 -0.47
                                   .   0.00    0.00  0.00  0.00
                                 Other 0.00    0.00  0.00  0.00

i3       0.57     0.75    0.20     1   0.57*   0.25  0.29  1.00  0.20
                                   2   0.00    0.00  0.00  0.00  0.00
                                   3   0.00    0.00  0.00  0.00  0.00
                                   4   0.43    0.75  0.43  0.00 -0.20
                                   .   0.00    0.00  0.00  0.00
                                 Other 0.00    0.00  0.00  0.00

i4       0.57     0.75    0.47     1   0.21    0.50  0.14  0.00 -0.36
                                   2   0.14    0.25  0.14  0.00 -0.28
                                   3   0.07    0.00  0.14  0.00  0.05
                                   4   0.57*   0.25  0.29  1.00  0.47
                                   .   0.00    0.00  0.00  0.00
                                 Other 0.00    0.00  0.00  0.00

i5       0.71     0.50    0.07     1   0.71*   0.50  0.43  1.00  0.07
                                   2   0.00    0.00  0.00  0.00  0.00
                                   3   0.14    0.25  0.14  0.00  0.11
                                   4   0.14    0.25  0.14  0.00 -0.20
                                   .   0.00    0.00  0.00  0.00
                                 Other 0.00    0.00  0.00  0.00

i6       0.71     0.75    0.48     1   0.14    0.50  0.00  0.00 -0.57
                                   2   0.14    0.25  0.14  0.00 -0.05
                                   3   0.71*   0.25  0.57  1.00  0.48
                                   4   0.00    0.00  0.00  0.00  0.00
                                   .   0.00    0.00  0.00  0.00
                                 Other 0.00    0.00  0.00  0.00
                                 
univar score

                                   -------------- Quantiles ---------------
Variable     n    Mean    S.D.     Min      .25      Mdn      .75      Max
---------------------------------------------------------------------------
   score    14    3.36    1.50     1.00     2.00     3.00     5.00     5.00
---------------------------------------------------------------------------

Standard Error of Measurement Revisited

In practice the standard error of measurement is obtained in the following manner.

Where
s_m = s_e is the standard error of measurement
s_x is the standard deviation of the obtained scores
r is the estimated reliability.

Types of Validity

Face Validity
Content Validity
Criterion Validity
- Concurrent Validity
- Predictive Validity
Construct Validity

Types of Tests

Norm Referenced Tests
Criterion Referenced Tests

Describing Standardized Test Performance

Percentile Rank, Percentiles (PR, %, %tile)
Standard Scores (SS)
- mean = 0, standard deviation = 1, this is what statisticians mean by standard score
- mean = 50, standard deviation = 10, sometimes called a T-score
- mean 100, standard deviation = 15,
Normal Curve Equivalents (NCE)
Stanines (Sta9)
Grade Equivalents (GE)

Sta9	1	2	3	4	5	6	7	8	9
% of cases	4% (lowest)	7%	12%	17%	20%	17%	12%	7%	4% (highest)

Table of Some Standardized Test Scores

SS -2.326 -1.645 -1.28 -0.84 -0.67 -0.52 -0.25 0.0 +0.25 +0.52 +0.67 +0.84 +1.28 +1.645 +2.326

T 26.74 33.55 37.18 41.59 43.26 44.8 47.5 50 52.5 55.2 56.74 58.41 62.82 66.45 73.26

PR 1 5 10 20 25 30 40 50 60 70 75 80 90 95 99

NCE 1 15.4 23 32.3 35.8 38.9 44.7 50 55.3 61 64.2 67.7 77 84.6 99

Sta9 1 2 2 3 4 4 5 5 5 6 6 7 8 8 9

SS	-2.326	-1.645	-1.28	-0.84	-0.67	-0.52	-0.25	0.0	+0.25	+0.52	+0.67	+0.84	+1.28	+1.645	+2.326
T	26.74	33.55	37.18	41.59	43.26	44.8	47.5	50	52.5	55.2	56.74	58.41	62.82	66.45	73.26
PR	1	5	10	20	25	30	40	50	60	70	75	80	90	95	99
NCE	1	15.4	23	32.3	35.8	38.9	44.7	50	55.3	61	64.2	67.7	77	84.6	99
Sta9	1	2	2	3	4	4	5	5	5	6	6	7	8	8	9

Band Intrepretation

The formula was Xbar ± CV_z * s_Xbar.

We can do the same thing in the area of measurement with the formula

obtained score ± CV_z*s_m

where
CV_z is the critical value obtained from the standard normal distribution
s_m is the standard error of measurement.

If you use CV_z = 1.96 then you would create a 95% confidence band. The correct interpretation of this confidence band would be that the true score will be found in 95% of all such confidence bands, which is very close to saying that there is a probability of .95 that the true score for this student falls within this confidence band.

Example:

Test 1: obtained score = 67, s_m = 2.0
(63.08, 70.92) = 67 ± 1.96 * 2.0

Test 2: obtained score = 63, s_m = 3.0
(57.12, 68.88) = 63 ± 1.96 * 3.0

Test 3: obtained score = 50, s_m = 2.5
(45.1, 54.9) = 50 ± 1.96 * 2.5

The confidence bands for tests 1 and 2 overlap so that there is probably no real difference in their scores. Test 3 however, does not overlap with the other two test and most likely represents a true differnce between test scores.

Some Standardized Tests

Achievement Tests or Achievement Test Batteries
- California Achievement Test (CAT)
- Comprehensive Tests of Basic Skills (CTBS)
- Iowa Test of Basic Skills (ITBS)
- Metropolitan Achievement Tests (MAT)
- Sequential Tests of Educational Progress (STEP)
- Science Research Associates Achievement Series (SRA)
- Stanford Achievement Test Series
- Tests of Achievement and Proficiency (TAP)
Aptitude Tests
Intelligence Tests
- Sandford-Binet Intelligence Scale
- Wechsler Intelligence Scale for Children
- Peabody Picture Vocabulary Test (PPV)
- Cognitive Abilities Test (CogAT)
Personality Tests
- Adjective Checklist
- Edwards Personal Preferece Schedule (EPPS)
- Minnesota Multiphasic Personality Inventory (MMPI)
- Rorschach Inkblot Technique
- Thematic Apperception Test (TAT)
Sensory-Motor Tests
Vocational Tests and Vocational Interests Tests

ED230A Measurement and Testing