Applied Categorical & Nonnormal Data Analysis

Introduction to Survival Analysis

In this unit we present an introduction to survival analysis, also known as event history analysis or time-to-event analysis in the social sciences. Analyses of this type involve the amount of time that a subject is at risk while under observation. Time, itself, can be measured in different ways, it can be continuous or it treated as discrete. In this unit we will look into continuous time survival analysis and in the next unit we will give an introduction to discrete time suvival analysis.

There are two aspects of survival analysis that make the it interesting from a data analysis perspective are:

The response variable, time to failure, is usually not normally distributed.
Survival analysis often involves censored data.

In some studies, all of the cases might be observed until failure or the specific event occurs. In other studies, the study may end before failure occurs for some subjects or some subjects may withdraw or dropout of the study before the failure event. In either event, the subject is not under observation when the failure occurs. When a subject is not observed until failure, the observation is considered to be censored. These types of right censored data are common in survival analysis. At times, analyses might also include left censored data.

Suvivorship and Hazard Functions

Let the probability density function and and cumulative density funtion be denoted as follows:

pdf -- f(t)
cdf -- F(t)

Now, we can define the survivorship function as:

S(t) = 1 - F(t)

The hazard function can next be defined as:

h(t) = f(t)/S(t)

However, it is probably easier to interpret the cumulative hazard function, H(t), which is just the integral over 0 to t of h(t).

Stata survival time commands, such as, sts list and sts graph can display results in either the survivorship or hazard metric.

Nonparametric Methods

There are two nonparametric approaches commonly used to estimate the survivor function and cumulative hazard function. The Kaplan-Meier (1958) method estimates the survivor function and the Nelson-Aalen (1972 & 1978) method estimates the cumulative hazard function. These nonparametric estimators require only an ordering of the time to failure (or censoring) and do not include the effects of covariates. We will begin our explanation of the Kaplan-Meier estimator with the following simple example.

id t failed
1  2   1
2  4   1
3  4   1
4  5   0
5  7   1
6  8   0

We can summarize the data as follows:

              n_j            d_j
at time    # at risk    # failed    # censored
  2           6             1            0
  4           5             2            0
  5           3             0            1
  7           2             1            0
  8           1             0            1

Using the formula,

we can compute a probability column and also the continuous product of the probabilities.

              n_j            d_j
at time    # at risk    # failed    # censored    p     S(t)
  2           6             1            0       5/6     5/6
  4           5             2            0       3/5     1/2
  5           3             0            1        1      1/2
  7           2             1            0       1/2     1/4
  8           1             0            1        1      1/4

The last column contains the Kaplan-Meier survivor function.

In Stata, we would use the sts list command to obtain the Kaplan-Meier survivor function. Before we can use the sts list command we need to get the data into the proper format for survival analysis by using the stset command. Here is how it is done in Stata.

input id t failed
1  2   1
2  4   1
3  4   1
4  5   0
5  7   1
6  8   0
end
  
stset t, failure(failed)

     failure event:  failed ~= 0 & failed ~= .
obs. time interval:  (0, t]
 exit on or before:  failure

------------------------------------------------------------------------------
        6  total obs.
        0  exclusions
------------------------------------------------------------------------------
        6  obs. remaining, representing
        4  failures in single record/single failure data
       30  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =         8
  
sts list

         failure _d:  failed
   analysis time _t:  t

           Beg.          Net            Survivor      Std.
  Time    Total   Fail   Lost           Function     Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
     2        6      1      0             0.8333    0.1521     0.2731    0.9747
     4        5      2      0             0.5000    0.2041     0.1109    0.8037
     5        3      0      1             0.5000    0.2041     0.1109    0.8037
     7        2      1      0             0.2500    0.2041     0.0123    0.6459
     8        1      0      1             0.2500    0.2041     0.0123    0.6459
-------------------------------------------------------------------------------

You might think that it would be easy to obtain the cumulative hazard function from the Kaplan-Meier using the relatioship between the survivor and hazard functions (see above) but there are problems in small samples with this approach. It is better to use the following formula for the Nelson-Aalen estimator.

We will compute the column e_j = d_j/n_j and the column H(t) which is the sum of the e_j.

              n_j            d_j 
at time    # at risk    # failed    # censored    e_j      H(t)
  2           6             1            0       1/6    0.1667
  4           5             2            0       2/5    0.5667
  5           3             0            1        0     0.5667
  7           2             1            0       1/2    1.0667
  8           1             0            1        0     1.0667

In Stata, we use the na option with sts list.

sts list, na

         failure _d:  failed
   analysis time _t:  t

           Beg.          Net          Nelson-Aalen    Std.
  Time    Total   Fail   Lost           Cum. Haz.    Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
     2        6      1      0             0.1667    0.1667     0.0235    1.1832
     4        5      2      0             0.5667    0.3283     0.1820    1.7639
     5        3      0      1             0.5667    0.3283     0.1820    1.7639
     7        2      1      0             1.0667    0.5981     0.3554    3.2015
     8        1      0      1             1.0667    0.5981     0.3554    3.2015
-------------------------------------------------------------------------------

Note: Due to the introductory nature of this unit we do go into issues such as delayed entry, interval truncation, interval censoring, etc.

HIV Example

Here is an example using HIV data from Hosmer & Lemeshow (1999).


use http://www.gseis.ucla.edu/courses/data/hivdata
  
describe

Contains data from http://www.gseis.ucla.edu/courses/data/hivdata.dta
  obs:           100                          
 vars:             7                          7 Feb 2001 02:07
 size:         3,800 (99.5% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g                  
entdate         str7   %9s                    
enddate         str7   %9s                    
time            float  %9.0g                  
age             float  %9.0g                  
drug            float  %9.0g                  
died            float  %9.0g                  
-------------------------------------------------------------------------------
  
list entdate enddate time died in 1/15

       entdate    enddate       time       died 
  1.   15may90    14oct90          5          1  
  2.   19sep89    20mar90          6          0  
  3.   21apr91    20dec91          8          1  
  4.   03jan91    04apr91          3          1  
  5.   18sep89    19jul91         22          1  
  6.   18mar91    17apr91          1          0  
  7.   11nov89    11jun90          7          1  
  8.   25nov89    25aug90          9          1  
  9.   11feb91    13may91          3          1  
 10.   11aug89    11aug90         12          1  
 11.   11apr90    10jun90          2          0  
 12.   11may91    10may92         12          1  
 13.   17jan89    16feb89          1          1  
 14.   16feb91    17may92         15          1  
 15.   09apr91    06feb94         34          1 
  

list time died drug age in 1/15

          time       died       drug        age 
  1.         5          1          0         46  
  2.         6          0          1         35  
  3.         8          1          1         30  
  4.         3          1          1         30  
  5.        22          1          0         36  
  6.         1          0          1         32  
  7.         7          1          1         36  
  8.         9          1          1         31  
  9.         3          1          0         48  
 10.        12          1          0         47  
 11.         2          0          1         28  
 12.        12          1          0         34  
 13.         1          1          1         44  
 14.        15          1          1         32  
 15.        34          1          0         36  
  
summarize

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
      id |     100        50.5   29.01149          1        100  
 entdate |       0
 enddate |       0
    time |     100       11.36   15.28353          1         60  
     age |     100       36.07   6.700302         20         54  
    drug |     100         .49   .5024184          0          1  
    died |     100          .8   .4020151          0          1

  
tab1 drug age

-> tabulation of drug  

       drug |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         51       51.00       51.00
          1 |         49       49.00      100.00
------------+-----------------------------------
      Total |        100      100.00
  
-> tabulation of age  

        age |      Freq.     Percent        Cum.
------------+-----------------------------------
         20 |          1        1.00        1.00
         21 |          1        1.00        2.00
         22 |          1        1.00        3.00
         25 |          2        2.00        5.00
         26 |          2        2.00        7.00
         28 |          3        3.00       10.00
         29 |          2        2.00       12.00
         30 |          6        6.00       18.00
         31 |          5        5.00       23.00
         32 |          9        9.00       32.00
         33 |          6        6.00       38.00
         34 |          8        8.00       46.00
         35 |          5        5.00       51.00
         36 |          9        9.00       60.00
         37 |          3        3.00       63.00
         38 |          3        3.00       66.00
         39 |          5        5.00       71.00
         40 |          3        3.00       74.00
         41 |          4        4.00       78.00
         42 |          4        4.00       82.00
         43 |          2        2.00       84.00
         44 |          4        4.00       88.00
         45 |          1        1.00       89.00
         46 |          3        3.00       92.00
         47 |          4        4.00       96.00
         48 |          1        1.00       97.00
         50 |          1        1.00       98.00
         51 |          1        1.00       99.00
         54 |          1        1.00      100.00
------------+-----------------------------------
      Total |        100      100.00
  
hist age


  
stset time, failure(died) 

     failure event:  died ~= 0 & died ~= .
obs. time interval:  (0, time]
 exit on or before:  failure

------------------------------------------------------------------------------
      100  total obs.
        0  exclusions
------------------------------------------------------------------------------
      100  obs. remaining, representing
       80  failures in single record/single failure data
     1136  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =        60
                            
list id _st- _t0 in 1/20

            id       _st        _d          _t         _t0 
  1.         1         1         1           5           0  
  2.         2         1         0           6           0  
  3.         3         1         1           8           0  
  4.         4         1         1           3           0  
  5.         5         1         1          22           0  
  6.         6         1         0           1           0  
  7.         7         1         1           7           0  
  8.         8         1         1           9           0  
  9.         9         1         1           3           0  
 10.        10         1         1          12           0  
 11.        11         1         0           2           0  
 12.        12         1         1          12           0  
 13.        13         1         1           1           0  
 14.        14         1         1          15           0  
 15.        15         1         1          34           0  
 16.        16         1         1           1           0  
 17.        17         1         1           4           0  
 18.        18         1         0          19           0  
 19.        19         1         0           3           0  
 20.        20         1         1           2           0  
  
stdes

         failure _d:  died
   analysis time _t:  time

                                   |-------------- per subject --------------|
Category                   total        mean         min     median        max
------------------------------------------------------------------------------
no. of subjects              100   
no. of records               100           1           1          1          1

(first) entry time                         0           0          0          0
(final) exit time                      11.36           1          5         60

subjects with gap              0   
time on gap if gap             0   
time at risk                1136       11.36           1          5         60

failures                      80          .8           0          1          1
------------------------------------------------------------------------------
  
stvary

         failure _d:  died
   analysis time _t:  time

           subjects for whom the variable is
                                             never    always   sometimes
variable |  constant    varying             missing   missing   missing
---------+--------------------------------------------------------------
      id |       100          0                 100         0         0
 entdate |       100          0                 100         0         0
 enddate |       100          0                 100         0         0
     age |       100          0                 100         0         0
    drug |       100          0                 100         0         0
 
stsum

         failure _d:  died
   analysis time _t:  time

         |               incidence       no. of    |------ Survival time -----|
         | time at risk     rate        subjects        25%       50%       75%
---------+---------------------------------------------------------------------
   total |         1136   .0704225           100          3         7        15
                                                          
stsum, by(drug)

         failure _d:  died
   analysis time _t:  time

         |               incidence       no. of    |------ Survival time -----|
drug     | time at risk     rate        subjects        25%       50%       75%
---------+---------------------------------------------------------------------
       0 |          864   .0486111            51          5        11        34
       1 |          272   .1397059            49          3         5         8
---------+---------------------------------------------------------------------
   total |         1136   .0704225           100          3         7        15
  
sts list

         failure _d:  died
   analysis time _t:  time

           Beg.          Net            Survivor      Std.
  Time    Total   Fail   Lost           Function     Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
     1      100     15      2             0.8500    0.0357     0.7636    0.9067
     2       83      5      5             0.7988    0.0402     0.7057    0.8652
     3       73     10      2             0.6894    0.0473     0.5862    0.7718
     4       61      4      1             0.6442    0.0493     0.5387    0.7315
     5       56      7      0             0.5636    0.0517     0.4564    0.6577
     6       49      2      1             0.5406    0.0521     0.4334    0.6361
     7       46      6      1             0.4701    0.0526     0.3644    0.5688
     8       39      4      0             0.4219    0.0525     0.3183    0.5217
     9       35      3      0             0.3857    0.0520     0.2845    0.4858
    10       32      3      1             0.3496    0.0511     0.2514    0.4493
    11       28      3      0             0.3121    0.0500     0.2177    0.4110
    12       25      2      2             0.2872    0.0490     0.1956    0.3851
    13       21      1      0             0.2735    0.0486     0.1835    0.3711
    14       20      1      0             0.2598    0.0480     0.1715    0.3569
    15       19      2      0             0.2325    0.0467     0.1479    0.3282
    19       17      0      1             0.2325    0.0467     0.1479    0.3282
    22       16      1      0             0.2179    0.0460     0.1355    0.3130
    24       15      0      1             0.2179    0.0460     0.1355    0.3130
    30       14      1      0             0.2024    0.0453     0.1222    0.2969
    31       13      1      0             0.1868    0.0444     0.1092    0.2805
    32       12      1      0             0.1712    0.0433     0.0966    0.2638
    34       11      1      0             0.1557    0.0421     0.0843    0.2469
    35       10      1      0             0.1401    0.0407     0.0724    0.2296
    36        9      1      0             0.1245    0.0390     0.0610    0.2119
    43        8      1      0             0.1090    0.0371     0.0500    0.1939
    53        7      1      0             0.0934    0.0349     0.0396    0.1754
    54        6      1      0             0.0778    0.0324     0.0298    0.1564
    56        5      0      1             0.0778    0.0324     0.0298    0.1564
    57        4      1      0             0.0584    0.0296     0.0178    0.1349
    58        3      1      0             0.0389    0.0253     0.0082    0.1117
    60        2      0      2             0.0389    0.0253     0.0082    0.1117
-------------------------------------------------------------------------------
  
sts list, by(drug) compare

         failure _d:  died
   analysis time _t:  time

                 Survivor Function
drug                  0          1
----------------------------------
time       1     0.9020     0.7959
           8     0.6078     0.2037
          15     0.3624     0.0582
          22     0.3383     0.0582
          29     0.3383     0.0582
          36     0.1821     0.0582
          43     0.1561     0.0582
          50     0.1561     0.0582
          57     0.0781          .
          64          .          .
----------------------------------
  
sts graph


  
sts graph, na


  
sts graph, by(drug)


  
sts graph, by(drug) na

Categorical Data Analysis Course

Phil Ender