In this unit we present an introduction to survival analysis, also known as event history analysis or time-to-event analysis in the social sciences. Analyses of this type involve the amount of time that a subject is at risk while under observation. Time, itself, can be measured in different ways, it can be continuous or it treated as discrete. In this unit we will look into continuous time survival analysis and in the next unit we will give an introduction to discrete time suvival analysis.
There are two aspects of survival analysis that make the it interesting from a data analysis perspective are:
In some studies, all of the cases might be observed until failure or the specific event occurs. In other studies, the study may end before failure occurs for some subjects or some subjects may withdraw or dropout of the study before the failure event. In either event, the subject is not under observation when the failure occurs. When a subject is not observed until failure, the observation is considered to be censored. These types of right censored data are common in survival analysis. At times, analyses might also include left censored data.
Suvivorship and Hazard Functions
Let the probability density function and and cumulative density funtion be denoted as follows:
Now, we can define the survivorship function as:
The hazard function can next be defined as:
However, it is probably easier to interpret the cumulative hazard function, H(t), which is just the integral over 0 to t of h(t).
Stata survival time commands, such as, sts list and sts graph can display results in either the survivorship or hazard metric.
Nonparametric Methods
There are two nonparametric approaches commonly used to estimate the survivor function and cumulative hazard function. The Kaplan-Meier (1958) method estimates the survivor function and the Nelson-Aalen (1972 & 1978) method estimates the cumulative hazard function. These nonparametric estimators require only an ordering of the time to failure (or censoring) and do not include the effects of covariates. We will begin our explanation of the Kaplan-Meier estimator with the following simple example.
id t failed 1 2 1 2 4 1 3 4 1 4 5 0 5 7 1 6 8 0
We can summarize the data as follows:
nj dj at time # at risk # failed # censored 2 6 1 0 4 5 2 0 5 3 0 1 7 2 1 0 8 1 0 1
Using the formula,
we can compute a probability column and also the continuous product of the probabilities.
nj dj at time # at risk # failed # censored p S(t) 2 6 1 0 5/6 5/6 4 5 2 0 3/5 1/2 5 3 0 1 1 1/2 7 2 1 0 1/2 1/4 8 1 0 1 1 1/4
The last column contains the Kaplan-Meier survivor function.
In Stata, we would use the sts list command to obtain the Kaplan-Meier survivor function. Before we can use the sts list command we need to get the data into the proper format for survival analysis by using the stset command. Here is how it is done in Stata.
input id t failed 1 2 1 2 4 1 3 4 1 4 5 0 5 7 1 6 8 0 end stset t, failure(failed) failure event: failed ~= 0 & failed ~= . obs. time interval: (0, t] exit on or before: failure ------------------------------------------------------------------------------ 6 total obs. 0 exclusions ------------------------------------------------------------------------------ 6 obs. remaining, representing 4 failures in single record/single failure data 30 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 8 sts list failure _d: failed analysis time _t: t Beg. Net Survivor Std. Time Total Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------------- 2 6 1 0 0.8333 0.1521 0.2731 0.9747 4 5 2 0 0.5000 0.2041 0.1109 0.8037 5 3 0 1 0.5000 0.2041 0.1109 0.8037 7 2 1 0 0.2500 0.2041 0.0123 0.6459 8 1 0 1 0.2500 0.2041 0.0123 0.6459 -------------------------------------------------------------------------------You might think that it would be easy to obtain the cumulative hazard function from the Kaplan-Meier using the relatioship between the survivor and hazard functions (see above) but there are problems in small samples with this approach. It is better to use the following formula for the Nelson-Aalen estimator.
We will compute the column ej = dj/nj and the column H(t) which is the sum of the ej.
nj dj at time # at risk # failed # censored ej H(t) 2 6 1 0 1/6 0.1667 4 5 2 0 2/5 0.5667 5 3 0 1 0 0.5667 7 2 1 0 1/2 1.0667 8 1 0 1 0 1.0667
In Stata, we use the na option with sts list.
sts list, na failure _d: failed analysis time _t: t Beg. Net Nelson-Aalen Std. Time Total Fail Lost Cum. Haz. Error [95% Conf. Int.] ------------------------------------------------------------------------------- 2 6 1 0 0.1667 0.1667 0.0235 1.1832 4 5 2 0 0.5667 0.3283 0.1820 1.7639 5 3 0 1 0.5667 0.3283 0.1820 1.7639 7 2 1 0 1.0667 0.5981 0.3554 3.2015 8 1 0 1 1.0667 0.5981 0.3554 3.2015 -------------------------------------------------------------------------------
Note: Due to the introductory nature of this unit we do go into issues such as delayed entry, interval truncation, interval censoring, etc.
HIV Example
Here is an example using HIV data from Hosmer & Lemeshow (1999).
use http://www.gseis.ucla.edu/courses/data/hivdata describe Contains data from http://www.gseis.ucla.edu/courses/data/hivdata.dta obs: 100 vars: 7 7 Feb 2001 02:07 size: 3,800 (99.5% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id float %9.0g entdate str7 %9s enddate str7 %9s time float %9.0g age float %9.0g drug float %9.0g died float %9.0g ------------------------------------------------------------------------------- list entdate enddate time died in 1/15 entdate enddate time died 1. 15may90 14oct90 5 1 2. 19sep89 20mar90 6 0 3. 21apr91 20dec91 8 1 4. 03jan91 04apr91 3 1 5. 18sep89 19jul91 22 1 6. 18mar91 17apr91 1 0 7. 11nov89 11jun90 7 1 8. 25nov89 25aug90 9 1 9. 11feb91 13may91 3 1 10. 11aug89 11aug90 12 1 11. 11apr90 10jun90 2 0 12. 11may91 10may92 12 1 13. 17jan89 16feb89 1 1 14. 16feb91 17may92 15 1 15. 09apr91 06feb94 34 1 list time died drug age in 1/15 time died drug age 1. 5 1 0 46 2. 6 0 1 35 3. 8 1 1 30 4. 3 1 1 30 5. 22 1 0 36 6. 1 0 1 32 7. 7 1 1 36 8. 9 1 1 31 9. 3 1 0 48 10. 12 1 0 47 11. 2 0 1 28 12. 12 1 0 34 13. 1 1 1 44 14. 15 1 1 32 15. 34 1 0 36 summarize Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- id | 100 50.5 29.01149 1 100 entdate | 0 enddate | 0 time | 100 11.36 15.28353 1 60 age | 100 36.07 6.700302 20 54 drug | 100 .49 .5024184 0 1 died | 100 .8 .4020151 0 1 tab1 drug age -> tabulation of drug drug | Freq. Percent Cum. ------------+----------------------------------- 0 | 51 51.00 51.00 1 | 49 49.00 100.00 ------------+----------------------------------- Total | 100 100.00 -> tabulation of age age | Freq. Percent Cum. ------------+----------------------------------- 20 | 1 1.00 1.00 21 | 1 1.00 2.00 22 | 1 1.00 3.00 25 | 2 2.00 5.00 26 | 2 2.00 7.00 28 | 3 3.00 10.00 29 | 2 2.00 12.00 30 | 6 6.00 18.00 31 | 5 5.00 23.00 32 | 9 9.00 32.00 33 | 6 6.00 38.00 34 | 8 8.00 46.00 35 | 5 5.00 51.00 36 | 9 9.00 60.00 37 | 3 3.00 63.00 38 | 3 3.00 66.00 39 | 5 5.00 71.00 40 | 3 3.00 74.00 41 | 4 4.00 78.00 42 | 4 4.00 82.00 43 | 2 2.00 84.00 44 | 4 4.00 88.00 45 | 1 1.00 89.00 46 | 3 3.00 92.00 47 | 4 4.00 96.00 48 | 1 1.00 97.00 50 | 1 1.00 98.00 51 | 1 1.00 99.00 54 | 1 1.00 100.00 ------------+----------------------------------- Total | 100 100.00 hist age stset time, failure(died) failure event: died ~= 0 & died ~= . obs. time interval: (0, time] exit on or before: failure ------------------------------------------------------------------------------ 100 total obs. 0 exclusions ------------------------------------------------------------------------------ 100 obs. remaining, representing 80 failures in single record/single failure data 1136 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 60 list id _st- _t0 in 1/20 id _st _d _t _t0 1. 1 1 1 5 0 2. 2 1 0 6 0 3. 3 1 1 8 0 4. 4 1 1 3 0 5. 5 1 1 22 0 6. 6 1 0 1 0 7. 7 1 1 7 0 8. 8 1 1 9 0 9. 9 1 1 3 0 10. 10 1 1 12 0 11. 11 1 0 2 0 12. 12 1 1 12 0 13. 13 1 1 1 0 14. 14 1 1 15 0 15. 15 1 1 34 0 16. 16 1 1 1 0 17. 17 1 1 4 0 18. 18 1 0 19 0 19. 19 1 0 3 0 20. 20 1 1 2 0 stdes failure _d: died analysis time _t: time |-------------- per subject --------------| Category total mean min median max ------------------------------------------------------------------------------ no. of subjects 100 no. of records 100 1 1 1 1 (first) entry time 0 0 0 0 (final) exit time 11.36 1 5 60 subjects with gap 0 time on gap if gap 0 time at risk 1136 11.36 1 5 60 failures 80 .8 0 1 1 ------------------------------------------------------------------------------ stvary failure _d: died analysis time _t: time subjects for whom the variable is never always sometimes variable | constant varying missing missing missing ---------+-------------------------------------------------------------- id | 100 0 100 0 0 entdate | 100 0 100 0 0 enddate | 100 0 100 0 0 age | 100 0 100 0 0 drug | 100 0 100 0 0 stsum failure _d: died analysis time _t: time | incidence no. of |------ Survival time -----| | time at risk rate subjects 25% 50% 75% ---------+--------------------------------------------------------------------- total | 1136 .0704225 100 3 7 15 stsum, by(drug) failure _d: died analysis time _t: time | incidence no. of |------ Survival time -----| drug | time at risk rate subjects 25% 50% 75% ---------+--------------------------------------------------------------------- 0 | 864 .0486111 51 5 11 34 1 | 272 .1397059 49 3 5 8 ---------+--------------------------------------------------------------------- total | 1136 .0704225 100 3 7 15 sts list failure _d: died analysis time _t: time Beg. Net Survivor Std. Time Total Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------------- 1 100 15 2 0.8500 0.0357 0.7636 0.9067 2 83 5 5 0.7988 0.0402 0.7057 0.8652 3 73 10 2 0.6894 0.0473 0.5862 0.7718 4 61 4 1 0.6442 0.0493 0.5387 0.7315 5 56 7 0 0.5636 0.0517 0.4564 0.6577 6 49 2 1 0.5406 0.0521 0.4334 0.6361 7 46 6 1 0.4701 0.0526 0.3644 0.5688 8 39 4 0 0.4219 0.0525 0.3183 0.5217 9 35 3 0 0.3857 0.0520 0.2845 0.4858 10 32 3 1 0.3496 0.0511 0.2514 0.4493 11 28 3 0 0.3121 0.0500 0.2177 0.4110 12 25 2 2 0.2872 0.0490 0.1956 0.3851 13 21 1 0 0.2735 0.0486 0.1835 0.3711 14 20 1 0 0.2598 0.0480 0.1715 0.3569 15 19 2 0 0.2325 0.0467 0.1479 0.3282 19 17 0 1 0.2325 0.0467 0.1479 0.3282 22 16 1 0 0.2179 0.0460 0.1355 0.3130 24 15 0 1 0.2179 0.0460 0.1355 0.3130 30 14 1 0 0.2024 0.0453 0.1222 0.2969 31 13 1 0 0.1868 0.0444 0.1092 0.2805 32 12 1 0 0.1712 0.0433 0.0966 0.2638 34 11 1 0 0.1557 0.0421 0.0843 0.2469 35 10 1 0 0.1401 0.0407 0.0724 0.2296 36 9 1 0 0.1245 0.0390 0.0610 0.2119 43 8 1 0 0.1090 0.0371 0.0500 0.1939 53 7 1 0 0.0934 0.0349 0.0396 0.1754 54 6 1 0 0.0778 0.0324 0.0298 0.1564 56 5 0 1 0.0778 0.0324 0.0298 0.1564 57 4 1 0 0.0584 0.0296 0.0178 0.1349 58 3 1 0 0.0389 0.0253 0.0082 0.1117 60 2 0 2 0.0389 0.0253 0.0082 0.1117 ------------------------------------------------------------------------------- sts list, by(drug) compare failure _d: died analysis time _t: time Survivor Function drug 0 1 ---------------------------------- time 1 0.9020 0.7959 8 0.6078 0.2037 15 0.3624 0.0582 22 0.3383 0.0582 29 0.3383 0.0582 36 0.1821 0.0582 43 0.1561 0.0582 50 0.1561 0.0582 57 0.0781 . 64 . . ---------------------------------- sts graph sts graph, na sts graph, by(drug) sts graph, by(drug) na
Categorical Data Analysis Course
Phil Ender