Survival analysis 1: Introduction to analysing time to event outcomes

Researchers are often interested in analysing the time to which an event occurs. Examples of such outcomes include time to failure of a material, time to recovery after fatigue, or time to death after surgery. Time is the most common unit in which such outcomes are measured, but the outcome need not necessarily be time. It could be another type of count data e.g. number of repeated loadings till material failure.
At first glance, using linear regression to analyse outcomes of time seems an obvious choice. But this is problematic because time outcomes are often non-Normally distributed. They can be very skewed or bimodal, and linear regression is not robust to these violations.
Survival analysis is a technique used to analyse time to occurrence of an event. Survival analysis does not assume time outcomes are Normally distributed. The key principle is: since events occur at given times, simply use time to order the events, then only analyse the ordering of the time values.
How survival analysis works: semi- and non-parametric methods
Suppose we have data from 6 people on the no. of years (i.e. time) after surgey when they died. We could also indicate if they had died or if they were still alive with reference to a certain time point using 0’s or 1’s:
id | year | age | died at year=1 | died at year=6 |
---|---|---|---|---|
1 | 1 | 48 | 1 | 1 |
2 | 6 | 63 | 0 | 1 |
3 | 9 | 54 | 0 | 0 |
4 | 9 | 49 | 0 | 0 |
5 | 15 | 44 | 0 | 0 |
6 | 27 | 36 | 0 | 0 |
We could ask:
After being exposed to the risk of dying, what is the probability of dying at the 1st unit of time?
At year = 1, only participant 1 died. To get the probability of dying at year = 1, we could perform logistic regression of death on time, potentially adjusting for age.
We could also ask:
After being exposed to the risk of dying, what is the probability of dying at the 6th unit of time?
By year = 6, participants 1 and 2 had died. Since participant 1 died prior to participant 2, only participant 2 contributed new information to the number of people remaining who had potential to die. So there is less information at year = 6 (i.e. the number of people at risk of dying is smaller) compared to at year = 1 because there is one less participant. Nevertheless, to get the probability of dying at year = 6, we could perform logistic regression of death on time, adjusting for age.
In fact, we could run 6 regression models, one for each year after surgery where someone died, to get the probability of dying at each time point adjusted for age. But this is inefficient. It would be better to combine these separate analyses, constraining the regression coefficient (for age) to be the same. This is what semiparametric survival analysis does if a conditional logistic regression is fit for each analysis; in particular, Cox regression (1972) does this.
Regression models don’t make assumptions about distributions of independent (x-axis) variables. Since each of the 6 separate logistic regressions does not make assume how survival times are distributed, the analysis that combines these regressions also doesn’t assume how survival times are distributed. So, time is only used to order the events (death), and non-parametric anaylsis is performed on the ordered time values. However, we can make some assumptions on how (some) covariates are distributed (e.g. age), so there is also a parametric component to the analysis.
When there are no covariates, or when covariates are qualitative, non-parametric methods such as the Kaplan and Meier (1958) method, or the methods by Nelson and Aalen (1972, 1978) are used to estimate the probability of surviving past a certain time, or comparing survival for each covariate category.
Parametric methods
It is possible to assume that time to event data (specifically, the hazard function; we’ll cover this in the next post) follow a distribution. In this case, the analysis of time outcomes and covariates require distributional assumptions based on parameters that describe those distributions. Examples of common distributions of hazard functions include the Weibull (favoured by engineers), lognormal and loglogistic functions.
Summary
Survival analysis involves analysing time to event outcomes. Common types of survival analysis include semiparametric Cox regression, or non-parametric Kaplan-Meier or Nelson-Aalen methods. These methods use time to order the events then analyse the ordering of time values.
Next time, we’ll expand more on the hazard function and other key concepts in survival analysis.
Reference
Cleves M, Gutierrez R, Gould W, Marchenko Y (2008) An Introduction to Survival Analysis Using Stata. Chp 1. (2nd Ed) Stata Press: Texas, USA.