Why we need survival analysis
When the dependent variable is continuous, we would ordinarily first think of linear regression. It’s a very good methods when you want to look at the relationship between a continuous dependent variable and one or more independent variables.
But, like nearly all statistical techniques, they make assumptions. And one of the assumptions that is so clear as to usually go unstated is that we know the value of the dependent variable; usually, this is not a problem. If we want to model, say, what people weigh, we can weigh them. But in one common type of analysis, we don’t always know the dependent variable – that’s when the dependent variable is time to an event. In that case, we need survival analysis.
Survival analysis gets its name from the fact that it is often used to look at how long people will live, and to see what influences that. Do women live longer than men? Do people who take aspirin live longer than those who do not? and that sort of thing. But it can be time to any event. We could look at how long prisoners stay in jail, how long patients stay in the hospital, how long couples stay married, or any other variable that is a time.
The key reason that we need survival analysis is that these data are often censored. If, for example, we were looking at how long couples stay married, we could select some couples, and follow them over time. But some couples won’t get divorced before we finish our study. Similarly, some patients won’t die during our study, and so on.
Types of survival analysis
Although there are a wide variety of techniques for doing survival analysis, they fall into three famlies: Parametric, semi-parametric, and non-parametric. The difference is in what we wish to assume about the distribution of survival times. In parametric survival analysis, we assume that survival times come from some specific statistical distribution; in semiparametric survival analysis, we do not need to make this assumption, but we do make another assumption – usually the proportional hazards assumption. In nonparametric analysis, we avoid even that assumption. Since the exact nature of the survival function is hard to know, and is critical to the results, semiparametric survival analysis is much more commonly used than parametric. And semi-parametric offers more useful output than nonparametric analysis. By far the most common method is known as Cox proportional hazards regression.
Semiparametric methods, unlike parametric methods do not allow you to predict a survival time; rather, they just let you see differences between groups, or differences based on some other measure. For instance, Cox methods would not predict how long couples would stay together, but it could predict how much more quickly (say) couples with a large age difference got divorced than couples with similar ages. This is often of primary interest.
Some output from survival analysis
Many statistical packages can do survival analysis, and the output will vary somewhat. But one key piece of information is the hazard ratio. This is, as the name would suggest, a ratio of hazards, and a hazard is defined as “the instantaneous rate of change in the …. [event] … probability at time t, on the condition that … [the event has not happened]… before time t” (Cantor, p. 9 see sources at end). That’s rather technical and fully understanding it requires calculus, but the basic idea (and this is NOT precisely right) “how likely is the event to happen at time t, if it hasn’t happened yet?”
The hazard ratio, then, looks at how the hazard changes as an independent variable changes. We might, for example, look at the risk of dying among people who do and do not take aspirin. One group will be more likely to die, by a given ratio, at any given time. Or, if the independent variable is continuous, then it’s the hazard ratio per unit of the independent variable. For example, we might look at how the hazard of divorce changes as the husband gets older, and the hazard ratio would then be per year. If the hazard ratio were 1.03 per year, then for 2 years the ratio would be 1.03*1.03 = 1.0609; for 3 years it would be 1.03^3 = 1.093, and so on.
Other output from survival analysis includes graphs, including graphs of the survival time for different groups.
The proportional hazards assumption
The proportional hazards assumption is that the ratio of the hazards is constant over time. That is, if, say smokers who are 30 years old have a hazard that is 1.1 times that of nonsmokers who are 30, then smokers who are 70 have a hazard that is 1.1 times that of nonsmokers who are 70. This is often reasonable, but sometimes it is not.
Further reading about survival analysis (increasing in complexity and rigor from top to bottom)
Survival analysis: A self learning text – Kleinbaum et al: A very good introduction
Survival analysis using SAS – Allison – quite dated but very good
SAS Survival analysis for medical research – Cantor – The book I use most often
Modeling survival data; Extending the Cox model – Thereau et al. – for complex cases