If you need to predict a time-based event, most common models, whether regression, classification or survival, can get you there but the quality, type of answer, and path taken will vary. A regression model will return a time estimate, a classification model will return the probability of an event at x time, and a survival model will return probabilities of an event not happening over various time frames. We’ll skip the regression model here as we’re not only interested in the time estimate but also the probability of an outcome. With a regression model you would have to first model the outcome and then figure out the time estimate.
A survival model needs two outcome variables: a time variable and an outcome/event variable. Every observation in the data set needs a time period. The event outcome, on the other hand, doesn’t need to be fully known, in contrast with a logistic regression or classification model which requires training on a known outcome. Instead of needing a true/false, sick/healthy, or dead/alive, a survival model uses the concept of the event, something either has happened or we don’t know.
Here we will use a random forest survival model as it offers advantages like capturing non-linear effects that a traditional model cannot do and be easily distributed over multiple cores. The two models that I have used are the ranger package and the randomForestSRC package. We will focus on the
ranger model as it doesn’t require additional steps to get it to work on multiple cores. I have been able to work on fairly wide data set using it.
Let’s set up our formula. Survival models require two values in the
Surv function, the time period followed by the outcome. After the tilde we add our predictors as it typically done with most modeling formulas.
The plots represent the probability of survival/not reaching event over time. In these cases, the orange line has a much higher provability of not being diagnosed with AIDS or dying than the blue line. This can be confusing, but a survival model yields a probability of NOT reaching event.
Let’s look at why the model may see the orange line (row 1) with a higher probability of not reaching event than the blue line (row 56):