Survival Analysis: Leveraging Deep Learning for Time-to-Event Forecasting | by Lina Faik

[ad_1]

Practical Application to Rehospitalization

Survival models are great for predicting the time for an event to occur. These models can be used in a wide variety of use cases including predictive maintenance (forecasting when a machine is likely to break down), marketing analytics (anticipating customer churn), patient monitoring (predicting a patient is likely to be re-hospitalized), and much more.

By combining machine learning with survival models, the resulting models can benefit from the high predictive power of the former while retaining the framework and typical outputs of the latter (such as the survival probability or hazard curve over time). For more information, check out the first article of this series here.

However, in practice, ML-based survival models still require extensive feature engineering and thus prior business knowledge and intuition to lead to satisfying results. So, why not use deep learning models instead to bridge the gap?

Objective

This article focuses on how deep learning can be combined with the survival analysis framework to solve use cases such as predicting the likelihood of a patient being (re)hospitalized.

After reading this article, you will understand:

How can deep learning be leveraged for survival analysis?
What are the common deep learning models in survival analysis and how do they work?
How can these models be applied concretely to hospitalization forecasting?

This article is the second part of the series around survival analysis. If you are not familiar with survival analysis, it is best to start by reading the first one here. The experimentations described in the article were carried out using the libraries scikit-survival, pycox, and plotly. You can find the code here on GitHub.

1.1. Problem statement

Let’s start by describing the problem at hand.

We are interested in predicting the likelihood that a given patient will be rehospitalized given the available information about his health status. More specifically, we would like to estimate this probability at different time points after the last visit. Such an estimate is essential to monitor patient health and mitigate their risk of relapse.

This is a typical survival analysis problem. The data consists of 3 elements:

Patient’s baseline data including:

Demographics: age, gender, locality (rural or urban)
Patient history: smoking, alcohol, diabetes mellitus, hypertension, etc.
Laboratory results: hemoglobin, total lymphocyte count, platelets, glucose, urea, creatinine, etc.
More information about the source dataset here.

A time t and an event indicator δ∈0;1:

If the event occurs during the observation duration, t is equal to the time between the moment the data were collected and the moment the event (i.e., rehospitalization) is observed, In that case, δ = 1.
If not, t is equal to the time between the moment the data were collected and the last contact with the patient (e.g. end of study). In that case, δ = 0.

Figure 1 — Survival analysis data, illustration by the author. Note: patients A, and C are censored.

⚠️ With this description, why use survival analysis methods when the problem is so similar to a regression task? The initial paper gives a pretty good explanation of the main reason:

“If one chooses to use standard regression methods, the right-censored data becomes a type of missing data. It is usually removed or imputed, which may introduce bias into the model. Therefore, modeling right-censored data requires special attention, hence the use of a survival model.” Source [2]

1.2. DeepSurv

Approach

Let’s move on to the theoretical part with a little refresher on the hazard function.

“The hazard function is the probability an individual will not survive an extra infinitesimal amount of time δ, given they have already survived up to time t. Thus, a greater hazard signifies a greater risk of death.”

Source [2]

Similar to the Cox proportional hazards (CPH) model, DeepSurv is based on the assumption that the hazard function is the product of the 2 functions:

the baseline hazard function: λ_0(t)
the risk score, r(x)=exp(h(x)). It models how the hazard function varies from the baseline for a given individual given the observed covariates.

More on CPH models in the first article of this series.

The function h(x) is commonly referred to as the log-risk function. And this is precisely the function that the Deep Surv model aims at modeling.

In fact, CPH models assume that h(x) is a linear function: h(x) = β . x. Fitting the model consists thus in computing the weights β to optimize the objective function. However, the linear proportional hazards assumption does not hold in many applications. This justifies the need for a more complex non-linear model that is ideally capable of handling large volumes of data.

Architecture

In this context, how can the DeepSurv model provide a better alternative? Let’s start by describing it. According to the original paper, it’s a “deep feed-forward neural network which predicts the effects of a patient’s covariates on their hazard rate parameterized by the weights of the network θ.” [2]

How does it work?

‣ The input to the network is the baseline data x.

‣ The network propagates the inputs through a number of hidden layers with weights θ. The hidden layers consist of fully-connected nonlinear activation functions followed by dropout.

‣ The final layer is a single node that performs a linear combination of the hidden features. The output of the network is taken as the predicted log-risk function.

Source [2]

Figure 2 — DeepSurv architecture, illustration by the author, inspired by source [2]

As a result of this architecture, the model is very flexible. Hyperparametric search techniques are typically used to determine the number of hidden layers, the number of nodes in each layer, the dropout probability and other settings.

What about the objective function to optimize?

CPH models are trained to optimize the Cox partial likelihood. It consists of calculating for each patient i at time Ti the probability that the event has happened, considering all the individuals still at risk at time Ti, and then multiplying all these probabilities together. You can find the exact mathematical formula here [2].
Similarly, the objective function of DeepSurv is the log-negative mean of the same partial likelihood with an additional part that serves to regularize the network weights. [2]

Code sample

Here is a small code snippet to get an idea of how this type of model is implemented using the pycox library. The complete code can be found in the notebook examples of the library here [6].

# Step 1: Neural net
# simple MLP with two hidden layers, ReLU activations, batch norm and dropoutin_features = x_train.shape[1]
num_nodes = [32, 32]
out_features = 1
batch_norm = True
dropout = 0.1
output_bias = False
net = tt.practical.MLPVanilla(in_features, num_nodes, out_features, batch_norm,
dropout, output_bias=output_bias)
model = CoxPH(net, tt.optim.Adam)
# Step 2: Model training
batch_size = 256
epochs = 512
callbacks = [tt.callbacks.EarlyStopping()]
verbose = True
model.optimizer.set_lr(0.01)
log = model.fit(x_train, y_train, batch_size, epochs, callbacks, verbose,
val_data=val, val_batch_size=batch_size)
# Step 3: Prediction
_ = model.compute_baseline_hazards()
surv = model.predict_surv_df(x_test)
# Step 4: Evaluation
ev = EvalSurv(surv, durations_test, events_test, censor_surv='km')
ev.concordance_td()

1.3. DeepHit

Approach

Instead of making strong assumptions about the distribution of survival times, what if we could train a deep neural network that would learn them directly?

This is the case with the DeepHit model. In particular, it brings two significant improvements over previous approaches:

It does not rely on any assumptions about the underlying stochastic process. Thus, the network learns to model the evolution over time of the relationship between the covariates and the risk.
It can handle competing risks (e.g., simultaneously modeling the risks of being rehospitalized and dying) through a multi-task learning architecture.

Architecture

As described here [3], DeepHits follows the common architecture of multi-task learning models. It consists of two main parts:

A shared subnetwork, where the model learns from the data a general representation useful for all the tasks.
Task-specific subnetworks, where the model learns more task-specific representations.

However, the architecture of the DeepHit model differs from typical multi-task learning models in two aspects:

It includes a residual connection between the inital covariates and the input of the task-specific sub-networks.
It uses only one softmax output layer. Thanks to this, the model does not learn the marginal distribution of competing events but the joint distribution.

The figures below show the case where the model is trained simultaneously on two tasks.

The output of the DeepHit model is a vector y for every subject. It gives the probability that the subject will experience the event k ∈ [1, 2] for every timestamp t within the observation time.

Figure 3 — DeepHit architecture, illustration by the author, inspired by source [4]

2.1. Methodology

Data

The data set was divided into three parts: a training set (60% of the data), a validation set (20%), and a test set (20%). The training and validation sets were used to optimize the neural networks during training and the test set for final evaluation.

Benchmark

The performance of the deep learning models was compared to a benchmark of models including CoxPH and ML-based survival models (Gradient Boosting and SVM). More information on these models is available in the first article of the series.

Metrics

Two metrics were used to evaluate the models:

Concordance index (C-index): it measures the capability of the model to provide a reliable ranking of survival times based on individual risk scores. It is computed as the proportion of concordant pairs in a dataset.
Brier score: It’s a time-dependent extension of the mean squared error to right censored data. In other words, it represents the average squared distance between the observed survival status and the predicted survival probability.

2.2. Results

In terms of C-index, the performance of the deep learning models is considerably better than that of the ML-based survival analysis models. Moreover, there is almost no difference between the performance of Deep Surval and Deep Hit models.

Figure 4 — C-Index of models on the train and test sets

In terms of Brier score, the Deep Surv model stands out from the others.

When examining the curve of the Brier score as a function of time, the curve of the Deep Surv model is lower than the others, which reflects a better accuracy.

This observation is confirmed when considering the integration of the score over the same time interval.

Figure 6 — Integrated Brier score on the test set

Note that the Brier wasn’t computed for the SVM as this score is only applicable for models that are able to estimate a survival function.

Figure 7— Survival curves of randomly selected patients using DeepSurv Model

Finally, deep learning models can be used for survival analysis as well as statistical models. Here, for instance, we can see the survival curve of randomly selected patients. Such outputs can bring many benefits, in particular allowing a more effective follow-up of the patients that are the most at risk.

✔️ Survival models are very useful for predicting the time it takes for an event to occur.

✔️ They can help address many use cases by providing a learning framework and techniques as well as useful outputs such as the probability of survival or the hazard curve over time.

✔️ They are even indispensable in this type of uses cases to exploit all the data including the censored observations (when the event does not occur during the observation period for example).

✔️ ML-based survival models tend to perform better than statistical models (more information here). However, they require high-quality feature engineering based on solid business intuition to achieve satisfactory results.

✔️ This is where Deep Learning can bridge the gap. Deep learning-based survival models like DeepSurv or DeepHit have the potential to perform better with less effort!

✔️ Nevertheless, these models are not without drawbacks. They require a large database for training and require fine-tuning multiple hyperparameters.

[1] Bollepalli, S.C.; Sahani, A.K.; Aslam, N.; Mohan, B.; Kulkarni, K.; Goyal, A.; Singh, B.; Singh, G.; Mittal, A.; Tandon, R.; Chhabra, S.T.; Wander, G.S.; Armoundas, A.A. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.

[2] Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T., & Kluger, Y. (2016). DeepSurv: Personalized Treatment Recommender System Using A Cox Proportional Hazards Deep Neural Network, ArXiv

[3] Laura Löschmann, Daria Smorodina, Deep Learning for Survival Analysis, Seminar information systems (WS19/20), February 6, 2020

[4] Lee, Changhee et al. DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks. AAAI Conference on Artificial Intelligence (2018).

[5] Wikipedia, Proportional hazards model

[6] Pycox library

[ad_2]

Source link