Skip to content

Latest commit

 

History

History
123 lines (89 loc) · 5.77 KB

README.md

File metadata and controls

123 lines (89 loc) · 5.77 KB

Survival Analysis of Distant Metastasis in Breast Cancer Patients

Disclaimer

This work is for learning purposes only. The work can not be used for publications or commercial products etc. without mentor’s consent.

Contents

  1. Objective
  2. Dataset
  3. Survival Function
  4. Implementation of Predictive Models
  5. COBRA Implementation and Results
  6. Directory Structure
  7. References

Objective

We use the Transbig Dataset to predict the survival function for distant metastasis in breast cancer patients using survival analysis and combined regression strategies.

Dataset

  • The data from TRANSBIG validation study of 198 patients is used to perform the analysis.
  • In TRANSBIG, the datasets Patient Characteristic and Diagnostic Details, and Gene Features are joined.
  • Patient Characteristic and Diagnostic Details contains the clinical information of the patients.
  • The observations in this dataset are censored, in the sense that for some units the event of interest has not occured at the time the data was analyzed or collected.
  • Gene Features dataset contains the information of 22283 genes features of the patients.

What is Survival Function?

By definition survival function is a function that gives the probability that a patient will survive beyond any specified time. Mathematically, if T is a continuous random variable with pdf f(t) and cdf F(t). Then the probability that the patient suffered distant metastasis by time duration t is nothing but the survival function.

image

Implementation of Predictive Models

Multiple ML Models are applied to predict the survival function:

  • Support Vector Machine (svm.SVC)
  • KNeighborsClassifier
  • DecisionTreeClassifier
  • Gaussian Naive Bayes
  • LinearDiscriminantAnalysis

In addition to this, we also implemented Random Survival Forest to predict the survival function in two ways.

  • Using the python library Random Survival Forest
  • Native Implementation by defining Indicator variables for various values of t, and using Regression models to predict these, and calculating survival function as in above formula.

COBRA Implementation and Results

Cobra implementation is done from scratch by defining a class in python and using various models to combine them through voting.

Applying Random Survival Forest yielded an R^2 value of 0.65. Later, applying Cobra increased the R^2 value to 0.77.

In addition to R^2 analysis, the survival function is plotted against the time in days, and it agrees with the mathematical behavior and python implementation of Random Survival Forest.

COBRA:

image

Random Survival Forest:

image

Directory Tree:

MA691-COBRA-3
|
└───README.md
|   
│
└───Documentation  // TRANSBIG Dataset README and other documentations
|
│
└───Data
│   │   GSE7390_family.soft.gz    // TRANSBIG Dataset
│   │   cleaned_data.csv          // Cleaned Data with all patient characterstics and gene features 
│   │   gene.csv                  // List of 76 most important genes to analyze time to distant metastasis
|   |   selected.csv              // Subset of cleaned data containing only 76 important gene features and patient characteristics
│      
└───Literature   // Various research papers that we used and implemented in our work
│   
│   
└───Notebooks
|   │   cobra.ipynb            // COBRA Implementation 
|   │   data_cleaning.ipynb    // Cleaning and extracting relevant data from TRANSBIG zip
|   │   indicators.ipynb       // Native Implementation to predict Survival Function
|   │   regression.ipynb       // Application of various regression models to predict time to distant metastasis
|   │   survival.ipynb         // Application of Random Survival Forest of Scikit-learn
|
|
└───Scripts
|   │   cobra_estimator.py     // Definition of Cobra Class
|   │   data_clean.py          // Script to extract, clean and save relevant data in a new csv file
|   │   main.py                // Main script which calls the COBRA model
|   │   random_survival.py     // Script that runs in-built Random Survival Forest 
|

References

  1. Biau, Gérard, et al. "COBRA: A combined regression strategy." Journal of Multivariate Analysis 146 (2016): 18-28.

  2. Hikichi, Shiori, Masahiro Sugimoto, and Masaru Tomita. "Correlation-centred variable selection of a gene expression signature to predict breast cancer metastasis." Scientific reports 10.1 (2020): 1-8.

  3. Wang, Yixin, et al. "Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer." The Lancet 365.9460 (2005): 671-679.

  4. Ishwaran, Hemant, et al. "Random survival forests." The annals of applied statistics 2.3 (2008): 841-860.

  5. Desmedt, Christine, et al. "Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series." Clinical cancer research 13.11 (2007): 3207-3214.

  6. Hazard Function

  7. Survuval Analysis

  8. Hazard and cumulative hazard plotting

  9. Intuition for cumulative hazard function