This work is for learning purposes only. The work can not be used for publications or commercial products etc. without mentor’s consent.
- Objective
- Dataset
- Survival Function
- Implementation of Predictive Models
- COBRA Implementation and Results
- Directory Structure
- References
We use the Transbig Dataset to predict the survival function for distant metastasis in breast cancer patients using survival analysis and combined regression strategies.
- The data from
TRANSBIG
validation study of 198 patients is used to perform the analysis. - In TRANSBIG, the datasets
Patient Characteristic and Diagnostic Details
, andGene Features
are joined. - Patient Characteristic and Diagnostic Details contains the clinical information of the patients.
- The observations in this dataset are censored, in the sense that for some units the event of interest has not occured at the time the data was analyzed or collected.
- Gene Features dataset contains the information of 22283 genes features of the patients.
By definition survival function is a function that gives the probability that a patient will survive beyond any specified time. Mathematically, if T is a continuous random variable with pdf f(t) and cdf F(t). Then the probability that the patient suffered distant metastasis by time duration t is nothing but the survival function.
Multiple ML Models are applied to predict the survival function:
- Support Vector Machine (svm.SVC)
- KNeighborsClassifier
- DecisionTreeClassifier
- Gaussian Naive Bayes
- LinearDiscriminantAnalysis
In addition to this, we also implemented Random Survival Forest to predict the survival function in two ways.
- Using the python library
Random Survival Forest
- Native Implementation by defining Indicator variables for various values of
t
, and using Regression models to predict these, and calculating survival function as in above formula.
Cobra implementation is done from scratch by defining a class in python and using various models to combine them through voting.
Applying Random Survival Forest yielded an R^2
value of 0.65. Later, applying Cobra increased the R^2
value to 0.77.
In addition to R^2
analysis, the survival function is plotted against the time in days, and it agrees with the mathematical behavior and python implementation of Random Survival Forest.
COBRA:
Random Survival Forest:
MA691-COBRA-3
|
└───README.md
|
│
└───Documentation // TRANSBIG Dataset README and other documentations
|
│
└───Data
│ │ GSE7390_family.soft.gz // TRANSBIG Dataset
│ │ cleaned_data.csv // Cleaned Data with all patient characterstics and gene features
│ │ gene.csv // List of 76 most important genes to analyze time to distant metastasis
| | selected.csv // Subset of cleaned data containing only 76 important gene features and patient characteristics
│
└───Literature // Various research papers that we used and implemented in our work
│
│
└───Notebooks
| │ cobra.ipynb // COBRA Implementation
| │ data_cleaning.ipynb // Cleaning and extracting relevant data from TRANSBIG zip
| │ indicators.ipynb // Native Implementation to predict Survival Function
| │ regression.ipynb // Application of various regression models to predict time to distant metastasis
| │ survival.ipynb // Application of Random Survival Forest of Scikit-learn
|
|
└───Scripts
| │ cobra_estimator.py // Definition of Cobra Class
| │ data_clean.py // Script to extract, clean and save relevant data in a new csv file
| │ main.py // Main script which calls the COBRA model
| │ random_survival.py // Script that runs in-built Random Survival Forest
|
-
Biau, Gérard, et al. "COBRA: A combined regression strategy." Journal of Multivariate Analysis 146 (2016): 18-28.
-
Hikichi, Shiori, Masahiro Sugimoto, and Masaru Tomita. "Correlation-centred variable selection of a gene expression signature to predict breast cancer metastasis." Scientific reports 10.1 (2020): 1-8.
-
Wang, Yixin, et al. "Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer." The Lancet 365.9460 (2005): 671-679.
-
Ishwaran, Hemant, et al. "Random survival forests." The annals of applied statistics 2.3 (2008): 841-860.
-
Desmedt, Christine, et al. "Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series." Clinical cancer research 13.11 (2007): 3207-3214.