https://www.kaggle.com/competitions/open-problems-single-cell-perturbations
Download and populate the data folder with the data from the kaggle competition as per the below file structure.
pip install -r requirements.txt
File structure for the data folder
├── data
│ ├── adata_excluded_ids.csv
│ ├── adata_obs_meta.csv
│ ├── adata_train.parquet
│ ├── de_df.csv
│ ├── de_train_clustered.parquet
│ ├── de_train.parquet
│ ├── de_train_updated.parquet
│ ├── id_map.csv
│ ├── lincs_id_compound_mapping.parquet
│ ├── model_predictions_vs_actual.csv
│ ├── multiome_obs_meta.csv
│ ├── multiome_train.parquet
│ ├── multiome_var_meta.csv
│ └──sample_submission.csv
├── encoders
├── models
├── nn_auto_rev2
│ ├── data_preprocessing.py
│ ├── main.py
│ ├── model.py
│ ├── train.py
│ └── utils.py
├── nn_only_src
│ ├── data_processing.py
│ ├── evaluation.py
│ ├── main.py
│ ├── model.py
│ └── training.py
├── output
├── output.txt
├── README.md
└── requirements.txt
Download and populate the data folder with the data from the kaggle competition.
pip install -r requirements.txt
Our predictive model, implemented in model.py
, is a transformer-based neural network, TransformerNN
, developed using PyTorch.
-
TransformerNN: A subclass of PyTorch's
nn.Module
,TransformerNN
features multi-head attention, customizable layers, and dropout rate. It's designed to capture cell responses to different chemical compounds. -
Sparse Features & Target Encoding: Distinct representations are used for target encoding and sparse features, encoding cell type and chemical interactions.
Outlined in training.py
, our training methodology includes:
- Data Split: Using sklearn's
train_test_split
to create training and validation sets. - Training Mechanics:
train_model
function manages the training epochs, learning rate, and device setup. - Optimization & Learning Rate Adjustment: Adam optimizer and PyTorch's
ReduceLROnPlateau
scheduler are used, along with the Huber loss function for stability and reduced outlier sensitivity.
Implemented in nn_only.src
.
- ComplexAutoencoder: For dimensionality reduction, comprising an encoder, latent space, and decoder. Targets essential data features while preventing overfitting.
- ComplexNet: Utilizes latent space representations for predictions, integrating linear layers, ReLU activation, dropout, and a transformer encoder layer.
Training Process:
- Autoencoder Training: Focuses on optimizing latent space representation.
- ComplexNet Training: Concentrates on learning from reduced feature space after autoencoder training.
Integration Steps:
- Data Preparation: Loading and preprocessing from
id_map.csv
. - Model Setup: Loading and setting
ComplexAutoencoder
andComplexNet
to evaluation mode. - Feature Encoding & Prediction: Encoding features and predicting latent space representation.
- Decoding and Gene Expression Prediction: Using the decoder to predict gene expressions.
- Post-Processing: Structuring predictions for submission.
Implemented in nn_auto_rev2.src
.
Our submission in the Kaggle competition:
- Performance Metric: Achieved a MRRMSE of 0.822, ranking 749th.
- Benchmark: The top score was 0.729 by N. Jean Kouagou.
- Analysis: Our performance was influenced by our first-time use of
ComplexAutoencoder
andComplexNet
and project time constraints.
- Generative Adversarial Networks: Exploring GANs for modeling cellular reactions.
- Chemical Analysis Libraries: Augmenting data processing with tools like RDKit.
- Training and Tuning Improvements: Advancing optimization techniques, loss functions, and network architectures.
This README documents our methods, results, and future plans in developing predictive models for cellular response analysis to chemical compounds.