A cross-attention network based on masked autoencoder called single-cell multimodal masked autoencoder
- python 3.11.9
- timm 1.0.7
- pytorch 2.3.1
- cudnn 8.9.2.26
- scanpy 1.10.2
- anndata 0.10.8
- scikit-learn 1.5.1
The above packages are the main packages used for the experiment, most 2.0+ torch environments can run the experiment directly, just in case we have provided a ./requirements.txt
file for all packages.
If you want to use your own datasets in scMMAE, you should change six parameters:
config.RNA_tokens = config.RNA_component * config.emb_RNA, RNA_tokens represents the gene number you used (I used 4000 highly variable genes), and config.emb_RNA needs to be divisible by head numbers;
config.ADT_tokens = config.ADT_component * config.emb_ADT, ADT_tokens represents the protein number you used (I used all proteins), and config.emb_ADT needs to be divisible by head numbers.
The input data is two matrix (RNA: cell_numbers*1*gene_numbers, PROTEIN:cell_numbers*1*protein_numbers). In addition, input data should be normalized before running the model.
Use Anaconda to create a Python virtual environment. Here, we will create a Python 3.11 environment named scMMAE
conda create -n scMMAE python=3.11.9
Install packages
pip install -r requirements.txt
You can run ./scMMAE/code/stage1.py
, and ./scMMAE/code/stage2.py
directly as long as you unrar the dataset in the ./scMMAE/dataset/CITE-seq/*.rar
,and ./scMMAE/dataset/RNA-seq/*.rar
.
Then you can run ./scMMAE/code/tutorial.ipynb
to reproduce the results for IFNB scRNA-seq dataset and you should ideally comment out the training code at these stages. Of note, due to the large size of the dataset, we have uploaded a rar archive inside the dataset folder, which you will need to extract to the current directory.
If you need pretrained and fine-tuned model for the dataset in the experiment, please contact [email protected]