Single-cell sequencing is paving the way for precision medicine. It is the next srep towards making precision medicine more accurate. However, the analysis of single-cell data is incredibly complex with numerous distinct approaches resulting in more than 500 Python and R libraries existing today.
The goal of this project is to tackle the complexity of data analysis by identifying the best approaches. The single-cell transcriptomics analysis has multiple steps, but we have focused on data integration — a crucial step when working with clinical data coming from patients.
- Single-cell sequencing data search and preprocessing
- Selection of metric to compare performance of different algorithms
- Simulation of artificial scRNA-seq datasets to compare the performances of the algorithms on real data and similar simulated
- Comparative analysis of algorithms aimed at batch correction in single-cell sequencing data
- Create a tool based on existing algorithms for data integration
-
We found 3 public datasets approximately 500, 15000, 80000 cells and 2, 4, 2 batches respectively. It may be found here. Also you need docker to be installed on your computer.
-
Silhouette-score with cosine distance was chosen as a metric. If we denote through a — the average distance from this object to objects from the same cluster, through b — the average distance from this object to objects from the nearest cluster (different from the one in which the object itself lies). In our case labels of clusters are cell type or batch. Then the silhouette of this object is called the value:
-
Generated 3 artificial datasets 500, 15000, 80000 cells and 2, 5, 2 batches respectively. We used SymSim R library, the code can be found here.
-
For the downstream work we have chosen 5 different approaches: Combat and Regress out which use linear models for batch correction and MNN, BBKNN and Scanorama which in turn are looking for mutual nearest neighbors in others batches for every cell.
-
Based on these 5 libraries, we made a program that for each specific dataset can choose the best approach and apply it to correct the batch effect.
We analyzed 5 algorithms on 6 datasets (3 real and 3 simulated). We provided batch correction of expression matrix, then calculated silhouettes for each cell in data and finally run MU-test to compare silhouettes of baseline (data before correction) and corrected data. We did it in two ways:
- Calculate silhouettes based on cell type labels and then run MU-test with alternative argument equal "less", because in this situation correction should increase our metric (cell types must form separate clusters)
- Calculate silhouettes based on batch labels and then run MU-test with alternative argument equal "greater", because in this situation correction should decrease our metric (different batches should mix)
Figure 1 shows the results obtained. Blue color means that the silhouette according to the batch is significantly reduced after correction, and yellow, that the silhouette according to the cell type is significantly increased.
Fig. 1. Significance of differences between silhouettes.
Also in Figure 2 you can see the time it took for each algorithm to work on the corresponding data. It is expected that MNN based on the search for mutual nearest neighbors worked most often longer, but at the same time its modifications (BBKNN and Scanorama), which preliminarily reduce the dimension of the data using canonical correlation analysis worked much faster.
Fig. 2. Time performance of different algorithms on different datasets.
Figure 3 depicts how clusters change after correction: cell types form separate clusters, and batches are stirred.
Fig. 3. UMAP before and after correction with Scanorama, labeled by cell type and batch.
As a result, based on the existing algorithms for correcting the batch effect, we have created a program that can apply the algorithm chosen by the user or run all algorithms and select a list of the most successful ones (See Usage for more information about downloading and running).
You can download and run program on our small example using following commands in your terminal:
git clone https://github.com/immunomind/bi2021spring.git
git checkout dev
cd bi2021spring/
pip install -r requirements.txt
cd source/
python data_integrator.py --adata ../example_data/example.h5ad \
--celltype celltype batch \
--batch batch \
--do_filter True \
--algtorun all \
--out example_result
Where adata is expression matrix with genes as columns and cells as rows and also with annotated cell and batch types (you can see more information in --help); celltype and batch are names of corresponding columns; do_filter tells if program apply casual filters (the gene is expressed in more than 100 cells and more than 20 genes have nonzero expression in a cell); algtorun means which algorithms will be used for correction or all and at the end there will be a list of the best ones; out is the directory with results.
We ran this program on Ubuntu 18.04 with 8 threads and 64 Gb RAM. With algtorun option equals all it has taken approximately 1 m on the small dataset, 10 m on the medium one and 75 m on the big. If your data is more than 100000 cells then we do not recommend using MNN due to too long working time.
- Haghverdi, L., Lun, A., Morgan, M. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427 (2018).
- Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol 37, 685–691 (2019).
- Krzysztof Polański, Matthew D Young, Zhichao Miao, Kerstin B Meyer, Sarah A Teichmann, Jong-Eun Park, BBKNN: fast batch alignment of single cell transcriptomes, Bioinformatics 36, 3, (2020).
- Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020).
- Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat Commun 10, 2611 (2019).
- Yuqing Zhang, Giovanni Parmigiani, W Evan Johnson, ComBat-seq: batch effect adjustment for RNA-seq count data, NAR Genomics and Bioinformatics 2, 3, (2020).