Skip to content

SAFE (Single-cell Aggregated clustering From Ensemble): Cluster ensemble for single-cell RNA-seq data

Notifications You must be signed in to change notification settings

yuxuanChen777/SAFEclustering

 
 

Repository files navigation

SAFEclustering

SAFE (Single-cell Aggregated clustering From Ensemble): Cluster ensemble for single-cell RNA-seq data

Although several methods have been recently developed for clustering cell types using single-cell RNA-seq (scRNA-Seq) data, they utilize different characteristics of data and yield varying results in terms of both the number of clusters and actual cluster assignments. Here, we present SAFE-clustering, Single-cell Aggregated (From Ensemble) clustering, a flexible, accurate and robust method for clustering scRNA-Seq data. SAFE-clustering takes as input, results from multiple clustering methods, to build one consensus solution. SAFE-clustering currently embeds four state-of-the-art methods, SC3, CIDR, Seurat and t-SNE + k-means; and ensembles solutions from these four methods using three hypergraph-based partitioning algorithms.

SAFEclustering is maintained by Yuchen Yang [[email protected]] and Yun Li [[email protected]].

News and Updates

Sep 7, 2020

  • Version 2.00 released
    • The Seuart version used in SAFEclustering is updated to version 3. Seurat v.2 is no longer compatible
    • Only count data is acceptable by SAFEclustering. Other formats, such as FPKM, CPM and RPKM are no longer compatible

Dec 5, 2018

  • Version 1.00.1 released
    • Fixing an error in Seurat clustering to allow more than 20 PCs computed
    • Fixing an error in tSNE + k-means clustering when specifying the maximum value of the pool of cluster numbers

July 24, 2018

  • Version 0.99.0 released
    • First offical release
    • Now it can only work on Mac and Linux platform

Installation

You can install SAFEclustering from github with:

install.packages("devtools")

devtools::install_github("yycunc/SAFEclustering")

Note that hypergraph partitioning algorithm (HGPA) is performed using the shmetis program (from the hMETIS package v. 1.5 (Karypis et al., IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1999)), and meta-clustering algorithm (MCLA) and cluster-based similarity partitioning algorithm (CSPA) are performed using gpmetis program (from METIS v. 5.1.0 (Karypis and Kumar, SIAM Journal on Scientific Computing, 1998)). Please download the two programs corresponding to the operating systems you are using and put them in the working directory or provide the directory where these two programs are.

SAFEclustering Examples

Here we provide one example using the dataset from Zheng et al., (Nature Communications, 2016). Zheng dataset contains 500 human peripheral blood mononuclear cells (PBMCs) sequenced using GemCode platform, which consists of three cell types, CD56+ natural killer cells, CD19+ B cells and CD4+/CD25+ regulatory T cells. The original data can be downloaded from 10X GENOMICS website.

Load the data

library("SAFEclustering")
data("data_SAFE")

Zheng dataset

Setup the input expression matrix

dim(data_SAFE$Zheng.expr)

data_SAFE$Zheng.expr[1:5, 1:5]

Perform individual clustering

Here we perform single-cell clustering using four popular methods, SC3, CIDR, Seurat and t-SNE + k-means, without filtering any genes or cells.

cluster.results <- individual_clustering(inputTags = data_SAFE$Zheng.expr, mt_filter = TRUE, 
SC3 = TRUE, gene_filter = FALSE, CIDR = TRUE, nPC.cidr = NULL, 
Seurat = TRUE, nGene_filter = FALSE, nPC.seurat = NULL, resolution = 0.7, tSNE = TRUE, dimensions = 3, 
perplexity = 30, SEED = 123)

The function indiviual_clustering will output a matrix, where each row represents the cluster results of each method, and each colunm represents a cell. User can also extend SAFE-clustering to other scRNA-seq clustering methods, by putting all clustering results into a M * N matrix with M clustering methods and N cells.

cluster.results[1:4, 1:10]

Cluster ensemble

Using the clustering results generated in last step, we perform cluster ensemble using three partitioning algorithms meta-clustering algorithm (MCLA), hypergraph partitioning algorithm (HGPA) and cluster-based similarity partitioning algorithm (CSPA) (Strehl and Ghosh, Proceedings of AAAI 2002, Edmonto, Canada, 2002).

Here, the programs required, shmetis and gpmetis, are in the local working directory "~/Documents/single_cell_clustering".

cluster.ensemble <- SAFE(cluster_results = cluster.results, program.dir = "~/Documents/single_cell_clustering", 
MCLA = TRUE, CSPA = TRUE, HGPA = TRUE, SEED = 123)

Here is the list of ANMI results for esemble solution of each K and each partitioning algorithm.

## [1] "HGPA partitioning at K = 2: 2 clusters at ANMI = 0.00329903476904425"
## [1] "HGPA partitioning at K = 3: 3 clusters at ANMI = 0.278691668779803"
## [1] "HGPA partitioning at K = 4: 4 clusters at ANMI = 0.00392992505505839"
## [1] "HGPA partitioning at K = 5: 5 clusters at ANMI = 0.552234460801785"
## [1] "MCLA partitioning at K = 2: 2 clusters at ANMI = 0.568294023177534"
## [1] "MCLA partitioning at K = 3: 3 clusters at ANMI = 0.929094923585274"
## [1] "MCLA partitioning at K = 4: 4 clusters at ANMI = 0.872601957447147"
## [1] "MCLA partitioning at K = 5: 4 clusters at ANMI = 0.923346490477427"
## [1] "CSPA partitioning at K = 2: 2 clusters at ANMI = 0.53144399728197"
## [1] "CSPA partitioning at K = 3: 3 clusters at ANMI = 0.850151780486274"
## [1] "CSPA partitioning at K = 4: 4 clusters at ANMI = 0.665510270422344"
## [1] "CSPA partitioning at K = 5: 5 clusters at ANMI = 0.666022118059772"
## [1] "Optimal number of clusters is 3 with ANMI = 0.929094923585274"

Function SAFE will output a list for Average Normalized Mutual Information (ANMI) metric (Strehl and Ghosh Proceedings of AAAI 2002, Edmonto, Canada, 2002) between each ensemble solution and the individual solutions. The optimal clustering ensemble is selected from the ensemble solution with the highest ANMI value.

cluster.ensemble$Summary

cluster.ensemble$MCLA[1:10]

cluster.ensemble$MCLA_optimal_k

We can compare the clustering results to the true labels using the Adjusted Rand Index (ARI)

library(cidr)

# Cell labels of ground truth
head(data_SAFE$Zheng.celltype)

# Calculating ARI for cluster ensemble
adjustedRandIndex(cluster.ensemble$optimal_clustering, data_SAFE$Zheng.celltype)

Citation

Yang, Y., Huh, R., Culpepper, H., Lin, Y., Love, M., Li, Y. (2019) SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for Single-cell RNA-seq Data. Bioinformatics, 35: 1269-1277. [PMID: 30202935].

Credits

Algorithms of MCLA, HGPA and CSPA are from Strehl and Ghosh (2002) . Some codes are paraphrased from the Matlab package of ClusterEnsemble-V2.0 (http://strehl.com/soft.html).

About

SAFE (Single-cell Aggregated clustering From Ensemble): Cluster ensemble for single-cell RNA-seq data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%