Function to calculate batch ARI metric #86

sjspielman · 2022-08-09T19:07:57Z

We should write a function that takes an integrated SCE object and calculates batch ARI. The final output from this function should be an R object that can be saved to file, likely either TSV or RDS depending on the specific content of the batch ARI calculations.

For reference, I'm including this benchmarking manuscript and their code to calculate batch ARI, but I know we also have some code in scpca-downstream-analyses that can likely be adapted! -
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9

https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/blob/master/Script/evaluation/ARI/ARI_utils/ari_calcul_sampled.R

I'll also note batch ARI is not one of the metrics calculated in this python package - https://github.com/theislab/scib#metrics

The text was updated successfully, but these errors were encountered:

sjspielman · 2022-08-16T18:37:18Z

@allyhawkins, do we want to do bootstrapping of the clustering here similar to scpca-downstream-analyses? Or just go for it with one k-means run?

allyhawkins · 2022-08-16T20:56:29Z

The bootstrapping of the ARI we used in scpca-downstream-analyses is to look at the stability of the clustering so would help us determine how reliable the clustering is and pick which number of centers to use for kmeans. In looking at how they calculate the batch ARI in scIB, they do something similar to bootstrapping like we did with the ARI, and downsample the cells, re-cluster, and then compute the batch ARI (comparing the batch labels to the cluster labels) and then take the median of all iterations. I think the approach of downsampling over a set number of iterations makes sense to me and we should follow what they've done.

In terms of figuring out what the "best clustering" is, I also noticed in their function that the batch ARI is still reliant on cell type information. The number of centers used for kmeans each time they subsample is based on the number of cell types. Additionally, each time they downsample they only look at cells with shared cell types. This means that the number of clusters is equivalent to cell types so they expect the clusters to define the cell types. I wonder if this is something we might want to consider...? I would think that batch ARI is independent of cell type, but it seems like there probably is some bias based on what cell types are present in each dataset and your ARI would be altered if you were not looking at shared cell types.

It looks like by using the optimal option, you can find the optimal center to use for kmeans rather than rely on cell type? Perhaps using the NbClust package to do that might be something we want to try everytime we downsample?

sjspielman · 2022-08-16T21:05:35Z

I think the approach of downsampling over a set number of iterations makes sense to me and we should follow what they've done.

Makes sense to me!

In terms of figuring out what the "best clustering" is, I also noticed in their function that the batch ARI is still reliant on cell type information

I had thought they were only doing ARI at the cell level, not the batch level (?), which is what it says in the README. But now I also see this in the paper itself -

Similarly, we computed and plotted the ARI scores in the same fashion, 1-ARI_batch and ARI_cell type. To compute the ARI scores, k-means clustering was first performed to obtain cluster labels for comparison against batch labels and cell type labels to obtain the ARI_batch and ARI_cell type scores respectively.

allyhawkins · 2022-08-16T21:44:08Z

I had thought they were only doing ARI at the cell level, not the batch level (?), which is what it says in the README. But now I also see this in the paper itself -

Similarly, we computed and plotted the ARI scores in the same fashion, 1-ARI_batch and ARI_cell type. To compute the ARI scores, k-means clustering was first performed to obtain cluster labels for comparison against batch labels and cell type labels to obtain the ARI_batch and ARI_cell type scores respectively.

Yes... it looks like the function you originally linked calculates both the ARI batch and ARI cell type, but the clusters are first defined by the numbers of cell types. They then calculate ARI by comparing the vector of batch assignments to the cluster assignments. Another thought I had was looking at the ARI by comparing cluster assignments of unintegrated to integrated, which would be similar but solely looking at clustering irrespective of cell type or batch. In a perfect world you would expect complete separation of cells based on their original batch before integration and then after integration complete mixing so the ARI would be close to 0 (showing good integration and mixing of batches). But there are so many confounders (i.e cell types that don't mix), I don't think it will be that straight forward. But it would provide a relative measure of how similar the clustering is from pre to post integration?

sjspielman · 2022-08-17T13:35:53Z

FYI I was getting confused back and forth with the two benchmarks, scib and the other one I linked :) The quote above is from the scib paper where they only do at cell level.

Another thought I had was looking at the ARI by comparing cluster assignments of unintegrated to integrated, which would be similar but solely looking at clustering irrespective of cell type or batch.

This could be interesting too, but I also am not sure I would understand how to interpret it with all the cofounders, like you say.

sjspielman changed the title ~~Function to calculate batch ARI~~ Function to calculate batch ARI metric Aug 10, 2022

This was referenced Aug 17, 2022

Function for batch ARI calculation #97

Closed

Function for batch ARI calculation #98

Closed

Function for batch ARI #101

Merged

sjspielman closed this as completed in #101 Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to calculate batch ARI metric #86

Function to calculate batch ARI metric #86

sjspielman commented Aug 9, 2022

sjspielman commented Aug 16, 2022

allyhawkins commented Aug 16, 2022

sjspielman commented Aug 16, 2022

allyhawkins commented Aug 16, 2022

sjspielman commented Aug 17, 2022

Function to calculate batch ARI metric #86

Function to calculate batch ARI metric #86

Comments

sjspielman commented Aug 9, 2022

sjspielman commented Aug 16, 2022

allyhawkins commented Aug 16, 2022

sjspielman commented Aug 16, 2022

allyhawkins commented Aug 16, 2022

sjspielman commented Aug 17, 2022