-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function to calculate batch ARI metric #86
Comments
@allyhawkins, do we want to do bootstrapping of the clustering here similar to |
The bootstrapping of the ARI we used in In terms of figuring out what the "best clustering" is, I also noticed in their function that the batch ARI is still reliant on cell type information. The number of centers used for kmeans each time they subsample is based on the number of cell types. Additionally, each time they downsample they only look at cells with shared cell types. This means that the number of clusters is equivalent to cell types so they expect the clusters to define the cell types. I wonder if this is something we might want to consider...? I would think that batch ARI is independent of cell type, but it seems like there probably is some bias based on what cell types are present in each dataset and your ARI would be altered if you were not looking at shared cell types. It looks like by using the |
Makes sense to me!
I had thought they were only doing ARI at the cell level, not the batch level (?), which is what it says in the README. But now I also see this in the paper itself -
|
Yes... it looks like the function you originally linked calculates both the ARI batch and ARI cell type, but the clusters are first defined by the numbers of cell types. They then calculate ARI by comparing the vector of batch assignments to the cluster assignments. Another thought I had was looking at the ARI by comparing cluster assignments of unintegrated to integrated, which would be similar but solely looking at clustering irrespective of cell type or batch. In a perfect world you would expect complete separation of cells based on their original batch before integration and then after integration complete mixing so the ARI would be close to 0 (showing good integration and mixing of batches). But there are so many confounders (i.e cell types that don't mix), I don't think it will be that straight forward. But it would provide a relative measure of how similar the clustering is from pre to post integration? |
FYI I was getting confused back and forth with the two benchmarks,
This could be interesting too, but I also am not sure I would understand how to interpret it with all the cofounders, like you say. |
We should write a function that takes an integrated SCE object and calculates batch ARI. The final output from this function should be an R object that can be saved to file, likely either TSV or RDS depending on the specific content of the batch ARI calculations.
For reference, I'm including this benchmarking manuscript and their code to calculate batch ARI, but I know we also have some code in
scpca-downstream-analyses
that can likely be adapted! -https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9
https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/blob/master/Script/evaluation/ARI/ARI_utils/ari_calcul_sampled.R
I'll also note batch ARI is not one of the metrics calculated in this python package - https://github.com/theislab/scib#metrics
The text was updated successfully, but these errors were encountered: