CentroidSpace AnnData Annotations #455

Lilly-May · 2023-12-06T20:20:10Z

PR Checklist

Documentation in docs is updated

Description of changes

CentroidSpace calculates the centroid for all cells of a given perturbation. The resulting AnnData object will have one row for each perturbation. The objective of this pull request is to enhance the annotation of the returned AnnData object, thereby simplifying its downstream usage. Specifically, the following modifications in the CentroidSpace.compute method have been made:

I included the .obs column specified in the parametertarget_col (the column containing perturbation annotations) as a .obs column in the output. Although perturbation information is also stored as an index in the returned adata, having it in .obs is useful, for instance, for the creation of a colored UMAP plot (see screenshot below).
The X of the returned adata contains the coordinates for each centroid in the respective embedding (e.g., UMAP). I also store the values in X as an embedding in obsm, once again, to simplify processes such as the creation of a UMAP plot.
I introduced the parameter keep_obs to the compute function, allowing users to specify .obs columns from the original adata that should be kept in the returned adata. This is useful when additional annotations are required for downstream processing, such as a pathway annotation for each perturbation.

Example usage

for more information, see https://pre-commit.ci

Zethson · 2023-12-06T21:09:44Z

I included the .obs column specified in the parametertarget_col (the column containing perturbation annotations) as a .obs column in the output. Although perturbation information is also stored as an index in the returned adata, having it in .obs is useful, for instance, for the creation of a colored UMAP plot (see screenshot below).

Excellent! We might actually try to copy all obs columns as far as we can?

I also wonder whether any of the other PerturbationSpace methods need this? I'd hope that we could keep the dimensions of the objects somewhat consistent.

I introduced the parameter keep_obs to the compute function, allowing users to specify .obs columns from the original adata that should be kept in the returned adata. This is useful when additional annotations are required for downstream processing, such as a pathway annotation for each perturbation.

I again wonder whether we should keep all of them?

Lilly-May · 2023-12-06T21:48:38Z

Thanks for the review @Zethson!

Excellent! We might actually try to copy all obs columns as far as we can?

I thought about doing that, but the reason I decided against it is that we can only keep the .obs columns with the same value for all cells of one perturbation. We could of course iterate through all .obs and test if the respective .obs variable fulfills this condition, but from what I've seen, it's quite common for perturbation datasets to have one .obs column for each perturbation (such as for the Norman dataset, >100 obs variables). Maybe I'm overestimating, but I'm concerned that it could scale up pretty quickly in terms of computational complexity?

I also wonder whether any of the other PerturbationSpace methods need this? I'd hope that we could keep the dimensions of the objects somewhat consistent.

Yes, that's true! Based on what I've tested so far, PseudobulkSpace keeps the unambiguous .obs values and changes it's shape to nr_perturbations x nr_obs, just as CentroidSpace does. DBSCANSpace and KMeansSpace keep the original shape (nr_cells x nr_obs), since they only calculate clusters for all datapoints. DiscriminatorClassifier works a bit differently anyway, since we train the model and then obtain the embeddings from the NN layer, which then of course also have the shape of the last layer. But I'll definitely keep an eye on making everything as consistent as possible. Or do you already have ideas on how to improve consistency?

Zethson · 2023-12-06T21:52:01Z

I thought about doing that, but the reason I decided against it is that we can only keep the .obs columns with the same value for all cells of one perturbation. We could of course iterate through all .obs and test if the respective .obs variable fulfills this condition, but from what I've seen, it's quite common for perturbation datasets to have one .obs column for each perturbation (such as for the Norman dataset, >100 obs variables). Maybe I'm overestimating, but I'm concerned that it could scale up pretty quickly in terms of computational complexity?

I think that there's probably pretty efficient pandas code for that. I'd encourage you to give it a try.

Yes, that's true! Based on what I've tested so far, PseudobulkSpace keeps the unambiguous .obs values and changes it's shape to nr_perturbations x nr_obs, just as CentroidSpace does. DBSCANSpace and KMeansSpace keep the original shape (nr_cells x nr_obs), since they only calculate clusters for all datapoints. DiscriminatorClassifier works a bit differently anyway, since we train the model and then obtain the embeddings from the NN layer, which then of course also have the shape of the last layer. But I'll definitely keep an eye on making everything as consistent as possible. Or do you already have ideas on how to improve consistency?

Ahh, that's fair. So let's try and stick to the PerturbationSpace and ClusterSpace differentiations. These should probably be two distinct concepts with two distinct interfaces and matrix dimensions.

for more information, see https://pre-commit.ci

pertpy/tools/_perturbation_space/_simple.py

CentroidSpace AnnData Annotations

af0d761

Lilly-May requested a review from AlejandroTL December 6, 2023 20:20

github-actions bot added the bug Something isn't working label Dec 6, 2023

Lilly-May requested a review from Zethson December 6, 2023 20:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

df06ab5

for more information, see https://pre-commit.ci

Lilly-May and others added 2 commits December 7, 2023 19:19

Annotating all .obs columns by default

819e989

[pre-commit.ci] auto fixes from pre-commit.com hooks

b01ac0d

for more information, see https://pre-commit.ci

Zethson approved these changes Dec 7, 2023

View reviewed changes

pertpy/tools/_perturbation_space/_simple.py Outdated Show resolved Hide resolved

pertpy/tools/_perturbation_space/_simple.py Outdated Show resolved Hide resolved

Changed docs explanation

a56dde8

Lilly-May merged commit e0ce61d into main Dec 8, 2023
4 of 7 checks passed

Lilly-May deleted the fix/centroid_space_adata_annotation branch December 13, 2023 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CentroidSpace AnnData Annotations #455

CentroidSpace AnnData Annotations #455

Lilly-May commented Dec 6, 2023

Zethson commented Dec 6, 2023

Lilly-May commented Dec 6, 2023

Zethson commented Dec 6, 2023

CentroidSpace AnnData Annotations #455

CentroidSpace AnnData Annotations #455

Conversation

Lilly-May commented Dec 6, 2023

Zethson commented Dec 6, 2023

Lilly-May commented Dec 6, 2023

Zethson commented Dec 6, 2023