Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CentroidSpace AnnData Annotations #455

Merged
merged 5 commits into from
Dec 8, 2023

Conversation

Lilly-May
Copy link
Collaborator

PR Checklist

  • Documentation in docs is updated

Description of changes

CentroidSpace calculates the centroid for all cells of a given perturbation. The resulting AnnData object will have one row for each perturbation. The objective of this pull request is to enhance the annotation of the returned AnnData object, thereby simplifying its downstream usage. Specifically, the following modifications in the CentroidSpace.compute method have been made:

  • I included the .obs column specified in the parametertarget_col (the column containing perturbation annotations) as a .obs column in the output. Although perturbation information is also stored as an index in the returned adata, having it in .obs is useful, for instance, for the creation of a colored UMAP plot (see screenshot below).
  • The X of the returned adata contains the coordinates for each centroid in the respective embedding (e.g., UMAP). I also store the values in X as an embedding in obsm, once again, to simplify processes such as the creation of a UMAP plot.
  • I introduced the parameter keep_obs to the compute function, allowing users to specify .obs columns from the original adata that should be kept in the returned adata. This is useful when additional annotations are required for downstream processing, such as a pathway annotation for each perturbation.

Example usage

Bildschirmfoto 2023-12-06 um 21 18 34

@github-actions github-actions bot added the bug Something isn't working label Dec 6, 2023
@Zethson
Copy link
Member

Zethson commented Dec 6, 2023

I included the .obs column specified in the parametertarget_col (the column containing perturbation annotations) as a .obs column in the output. Although perturbation information is also stored as an index in the returned adata, having it in .obs is useful, for instance, for the creation of a colored UMAP plot (see screenshot below).

Excellent! We might actually try to copy all obs columns as far as we can?

I also wonder whether any of the other PerturbationSpace methods need this? I'd hope that we could keep the dimensions of the objects somewhat consistent.

I introduced the parameter keep_obs to the compute function, allowing users to specify .obs columns from the original adata that should be kept in the returned adata. This is useful when additional annotations are required for downstream processing, such as a pathway annotation for each perturbation.

I again wonder whether we should keep all of them?

@Lilly-May
Copy link
Collaborator Author

Thanks for the review @Zethson!

Excellent! We might actually try to copy all obs columns as far as we can?

I thought about doing that, but the reason I decided against it is that we can only keep the .obs columns with the same value for all cells of one perturbation. We could of course iterate through all .obs and test if the respective .obs variable fulfills this condition, but from what I've seen, it's quite common for perturbation datasets to have one .obs column for each perturbation (such as for the Norman dataset, >100 obs variables). Maybe I'm overestimating, but I'm concerned that it could scale up pretty quickly in terms of computational complexity?

I also wonder whether any of the other PerturbationSpace methods need this? I'd hope that we could keep the dimensions of the objects somewhat consistent.

Yes, that's true! Based on what I've tested so far, PseudobulkSpace keeps the unambiguous .obs values and changes it's shape to nr_perturbations x nr_obs, just as CentroidSpace does. DBSCANSpace and KMeansSpace keep the original shape (nr_cells x nr_obs), since they only calculate clusters for all datapoints. DiscriminatorClassifier works a bit differently anyway, since we train the model and then obtain the embeddings from the NN layer, which then of course also have the shape of the last layer. But I'll definitely keep an eye on making everything as consistent as possible. Or do you already have ideas on how to improve consistency?

@Zethson
Copy link
Member

Zethson commented Dec 6, 2023

I thought about doing that, but the reason I decided against it is that we can only keep the .obs columns with the same value for all cells of one perturbation. We could of course iterate through all .obs and test if the respective .obs variable fulfills this condition, but from what I've seen, it's quite common for perturbation datasets to have one .obs column for each perturbation (such as for the Norman dataset, >100 obs variables). Maybe I'm overestimating, but I'm concerned that it could scale up pretty quickly in terms of computational complexity?

I think that there's probably pretty efficient pandas code for that. I'd encourage you to give it a try.

Yes, that's true! Based on what I've tested so far, PseudobulkSpace keeps the unambiguous .obs values and changes it's shape to nr_perturbations x nr_obs, just as CentroidSpace does. DBSCANSpace and KMeansSpace keep the original shape (nr_cells x nr_obs), since they only calculate clusters for all datapoints. DiscriminatorClassifier works a bit differently anyway, since we train the model and then obtain the embeddings from the NN layer, which then of course also have the shape of the last layer. But I'll definitely keep an eye on making everything as consistent as possible. Or do you already have ideas on how to improve consistency?

Ahh, that's fair. So let's try and stick to the PerturbationSpace and ClusterSpace differentiations. These should probably be two distinct concepts with two distinct interfaces and matrix dimensions.

pertpy/tools/_perturbation_space/_simple.py Outdated Show resolved Hide resolved
pertpy/tools/_perturbation_space/_simple.py Outdated Show resolved Hide resolved
@Lilly-May Lilly-May merged commit e0ce61d into main Dec 8, 2023
4 of 7 checks passed
@Lilly-May Lilly-May deleted the fix/centroid_space_adata_annotation branch December 13, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants