Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of dorothea and progeny #1724

Closed
PauBadiaM opened this issue Mar 9, 2021 · 7 comments
Closed

Integration of dorothea and progeny #1724

PauBadiaM opened this issue Mar 9, 2021 · 7 comments
Labels

Comments

@PauBadiaM
Copy link
Contributor

Hi everyone,

Seeing how many new single cell and spatial tools are being developed in Python, and how we are increasingly using it in general and scanpy in particular, at saezlab we decided to re-implement our tools to estimate pathways and Transcription factor (TF) activity (Dorothea and Progeny) in it. Here's a first draft in Python of our tools:
https://github.com/saezlab/dorothea-py
https://github.com/saezlab/progeny-py

Our tools take gene expression as input and generate matrices of TF and pathway activities. They can be understood as:

  1. Prior-knowledge dimensionality reduction methods (obsm). Examples of usage:
    • Used as input for NN
    • Used as input for integration methods
  2. New data assays (X). Examples of usage:
    • Plot feature activities in projections such as PCA or UMAP
    • Plot feature activities in heat-maps, clustermaps, violin plots, etc
    • Differences between groups can be modeled to find significant differences

Because of this duality, the integration of our tools into scanpy is not straightforward. If we store the activities in obsm they can be used as a dimensonality reduction embedding but then we lose acces to all the fantastic plotting functions based on X. Then if we add add our activities to X, they have a very different distribution than gene expression plus there would be an overlap of names between genes and TFs. A solution to this would be to have a separate .layer to store this matrices but layers must contain the same dimensions as X. Another workaround would be to store it in .raw but then we force the user to use remove its previous contents, plus it is used in some methods as default which could cause problems.

What would be a smart solution to integrate our tools in your universe?

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 9, 2021

Hi @PauBadiaM,

I have always viewed Dorothea and Progeny as methods to aid in the interpretation of my data. Hence, I would assume this might be most useful as a targeted approach to plot activity of a particular TF or pathway. This is something I would probably find most useful as a function where i can either ask for the activity of a single TF/pathway or to get the activity score that explains most variation/correlates with a particular PC. Hence I would err on the side of storing the activities in .obsm and then have some functionality around analysing which activity scores are most useful to a user. It will be hard for users to go through all of the data in the end for further analysis. You can always write a wrapper around things like sc.tl.rank_genes_groups where the .obsm data is copied into a new adata_tmp.X for rank genes groups output.

@ivirshup
Copy link
Member

ivirshup commented Mar 10, 2021

I think making access to entires in obsm for plotting functions is a good idea. This is definitely on our roadmap, and has started to be implemented (scverse/anndata#342), but is a bit stalled at the moment.

Am I correct in understanding that being able to things like:

adata.obsm["pathways"] = pathway_dataframe_func(adata)
sc.pl.heatmap(adata, groupby="leiden", obsm="pathway")
sc.pl.umap(adata, color=["pathways/pathway-1", "leiden"])

would solve most of the barriers you're facing?

@PauBadiaM
Copy link
Contributor Author

PauBadiaM commented Mar 10, 2021

Thanks for the quick responses @LuckyMD and @ivirshup.
If obsm entries were accessible for plotting functions that would be fantastic. It would really solve all our problems. Once this is implemented I would only need to write a wrapper to model differences of activities between groups and that's it.
Looking forward for this update, thanks!

@giovp
Copy link
Member

giovp commented Mar 10, 2021

on the same line, we wrote a very simple extract function in squidpy that we ended up using quite a lot: https://squidpy.readthedocs.io/en/latest/api/squidpy.pl.extract.html

see for instance a usage example here: https://squidpy.readthedocs.io/en/latest/auto_examples/image/compute_texture_features.html#sphx-glr-auto-examples-image-compute-texture-features-py

I think what you guys are working in scverse/anndata#342 has much broader scope, and in general more useful for multi modal data etc. but if you think sq.pl.extract() could be a quick and dirty way to get the results you want, we could think of moving it here?

@ivirshup
Copy link
Member

I was thinking a quick thing to do would be to add an obsm argument to the plots (like layer). In this case we probably just wouldn't allow using values from X and obsm[whatever] together for the moment.

@PauBadiaM
Copy link
Contributor Author

Thank you all for the feedback!

In the end the best solution has been to store activities in .obsm and then use the plotting functions via an extract function like in squidpy.

Now that both tools are AnnData compatible, should I open a pull request to add them into the Ecosystem?

@ivirshup
Copy link
Member

Now that both tools are AnnData compatible, should I open a pull request to add them into the Ecosystem?

Please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants