Integration of dorothea and progeny #1724

PauBadiaM · 2021-03-09T14:02:43Z

Hi everyone,

Seeing how many new single cell and spatial tools are being developed in Python, and how we are increasingly using it in general and scanpy in particular, at saezlab we decided to re-implement our tools to estimate pathways and Transcription factor (TF) activity (Dorothea and Progeny) in it. Here's a first draft in Python of our tools:
https://github.com/saezlab/dorothea-py
https://github.com/saezlab/progeny-py

Our tools take gene expression as input and generate matrices of TF and pathway activities. They can be understood as:

Prior-knowledge dimensionality reduction methods (obsm). Examples of usage:
- Used as input for NN
- Used as input for integration methods
New data assays (X). Examples of usage:
- Plot feature activities in projections such as PCA or UMAP
- Plot feature activities in heat-maps, clustermaps, violin plots, etc
- Differences between groups can be modeled to find significant differences

Because of this duality, the integration of our tools into scanpy is not straightforward. If we store the activities in obsm they can be used as a dimensonality reduction embedding but then we lose acces to all the fantastic plotting functions based on X. Then if we add add our activities to X, they have a very different distribution than gene expression plus there would be an overlap of names between genes and TFs. A solution to this would be to have a separate .layer to store this matrices but layers must contain the same dimensions as X. Another workaround would be to store it in .raw but then we force the user to use remove its previous contents, plus it is used in some methods as default which could cause problems.

What would be a smart solution to integrate our tools in your universe?

The text was updated successfully, but these errors were encountered:

LuckyMD · 2021-03-09T18:29:11Z

Hi @PauBadiaM,

I have always viewed Dorothea and Progeny as methods to aid in the interpretation of my data. Hence, I would assume this might be most useful as a targeted approach to plot activity of a particular TF or pathway. This is something I would probably find most useful as a function where i can either ask for the activity of a single TF/pathway or to get the activity score that explains most variation/correlates with a particular PC. Hence I would err on the side of storing the activities in .obsm and then have some functionality around analysing which activity scores are most useful to a user. It will be hard for users to go through all of the data in the end for further analysis. You can always write a wrapper around things like sc.tl.rank_genes_groups where the .obsm data is copied into a new adata_tmp.X for rank genes groups output.

ivirshup · 2021-03-10T03:20:54Z

I think making access to entires in obsm for plotting functions is a good idea. This is definitely on our roadmap, and has started to be implemented (scverse/anndata#342), but is a bit stalled at the moment.

Am I correct in understanding that being able to things like:

adata.obsm["pathways"] = pathway_dataframe_func(adata)
sc.pl.heatmap(adata, groupby="leiden", obsm="pathway")
sc.pl.umap(adata, color=["pathways/pathway-1", "leiden"])

would solve most of the barriers you're facing?

PauBadiaM · 2021-03-10T08:01:26Z

Thanks for the quick responses @LuckyMD and @ivirshup.
If obsm entries were accessible for plotting functions that would be fantastic. It would really solve all our problems. Once this is implemented I would only need to write a wrapper to model differences of activities between groups and that's it.
Looking forward for this update, thanks!

giovp · 2021-03-10T09:37:27Z

on the same line, we wrote a very simple extract function in squidpy that we ended up using quite a lot: https://squidpy.readthedocs.io/en/latest/api/squidpy.pl.extract.html

see for instance a usage example here: https://squidpy.readthedocs.io/en/latest/auto_examples/image/compute_texture_features.html#sphx-glr-auto-examples-image-compute-texture-features-py

I think what you guys are working in scverse/anndata#342 has much broader scope, and in general more useful for multi modal data etc. but if you think sq.pl.extract() could be a quick and dirty way to get the results you want, we could think of moving it here?

ivirshup · 2021-03-10T09:49:38Z

I was thinking a quick thing to do would be to add an obsm argument to the plots (like layer). In this case we probably just wouldn't allow using values from X and obsm[whatever] together for the moment.

PauBadiaM · 2021-03-25T16:35:17Z

Thank you all for the feedback!

In the end the best solution has been to store activities in .obsm and then use the plotting functions via an extract function like in squidpy.

Now that both tools are AnnData compatible, should I open a pull request to add them into the Ecosystem?

ivirshup · 2021-03-26T03:45:30Z

Now that both tools are AnnData compatible, should I open a pull request to add them into the Ecosystem?

Please do!

PauBadiaM added the Question label Mar 9, 2021

ivirshup closed this as completed Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of dorothea and progeny #1724

Integration of dorothea and progeny #1724

PauBadiaM commented Mar 9, 2021

LuckyMD commented Mar 9, 2021

ivirshup commented Mar 10, 2021 •

edited

Loading

PauBadiaM commented Mar 10, 2021 •

edited

Loading

giovp commented Mar 10, 2021

ivirshup commented Mar 10, 2021

PauBadiaM commented Mar 25, 2021

ivirshup commented Mar 26, 2021

Integration of dorothea and progeny #1724

Integration of dorothea and progeny #1724

Comments

PauBadiaM commented Mar 9, 2021

LuckyMD commented Mar 9, 2021

ivirshup commented Mar 10, 2021 • edited Loading

PauBadiaM commented Mar 10, 2021 • edited Loading

giovp commented Mar 10, 2021

ivirshup commented Mar 10, 2021

PauBadiaM commented Mar 25, 2021

ivirshup commented Mar 26, 2021

ivirshup commented Mar 10, 2021 •

edited

Loading

PauBadiaM commented Mar 10, 2021 •

edited

Loading