Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Generic design support using formulaic #328

Merged
merged 34 commits into from
Nov 18, 2024
Merged

Conversation

BorisMuzellec
Copy link
Collaborator

@BorisMuzellec BorisMuzellec commented Oct 29, 2024

Reference Issue or PRs

Closes #181
Closes #213
Closes #309
Closes #272
Closes #202
Closes #184
Closes #125
Will unblock scverse/pertpy#610

What does your PR implement? Be specific.

This PR implements support for general design matrices thanks to formulaic, and using utils from pertpy.

DeseqDataSet

  • Designs should now be provided to DeseqDataSet using the design argument, either in the form of a string representing a formulaic formula (e.g. "~condition + treatment", "~condition + condition:treatment", "~condition + exp(cofactor)"...), or an ndarray directly corresponding to a design matrix.
  • design_factors is still supported but throws a DeprecationWarning
  • continuous_factors is deprecated, as continuous type inference is handled by formulaic
  • ref_level is deprecated
  • Due to new decorated methods, DeseqDataSet is no longer picklable. A to_picklable_anndata() method was added to allow users to pickle results for later use.

DeseqStats

  • Default contrasts are no longer supported, as they lead to too many errors
  • Contrasts may be provided as before for categorical variables (e.g. ["treatment", "test", "control"]), or directly in the form of a contrast vector (a numpy array).
  • For now, contrasts for continuous variables are directly specified with a contrast vector.
  • lfc_shrink no longer supports a default coef argument

BREAKING CHANGE: python 3.9 is no longer supported.

TODO:

  • Update the example notebook gallery on RTD.
    • Failure seems to be due to the fact that it's not possible to pickle classes with decorated functions. Solved using to_picklable_anndata.
  • Add more tests
    • Port the _formulaic.py tests from pertpy
    • Test inputing a design matrix directly (and not as a formula).

Copy link

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer this over the old appraoch, many thanks for moving this forward @BorisMuzellec!



@dataclass
class FactorMetadata:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also quite a bunch of test cases for the _formulaic.py file and the LinearModelBase in pertpy. Would be great if you could also port them here!

pydeseq2/dds.py Show resolved Hide resolved
pydeseq2/dds.py Show resolved Hide resolved
pydeseq2/dds.py Show resolved Hide resolved
Comment on lines +318 to +327
@property
def variables(self):
"""Get the names of the variables used in the model definition."""
try:
return self.obsm["design_matrix"].model_spec.variables_by_source["data"]
except AttributeError:
raise ValueError(
"""Retrieving variables is only possible if the model was initialized
using a formula."""
) from None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this stuff could really become a Mixin as you suggested, then we can more easily reuse it across pyDESeq2 and pertpy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's simpler for perpty than yes I can put this in a Mixin.

There's a difference though, because here the design is stored in .obsm["design_matrix"] as opposed to .design in pertpy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure yet what's the best solution. Maybe we just leave it as is now, and when I'll look into refactoring pertpy I can propose a PR with changes to pyDESeq2 if required.

pydeseq2/ds.py Outdated Show resolved Hide resolved
pydeseq2/ds.py Outdated
Comment on lines 618 to 645
def cond(self, **kwargs):
"""
Get a contrast vector representing a specific condition.

Parameters
----------
**kwargs
Column/value pairs.

Returns
-------
ndarray
A contrast vector that aligns to the columns of the design matrix.
"""
cond_dict = kwargs
if not set(cond_dict.keys()).issubset(self.dds.variables):
raise ValueError(
"""You specified a variable that is not part of the model. Available
variables: """
+ ",".join(self.dds.variables)
)
new_ref_idx = self.LFC.columns.get_loc(f"{factor}_{ref}_vs_{old_ref}")
self.contrast_vector[new_alternative_idx] = 1
self.contrast_vector[new_ref_idx] = -1
for var in self.dds.variables:
if var in cond_dict:
self.dds._check_category(var, cond_dict[var])
else:
cond_dict[var] = self.dds._get_default_value(var)
df = pd.DataFrame([kwargs])
return self.dds.obsm["design_matrix"].model_spec.get_model_matrix(df).iloc[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on you if you want to adopt the .cond() syntax for building contrasts (which was originally devised by @const-ae in glmGamPoi).

A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.

In case you were just to support [column, baseline, treatment] and numpy array contrasts, you could probably come up with a way simpler solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this, whatever offers the most flexibility is best.

At first I tried simplifying the code in _formulaic.py to keep only what I need (mainly retrieving levels for a given factor + whether it has numerical or categorical type), but I ended up keeping everything because I didn't find a straightforward simplification.

A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.

How would you define a contrast to test interaction terms using .cond()? Right now I don't see how to do it without using a numerical factor.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider a design ~ disease * timepoint

disease timepoint
healthy T0
healthy T1
diseased T0
diseased T1

which gives us the following coefficients:
Intercept, diseased, T1, T1:diseased

Then you could test:

diseased vs. healthy

contrast = dds.cond(disease="diseased") - dds.cond(disease="healthy")

T1 vs. T0

contrast = dds.cond(timepoint="T1") - dds.cond(timepoint="T0")

Interaction T1:diseased

contrast = (
    (dds.cond(timepoint="T1", disease="diseased") - dds.cond(timepoint="T0", disease="diseased")) - 
    (dds.cond(timepoint="T1", disease="healthy") - dds.cond(timepoint="T0", disease="healthy"))
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit there's no suitable documentation for this in pertpy. But in principle, using this "DSL", it should be possible to specify arbitrary contrasts.

@BorisMuzellec BorisMuzellec marked this pull request as ready for review November 14, 2024 16:44
Copy link

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pydeseq2/dds.py Outdated
# Also check continuous factors
if self.continuous_factors is not None:
self.continuous_factors = replace_underscores(self.continuous_factors)
assert isinstance(self.design, (str, pd.DataFrame)) or isinstance(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: if this is meant as a user-facing error message, it should probably be a ValueError instead of an assertion

Copy link
Collaborator

@umarteauowkin umarteauowkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great PR, I m convinced :) Just one thing that is not clear for me is what should be put in the LFC shrinkage: is it really a column of the design ? For me it should be LFC @ contrast, but maybe this is not relavant for this PR, I just spotted it since you made the modification.
Finally, could you add a comment on what the Materializer is supposed to do ? (i.e., just one line of comment on what a materializer is :))
Thanks again !

@@ -278,11 +282,11 @@
# LFC shrinkage. This is implemented by the :meth:`lfc_shrink() <DeseqStats.lfc_shrink>`
# method.

stat_res.lfc_shrink(coeff="condition_B_vs_A")
ds.lfc_shrink(coeff="condition[T.B]")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment to explain what this means ?

@BorisMuzellec
Copy link
Collaborator Author

Just one thing that is not clear for me is what should be put in the LFC shrinkage: is it really a column of the design ?

Yes: LFC shrinkage performs MAP estimation with a prior on a given LFC coefficient (i.e. column) that the user must specify.

In principle, I guess it would be possible to do the same thing with contrast @ LFC, I'm just not sure what it would mean. (Also, would the prior be amenable to linear combination?)

@BorisMuzellec
Copy link
Collaborator Author

Thanks for your reviews @grst @umarteauowkin ! I'm merging this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants