ENH Generic design support using `formulaic` #328

BorisMuzellec · 2024-10-29T16:14:44Z

Reference Issue or PRs

Closes #181
Closes #213
Closes #309
Closes #272
Closes #202
Closes #184
Closes #125
Will unblock scverse/pertpy#610

What does your PR implement? Be specific.

This PR implements support for general design matrices thanks to formulaic, and using utils from pertpy.

`DeseqDataSet`

Designs should now be provided to DeseqDataSet using the design argument, either in the form of a string representing a formulaic formula (e.g. "~condition + treatment", "~condition + condition:treatment", "~condition + exp(cofactor)"...), or an ndarray directly corresponding to a design matrix.
design_factors is still supported but throws a DeprecationWarning
continuous_factors is deprecated, as continuous type inference is handled by formulaic
ref_level is deprecated
Due to new decorated methods, DeseqDataSet is no longer picklable. A to_picklable_anndata() method was added to allow users to pickle results for later use.

`DeseqStats`

Default contrasts are no longer supported, as they lead to too many errors
Contrasts may be provided as before for categorical variables (e.g. ["treatment", "test", "control"]), or directly in the form of a contrast vector (a numpy array).
For now, contrasts for continuous variables are directly specified with a contrast vector.
lfc_shrink no longer supports a default coef argument

BREAKING CHANGE: python 3.9 is no longer supported.

TODO:

Update the example notebook gallery on RTD.
- ~~Failure seems to be due to the fact that it's not possible to pickle classes with decorated functions.~~ Solved using to_picklable_anndata.
Add more tests
- Port the _formulaic.py tests from pertpy
- Test inputing a design matrix directly (and not as a formula).

…d compatibiliy, but throw deprecation warning

…vided

…ey are no longer used

…ds without its unpicklable attributes

grst

I really prefer this over the old appraoch, many thanks for moving this forward @BorisMuzellec!

grst · 2024-10-30T15:31:33Z

pydeseq2/_formulaic.py

+
+
+@dataclass
+class FactorMetadata:


There's also quite a bunch of test cases for the _formulaic.py file and the LinearModelBase in pertpy. Would be great if you could also port them here!

pydeseq2/dds.py

grst · 2024-10-30T15:46:50Z

pydeseq2/dds.py

+    @property
+    def variables(self):
+        """Get the names of the variables used in the model definition."""
+        try:
+            return self.obsm["design_matrix"].model_spec.variables_by_source["data"]
+        except AttributeError:
+            raise ValueError(
+                """Retrieving variables is only possible if the model was initialized
+                using a formula."""
+            ) from None


Maybe this stuff could really become a Mixin as you suggested, then we can more easily reuse it across pyDESeq2 and pertpy.

If it's simpler for perpty than yes I can put this in a Mixin.

There's a difference though, because here the design is stored in .obsm["design_matrix"] as opposed to .design in pertpy.

I'm not entirely sure yet what's the best solution. Maybe we just leave it as is now, and when I'll look into refactoring pertpy I can propose a PR with changes to pyDESeq2 if required.

pydeseq2/ds.py

grst · 2024-10-30T15:53:48Z

pydeseq2/ds.py

+    def cond(self, **kwargs):
+        """
+        Get a contrast vector representing a specific condition.
+
+        Parameters
+        ----------
+        **kwargs
+            Column/value pairs.
+
+        Returns
+        -------
+        ndarray
+            A contrast vector that aligns to the columns of the design matrix.
+        """
+        cond_dict = kwargs
+        if not set(cond_dict.keys()).issubset(self.dds.variables):
+            raise ValueError(
+                """You specified a variable that is not part of the model. Available
+                variables: """
+                + ",".join(self.dds.variables)
            )
-            new_ref_idx = self.LFC.columns.get_loc(f"{factor}_{ref}_vs_{old_ref}")
-            self.contrast_vector[new_alternative_idx] = 1
-            self.contrast_vector[new_ref_idx] = -1
+        for var in self.dds.variables:
+            if var in cond_dict:
+                self.dds._check_category(var, cond_dict[var])
+            else:
+                cond_dict[var] = self.dds._get_default_value(var)
+        df = pd.DataFrame([kwargs])
+        return self.dds.obsm["design_matrix"].model_spec.get_model_matrix(df).iloc[0]


Depends on you if you want to adopt the .cond() syntax for building contrasts (which was originally devised by @const-ae in glmGamPoi).

A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.

In case you were just to support [column, baseline, treatment] and numpy array contrasts, you could probably come up with a way simpler solution.

I don't have a strong opinion on this, whatever offers the most flexibility is best.

At first I tried simplifying the code in _formulaic.py to keep only what I need (mainly retrieving levels for a given factor + whether it has numerical or categorical type), but I ended up keeping everything because I didn't find a straightforward simplification.

A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.

How would you define a contrast to test interaction terms using .cond()? Right now I don't see how to do it without using a numerical factor.

Let's consider a design ~ disease * timepoint

disease timepoint

healthy T0

healthy T1

diseased T0

diseased T1

which gives us the following coefficients:
Intercept, diseased, T1, T1:diseased

Then you could test:

diseased vs. healthy

contrast = dds.cond(disease="diseased") - dds.cond(disease="healthy")

T1 vs. T0

contrast = dds.cond(timepoint="T1") - dds.cond(timepoint="T0")

Interaction T1:diseased

contrast = ( (dds.cond(timepoint="T1", disease="diseased") - dds.cond(timepoint="T0", disease="diseased")) - (dds.cond(timepoint="T1", disease="healthy") - dds.cond(timepoint="T0", disease="healthy")) )

I admit there's no suitable documentation for this in pertpy. But in principle, using this "DSL", it should be possible to specify arbitrary contrasts.

…cal constrasts have incorrect shapes

…qStats initialization

grst

LGTM

grst · 2024-11-16T19:01:34Z

pydeseq2/dds.py

-            # Also check continuous factors
-            if self.continuous_factors is not None:
-                self.continuous_factors = replace_underscores(self.continuous_factors)
+        assert isinstance(self.design, (str, pd.DataFrame)) or isinstance(


nitpick: if this is meant as a user-facing error message, it should probably be a ValueError instead of an assertion

umarteauowkin

Thanks for this great PR, I m convinced :) Just one thing that is not clear for me is what should be put in the LFC shrinkage: is it really a column of the design ? For me it should be LFC @ contrast, but maybe this is not relavant for this PR, I just spotted it since you made the modification.
Finally, could you add a comment on what the Materializer is supposed to do ? (i.e., just one line of comment on what a materializer is :))
Thanks again !

umarteauowkin · 2024-11-17T08:11:20Z

examples/plot_minimal_pydeseq2_pipeline.py

@@ -278,11 +282,11 @@
 # LFC shrinkage. This is implemented by the :meth:`lfc_shrink() <DeseqStats.lfc_shrink>`
 # method.

-stat_res.lfc_shrink(coeff="condition_B_vs_A")
+ds.lfc_shrink(coeff="condition[T.B]")


Maybe add a comment to explain what this means ?

…n with invalid type

BorisMuzellec · 2024-11-18T15:30:09Z

Just one thing that is not clear for me is what should be put in the LFC shrinkage: is it really a column of the design ?

Yes: LFC shrinkage performs MAP estimation with a prior on a given LFC coefficient (i.e. column) that the user must specify.

In principle, I guess it would be possible to do the same thing with contrast @ LFC, I'm just not sure what it would mean. (Also, would the prior be amenable to linear combination?)

BorisMuzellec · 2024-11-18T15:50:57Z

Thanks for your reviews @grst @umarteauowkin ! I'm merging this :)

BorisMuzellec added 18 commits October 28, 2024 15:18

feat: set design matrices using formulaic (WIP)

2bfb926

feat: set design matrices using formulaic (WIP)

c5f4068

feat: check that there are no NaNs in the design matrix

90e7305

feat: implement support for pairwise categorical contrasts

6f2bdd6

feat: allow directly inputing a contrast vector

c91a477

feat: allow setting design factors as in versions < 0.5.* for backwar…

ccedb5a

…d compatibiliy, but throw deprecation warning

feat: improve summary message when a contrast vector was directly pro…

949c483

…vided

docs: update docstrings to reflect deprecation of design_factors

e3964d4

chore: deprecate ref_level

a17acbc

chore: deprecate continuous_factors

f66ded8

chore: remove build_design_matrix and remove_underscores utils, as th…

86db819

…ey are no longer used

refactor: update lfc_shrink to reflect new design format

7502a2c

tests: update edge case tests

4d7d638

tests: update main tests

fef3068

docs: remove build_design_matrix from docs

f06390a

docs: add missing docstrings in new DeseqStats methods

46ac4c4

feat: add formulaic utils copied from pertpy

2c11548

build!: discontinue python 3.9

43d75ba

jeandut mentioned this pull request Oct 29, 2024

Adding interaction terms to the design matrix #181

Closed

6 tasks

BorisMuzellec added 7 commits October 29, 2024 17:24

docs: update minimal example

4672329

docs: update data loading example

6861940

docs: update step by step example

146e4d9

fix: typo

9486eba

feat: implement a to_picklable_anndata() method to allow saving a d…

4a298a2

…ds without its unpicklable attributes

docs: add API documentation for to_picklable_anndata

172d7c5

docs: fix rst formatting

3e90db5

grst reviewed Oct 30, 2024

View reviewed changes

refactor: throw a ValueError instead of an AssertionError when numeri…

4cd7376

…cal constrasts have incorrect shapes

This was referenced Oct 31, 2024

Use sphinx-autodoc-typehints to fill in type in docstrings #329

Open

Save / load DeseqDataSet states and write anndata to disk as h5ad #330

Open

BorisMuzellec added 2 commits November 4, 2024 14:55

test: ignore anndata ImplicitModificationWarnings in test suite

370564d

test: add _formulaic test suite from pertpy

a404a3b

BorisMuzellec mentioned this pull request Nov 4, 2024

MAINT Deprecated pandas.DataFrameGroupBy.grouper #332

Merged

BorisMuzellec mentioned this pull request Nov 12, 2024

[Feature request] one vs. all others #168

Open

BorisMuzellec added 3 commits November 13, 2024 11:24

refactor: move .cond() to DeseqDataSet so that it can be used at Dese…

7da5320

…qStats initialization

docs: add more details on providing a design matrix directly

9b5a8f9

tests: add a few tests when a design matrix is directly provided

8a70263

BorisMuzellec marked this pull request as ready for review November 14, 2024 16:44

BorisMuzellec requested review from maikia and umarteauowkin as code owners November 14, 2024 16:44

BorisMuzellec requested a review from grst November 15, 2024 09:12

grst approved these changes Nov 16, 2024

View reviewed changes

umarteauowkin approved these changes Nov 18, 2024

View reviewed changes

refactor: throw ValueError instead of AssertionError in case of desig…

97e72f5

…n with invalid type

BorisMuzellec added 2 commits November 18, 2024 16:33

docs: improve lfc_shrink example

81b5db2

docs: document the new materializer attributes

28f3a6f

BorisMuzellec merged commit ed62d06 into main Nov 18, 2024
9 checks passed

BorisMuzellec deleted the wilkinson_formulae branch November 18, 2024 15:51

This was referenced Nov 19, 2024

Extend PyDESeq2 class to support arbitrary contrasts and designs scverse/pertpy#610

Open

[BUG] continuous_factor errors out in _build_contrast #313

Closed

Improvement for DeseqDataSet documentation #316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Generic design support using `formulaic` #328

ENH Generic design support using `formulaic` #328

BorisMuzellec commented Oct 29, 2024 •

edited

Loading

grst left a comment

grst Oct 30, 2024

grst Oct 30, 2024

BorisMuzellec Oct 31, 2024

grst Oct 31, 2024

grst Oct 30, 2024

BorisMuzellec Oct 31, 2024

grst Oct 31, 2024

grst Oct 31, 2024

grst left a comment

grst Nov 16, 2024

umarteauowkin left a comment

umarteauowkin Nov 17, 2024

BorisMuzellec commented Nov 18, 2024

BorisMuzellec commented Nov 18, 2024

ENH Generic design support using formulaic #328

ENH Generic design support using formulaic #328

Conversation

BorisMuzellec commented Oct 29, 2024 • edited Loading

Reference Issue or PRs

What does your PR implement? Be specific.

DeseqDataSet

DeseqStats

grst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

umarteauowkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BorisMuzellec commented Nov 18, 2024

BorisMuzellec commented Nov 18, 2024

ENH Generic design support using `formulaic` #328

ENH Generic design support using `formulaic` #328

BorisMuzellec commented Oct 29, 2024 •

edited

Loading

`DeseqDataSet`

`DeseqStats`