Issue19: add support for Linearmodels #144

jpweytjens · 2024-12-15T17:40:56Z

This PR adds initial support for the linearmodels package, focusing on panel data models.

Implementation Details

The implementation differs slightly from other models due to how linearmodels handles data. Since fitted linearmodels models don't retain the original DataFrame (discussed in #19), users need to manually wrap the fitted model in a ModelLinearmodels class:

from linearmodels.panel import PanelOLS
import marginaleffects as me

model = PanelOLS.from_formula(formula, data).fit()
mod = me.model_linearmodels.ModelLinearmodels(model, df)

This approach preserves access to all newdata options (e.g., mean).

Testing

I've added basic consistency tests for the current implementation. Additional R-based comparison tests are still needed, but I don't have the expertise in R panel data to make these myself.

Current limitations

Fixed Effects Support
Patsy doesn't support linearmodels' custom EntityEffects and TimeEffects keywords
As a result, marginaleffects can't handle these effects either.
Workaround: Create the y and exog dataframes yourself with Patsy and construct the model as follows:

y, exog = patsy.dmatrices(formula, return_type='dataframe')
model = PanelOLS(y, exog, entity_effects=True)
model.formula = formula

The formula needs to be manually added here.

Intercept Handling
linearmodels adds an intercept by default, even when it's not specified explicitly in a formula. This also conflicts with Patsy's expectations
Workaround: Users must explicitly include an intercept +1 in formula when desired

Next steps

Add R comparison tests
Add documentation

vincentarelbundock · 2024-12-15T19:01:55Z

Thanks, this looks fantastic!

I'll try to give this a read as soon as possible, but it's the end of semester, so I'm not sure exactly when I'll have time for a deep dive.

One thing I'd like us to think about is how we can design a consistent interface for all model-fitting functions that accept matrices rather than data frames.

For example, @artiom-matvei and I have discussed adding support for Scikit-Learn, where the typical algorithm accepts y,X, and has a .predict() method that accepts a X matrix.

There also, we could use the marginaleffects machinery to build a hi and lo counterfactual data frames, then use Patsy or Formulaic to convert to design matrices, and then use those matrices to fit and make predictions on the model.

That will be a use case that applies to several packages, not just Scikit-Learn and LinearModels, so we want to think of a consistent approach. Maybe we can learn something from Scikit-Learn pipelines, where one of the possible transformers is a formula: https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html

vincentarelbundock · 2024-12-15T21:56:46Z

FYI, I'm doing some experiments here: #145

Still far from a finished product, but I'll let you know when I think I've found a good high-level interface.

vincentarelbundock · 2024-12-18T17:29:51Z

@jpweytjens

I merged the sklearn PR. Usage examples in the first post of #145

We can still think about the interface some, but that gives us something to build on.

I also took the liberty to update your branch, since some of the methods names have changed. The key difference is that the list-wise deleted data frame used to fit the model should now be hosted in self.data and the formula in self.formula.

Currently, the linearmodels functionality does not appear to work, because of an indexing problem. I'm terrible at Pandas, so if you want to take a look, that would be awesome.

Thanks again for your contribution here. This is very cool!

jpweytjens · 2024-12-18T23:23:54Z

I'm working on the index issue and adding a fit_linearmodels method similar to the fit_sklearn while we think about the (final) interface.

This will take a bit more work, as the new approach with formulaic approach adds a few more locations where dataframes are converted to polars with narwhals. I'm testing if I can make my PR more data agnostic by e.g. removing the isinstance checks in comparisons and using narwhals in ModelLinearmodels instead of my custom _pl_to_pd function.

jpweytjens · 2024-12-19T14:46:02Z

I've added a few things:

a fit_linearmodels function
support for the different covariance estimators supported by linearmodels
used narwhals instead of the custom _pl_to_pd functions to be more dataframe agnostic
made the ingest function compatible with multiindexed Pandas DataFrames
a parser for formulas with support for the EntityEffects keywords as used by linearmodels
docstrings to many of the (new) functions in utils,formulaic and model_linearmodels
renamed formulaic.variables to formulaic.extract_variables. vars is often used as a variable name for the output of variables, but vars is a Python keyword. I've changed all occurences of vars = variables(...) to variables = extract_variables(...)

A simple test of the new interface is as follows

from statsmodels.datasets import grunfeld
from linearmodels.panel import PanelOLS
import marginaleffects as me


data = grunfeld.load_pandas().data.set_index(["firm", "year"])

formula = "invest ~ value + capital"
formula_with_effects = f"{formula} + EntityEffects + TimeEffects"

for f in [formula, formula_with_effects]:
    mod = me.fit_linearmodels(f, data, engine=PanelOLS)

    print(mod.model)
    print(me.avg_slopes(mod))

I did disable the validate_types decorator in some places. I couldn't solve the unknown dtype of NativeFrame that I got by introducing narwhals. It's not clear to me whether something like from narwhals.typing import IntoFrame is sufficient, or we need a new Protocol, ...

The test cases for linearmodels also need some updating.

vincentarelbundock · 2024-12-25T15:17:05Z

Thanks for this work, and sorry for the delay! (End of semester was a bit crazy this year.)

I pushed a few commits:

Fixed ainstance() vs. isinstance() bug.
Updated a few tests
The fit_*() functions no longer wrap the fitted model in a class immediately (now done in sanitize_model()) This allows users to call the methods of the fitted object directly, like fit_statsmodels(...).summary()

TODO:

Update the test_utils.py tests, which now seem to fail because ingest() returns a new index column.
test_linearmodels.py all appear to be failing. The discrepancy in estimates is massive, and I don't know what the expected values should be there, as I haven't touched that.

jpweytjens · 2024-12-25T16:54:33Z

Does the index column only occur when working with pandas DataFrames with a single index? We could reset the index in ingest only when the passed pandas DataFrame has a MultiIndex as required by linearmodels. I all other cases, we don't need to preserve the index.

The expected test results are the output of the first version of this PR. I'm not sure what caused the big discrepancy. I'll look into it.

I'd like to to have some PanelOLS results and marginal effects estimated with R and the R version of marginaleffects to our python implentation with it. Could you point me in the right direction to generate these results? I'm not that familiar with the popular R packages.

Could you have a look at the@validate_types decorator in formulaic and fit_linearmodels? I couldn't get it to work with the narwhals IntoFrame/NativeFrame dtype.

vincentarelbundock · 2024-12-25T18:12:23Z

I just fixed the ingest tests.

Don't worry about type checking. I plan to overhaul that later. You can just remove the decorator.

The two main packages for these models in R are probably fixest and plm. Here are some equivalences:

import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)
from linearmodels import PanelOLS
etdata = data.set_index(['firm','year'])
PanelOLS(etdata.invest,
         etdata[['value','capital']],
         entity_effects=True,
         time_effects=True).fit(debiased=True)

etdata.to_csv("~/Downloads/grunfeld.csv")

library(fixest)
dat = read.csv("~/Downloads/grunfeld.csv")
mod = feols(invest ~ value + capital | firm + year, data = dat)
coef(mod)


library(plm)
mod = plm(invest ~ value + capital, data = dat, index = c("firm", "year"), model = "within", effect="twoways")
coef(mod)

jpweytjens and others added 6 commits November 22, 2024 16:08

feat: initial support for linearmodels

7c291e8

wip: sanitize linearmodels

d90984f

feat: add support for ModeLinearmodels in predictions

82fcb3b

chore: ignore .DS_store

a7a7653

test: add tests for Linearmodels PanelOLS

0738570

Merge branch 'main' into issue19

b421215

vincentarelbundock added 2 commits December 18, 2024 11:59

Merge branch 'main' into issue19

9b1667f

linearmodels: update methods names

db5a678

jpweytjens added 4 commits December 19, 2024 15:36

feat: add fit interface and narwhals support to linearmodels

ed43284

feat: add linearmodels formula parser and rename variables

f66aa1a

feat: make ingest pandas multiindex compataible

4da7a8d

test: first attempt at fixing test with new fit interface

2a6b133

jpweytjens and others added 4 commits December 19, 2024 17:50

feat: don't require pandas as dependency

74927fc

isinstance() bug + tests

7b2eed1

Merge branch 'main' into issue19

0a66dd8

fit_linearmodels() does not wrap the object in class immediately

a010ba6

fix ingest tests

4486e34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue19: add support for Linearmodels #144

Issue19: add support for Linearmodels #144

jpweytjens commented Dec 15, 2024 •

edited

Loading

vincentarelbundock commented Dec 15, 2024

vincentarelbundock commented Dec 15, 2024

vincentarelbundock commented Dec 18, 2024

jpweytjens commented Dec 18, 2024

jpweytjens commented Dec 19, 2024

vincentarelbundock commented Dec 25, 2024

jpweytjens commented Dec 25, 2024

vincentarelbundock commented Dec 25, 2024

Issue19: add support for Linearmodels #144

Are you sure you want to change the base?

Issue19: add support for Linearmodels #144

Conversation

jpweytjens commented Dec 15, 2024 • edited Loading

Implementation Details

Testing

Current limitations

Next steps

vincentarelbundock commented Dec 15, 2024

vincentarelbundock commented Dec 15, 2024

vincentarelbundock commented Dec 18, 2024

jpweytjens commented Dec 18, 2024

jpweytjens commented Dec 19, 2024

vincentarelbundock commented Dec 25, 2024

jpweytjens commented Dec 25, 2024

vincentarelbundock commented Dec 25, 2024

jpweytjens commented Dec 15, 2024 •

edited

Loading