Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue19: add support for Linearmodels #144

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jpweytjens
Copy link

@jpweytjens jpweytjens commented Dec 15, 2024

This PR adds initial support for the linearmodels package, focusing on panel data models.

Implementation Details

The implementation differs slightly from other models due to how linearmodels handles data. Since fitted linearmodels models don't retain the original DataFrame (discussed in #19), users need to manually wrap the fitted model in a ModelLinearmodels class:

from linearmodels.panel import PanelOLS
import marginaleffects as me

model = PanelOLS.from_formula(formula, data).fit()
mod = me.model_linearmodels.ModelLinearmodels(model, df)

This approach preserves access to all newdata options (e.g., mean).

Testing

I've added basic consistency tests for the current implementation. Additional R-based comparison tests are still needed, but I don't have the expertise in R panel data to make these myself.

Current limitations

  1. Fixed Effects Support
    Patsy doesn't support linearmodels' custom EntityEffects and TimeEffects keywords
    As a result, marginaleffects can't handle these effects either.
    Workaround: Create the y and exog dataframes yourself with Patsy and construct the model as follows:
y, exog = patsy.dmatrices(formula, return_type='dataframe')
model = PanelOLS(y, exog, entity_effects=True)
model.formula = formula

The formula needs to be manually added here.

  1. Intercept Handling
    linearmodels adds an intercept by default, even when it's not specified explicitly in a formula. This also conflicts with Patsy's expectations
    Workaround: Users must explicitly include an intercept +1 in formula when desired

Next steps

  • Add R comparison tests
  • Add documentation

@vincentarelbundock
Copy link
Owner

Thanks, this looks fantastic!

I'll try to give this a read as soon as possible, but it's the end of semester, so I'm not sure exactly when I'll have time for a deep dive.

One thing I'd like us to think about is how we can design a consistent interface for all model-fitting functions that accept matrices rather than data frames.

For example, @artiom-matvei and I have discussed adding support for Scikit-Learn, where the typical algorithm accepts y,X, and has a .predict() method that accepts a X matrix.

There also, we could use the marginaleffects machinery to build a hi and lo counterfactual data frames, then use Patsy or Formulaic to convert to design matrices, and then use those matrices to fit and make predictions on the model.

That will be a use case that applies to several packages, not just Scikit-Learn and LinearModels, so we want to think of a consistent approach. Maybe we can learn something from Scikit-Learn pipelines, where one of the possible transformers is a formula: https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html

@vincentarelbundock
Copy link
Owner

FYI, I'm doing some experiments here: #145

Still far from a finished product, but I'll let you know when I think I've found a good high-level interface.

@vincentarelbundock
Copy link
Owner

@jpweytjens

I merged the sklearn PR. Usage examples in the first post of #145

We can still think about the interface some, but that gives us something to build on.

I also took the liberty to update your branch, since some of the methods names have changed. The key difference is that the list-wise deleted data frame used to fit the model should now be hosted in self.data and the formula in self.formula.

Currently, the linearmodels functionality does not appear to work, because of an indexing problem. I'm terrible at Pandas, so if you want to take a look, that would be awesome.

Thanks again for your contribution here. This is very cool!

@jpweytjens
Copy link
Author

I'm working on the index issue and adding a fit_linearmodels method similar to the fit_sklearn while we think about the (final) interface.

This will take a bit more work, as the new approach with formulaic approach adds a few more locations where dataframes are converted to polars with narwhals. I'm testing if I can make my PR more data agnostic by e.g. removing the isinstance checks in comparisons and using narwhals in ModelLinearmodels instead of my custom _pl_to_pd function.

@jpweytjens
Copy link
Author

I've added a few things:

  • a fit_linearmodels function
  • support for the different covariance estimators supported by linearmodels
  • used narwhals instead of the custom _pl_to_pd functions to be more dataframe agnostic
  • made the ingest function compatible with multiindexed Pandas DataFrames
  • a parser for formulas with support for the EntityEffects keywords as used by linearmodels
  • docstrings to many of the (new) functions in utils,formulaic and model_linearmodels
  • renamed formulaic.variables to formulaic.extract_variables. vars is often used as a variable name for the output of variables, but vars is a Python keyword. I've changed all occurences of vars = variables(...) to variables = extract_variables(...)

A simple test of the new interface is as follows

from statsmodels.datasets import grunfeld
from linearmodels.panel import PanelOLS
import marginaleffects as me


data = grunfeld.load_pandas().data.set_index(["firm", "year"])

formula = "invest ~ value + capital"
formula_with_effects = f"{formula} + EntityEffects + TimeEffects"

for f in [formula, formula_with_effects]:
    mod = me.fit_linearmodels(f, data, engine=PanelOLS)

    print(mod.model)
    print(me.avg_slopes(mod))

I did disable the validate_types decorator in some places. I couldn't solve the unknown dtype of NativeFrame that I got by introducing narwhals. It's not clear to me whether something like from narwhals.typing import IntoFrame is sufficient, or we need a new Protocol, ...

The test cases for linearmodels also need some updating.

@vincentarelbundock
Copy link
Owner

Thanks for this work, and sorry for the delay! (End of semester was a bit crazy this year.)

I pushed a few commits:

  1. Fixed ainstance() vs. isinstance() bug.
  2. Updated a few tests
  3. The fit_*() functions no longer wrap the fitted model in a class immediately (now done in sanitize_model()) This allows users to call the methods of the fitted object directly, like fit_statsmodels(...).summary()

TODO:

  1. Update the test_utils.py tests, which now seem to fail because ingest() returns a new index column.
  2. test_linearmodels.py all appear to be failing. The discrepancy in estimates is massive, and I don't know what the expected values should be there, as I haven't touched that.

@jpweytjens
Copy link
Author

Does the index column only occur when working with pandas DataFrames with a single index? We could reset the index in ingest only when the passed pandas DataFrame has a MultiIndex as required by linearmodels. I all other cases, we don't need to preserve the index.

The expected test results are the output of the first version of this PR. I'm not sure what caused the big discrepancy. I'll look into it.

I'd like to to have some PanelOLS results and marginal effects estimated with R and the R version of marginaleffects to our python implentation with it. Could you point me in the right direction to generate these results? I'm not that familiar with the popular R packages.

Could you have a look at the@validate_types decorator in formulaic and fit_linearmodels? I couldn't get it to work with the narwhals IntoFrame/NativeFrame dtype.

@vincentarelbundock
Copy link
Owner

I just fixed the ingest tests.

Don't worry about type checking. I plan to overhaul that later. You can just remove the decorator.

The two main packages for these models in R are probably fixest and plm. Here are some equivalences:

import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)
from linearmodels import PanelOLS
etdata = data.set_index(['firm','year'])
PanelOLS(etdata.invest,
         etdata[['value','capital']],
         entity_effects=True,
         time_effects=True).fit(debiased=True)

etdata.to_csv("~/Downloads/grunfeld.csv")
library(fixest)
dat = read.csv("~/Downloads/grunfeld.csv")
mod = feols(invest ~ value + capital | firm + year, data = dat)
coef(mod)


library(plm)
mod = plm(invest ~ value + capital, data = dat, index = c("firm", "year"), model = "within", effect="twoways")
coef(mod)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants