Support `linearmodels` #19

vincentarelbundock · 2023-07-02T11:45:00Z

https://bashtage.github.io/linearmodels/

jpweytjens · 2024-11-21T15:34:24Z

I'm having a go at this, but there is a fundamental discrepancy between marginaleffects and linearmodels.
marginaleffects uses polars throughout which doesn't support (multi)indices. linearmodels currently does not support polars and instead expects a multiindexed pandas DataFrame.

My first attempt at a ModelLinearmodels implements the required methods except for get_predict().

class ModelLinearmodels(ModelAbstract):

    def __init__(self, model, dataframe):
        self.model = model
        self.dataframe = dataframe
        self.multiindex = self.dataframe.index
        self.validate_coef()
        self.validate_modeldata()
        self.validate_response_name()
        self.validate_formula()
        self.variables_type = get_type_dictionary(self.modeldata)

    def get_coef(self):
        return np.array(self.model.params)

    def get_coef_names(self):
        return np.array(self.model.params.index.to_numpy())

    def get_modeldata(self):

        return pl.from_pandas(self.dataframe)

    def get_response_name(self):
        return self.model.model.dependent.vars[0]

    def get_vcov(self, vcov=True):
        if isinstance(vcov, bool):
            if vcov is True:
                V = self.model.cov
            else:
                V = None
        elif isinstance(vcov, str):
            raise ValueError(f"Linearmodels currently does not support {vcov} vcov.")
            lab = f"cov_{vcov}"
            if hasattr(self.model, lab):
                V = getattr(self.model, lab)
            else:
                raise ValueError(f"The model object has no {lab} attribute.")
        else:
            raise ValueError(
                '`vcov` must be a boolean or a string like "HC3", which corresponds to an attribute of the model object such as "vcov_HC3".'
            )

        if V is not None:
            V = np.array(V)
            if V.shape != (len(self.coef), len(self.coef)):
                raise ValueError(
                    "vcov must be a square numpy array with dimensions equal to the length of self.coef"
                )

        return V
    
    def _add_multiindex(self, df):
        if isinstance(df, pd.DataFrame):
            df = df.set_index(self.multiindex)
        return df

    def get_variables_names(self, variables=None, newdata=None):
        if variables is None:
            formula = self.formula
            columns = self.modeldata.columns
            order = {}
            for var in columns:
                match = re.search(rf"\b{re.escape(var)}\b", formula.split("~")[1])
                if match:
                    order[var] = match.start()
            variables = sorted(order, key=lambda i: order[i])

        if isinstance(variables, (str, dict)):
            variables = [variables] if isinstance(variables, str) else variables
        elif isinstance(variables, list) and all(
            isinstance(var, str) for var in variables
        ):
            pass
        else:
            raise ValueError(
                "`variables` must be None, a dict, string, or list of strings"
            )

        if newdata is not None:
            good = [x for x in variables if x in newdata.columns]
            bad = [x for x in variables if x not in newdata.columns]
            if len(bad) > 0:
                bad = ", ".join(bad)
                warnings.warn(f"Variable(s) not in newdata: {bad}")
            if len(good) == 0:
                raise ValueError("There is no valid column name in `variables`.")
        return variables

    def get_predict(self, params, newdata: pl.DataFrame):
        if isinstance(newdata, np.ndarray):
            exog = newdata
        else:
            y, exog = patsy.dmatrices(self.formula, self._add_multiindex(newdata.to_pandas()))
        p = self.model.model.predict(params, exog=exog)
        if p.ndim == 1:
            p = pl.DataFrame({"rowid": range(newdata.shape[0]), "estimate": p})
        elif p.ndim == 2:
            colnames = {f"column_{i}": str(i) for i in range(p.shape[1])}
            p = (
                pl.DataFrame(p)
                .rename(colnames)
                .with_columns(
                    pl.Series(range(p.shape[0]), dtype=pl.Int32).alias("rowid")
                )
                .melt(id_vars="rowid", variable_name="group", value_name="estimate")
            )
        else:
            raise ValueError(
                "The `predict()` method must return an array with 1 or 2 dimensions."
            )
        p = p.with_columns(pl.col("rowid").cast(pl.Int32))
        return p

    def get_formula(self):
        return self.model.model.formula

    def get_df(self):
        return self.model.df_resid

When newdata is passed to get_predict, e.g. the hi_X, lo_X and nd_X frames in comparisons, the method gets a polars DataFrame. I'm not sure what a general solution is to generate the correct multiindex pandas Dataframe, while keeping as much of the existing polars code.

Any advice on how to handle this would be appreciated.

vincentarelbundock · 2024-11-21T15:45:06Z

Thanks so much for taking a look at this, I really appreciate it!

Frankly, I'm not super familiar with either linearmodels or Pandas multi-indexing...

Does the fitted model hold attributes with the "original" names of the variables used in the index? Can we use this to convert from polars to pandas on the fly on ingestion?

jpweytjens · 2024-11-21T16:19:25Z

The fitted model stores the dependent and exog dataframes, but uses the column names generated by patsy. Both of these contain the original MultiIndex. Currently, I'm passing the original dataframe as well just to get the original column names, before patsy creates the matrices with new column names.

We could reset the index, converting them to columns, before converting a pandas dataframe to a polar dataframe. Where required, we could restore the index with .set_index.

Can this solution be done solely in the Model class, or do other functions need to change for this as well?

vincentarelbundock · 2024-11-21T16:22:08Z

In principle, I think it should be OK to do only in the model class, as long as all the methods assume Polars as both Input and Output.

jpweytjens · 2024-11-22T15:18:15Z

I've added initial support for linearmodels in my fork. The solution is indeed to reset the index before converting to polar, and setting it again when a pandas dataframe is required.

Unlike statsmodels, the fitted linearmodels model does not store the original dataframe. It only stores the output of patsy.dmatrices. Currently, I pass the original dataframe as an addtional required argument to 'ModelLinearmodels'. As a consequence, to test this new class, you need to create the model manually as 'sanitize_model' doesn't support passing the additional kwarg (yet).

from linearmodels.panel.model import PanelOLS
import marginaleffects as me

model = PanelOLS.from_formula(formula, data=df).fit()
me.slopes(
    me.model_linearmodels.ModelLinearmodels(model, df),
    variables=["x"],
    by=["dummy"],
)

This stil needs work before I can push a PR.

Is the original dataframe required, or can we solely rely on the patsy output?
Make sanitize_model compatible with the new model
Verify the other functions besides slopes
Implement test(s)

vincentarelbundock · 2024-11-22T15:28:27Z

Wow, this is super cool!

No, unfortunately, we cannot work solely with patsy output. The problem is that the contrast-building code works on data frames. For example, if we want a contrast between "red" and "blue", we duplicate the entire data frame and set the variable to "red". Then, we do the same with "blue", make predictions on both, and compare. There is no analogous logic to manipulate raw matrices, and writing this code is not realistic.

Perhaps we can force users to supply a data frame to the newdata argument explicitly for these models, and then assign modeldata to newdata. I think that might work, though we probably need to enforce some checks on the input.

jpweytjens · 2025-01-15T08:11:26Z

Closed by #144 😊

jpweytjens mentioned this issue Dec 15, 2024

Issue19: add support for Linearmodels #144

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `linearmodels` #19

Support `linearmodels` #19

vincentarelbundock commented Jul 2, 2023

jpweytjens commented Nov 21, 2024

vincentarelbundock commented Nov 21, 2024

jpweytjens commented Nov 21, 2024

vincentarelbundock commented Nov 21, 2024

jpweytjens commented Nov 22, 2024 •

edited

Loading

vincentarelbundock commented Nov 22, 2024

jpweytjens commented Jan 15, 2025

Support linearmodels #19

Support linearmodels #19

Comments

vincentarelbundock commented Jul 2, 2023

jpweytjens commented Nov 21, 2024

vincentarelbundock commented Nov 21, 2024

jpweytjens commented Nov 21, 2024

vincentarelbundock commented Nov 21, 2024

jpweytjens commented Nov 22, 2024 • edited Loading

vincentarelbundock commented Nov 22, 2024

jpweytjens commented Jan 15, 2025

Support `linearmodels` #19

Support `linearmodels` #19

jpweytjens commented Nov 22, 2024 •

edited

Loading