Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support linearmodels #19

Open
vincentarelbundock opened this issue Jul 2, 2023 · 7 comments
Open

Support linearmodels #19

vincentarelbundock opened this issue Jul 2, 2023 · 7 comments

Comments

@vincentarelbundock
Copy link
Owner

https://bashtage.github.io/linearmodels/

@jpweytjens
Copy link
Contributor

I'm having a go at this, but there is a fundamental discrepancy between marginaleffects and linearmodels.
marginaleffects uses polars throughout which doesn't support (multi)indices. linearmodels currently does not support polars and instead expects a multiindexed pandas DataFrame.

My first attempt at a ModelLinearmodels implements the required methods except for get_predict().

class ModelLinearmodels(ModelAbstract):

    def __init__(self, model, dataframe):
        self.model = model
        self.dataframe = dataframe
        self.multiindex = self.dataframe.index
        self.validate_coef()
        self.validate_modeldata()
        self.validate_response_name()
        self.validate_formula()
        self.variables_type = get_type_dictionary(self.modeldata)

    def get_coef(self):
        return np.array(self.model.params)

    def get_coef_names(self):
        return np.array(self.model.params.index.to_numpy())

    def get_modeldata(self):

        return pl.from_pandas(self.dataframe)

    def get_response_name(self):
        return self.model.model.dependent.vars[0]

    def get_vcov(self, vcov=True):
        if isinstance(vcov, bool):
            if vcov is True:
                V = self.model.cov
            else:
                V = None
        elif isinstance(vcov, str):
            raise ValueError(f"Linearmodels currently does not support {vcov} vcov.")
            lab = f"cov_{vcov}"
            if hasattr(self.model, lab):
                V = getattr(self.model, lab)
            else:
                raise ValueError(f"The model object has no {lab} attribute.")
        else:
            raise ValueError(
                '`vcov` must be a boolean or a string like "HC3", which corresponds to an attribute of the model object such as "vcov_HC3".'
            )

        if V is not None:
            V = np.array(V)
            if V.shape != (len(self.coef), len(self.coef)):
                raise ValueError(
                    "vcov must be a square numpy array with dimensions equal to the length of self.coef"
                )

        return V
    
    def _add_multiindex(self, df):
        if isinstance(df, pd.DataFrame):
            df = df.set_index(self.multiindex)
        return df

    def get_variables_names(self, variables=None, newdata=None):
        if variables is None:
            formula = self.formula
            columns = self.modeldata.columns
            order = {}
            for var in columns:
                match = re.search(rf"\b{re.escape(var)}\b", formula.split("~")[1])
                if match:
                    order[var] = match.start()
            variables = sorted(order, key=lambda i: order[i])

        if isinstance(variables, (str, dict)):
            variables = [variables] if isinstance(variables, str) else variables
        elif isinstance(variables, list) and all(
            isinstance(var, str) for var in variables
        ):
            pass
        else:
            raise ValueError(
                "`variables` must be None, a dict, string, or list of strings"
            )

        if newdata is not None:
            good = [x for x in variables if x in newdata.columns]
            bad = [x for x in variables if x not in newdata.columns]
            if len(bad) > 0:
                bad = ", ".join(bad)
                warnings.warn(f"Variable(s) not in newdata: {bad}")
            if len(good) == 0:
                raise ValueError("There is no valid column name in `variables`.")
        return variables

    def get_predict(self, params, newdata: pl.DataFrame):
        if isinstance(newdata, np.ndarray):
            exog = newdata
        else:
            y, exog = patsy.dmatrices(self.formula, self._add_multiindex(newdata.to_pandas()))
        p = self.model.model.predict(params, exog=exog)
        if p.ndim == 1:
            p = pl.DataFrame({"rowid": range(newdata.shape[0]), "estimate": p})
        elif p.ndim == 2:
            colnames = {f"column_{i}": str(i) for i in range(p.shape[1])}
            p = (
                pl.DataFrame(p)
                .rename(colnames)
                .with_columns(
                    pl.Series(range(p.shape[0]), dtype=pl.Int32).alias("rowid")
                )
                .melt(id_vars="rowid", variable_name="group", value_name="estimate")
            )
        else:
            raise ValueError(
                "The `predict()` method must return an array with 1 or 2 dimensions."
            )
        p = p.with_columns(pl.col("rowid").cast(pl.Int32))
        return p

    def get_formula(self):
        return self.model.model.formula

    def get_df(self):
        return self.model.df_resid

When newdata is passed to get_predict, e.g. the hi_X, lo_X and nd_X frames in comparisons, the method gets a polars DataFrame. I'm not sure what a general solution is to generate the correct multiindex pandas Dataframe, while keeping as much of the existing polars code.

Any advice on how to handle this would be appreciated.

@vincentarelbundock
Copy link
Owner Author

Thanks so much for taking a look at this, I really appreciate it!

Frankly, I'm not super familiar with either linearmodels or Pandas multi-indexing...

Does the fitted model hold attributes with the "original" names of the variables used in the index? Can we use this to convert from polars to pandas on the fly on ingestion?

@jpweytjens
Copy link
Contributor

The fitted model stores the dependent and exog dataframes, but uses the column names generated by patsy. Both of these contain the original MultiIndex. Currently, I'm passing the original dataframe as well just to get the original column names, before patsy creates the matrices with new column names.

We could reset the index, converting them to columns, before converting a pandas dataframe to a polar dataframe. Where required, we could restore the index with .set_index.

Can this solution be done solely in the Model class, or do other functions need to change for this as well?

@vincentarelbundock
Copy link
Owner Author

In principle, I think it should be OK to do only in the model class, as long as all the methods assume Polars as both Input and Output.

@jpweytjens
Copy link
Contributor

jpweytjens commented Nov 22, 2024

I've added initial support for linearmodels in my fork. The solution is indeed to reset the index before converting to polar, and setting it again when a pandas dataframe is required.

Unlike statsmodels, the fitted linearmodels model does not store the original dataframe. It only stores the output of patsy.dmatrices. Currently, I pass the original dataframe as an addtional required argument to 'ModelLinearmodels'. As a consequence, to test this new class, you need to create the model manually as 'sanitize_model' doesn't support passing the additional kwarg (yet).

from linearmodels.panel.model import PanelOLS
import marginaleffects as me

model = PanelOLS.from_formula(formula, data=df).fit()
me.slopes(
    me.model_linearmodels.ModelLinearmodels(model, df),
    variables=["x"],
    by=["dummy"],
)

This stil needs work before I can push a PR.

  • Is the original dataframe required, or can we solely rely on the patsy output?
  • Make sanitize_model compatible with the new model
  • Verify the other functions besides slopes
  • Implement test(s)

@vincentarelbundock
Copy link
Owner Author

Wow, this is super cool!

No, unfortunately, we cannot work solely with patsy output. The problem is that the contrast-building code works on data frames. For example, if we want a contrast between "red" and "blue", we duplicate the entire data frame and set the variable to "red". Then, we do the same with "blue", make predictions on both, and compare. There is no analogous logic to manipulate raw matrices, and writing this code is not realistic.

Perhaps we can force users to supply a data frame to the newdata argument explicitly for these models, and then assign modeldata to newdata. I think that might work, though we probably need to enforce some checks on the input.

@jpweytjens
Copy link
Contributor

Closed by #144 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants