Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikit-Learn support #145

Merged
merged 37 commits into from
Dec 18, 2024
Merged

Scikit-Learn support #145

merged 37 commits into from
Dec 18, 2024

Conversation

vincentarelbundock
Copy link
Owner

@vincentarelbundock vincentarelbundock commented Dec 15, 2024

TODO:

  • NULL handling
  • .predict() vs. .predict_proba()
  • is_pyfixest() in sanitize_model.py
  • multi-class predictions
  • comparisons
  • slopes
  • formulaic should be optional
  • Print entire data frame when newdata is not provided
from marginaleffects import *
from sklearn.linear_model import LogisticRegression, LinearRegression
import polars as pl
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Read as Polars
url = "https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Titanic.csv"
dat = pl.read_csv(url).drop_nulls(["PClass", "Age"])

# statsmodels formula interface: Patsy requires Pandas
mod = smf.ols('Survived ~ PClass + Age', data=dat.to_pandas()).fit(disp=0)
p = avg_predictions(mod)
print(p)


# Fit with Formulaic
m = fit_statsmodels(
    "Survived ~ PClass + Age", 
    data = dat, 
    engine = sm.Logit,
    kwargs_fit={'disp': 0},
    )
s = avg_slopes(m)
print(s)


m = fit_sklearn(
    'Survived ~ PClass * Age',
    data = dat,
    engine = LinearRegression,
    )
s = avg_slopes(m, by = "Sex")
print(s)


m = fit_sklearn(
    'Survived ~ PClass * Sex * Age',
    data = dat,
    engine = LogisticRegression,
    kwargs_engine={'max_iter': 1000},
    )
s = avg_slopes(m, by = "Sex")
print(s)

@jpweytjens
Copy link

Interesting approach. We can be pretty agnostic about the exact type of (statistical) model, as long as it provides a .fit method? If we expand our use of narwhals, we could also be agnostic about the dataframe that is provided. This might need only a few lines of the existing Models to determine the type of dataframe that is required by the provided engine.

Some questions that pop into my head

  • Is formulaic a drop-in replacement for Patsy?
  • Is this generic approach to (nearly) every statistical model in Python worthy of it's own package?
  • Can we still accept fitted models, by using the existing logic of the currently supported Models?

@vincentarelbundock
Copy link
Owner Author

vincentarelbundock commented Dec 16, 2024

Interesting approach. We can be pretty agnostic about the exact type of (statistical) model, as long as it provides a .fit method?

Yeah, I think that something like this might be quite convenient. I've not yet settled on the "final" interface, but making progress.

Is formulaic a drop-in replacement for Patsy?

No it is not. https://matthewwardrop.github.io/formulaic/latest/migration/

Is this generic approach to (nearly) every statistical model in Python worthy of it's own package?

No, I don't think so. Every modelling package offers its own formula interface, which will necessarily be richer than what we can offer here. Replicating the formula interface of every package is a fool's errand (tried and failed in R's Zelig). And in the end, we're just talking about <20 lines of code. This is not a big enough contribution for a package, IMHO.

Can we still accept fitted models, by using the existing logic of the currently supported Models?

Yes, the true purpose of fml.fit() is only to add two attributes to the model object: formula and data. This is important for objects Scikit and LinearModels, since they don't store the original data frame or the formula. But if we can retrieve the model data and the formula from a fitted objects ---as in Statsmodels--- then we don't need to call fml.fit()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants