Scikit-Learn support #145

vincentarelbundock · 2024-12-15T21:56:06Z

TODO:

from marginaleffects import *
from sklearn.linear_model import LogisticRegression, LinearRegression
import polars as pl
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Read as Polars
url = "https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Titanic.csv"
dat = pl.read_csv(url).drop_nulls(["PClass", "Age"])

# statsmodels formula interface: Patsy requires Pandas
mod = smf.ols('Survived ~ PClass + Age', data=dat.to_pandas()).fit(disp=0)
p = avg_predictions(mod)
print(p)


# Fit with Formulaic
m = fit_statsmodels(
    "Survived ~ PClass + Age", 
    data = dat, 
    engine = sm.Logit,
    kwargs_fit={'disp': 0},
    )
s = avg_slopes(m)
print(s)


m = fit_sklearn(
    'Survived ~ PClass * Age',
    data = dat,
    engine = LinearRegression,
    )
s = avg_slopes(m, by = "Sex")
print(s)


m = fit_sklearn(
    'Survived ~ PClass * Sex * Age',
    data = dat,
    engine = LogisticRegression,
    kwargs_engine={'max_iter': 1000},
    )
s = avg_slopes(m, by = "Sex")
print(s)

jpweytjens · 2024-12-16T09:06:28Z

Interesting approach. We can be pretty agnostic about the exact type of (statistical) model, as long as it provides a .fit method? If we expand our use of narwhals, we could also be agnostic about the dataframe that is provided. This might need only a few lines of the existing Models to determine the type of dataframe that is required by the provided engine.

Some questions that pop into my head

Is formulaic a drop-in replacement for Patsy?
Is this generic approach to (nearly) every statistical model in Python worthy of it's own package?
Can we still accept fitted models, by using the existing logic of the currently supported Models?

vincentarelbundock · 2024-12-16T12:14:49Z

Interesting approach. We can be pretty agnostic about the exact type of (statistical) model, as long as it provides a .fit method?

Yeah, I think that something like this might be quite convenient. I've not yet settled on the "final" interface, but making progress.

Is formulaic a drop-in replacement for Patsy?

No it is not. https://matthewwardrop.github.io/formulaic/latest/migration/

Is this generic approach to (nearly) every statistical model in Python worthy of it's own package?

No, I don't think so. Every modelling package offers its own formula interface, which will necessarily be richer than what we can offer here. Replicating the formula interface of every package is a fool's errand (tried and failed in R's Zelig). And in the end, we're just talking about <20 lines of code. This is not a big enough contribution for a package, IMHO.

Can we still accept fitted models, by using the existing logic of the currently supported Models?

Yes, the true purpose of fml.fit() is only to add two attributes to the model object: formula and data. This is important for objects Scikit and LinearModels, since they don't store the original data frame or the formula. But if we can retrieve the model data and the formula from a fitted objects ---as in Statsmodels--- then we don't need to call fml.fit()

vincentarelbundock added 2 commits December 15, 2024 16:48

scikit-learn support

4b89f81

abstract default methods

fcd5e70

vincentarelbundock mentioned this pull request Dec 15, 2024

Issue19: add support for Linearmodels #144

Open

2 tasks

vincentarelbundock added 12 commits December 15, 2024 18:07

null V in comparisons

2d0eae7

cleanup scikit class

419d791

scikit multiclass

3516ef0

minor

33692c2

fixup

b6d365d

gitignore

db271c0

lint

123eafa

comments

c58255e

formulaic module

3cb9de8

type validation

d52339b

init

536d43c

Scikit -> Sklearn

7619496

vincentarelbundock force-pushed the scikit branch from 06de812 to 7619496 Compare December 17, 2024 02:19

vincentarelbundock added 12 commits December 16, 2024 22:40

rename methods

96918fb

sanitize_model cleanup

d6886eb

simplification

76735a6

simplification

f44749a

lint

8d0753a

minor

4c4d0b3

ingest pandas everywhere

92d79b5

pydantic dependency

3e7f456

clean ingest()

44f7c8c

tests pass

72e1449

lint

51c7958

minor

ed41578

vincentarelbundock added 11 commits December 17, 2024 21:35

simplify

c376e92

model.modeldata -> model.data

c0a927a

remove get_modeldata()

ff3e7cb

deprecate get_formula()

710d816

comment

cc67765

fit_sklearn fit_statsmodels

bfc1955

fit_statsmodels in the statsmodels file

4835fc2

dependencies

bdc6c9f

minor

d91b810

find_variables -> find_predictors

8374b00

bump

e5c19c3

vincentarelbundock merged commit e5c19c3 into main Dec 18, 2024
5 checks passed

This was referenced Dec 18, 2024

Support: formulaic, scikit-learn, and matrix input #35

Closed

Specifying model without statsmodels.formulas.api seems to not work in both pandas and polars #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scikit-Learn support #145

Scikit-Learn support #145

vincentarelbundock commented Dec 15, 2024 •

edited

Loading

jpweytjens commented Dec 16, 2024

vincentarelbundock commented Dec 16, 2024 •

edited

Loading

Scikit-Learn support #145

Scikit-Learn support #145

Conversation

vincentarelbundock commented Dec 15, 2024 • edited Loading

jpweytjens commented Dec 16, 2024

vincentarelbundock commented Dec 16, 2024 • edited Loading

vincentarelbundock commented Dec 15, 2024 •

edited

Loading

vincentarelbundock commented Dec 16, 2024 •

edited

Loading