Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support: formulaic, scikit-learn, and matrix input #35

Closed
vincentarelbundock opened this issue Sep 18, 2023 · 6 comments
Closed

Support: formulaic, scikit-learn, and matrix input #35

vincentarelbundock opened this issue Sep 18, 2023 · 6 comments

Comments

@vincentarelbundock
Copy link
Owner

https://github.com/matthewwardrop/formulaic

Probably need another argument for the formula used to create y and X in scikit-learn

@vincentarelbundock
Copy link
Owner Author

import pandas
import polars as pl
from formulaic import model_matrix
from sklearn.linear_model import LinearRegression

df = pl.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/thornton_hiv.csv")

y, X = model_matrix("got ~ distvct + tinc * age", df.to_pandas())

lr = LinearRegression()
lr.fit(X, y)

X.model_spec.variables

X.model_spec.formula

@vincentarelbundock
Copy link
Owner Author

vincentarelbundock commented Sep 18, 2023

Do we care about this since there are no standard errors in scikit?

@vincentarelbundock vincentarelbundock changed the title Support: formulaic and scikit-learn Support: formulaic, scikit-learn, and matrix input Oct 27, 2024
@artiom-matvei
Copy link
Contributor

Is this to add support for models from scikit-learn?
For something like:

############## Important line is the last one
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
import seaborn as sns
from marginaleffects import *

# Set seed for reproducibility
np.random.seed(123)

# Load and preprocess the data
penguins = sns.load_dataset('penguins')
data = penguins[['species', 'bill_length_mm', 'bill_depth_mm']].dropna()
data['species'] = data['species'].astype('category')

# Scale the features
scaler = StandardScaler()
data[['bill_length_mm', 'bill_depth_mm']] = scaler.fit_transform(data[['bill_length_mm', 'bill_depth_mm']])

# Prepare features and target
X = data[['bill_length_mm', 'bill_depth_mm']].values
y = data['species'].cat.codes  # Convert categories to numeric codes

# Map species to codes
species_mapping = dict(zip(data['species'].cat.categories, range(len(data['species'].cat.categories))))
print("Species mapping:", species_mapping)

# Fit the multinomial logistic regression model
model_py = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=1e10, fit_intercept=True, random_state=123, max_iter=1000)
model_py.fit(X, y)

############## Important line is the last one
predictions(model_py)

@vincentarelbundock
Copy link
Owner Author

Yes, that's right. The idea would be to write a new model class similar to this: https://github.com/vincentarelbundock/pymarginaleffects/blob/main/marginaleffects/model_pyfixest.py

With these differences:

  1. To instantiate the model class, the user has to supply a Scikit Learn pipeline object that accepts a data frame and returns two matrices: y and X.
  2. When instantiated, it fits a given model.
  3. The get_predict() method then takes a newdata, puts it through the data preparatation pipeline, then makes predictions for that X.

You could give this a shot if you want. I think this is a really fun one.

@vincentarelbundock
Copy link
Owner Author

Done here: #145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants