Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pyfixest #56

Merged
merged 29 commits into from
Dec 27, 2023
Merged

Support pyfixest #56

merged 29 commits into from
Dec 27, 2023

Conversation

vincentarelbundock
Copy link
Owner

@vincentarelbundock vincentarelbundock commented Dec 23, 2023

Define an abstract class ModelPyfixest to support pyfixest models.

WIP. Does not work yet.

Known problems:

  • get_predict() expects a data frame, but the statsmodels-based infrastructure supplies a patsy matrix. We probably want to create patsy matrices inside the get_predict() method specific to statsmodels, and otherwise always pass a data frame to all get_predict().
import polars as pl
import numpy as np
from pyfixest.estimation import feols
from pyfixest.utils import get_data
from marginaleffects import predictions
data = get_data()
fit = feols("Y ~ X1", data = data)
predictions(fit)

@vincentarelbundock vincentarelbundock marked this pull request as ready for review December 23, 2023 22:13
@vincentarelbundock
Copy link
Owner Author

This appears to work, now. I would be curious to know what @s3alfisc thinks, especially about the pyfixest class here:

https://github.com/vincentarelbundock/pymarginaleffects/pull/56/files#diff-9b8c333d26e1955c292c955f0da401086afad8a91415c2107187b12a53bb6552

Below I paste a Python and an R script and their output. Run Python first because it saves the CSV. FWIW, I have not tested it extensively, but the results look similar.

R

Code

library(fixest)
library(marginaleffects)
options(width = 10000)

# Prep and fit
data = read.csv("issue56.csv")
fit = feols(Y ~ X1 * X2 * Z1, data = data)

# Prediction: Average over full empirical distribution
p = avg_predictions(fit)
print(p)

# Average marginal effects (slopes)
s = avg_slopes(fit)
print(s)

# Comparisons at two different evaluation points
c = comparisons(fit, newdata = datagrid(X1 = c(2, 4)))
print(c)

Log

> library(fixest)
> library(marginaleffects)
> options(width = 10000)
> 
> # Prep and fit
> data = read.csv("issue56.csv")
> fit = feols(Y ~ X1 * X2 * Z1, data = data)
NOTE: 2 observations removed because of NA values (LHS: 1, RHS: 1).
> 
> # Prediction: Average over full empirical distribution
> p = avg_predictions(fit)
> print(p)

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     2.57     0.0555 46.3   <0.001 Inf  2.46   2.68

Columns: estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 

> 
> # Average marginal effects (slopes)
> s = avg_slopes(fit)
> print(s)

 Term Estimate Std. Error       z Pr(>|z|)    S   2.5 % 97.5 %
   X1  0.42870     0.0865  4.9544   <0.001 20.4  0.2591  0.598
   X2  0.07336     0.0183  4.0067   <0.001 14.0  0.0375  0.109
   Z1 -0.00406     0.0544 -0.0747     0.94  0.1 -0.1107  0.103

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 

> 
> # Comparisons at two different evaluation points
> c = comparisons(fit, newdata = datagrid(X1 = c(2, 4)))
> print(c)

 Term Contrast X1 Estimate Std. Error     z Pr(>|z|)    S   2.5 % 97.5 %     X2    Z1
   X1       +1  2   0.4287     0.0865 4.953   <0.001 20.4  0.2591  0.598 -0.126 0.995
   X1       +1  4   0.4287     0.0865 4.953   <0.001 20.4  0.2591  0.598 -0.126 0.995
   X2       +1  2   0.0822     0.0357 2.301   0.0214  5.5  0.0122  0.152 -0.126 0.995
   X2       +1  4   0.1002     0.0875 1.146   0.2520  2.0 -0.0712  0.272 -0.126 0.995
   Z1       +1  2   0.0688     0.0795 0.866   0.3867  1.4 -0.0870  0.225 -0.126 0.995
   Z1       +1  4   0.2169     0.1821 1.191   0.2336  2.1 -0.1400  0.574 -0.126 0.995

Columns: rowid, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, X1, predicted_lo, predicted_hi, predicted, Y, X2, Z1 
Type:  response 

Python

Code

from pyfixest.estimation import feols
from pyfixest.utils import get_data
from marginaleffects import *

# Prep and fit
data = get_data()
data.to_csv("issue56.csv")
fit = feols("Y ~ X1 * X2 * Z1", data = data)

# Prediction: Average over full empirical distribution
p = avg_predictions(fit)
print(p)

# Average marginal effects (slopes)
s = avg_slopes(fit)
print(s)

# Comparisons at two different evaluation points
c = comparisons(fit, newdata = datagrid(X1 = [2, 4], model = fit))
print(c)

Log

shape: (1, 7)
┌──────────┬───────────┬──────┬─────────┬─────┬──────┬───────┐
│ EstimateStd.ErrorzP(>|z|) ┆ S2.5%97.5% │
│ ---------------------   │
│ strstrstrstrstrstrstr   │
╞══════════╪═══════════╪══════╪═════════╪═════╪══════╪═══════╡
│ 2.570.055546.30inf2.462.68  │
└──────────┴───────────┴──────┴─────────┴─────┴──────┴───────┘

Columns: estimate, std_error, statistic, p_value, s_value, conf_low, conf_high

shape: (3, 9)
┌──────┬─────────────┬──────────┬───────────┬───┬──────────┬────────┬────────┬───────┐
│ TermContrastEstimateStd.Error ┆ … ┆ P(>|z|)  ┆ S2.5%97.5% │
│ ------------       ┆   ┆ ------------   │
│ strstrstrstr       ┆   ┆ strstrstrstr   │
╞══════╪═════════════╪══════════╪═══════════╪═══╪══════════╪════════╪════════╪═══════╡
│ X1mean(dY/dX) ┆ 0.4290.0865    ┆ … ┆ 7.23e-0720.40.2590.598 │
│ X2mean(dY/dX) ┆ 0.07340.0183    ┆ … ┆ 6.15e-05140.03750.109 │
│ Z1mean(dY/dX) ┆ -0.004060.0544    ┆ … ┆ 0.940.0885-0.1110.103 │
└──────┴─────────────┴──────────┴───────────┴───┴──────────┴────────┴────────┴───────┘

Columns: term, contrast, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high

shape: (6, 10)
┌─────┬──────┬──────────┬──────────┬───┬──────────┬──────┬─────────┬───────┐
│ X1TermContrastEstimate ┆ … ┆ P(>|z|)  ┆ S2.5%97.5% │
│ ------------      ┆   ┆ ------------   │
│ strstrstrstr      ┆   ┆ strstrstrstr   │
╞═════╪══════╪══════════╪══════════╪═══╪══════════╪══════╪═════════╪═══════╡
│ 2X1+10.429    ┆ … ┆ 7.29e-0720.40.2590.598 │
│ 4X1+10.429    ┆ … ┆ 7.29e-0720.40.2590.598 │
│ 2X2+10.0822   ┆ … ┆ 0.02145.550.01220.152 │
│ 4X2+10.1      ┆ … ┆ 0.2521.99-0.07120.272 │
│ 2Z1+10.0688   ┆ … ┆ 0.3871.37-0.0870.225 │
│ 4Z1+10.217    ┆ … ┆ 0.2342.1-0.140.574 │
└─────┴──────┴──────────┴──────────┴───┴──────────┴──────┴─────────┴───────┘

Columns: X1, rowid, term, contrast, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high, predicted, predicted_lo, predicted_hi, Y, Y2, X2, f1, f2, f3, group_id, Z1, Z2

@vincentarelbundock vincentarelbundock changed the title [WIP]: Support pyfixest Support pyfixest Dec 23, 2023
@s3alfisc
Copy link
Contributor

Hi @vincentarelbundock, super cool and very exciting - this feels like an early Christmas gift! I can't promise to be able to take a look today but I'll definitely find time in the next few days =)

@vincentarelbundock
Copy link
Owner Author

Haha, nice!

But just to calibrate expectations, the model above is literally the only one I tried. It might not work elsewhere, but at least we have a base to work with.

:)

vincentarelbundock and others added 18 commits December 24, 2023 10:42
* plotnine

* trash.py for tests

* plotnine

* p9 plot_predictions works

* p9: plot_comparisons and plot_slopes

* lint

* toml

* workflows

* toml

* toml

* toml

* remove pandas dependency

* remove pandas dependency

* toml

* toml

* lint

* snapshot

* skip test
* plotnine

* trash.py for tests

* plotnine

* p9 plot_predictions works

* p9: plot_comparisons and plot_slopes

* lint

* toml

* workflows

* toml

* toml

* toml

* remove pandas dependency

* remove pandas dependency

* toml

* toml

* lint

* snapshot

* skip test
@vincentarelbundock vincentarelbundock merged commit 1dc7b70 into main Dec 27, 2023
@s3alfisc
Copy link
Contributor

s3alfisc commented Dec 27, 2023

Hi @vincentarelbundock - I have started to take a look yesterday and have implemented some tests with regressions with fixed effects. All major functions - predictions(), avg_predictions(), and avg_slopes() work correctly and produce identical results in Python and R. Only comparisons() returns only NaN values. I have started to debug this but it might take me a little more time until I will have figured it out.

If you are interested, I'd be happy to open a PR with more extensive unit tests. I would simply add tests for OLS with fixed effects, IV, and, once I add the required functionality to its predict method, Poisson Regression following your template in test_pyfixest.py.

@vincentarelbundock
Copy link
Owner Author

vincentarelbundock commented Dec 27, 2023

Ah thanks, this sounds amazing! Yep a PR seems like a good way to go. I can hopefully debug the comparisons() issue pretty easily once I see a reproducible example.

@s3alfisc
Copy link
Contributor

s3alfisc commented Dec 27, 2023

Here's a reproducible example for what I think is a bug in comparisons():

import polars as pl
import numpy as np
from pyfixest.estimation import feols
from marginaleffects import *

def create_test_data():

    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))

    return data

data = create_test_data()

# test 1: no fixed effects
fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data.to_pandas())

comparisons(fit, newdata = datagrid(X1 = [2, 4], model = fit))
X1	rowid	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	predicted	predicted_lo	predicted_hi	X2	Z1	e	f1	f2	Y
i64	i32	str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2	0	"X1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"X1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
2	0	"X2"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"X2"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
2	0	"Z1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"Z1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447

In R, I instead get

 Term Contrast Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %      X2     Z1 f1 X1
   X1       +1   0.0259     0.0209  1.24  0.21578  2.2 -0.0151  0.0669 -0.0243 0.0239  2  2
   X1       +1   0.0259     0.0209  1.24  0.21578  2.2 -0.0151  0.0669 -0.0243 0.0239  2  4
   X2       +1   0.0712     0.0489  1.46  0.14524  2.8 -0.0246  0.1670 -0.0243 0.0239  2  2
   X2       +1   0.1612     0.1094  1.47  0.14068  2.8 -0.0533  0.3757 -0.0243 0.0239  2  4
   Z1       +1  -0.1305     0.0392 -3.33  < 0.001 10.2 -0.2073 -0.0538 -0.0243 0.0239  2  2
   Z1       +1  -0.2903     0.0907 -3.20  0.00138  9.5 -0.4681 -0.1125 -0.0243 0.0239  2  4

by running

library(fixest)
library(marginaleffects)
library(reticulate)

data = py$data
fit = feols(Y ~ X1 * X2 * Z1 | f1, data = data)
comparisons(fit, newdata = datagrid(X1 = c(2,4), model = fit))

I'll set up the PR with tests before the new year =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants