Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyFixest returns nan in models with fixed effects #75

Closed
vincentarelbundock opened this issue Dec 27, 2023 · 10 comments
Closed

PyFixest returns nan in models with fixed effects #75

vincentarelbundock opened this issue Dec 27, 2023 · 10 comments

Comments

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Dec 27, 2023

Here's a reproducible example for what I think is a bug in comparisons():

import polars as pl
import numpy as np
from pyfixest.estimation import feols
from marginaleffects import *

def create_test_data():

    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))

    return data

data = create_test_data()

# test 1: no fixed effects
fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data.to_pandas())

comparisons(fit, newdata = datagrid(X1 = [2, 4], model = fit))
X1	rowid	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	predicted	predicted_lo	predicted_hi	X2	Z1	e	f1	f2	Y
i64	i32	str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2	0	"X1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"X1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
2	0	"X2"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"X2"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
2	0	"Z1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"Z1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447

In R, I instead get

 Term Contrast Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %      X2     Z1 f1 X1
   X1       +1   0.0259     0.0209  1.24  0.21578  2.2 -0.0151  0.0669 -0.0243 0.0239  2  2
   X1       +1   0.0259     0.0209  1.24  0.21578  2.2 -0.0151  0.0669 -0.0243 0.0239  2  4
   X2       +1   0.0712     0.0489  1.46  0.14524  2.8 -0.0246  0.1670 -0.0243 0.0239  2  2
   X2       +1   0.1612     0.1094  1.47  0.14068  2.8 -0.0533  0.3757 -0.0243 0.0239  2  4
   Z1       +1  -0.1305     0.0392 -3.33  < 0.001 10.2 -0.2073 -0.0538 -0.0243 0.0239  2  2
   Z1       +1  -0.2903     0.0907 -3.20  0.00138  9.5 -0.4681 -0.1125 -0.0243 0.0239  2  4

by running

library(fixest)
library(marginaleffects)
library(reticulate)

data = py$data
fit = feols(Y ~ X1 * X2 * Z1 | f1, data = data)
comparisons(fit, newdata = datagrid(X1 = c(2,4), model = fit))

I'll set up the PR with tests before the new year =)

Originally posted by @s3alfisc in #56 (comment)

@vincentarelbundock
Copy link
Owner Author

vincentarelbundock commented Dec 27, 2023

Edit to tag @s3alfisc

The problem is that datagrid() sets all variables that were not specified explicitly to their means or modes. In this case, f1 is integer, and taking the mean gives you a float with some numbers after the decimal.

This is obviously an invalid fixed effect value for pyfixest, so fit.predict() returns nan.

IMHO, the best solution available would be for PyFixest to be stricter about user input for fixed effects variables. In particular, it would seem safer if you could require those variables to be something like a categorical type (pl.Categorical in polars, and I think there's a pandas analogue). That would require users to be more explicit (a good thing!), and would prevent any package interacting with pyfixest from applying functions like np.mean() to columns which should not be treated as numeric.

Alternatively (and immediately), we can specify the values of f1 that we want to consider explicitly in the datagrid call. For example:

# comparison for a specific unit
comparisons(fit, newdata = datagrid(
    X1 = [2, 4],
    f1 = 1,
    model = fit)
)

# comparisons for all units
comparisons(fit, newdata = datagrid(
    X1 = [2, 4],
    f1 = data["f1"].unique(),
    model = fit)
)

@s3alfisc
Copy link
Contributor

s3alfisc commented Dec 30, 2023

Hi @vincentarelbundock, sorry for not responding right away, it took me some time to check pyfixest's code and think through options.

feols.predict() should work independently of the type of the fixed effects variable f1 as it is always wrapped into a C(f1) operator prior to calling formulaic. So predict() should treat all fixed effects as proper factors.

import polars as pl
import numpy as np
import pandas as pd
from pyfixest.estimation import feols


def create_test_data():

    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))

    return data

data = create_test_data().to_pandas()

fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data)
fit.predict()[:5] # array([-0.20069114, -0.18479564,  1.30258071,  0.17423046,  0.39628393])
data["f1"] = pd.Categorical(data["f1"])
fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data)
fit.predict()[:5] # array([-0.20069114, -0.18479564,  1.30258071,  0.17423046,  0.39628393])

Nevertheless, f1 is not changed to Categorical in the _data attribute of feols. This is something that I could easily implement and that should solve the problem above if I understand correctly? I am a little bit reluctant to require users to provide fixed effects as Categoricals as I think it's a nice convenience feature, and fixest also doesn't do it. What do you think of this?

@vincentarelbundock
Copy link
Owner Author

Yeah, I get it. It is a nice in fact a nice convenience feature.

I wonder about transforming the data. Maybe some users will be surprised to find that the dataset hosted in the fit object is not the same as the original.

Could you maybe store an attribute in the fit object to say which variables should be treated as categorical? marginaleffects could then look at that and act accordingly.

@s3alfisc
Copy link
Contributor

I wonder about transforming the data. Maybe some users will be surprised to find that the dataset hosted in the fit object is not the same as the original.

I also don't love it, so I was thinking to add a function argument convert_fixef_to_categoricalts = True to feols() / fepois() that by default converts all fixed effects to factors in feols._data, and to print a warning in case it actually happens. This way, users should be informed, have the option to turn off the behavior, and marginaleffects can work with categoricals without any extra effort.

@vincentarelbundock
Copy link
Owner Author

Look, I won't argue if you want to take this on, but if I were you I'd be reticent to polluting the user interface with an additional argument for something that is basically only useful for internal mechanics... Your call, obviously.

On my end, the only thing that would need to change would be this existing function:

https://github.com/vincentarelbundock/pymarginaleffects/blob/main/marginaleffects/utils.py#L99

We could add an additional model argument and extract the info from there. So it's quite trivial.

But again, your call.

@s3alfisc
Copy link
Contributor

s3alfisc commented Dec 30, 2023

I think you're right. 👍 You can actually already access all fixed effects as a formula string via

fit = feols("Y ~ X1 * X2 * Z1 | f1 + f2", data = data)
fit._fixef # 'f1+f2'
fit._fixef.split("+") #['f1', 'f2']

@vincentarelbundock
Copy link
Owner Author

And there's never a "^" as in fixest for R? "+" is always the right split string?

@s3alfisc
Copy link
Contributor

There might be, but then _data actually contains a new column f1^f2:

fit = feols("Y ~ X1 * X2 * Z1 | f1^f2", data = data)
fit._fixef.split("+") #['f1^f2']
fit._data.head()
X1	X2	Z1	e	f1	f2	Y	f1^f2
0	2.124449	0.139458	-1.133521	0.620320	2	2	0.284490	2^2
1	0.252646	0.394952	-0.108843	-0.066855	1	3	-0.077715	1^3
2	1.454179	0.778479	1.350050	2.289958	1	2	3.818280	1^2
3	0.569240	-0.422725	-0.573504	-0.417212	0	2	-0.279209	0^2
4	0.458224	-0.535603	-1.376782	0.725313	3	4	1.063211	3^4

vincentarelbundock added a commit that referenced this issue Dec 31, 2023
* broken

* bugfix

* Model classes hold dictionary of variable classes

* pyfixest recognizes fixed effects as categorical in datagrid()
@vincentarelbundock
Copy link
Owner Author

Should be fixed on Github. Thanks for following up on this and for the extra info.

import polars as pl
import numpy as np
from pyfixest.estimation import feols
from marginaleffects import *

def create_test_data():
    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))
    return data

data = create_test_data()

fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data.to_pandas())

comparisons(fit, newdata = datagrid(X1 = [2, 4], model = fit))

shape: (6, 10)
┌─────┬──────┬──────────┬──────────┬───┬──────────┬──────┬─────────┬─────────┐
│ X1TermContrastEstimate ┆ … ┆ P(>|z|)  ┆ S2.5%97.5%   │
│ ------------      ┆   ┆ ------------     │
│ strstrstrstr      ┆   ┆ strstrstrstr     │
╞═════╪══════╪══════════╪══════════╪═══╪══════════╪══════╪═════════╪═════════╡
│ 2X1+10.0259   ┆ … ┆ 0.2162.21-0.01510.0669  │
│ 4X1+10.0259   ┆ … ┆ 0.2162.21-0.01510.0669  │
│ 2X2+10.0712   ┆ … ┆ 0.1452.79-0.02460.167   │
│ 4X2+10.161    ┆ … ┆ 0.142.83-0.05320.376   │
│ 2Z1+1-0.131   ┆ … ┆ 0.00085210.2-0.207-0.0538 │
│ 4Z1+1-0.29    ┆ … ┆ 0.001379.51-0.468-0.113  │
└─────┴──────┴──────────┴──────────┴───┴──────────┴──────┴─────────┴─────────┘

@s3alfisc
Copy link
Contributor

s3alfisc commented Jan 1, 2024

Awesome! I'll add some very basic tests to pyfixest to make sure I never introduce any breaking changes (like renaming self._fixef) + will try to provide the promised PR with additional tests by the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants