`PyFixest` returns `nan` in models with fixed effects #75

vincentarelbundock · 2023-12-27T19:08:37Z

Here's a reproducible example for what I think is a bug in comparisons():

import polars as pl
import numpy as np
from pyfixest.estimation import feols
from marginaleffects import *

def create_test_data():

    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))

    return data

data = create_test_data()

# test 1: no fixed effects
fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data.to_pandas())

comparisons(fit, newdata = datagrid(X1 = [2, 4], model = fit))
X1	rowid	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	predicted	predicted_lo	predicted_hi	X2	Z1	e	f1	f2	Y
i64	i32	str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2	0	"X1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"X1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
2	0	"X2"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"X2"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
2	0	"Z1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447
4	1	"Z1"	"+1"	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.024305	0.023899	0.004883	2.573	2.431	0.010447

In R, I instead get

 Term Contrast Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %      X2     Z1 f1 X1
   X1       +1   0.0259     0.0209  1.24  0.21578  2.2 -0.0151  0.0669 -0.0243 0.0239  2  2
   X1       +1   0.0259     0.0209  1.24  0.21578  2.2 -0.0151  0.0669 -0.0243 0.0239  2  4
   X2       +1   0.0712     0.0489  1.46  0.14524  2.8 -0.0246  0.1670 -0.0243 0.0239  2  2
   X2       +1   0.1612     0.1094  1.47  0.14068  2.8 -0.0533  0.3757 -0.0243 0.0239  2  4
   Z1       +1  -0.1305     0.0392 -3.33  < 0.001 10.2 -0.2073 -0.0538 -0.0243 0.0239  2  2
   Z1       +1  -0.2903     0.0907 -3.20  0.00138  9.5 -0.4681 -0.1125 -0.0243 0.0239  2  4

by running

library(fixest)
library(marginaleffects)
library(reticulate)

data = py$data
fit = feols(Y ~ X1 * X2 * Z1 | f1, data = data)
comparisons(fit, newdata = datagrid(X1 = c(2,4), model = fit))

I'll set up the PR with tests before the new year =)

Originally posted by @s3alfisc in #56 (comment)

The text was updated successfully, but these errors were encountered:

vincentarelbundock · 2023-12-27T19:17:30Z

Edit to tag @s3alfisc

The problem is that datagrid() sets all variables that were not specified explicitly to their means or modes. In this case, f1 is integer, and taking the mean gives you a float with some numbers after the decimal.

This is obviously an invalid fixed effect value for pyfixest, so fit.predict() returns nan.

IMHO, the best solution available would be for PyFixest to be stricter about user input for fixed effects variables. In particular, it would seem safer if you could require those variables to be something like a categorical type (pl.Categorical in polars, and I think there's a pandas analogue). That would require users to be more explicit (a good thing!), and would prevent any package interacting with pyfixest from applying functions like np.mean() to columns which should not be treated as numeric.

Alternatively (and immediately), we can specify the values of f1 that we want to consider explicitly in the datagrid call. For example:

# comparison for a specific unit
comparisons(fit, newdata = datagrid(
    X1 = [2, 4],
    f1 = 1,
    model = fit)
)

# comparisons for all units
comparisons(fit, newdata = datagrid(
    X1 = [2, 4],
    f1 = data["f1"].unique(),
    model = fit)
)

s3alfisc · 2023-12-30T10:04:54Z

Hi @vincentarelbundock, sorry for not responding right away, it took me some time to check pyfixest's code and think through options.

feols.predict() should work independently of the type of the fixed effects variable f1 as it is always wrapped into a C(f1) operator prior to calling formulaic. So predict() should treat all fixed effects as proper factors.

import polars as pl
import numpy as np
import pandas as pd
from pyfixest.estimation import feols


def create_test_data():

    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))

    return data

data = create_test_data().to_pandas()

fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data)
fit.predict()[:5] # array([-0.20069114, -0.18479564,  1.30258071,  0.17423046,  0.39628393])
data["f1"] = pd.Categorical(data["f1"])
fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data)
fit.predict()[:5] # array([-0.20069114, -0.18479564,  1.30258071,  0.17423046,  0.39628393])

Nevertheless, f1 is not changed to Categorical in the _data attribute of feols. This is something that I could easily implement and that should solve the problem above if I understand correctly? I am a little bit reluctant to require users to provide fixed effects as Categoricals as I think it's a nice convenience feature, and fixest also doesn't do it. What do you think of this?

vincentarelbundock · 2023-12-30T12:54:20Z

Yeah, I get it. It is a nice in fact a nice convenience feature.

I wonder about transforming the data. Maybe some users will be surprised to find that the dataset hosted in the fit object is not the same as the original.

Could you maybe store an attribute in the fit object to say which variables should be treated as categorical? marginaleffects could then look at that and act accordingly.

s3alfisc · 2023-12-30T13:58:48Z

I wonder about transforming the data. Maybe some users will be surprised to find that the dataset hosted in the fit object is not the same as the original.

I also don't love it, so I was thinking to add a function argument convert_fixef_to_categoricalts = True to feols() / fepois() that by default converts all fixed effects to factors in feols._data, and to print a warning in case it actually happens. This way, users should be informed, have the option to turn off the behavior, and marginaleffects can work with categoricals without any extra effort.

vincentarelbundock · 2023-12-30T14:04:36Z

Look, I won't argue if you want to take this on, but if I were you I'd be reticent to polluting the user interface with an additional argument for something that is basically only useful for internal mechanics... Your call, obviously.

On my end, the only thing that would need to change would be this existing function:

https://github.com/vincentarelbundock/pymarginaleffects/blob/main/marginaleffects/utils.py#L99

We could add an additional model argument and extract the info from there. So it's quite trivial.

But again, your call.

s3alfisc · 2023-12-30T14:24:45Z

I think you're right. 👍 You can actually already access all fixed effects as a formula string via

fit = feols("Y ~ X1 * X2 * Z1 | f1 + f2", data = data)
fit._fixef # 'f1+f2'
fit._fixef.split("+") #['f1', 'f2']

vincentarelbundock · 2023-12-30T14:28:33Z

And there's never a "^" as in fixest for R? "+" is always the right split string?

s3alfisc · 2023-12-30T15:06:21Z

There might be, but then _data actually contains a new column f1^f2:

fit = feols("Y ~ X1 * X2 * Z1 | f1^f2", data = data)
fit._fixef.split("+") #['f1^f2']
fit._data.head()
X1	X2	Z1	e	f1	f2	Y	f1^f2
0	2.124449	0.139458	-1.133521	0.620320	2	2	0.284490	2^2
1	0.252646	0.394952	-0.108843	-0.066855	1	3	-0.077715	1^3
2	1.454179	0.778479	1.350050	2.289958	1	2	3.818280	1^2
3	0.569240	-0.422725	-0.573504	-0.417212	0	2	-0.279209	0^2
4	0.458224	-0.535603	-1.376782	0.725313	3	4	1.063211	3^4

* broken * bugfix * Model classes hold dictionary of variable classes * pyfixest recognizes fixed effects as categorical in datagrid()

vincentarelbundock · 2023-12-31T21:28:41Z

Should be fixed on Github. Thanks for following up on this and for the extra info.

import polars as pl
import numpy as np
from pyfixest.estimation import feols
from marginaleffects import *

def create_test_data():
    np.random.seed(1024)
    data = pl.DataFrame({
        "X1": np.random.normal(size = 1000),
        "X2": np.random.normal(size = 1000),
        "Z1": np.random.normal(size = 1000),
        "e": np.random.normal(size = 1000),
        "f1": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True),
        "f2": np.random.choice([0, 1, 2, 3, 4, 5], size = 1000, replace = True)
    }).with_columns((pl.col("X1") * pl.col("X2") * pl.col("Z1") + pl.col("e")).alias("Y"))
    return data

data = create_test_data()

fit = feols("Y ~ X1 * X2 * Z1 | f1", data = data.to_pandas())

comparisons(fit, newdata = datagrid(X1 = [2, 4], model = fit))

shape: (6, 10)
┌─────┬──────┬──────────┬──────────┬───┬──────────┬──────┬─────────┬─────────┐
│ X1  ┆ Term ┆ Contrast ┆ Estimate ┆ … ┆ P(>|z|)  ┆ S    ┆ 2.5%    ┆ 97.5%   │
│ --- ┆ ---  ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---  ┆ ---     ┆ ---     │
│ str ┆ str  ┆ str      ┆ str      ┆   ┆ str      ┆ str  ┆ str     ┆ str     │
╞═════╪══════╪══════════╪══════════╪═══╪══════════╪══════╪═════════╪═════════╡
│ 2   ┆ X1   ┆ +1       ┆ 0.0259   ┆ … ┆ 0.216    ┆ 2.21 ┆ -0.0151 ┆ 0.0669  │
│ 4   ┆ X1   ┆ +1       ┆ 0.0259   ┆ … ┆ 0.216    ┆ 2.21 ┆ -0.0151 ┆ 0.0669  │
│ 2   ┆ X2   ┆ +1       ┆ 0.0712   ┆ … ┆ 0.145    ┆ 2.79 ┆ -0.0246 ┆ 0.167   │
│ 4   ┆ X2   ┆ +1       ┆ 0.161    ┆ … ┆ 0.14     ┆ 2.83 ┆ -0.0532 ┆ 0.376   │
│ 2   ┆ Z1   ┆ +1       ┆ -0.131   ┆ … ┆ 0.000852 ┆ 10.2 ┆ -0.207  ┆ -0.0538 │
│ 4   ┆ Z1   ┆ +1       ┆ -0.29    ┆ … ┆ 0.00137  ┆ 9.51 ┆ -0.468  ┆ -0.113  │
└─────┴──────┴──────────┴──────────┴───┴──────────┴──────┴─────────┴─────────┘

s3alfisc · 2024-01-01T18:14:44Z

Awesome! I'll add some very basic tests to pyfixest to make sure I never introduce any breaking changes (like renaming self._fixef) + will try to provide the promised PR with additional tests by the end of the week.

vincentarelbundock added a commit that referenced this issue Dec 31, 2023

Issue #75 (#76)

d0ed69c

* broken * bugfix * Model classes hold dictionary of variable classes * pyfixest recognizes fixed effects as categorical in datagrid()

vincentarelbundock closed this as completed Dec 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PyFixest` returns `nan` in models with fixed effects #75

`PyFixest` returns `nan` in models with fixed effects #75

vincentarelbundock commented Dec 27, 2023 •

edited

Loading

vincentarelbundock commented Dec 27, 2023 •

edited

Loading

s3alfisc commented Dec 30, 2023 •

edited

Loading

vincentarelbundock commented Dec 30, 2023

s3alfisc commented Dec 30, 2023

vincentarelbundock commented Dec 30, 2023

s3alfisc commented Dec 30, 2023 •

edited

Loading

vincentarelbundock commented Dec 30, 2023

s3alfisc commented Dec 30, 2023

vincentarelbundock commented Dec 31, 2023

s3alfisc commented Jan 1, 2024

PyFixest returns nan in models with fixed effects #75

PyFixest returns nan in models with fixed effects #75

Comments

vincentarelbundock commented Dec 27, 2023 • edited Loading

vincentarelbundock commented Dec 27, 2023 • edited Loading

s3alfisc commented Dec 30, 2023 • edited Loading

vincentarelbundock commented Dec 30, 2023

s3alfisc commented Dec 30, 2023

vincentarelbundock commented Dec 30, 2023

s3alfisc commented Dec 30, 2023 • edited Loading

vincentarelbundock commented Dec 30, 2023

s3alfisc commented Dec 30, 2023

vincentarelbundock commented Dec 31, 2023

s3alfisc commented Jan 1, 2024

`PyFixest` returns `nan` in models with fixed effects #75

`PyFixest` returns `nan` in models with fixed effects #75

vincentarelbundock commented Dec 27, 2023 •

edited

Loading

vincentarelbundock commented Dec 27, 2023 •

edited

Loading

s3alfisc commented Dec 30, 2023 •

edited

Loading

s3alfisc commented Dec 30, 2023 •

edited

Loading