`predict` bug with mismatched data types #731

ivanhigueram · 2024-12-04T20:55:15Z

Hello there,

I am trying to create predictions with a dataset outside my model data. I found that if there's any type mismatch in the newdata compared with the data we used to estimate the model, the predict() method will return an array of nan:

Here's a reproducible example:

import pyfixest as pf

data = pf.get_data()

# Run model
model = pf.feols("Y ~ X1 | f1", data=data)

# Allow me to fill the nans just to make my points :)
data["f1"] = data["f1"].fillna(0).astype(int)
model.predict(newdata=data.iloc[0:100])

>>> array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, ...])

This error is coming from the definition of df_fe in the predict() and the _apply_fixef_numpy as the dictionary keys will be saved with a 10.0 rather than 10.

pyfixest/pyfixest/estimation/feols_.py

Line 1832 in 10a8fb8

df_fe = newdata[fvals].astype(str)

Not sure if this constitutes a bug on itself, or if this is a skill issue, but it would be nice to get a warning maybe? This is no problem for unit FEs if the ID is a str, but in dates, sometimes we get 2010.0 rather than 2010 when we do time operations and numpy preserves the data type.

I am running the '0.26.2' version in Python 3.10.

The text was updated successfully, but these errors were encountered:

s3alfisc · 2024-12-04T21:17:23Z

Thanks for reporting this Ivan @ivanhigueram! I have to think about this one for a bit. My intuition would be that keys of 10.0 and 10 should be treated equally? For sure I think adding a warning would be a good start!

s3alfisc · 2024-12-04T21:18:31Z

If you had to choose, what would be your preferred behavior?

ivanhigueram · 2024-12-04T21:19:50Z

I'd say a warning would be enough. Is definitely easier to do a astype() in the newdata than trying to fix this in your code base.

s3alfisc · 2024-12-04T21:34:42Z

Yes I was also afraid that handling things in pyfixest would be a lot of work 😅 I'll add a warning then (or would you be up to open a PR? but of course no pressure 😄 ).

I think it could be as simple as adding

        if self._has_fixef: 
            fixef = self._fixef.split("+")

            mismatched_fixef_types = [x for x in fixef if newdata[x].dtypes != self._data[x].dtypes]
            if mismatched_fixef_types:
                warnings.warn(f"Data types of fixed effects {mismatched_fixef_types} do not match the model data. This leads to     mismatched keys in the fixed effect dictionary, and as a result, to NaN predictions for columns with mismatched keys.")

around here

pyfixest/pyfixest/estimation/feols_.py

Line 1811 in 611e5cc

and to add a test that triggers the warning in test_errors.py.

ivanhigueram · 2024-12-04T21:57:13Z

I'd love to open the PR. I will ping if I go into any problems with testing and all that.

s3alfisc assigned ivanhigueram Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`predict` bug with mismatched data types #731

`predict` bug with mismatched data types #731

ivanhigueram commented Dec 4, 2024 •

edited

Loading

s3alfisc commented Dec 4, 2024 •

edited

Loading

s3alfisc commented Dec 4, 2024

ivanhigueram commented Dec 4, 2024

s3alfisc commented Dec 4, 2024

ivanhigueram commented Dec 4, 2024

predict bug with mismatched data types #731

predict bug with mismatched data types #731

Comments

ivanhigueram commented Dec 4, 2024 • edited Loading

s3alfisc commented Dec 4, 2024 • edited Loading

s3alfisc commented Dec 4, 2024

ivanhigueram commented Dec 4, 2024

s3alfisc commented Dec 4, 2024

ivanhigueram commented Dec 4, 2024

`predict` bug with mismatched data types #731

`predict` bug with mismatched data types #731

ivanhigueram commented Dec 4, 2024 •

edited

Loading

s3alfisc commented Dec 4, 2024 •

edited

Loading