Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict bug with mismatched data types #731

Open
ivanhigueram opened this issue Dec 4, 2024 · 5 comments
Open

predict bug with mismatched data types #731

ivanhigueram opened this issue Dec 4, 2024 · 5 comments
Assignees

Comments

@ivanhigueram
Copy link

ivanhigueram commented Dec 4, 2024

Hello there,

I am trying to create predictions with a dataset outside my model data. I found that if there's any type mismatch in the newdata compared with the data we used to estimate the model, the predict() method will return an array of nan:

Here's a reproducible example:

import pyfixest as pf

data = pf.get_data()

# Run model
model = pf.feols("Y ~ X1 | f1", data=data)

# Allow me to fill the nans just to make my points :)
data["f1"] = data["f1"].fillna(0).astype(int)
model.predict(newdata=data.iloc[0:100])

>>> array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, ...])

This error is coming from the definition of df_fe in the predict() and the _apply_fixef_numpy as the dictionary keys will be saved with a 10.0 rather than 10.

df_fe = newdata[fvals].astype(str)

Not sure if this constitutes a bug on itself, or if this is a skill issue, but it would be nice to get a warning maybe? This is no problem for unit FEs if the ID is a str, but in dates, sometimes we get 2010.0 rather than 2010 when we do time operations and numpy preserves the data type.

I am running the '0.26.2' version in Python 3.10.

@s3alfisc
Copy link
Member

s3alfisc commented Dec 4, 2024

Thanks for reporting this Ivan @ivanhigueram! I have to think about this one for a bit. My intuition would be that keys of 10.0 and 10 should be treated equally? For sure I think adding a warning would be a good start!

@s3alfisc
Copy link
Member

s3alfisc commented Dec 4, 2024

If you had to choose, what would be your preferred behavior?

@ivanhigueram
Copy link
Author

I'd say a warning would be enough. Is definitely easier to do a astype() in the newdata than trying to fix this in your code base.

@s3alfisc
Copy link
Member

s3alfisc commented Dec 4, 2024

Yes I was also afraid that handling things in pyfixest would be a lot of work 😅 I'll add a warning then (or would you be up to open a PR? but of course no pressure 😄 ).

I think it could be as simple as adding

        if self._has_fixef: 
            fixef = self._fixef.split("+")

            mismatched_fixef_types = [x for x in fixef if newdata[x].dtypes != self._data[x].dtypes]
            if mismatched_fixef_types:
                warnings.warn(f"Data types of fixed effects {mismatched_fixef_types} do not match the model data. This leads to     mismatched keys in the fixed effect dictionary, and as a result, to NaN predictions for columns with mismatched keys.")        

around here

and to add a test that triggers the warning in test_errors.py.

@ivanhigueram
Copy link
Author

I'd love to open the PR. I will ping if I go into any problems with testing and all that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants