Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pure numpy predict function that speeds up ~40x #553

Merged
merged 5 commits into from
Jul 17, 2024

Conversation

apoorvalal
Copy link
Member

@apoorvalal apoorvalal commented Jul 15, 2024

addressing an example from @b-knight on discord: the current predict function is quite inefficient in the presence of many FEs.

I did a quick and dirty numpy implementation (currently implemented as predict2 : didn't want to override the default predict method in case it upsets some unit tests that I don't know about - @s3alfisc please verify) that yields ~50x speedup and (afaict) no numerical difference.
Notebook

Copy link

codecov bot commented Jul 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Files Coverage Δ
pyfixest/estimation/feols_.py 90.53% <100.00%> (ø)

... and 28 files with indirect coverage changes

@apoorvalal apoorvalal force-pushed the speed_up_predict_fe branch from adee239 to e30b75e Compare July 15, 2024 04:59
@apoorvalal
Copy link
Member Author

codecov hates me and the feeling is mutual

@s3alfisc
Copy link
Member

Sometimes, when no one is watching closely, I dare to merge without codecov's permission 🙈😀

- rename `predict2` to `predict`
- delete original `predict`
@s3alfisc
Copy link
Member

Hi @apoorvalal, I have renamed predict2 to predict and deleted the original predict. This should help with codecov and trigger all unit tests =)

@s3alfisc s3alfisc mentioned this pull request Jul 15, 2024
@s3alfisc
Copy link
Member

Looks like something is going wrong with NaN values 🤔

FAILED tests/test_predict_resid_fixef.py::test_predict_nas - AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

x and y nan location mismatch:
x: array([ 1.807314e+00, nan, -2.170578e+00, 1.844755e+00,
-1.710621e-01, 4.697018e-01, -7.419144e-01, -1.526513e+00,
1.356631e+00, -2.075729e+00, -1.000085e+00, 3.599896e+00,...
y: array([ 1.807314e+00, nan, nan, 1.844755e+00,

@apoorvalal
Copy link
Member Author

Fixed that, but noticed something else odd in the process [which is a remaining unit test failure]. This
formula produces the following fixed effect dict. I'm unsure about how all distinct values of f1 have a fixed effect [as evidenced by the difference between the list of unique values and FE dict having no difference for f1]; I would have expected something like f2 where the first value (0.0) is omitted.

image

Am I missing something? Do you pool all strata of FEs and omit one (as opposed to omitting one of each kind of FE), which would explain the difference between f1 and f2 here ?

@@ -1487,7 +1487,7 @@ def predict(self, newdata: Optional[DataFrameType] = None) -> np.ndarray:
fixef_dicts = {}
for f in fvals:
fdict = self._fixef_dict[f"C({f})"]
omitted_cat = set(self._data[f].unique().astype(str)) - set(
omitted_cat = set(self._data[f].unique().astype(str).tolist()) - set(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detects the omitted category correctly now, so unit tests all pass.

fixef_mat[:, i] = vectorized_map(fe_levels, mapping).astype(float)
unique_levels, inverse = np.unique(df_fe_values[:, i], return_inverse=True)
mapping = np.array([subdict.get(level, np.nan) for level in unique_levels])
fixef_mat[:, i] = mapping[inverse]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeats FE values based on inverse vector

@s3alfisc
Copy link
Member

s3alfisc commented Jul 16, 2024

Re your comment on the fixed effects levels - I was also assuming that one reference level would be dropped per fixed effect, but this is not what r-fixest implements: only one level for all fixed effects is dropped for both fixed effects:

library(fixest)
library(reticulate)
data = py$data

fit = feols(
  Y ~ X1 | f1 + f2,
  data = data
)

fit

fixef(fit)
# $f1
#           0           1           2           3 
#  1.85823546  4.15857700  0.01392762  1.26009884 
#           4           5           6           7 
# -0.01674004  3.31246464  0.91274113  2.30340233 
#           8           9          10          11 
#  1.23155227  1.70117551  2.27371739  2.96642848 
#          12          13          14          15 
#  2.28065324  3.53616576  1.52956294  2.92011824 
#          16          17          18          19 
#  3.86997352  1.51750366  3.19879406  2.35433006 
#          20          21          22          23 
#  4.46921954  0.09201573  0.73233885  0.96252599 
#          24          25          26          27 
#  0.60007651  0.23480344  2.35894173  1.49875080 
#          28          29 
#  3.28294866  2.13958053 
# 
# $f2
#           0           1           2           3 
# -0.71670285 -0.32429502 -1.30150753 -2.99409568 
#           4           5           6           7 
# -1.02914628 -0.60182497 -0.30081185 -1.90983043 
#           8           9          10          11 
# -1.95719698 -2.46953986  0.00000000 -1.58053084 
#          12          13          14          15 
# -0.35922732 -3.00743412 -1.36364968 -0.99794680 
#          16          17          18          19 
# -0.95586933 -0.98298529 -1.56837428  0.08168792 
#          20          21          22          23 
# -1.95953248 -1.50922225 -0.77759901 -1.35969409 
#          24          25          26          27 
# -1.31521249  0.34309863 -0.83094012 -2.33659224 
#          28          29 
# -0.60359771 -3.25957044 

For f2, the level 11 is dropped / set to zero, but we indeed have estimated values for all levels of f1.

Do you have a good explanation for this? 🤔

I don't think it is a bug though, as r-fixest matches lm:

sort(predict(fit1))[1:3]
# [1] -5.114820 -5.084153 -4.863277
sort(predict(lm(Y ~ X1 + as.factor(f1) + as.factor(f2), data = data)))[1:3]
# -5.114820 -5.084153 -4.863277 

@apoorvalal
Copy link
Member Author

huh, TIL. So I guess FEs are implicitly pooled before one-hot-encoding.
cc-ing @lrberge: any downsides to this behaviour?

@s3alfisc
Copy link
Member

Ah, I found the relevant section in the fixest docs:

If there is more than 1 fixed-effect, then the attribute “references” is created. This is a vector of length the number of fixed-effects, each element contains the number of coefficients set as references. By construction, the elements of the first fixed-effect dimension are never set as references. In the presence of regular fixed-effects, there should be Q-1 references (with Q the number of fixed-effects).

If the fixed-effect coefficients are not regular, then several reference points need to be set: this means that the fixed-effects coefficients cannot be directly interpreted. If this is the case, then a warning is raised.

So really one reference level per fixed effects, and none for the first one. The default seems to be to assume "irregular" fixed effects? This is also what pyfixest implements I think / hope 😄

@apoorvalal
Copy link
Member Author

makes sense. Anyway, I think this is ready to merge; you're welcome to do some more testing; lookups being surprisingly simple

@s3alfisc
Copy link
Member

Yep, I'll take one closer look tomorrow & will merge then =)

@s3alfisc s3alfisc merged commit 506cd51 into py-econometrics:master Jul 17, 2024
7 checks passed
@apoorvalal
Copy link
Member Author

Just noticed that you implemented a lean argument that (correctly, IMO) doesn't attach data to the model object; this (and previous) implementation of predict would fail in that case because it looks up the reference category with respect to the ._data attribute before filling in the fixed effects values matrix.

A leaner implementation would store the reference category in the .fixef() method so that the old data doesn't need to be looked up.

@apoorvalal apoorvalal deleted the speed_up_predict_fe branch July 28, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants