-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pure numpy predict function that speeds up ~40x #553
add pure numpy predict function that speeds up ~40x #553
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
adee239
to
e30b75e
Compare
codecov hates me and the feeling is mutual |
Sometimes, when no one is watching closely, I dare to merge without codecov's permission 🙈😀 |
- rename `predict2` to `predict` - delete original `predict`
Hi @apoorvalal, I have renamed |
Looks like something is going wrong with
|
…issings; fixes fixest mismatch unit test
Fixed that, but noticed something else odd in the process [which is a remaining unit test failure]. This Am I missing something? Do you pool all strata of FEs and omit one (as opposed to omitting one of each kind of FE), which would explain the difference between |
@@ -1487,7 +1487,7 @@ def predict(self, newdata: Optional[DataFrameType] = None) -> np.ndarray: | |||
fixef_dicts = {} | |||
for f in fvals: | |||
fdict = self._fixef_dict[f"C({f})"] | |||
omitted_cat = set(self._data[f].unique().astype(str)) - set( | |||
omitted_cat = set(self._data[f].unique().astype(str).tolist()) - set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detects the omitted category correctly now, so unit tests all pass.
fixef_mat[:, i] = vectorized_map(fe_levels, mapping).astype(float) | ||
unique_levels, inverse = np.unique(df_fe_values[:, i], return_inverse=True) | ||
mapping = np.array([subdict.get(level, np.nan) for level in unique_levels]) | ||
fixef_mat[:, i] = mapping[inverse] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeats FE values based on inverse vector
Re your comment on the fixed effects levels - I was also assuming that one reference level would be dropped per fixed effect, but this is not what library(fixest)
library(reticulate)
data = py$data
fit = feols(
Y ~ X1 | f1 + f2,
data = data
)
fit
fixef(fit)
# $f1
# 0 1 2 3
# 1.85823546 4.15857700 0.01392762 1.26009884
# 4 5 6 7
# -0.01674004 3.31246464 0.91274113 2.30340233
# 8 9 10 11
# 1.23155227 1.70117551 2.27371739 2.96642848
# 12 13 14 15
# 2.28065324 3.53616576 1.52956294 2.92011824
# 16 17 18 19
# 3.86997352 1.51750366 3.19879406 2.35433006
# 20 21 22 23
# 4.46921954 0.09201573 0.73233885 0.96252599
# 24 25 26 27
# 0.60007651 0.23480344 2.35894173 1.49875080
# 28 29
# 3.28294866 2.13958053
#
# $f2
# 0 1 2 3
# -0.71670285 -0.32429502 -1.30150753 -2.99409568
# 4 5 6 7
# -1.02914628 -0.60182497 -0.30081185 -1.90983043
# 8 9 10 11
# -1.95719698 -2.46953986 0.00000000 -1.58053084
# 12 13 14 15
# -0.35922732 -3.00743412 -1.36364968 -0.99794680
# 16 17 18 19
# -0.95586933 -0.98298529 -1.56837428 0.08168792
# 20 21 22 23
# -1.95953248 -1.50922225 -0.77759901 -1.35969409
# 24 25 26 27
# -1.31521249 0.34309863 -0.83094012 -2.33659224
# 28 29
# -0.60359771 -3.25957044 For Do you have a good explanation for this? 🤔 I don't think it is a bug though, as sort(predict(fit1))[1:3]
# [1] -5.114820 -5.084153 -4.863277
sort(predict(lm(Y ~ X1 + as.factor(f1) + as.factor(f2), data = data)))[1:3]
# -5.114820 -5.084153 -4.863277 |
huh, TIL. So I guess FEs are implicitly pooled before one-hot-encoding. |
Ah, I found the relevant section in the fixest docs:
So really one reference level per fixed effects, and none for the first one. The default seems to be to assume "irregular" fixed effects? This is also what pyfixest implements I think / hope 😄 |
makes sense. Anyway, I think this is ready to merge; you're welcome to do some more testing; lookups being surprisingly simple |
Yep, I'll take one closer look tomorrow & will merge then =) |
Just noticed that you implemented a A leaner implementation would store the reference category in the |
addressing an example from @b-knight on discord: the current
predict
function is quite inefficient in the presence of many FEs.I did a quick and dirty numpy implementation (currently implemented as
predict2
: didn't want to override the default predict method in case it upsets some unit tests that I don't know about - @s3alfisc please verify) that yields ~50x speedup and (afaict) no numerical difference.Notebook