add pure numpy predict function that speeds up ~40x #553

apoorvalal · 2024-07-15T04:42:40Z

addressing an example from @b-knight on discord: the current predict function is quite inefficient in the presence of many FEs.

I did a quick and dirty numpy implementation (currently implemented as predict2 : didn't want to override the default predict method in case it upsets some unit tests that I don't know about - @s3alfisc please verify) that yields ~50x speedup and (afaict) no numerical difference.
Notebook

codecov · 2024-07-15T04:59:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files	Coverage Δ
pyfixest/estimation/feols_.py	`90.53% <100.00%> (ø)`

... and 28 files with indirect coverage changes

apoorvalal · 2024-07-15T05:40:55Z

codecov hates me and the feeling is mutual

s3alfisc · 2024-07-15T12:09:39Z

Sometimes, when no one is watching closely, I dare to merge without codecov's permission 🙈😀

- rename `predict2` to `predict` - delete original `predict`

s3alfisc · 2024-07-15T17:12:16Z

Hi @apoorvalal, I have renamed predict2 to predict and deleted the original predict. This should help with codecov and trigger all unit tests =)

s3alfisc · 2024-07-15T20:06:17Z

Looks like something is going wrong with NaN values 🤔

FAILED tests/test_predict_resid_fixef.py::test_predict_nas - AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

x and y nan location mismatch:
x: array([ 1.807314e+00, nan, -2.170578e+00, 1.844755e+00,
-1.710621e-01, 4.697018e-01, -7.419144e-01, -1.526513e+00,
1.356631e+00, -2.075729e+00, -1.000085e+00, 3.599896e+00,...
y: array([ 1.807314e+00, nan, nan, 1.844755e+00,

…issings; fixes fixest mismatch unit test

apoorvalal · 2024-07-16T01:03:35Z

Fixed that, but noticed something else odd in the process [which is a remaining unit test failure]. This
formula produces the following fixed effect dict. I'm unsure about how all distinct values of f1 have a fixed effect [as evidenced by the difference between the list of unique values and FE dict having no difference for f1]; I would have expected something like f2 where the first value (0.0) is omitted.

Am I missing something? Do you pool all strata of FEs and omit one (as opposed to omitting one of each kind of FE), which would explain the difference between f1 and f2 here ?

apoorvalal · 2024-07-16T06:47:59Z

pyfixest/estimation/feols_.py

@@ -1487,7 +1487,7 @@ def predict(self, newdata: Optional[DataFrameType] = None) -> np.ndarray:
            fixef_dicts = {}
            for f in fvals:
                fdict = self._fixef_dict[f"C({f})"]
-                omitted_cat = set(self._data[f].unique().astype(str)) - set(
+                omitted_cat = set(self._data[f].unique().astype(str).tolist()) - set(


detects the omitted category correctly now, so unit tests all pass.

apoorvalal · 2024-07-16T06:48:42Z

pyfixest/estimation/feols_.py

-        fixef_mat[:, i] = vectorized_map(fe_levels, mapping).astype(float)
+        unique_levels, inverse = np.unique(df_fe_values[:, i], return_inverse=True)
+        mapping = np.array([subdict.get(level, np.nan) for level in unique_levels])
+        fixef_mat[:, i] = mapping[inverse]


repeats FE values based on inverse vector

s3alfisc · 2024-07-16T09:34:45Z

Re your comment on the fixed effects levels - I was also assuming that one reference level would be dropped per fixed effect, but this is not what r-fixest implements: only one level for all fixed effects is dropped for both fixed effects:

library(fixest)
library(reticulate)
data = py$data

fit = feols(
  Y ~ X1 | f1 + f2,
  data = data
)

fit

fixef(fit)
# $f1
#           0           1           2           3 
#  1.85823546  4.15857700  0.01392762  1.26009884 
#           4           5           6           7 
# -0.01674004  3.31246464  0.91274113  2.30340233 
#           8           9          10          11 
#  1.23155227  1.70117551  2.27371739  2.96642848 
#          12          13          14          15 
#  2.28065324  3.53616576  1.52956294  2.92011824 
#          16          17          18          19 
#  3.86997352  1.51750366  3.19879406  2.35433006 
#          20          21          22          23 
#  4.46921954  0.09201573  0.73233885  0.96252599 
#          24          25          26          27 
#  0.60007651  0.23480344  2.35894173  1.49875080 
#          28          29 
#  3.28294866  2.13958053 
# 
# $f2
#           0           1           2           3 
# -0.71670285 -0.32429502 -1.30150753 -2.99409568 
#           4           5           6           7 
# -1.02914628 -0.60182497 -0.30081185 -1.90983043 
#           8           9          10          11 
# -1.95719698 -2.46953986  0.00000000 -1.58053084 
#          12          13          14          15 
# -0.35922732 -3.00743412 -1.36364968 -0.99794680 
#          16          17          18          19 
# -0.95586933 -0.98298529 -1.56837428  0.08168792 
#          20          21          22          23 
# -1.95953248 -1.50922225 -0.77759901 -1.35969409 
#          24          25          26          27 
# -1.31521249  0.34309863 -0.83094012 -2.33659224 
#          28          29 
# -0.60359771 -3.25957044

For f2, the level 11 is dropped / set to zero, but we indeed have estimated values for all levels of f1.

Do you have a good explanation for this? 🤔

I don't think it is a bug though, as r-fixest matches lm:

sort(predict(fit1))[1:3]
# [1] -5.114820 -5.084153 -4.863277
sort(predict(lm(Y ~ X1 + as.factor(f1) + as.factor(f2), data = data)))[1:3]
# -5.114820 -5.084153 -4.863277

apoorvalal · 2024-07-16T15:50:18Z

huh, TIL. So I guess FEs are implicitly pooled before one-hot-encoding.
cc-ing @lrberge: any downsides to this behaviour?

s3alfisc · 2024-07-16T16:48:34Z

Ah, I found the relevant section in the fixest docs:

If there is more than 1 fixed-effect, then the attribute “references” is created. This is a vector of length the number of fixed-effects, each element contains the number of coefficients set as references. By construction, the elements of the first fixed-effect dimension are never set as references. In the presence of regular fixed-effects, there should be Q-1 references (with Q the number of fixed-effects).

If the fixed-effect coefficients are not regular, then several reference points need to be set: this means that the fixed-effects coefficients cannot be directly interpreted. If this is the case, then a warning is raised.

So really one reference level per fixed effects, and none for the first one. The default seems to be to assume "irregular" fixed effects? This is also what pyfixest implements I think / hope 😄

apoorvalal · 2024-07-16T17:26:44Z

makes sense. Anyway, I think this is ready to merge; you're welcome to do some more testing; lookups being surprisingly simple

s3alfisc · 2024-07-16T17:55:08Z

Yep, I'll take one closer look tomorrow & will merge then =)

apoorvalal · 2024-07-17T15:56:40Z

Just noticed that you implemented a lean argument that (correctly, IMO) doesn't attach data to the model object; this (and previous) implementation of predict would fail in that case because it looks up the reference category with respect to the ._data attribute before filling in the fixed effects values matrix.

A leaner implementation would store the reference category in the .fixef() method so that the old data doesn't need to be looked up.

add pure numpy predict implementation with massive speedups

e30b75e

apoorvalal force-pushed the speed_up_predict_fe branch from adee239 to e30b75e Compare July 15, 2024 04:59

rename predict2 to predict

ba78db1

- rename `predict2` to `predict` - delete original `predict`

s3alfisc mentioned this pull request Jul 15, 2024

Bknight regex migration #551

Merged

simplify + modify _apply_fixef_numpy to vectorize lookup, propagate m…

c58876b

…issings; fixes fixest mismatch unit test

apoorvalal and others added 2 commits July 15, 2024 18:11

handle omitted categories in FE dict population

412ee17

fixes to squash missing year FE bug in did estimator

c2dac51

apoorvalal commented Jul 16, 2024

View reviewed changes

s3alfisc approved these changes Jul 17, 2024

View reviewed changes

s3alfisc merged commit 506cd51 into py-econometrics:master Jul 17, 2024
7 checks passed

apoorvalal deleted the speed_up_predict_fe branch July 28, 2024 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pure numpy predict function that speeds up ~40x #553

add pure numpy predict function that speeds up ~40x #553

apoorvalal commented Jul 15, 2024 •

edited

Loading

codecov bot commented Jul 15, 2024 •

edited

Loading

apoorvalal commented Jul 15, 2024

s3alfisc commented Jul 15, 2024

s3alfisc commented Jul 15, 2024

s3alfisc commented Jul 15, 2024

apoorvalal commented Jul 16, 2024

apoorvalal Jul 16, 2024

apoorvalal Jul 16, 2024

s3alfisc commented Jul 16, 2024 •

edited

Loading

apoorvalal commented Jul 16, 2024

s3alfisc commented Jul 16, 2024

apoorvalal commented Jul 16, 2024

s3alfisc commented Jul 16, 2024

apoorvalal commented Jul 17, 2024

add pure numpy predict function that speeds up ~40x #553

add pure numpy predict function that speeds up ~40x #553

Conversation

apoorvalal commented Jul 15, 2024 • edited Loading

codecov bot commented Jul 15, 2024 • edited Loading

Codecov Report

apoorvalal commented Jul 15, 2024

s3alfisc commented Jul 15, 2024

s3alfisc commented Jul 15, 2024

s3alfisc commented Jul 15, 2024

apoorvalal commented Jul 16, 2024

apoorvalal Jul 16, 2024

Choose a reason for hiding this comment

apoorvalal Jul 16, 2024

Choose a reason for hiding this comment

s3alfisc commented Jul 16, 2024 • edited Loading

apoorvalal commented Jul 16, 2024

s3alfisc commented Jul 16, 2024

apoorvalal commented Jul 16, 2024

s3alfisc commented Jul 16, 2024

apoorvalal commented Jul 17, 2024

apoorvalal commented Jul 15, 2024 •

edited

Loading

codecov bot commented Jul 15, 2024 •

edited

Loading

s3alfisc commented Jul 16, 2024 •

edited

Loading