Performance issue #105

Make42 · 2021-04-17T13:22:27Z

I have a pandas DataFrame that contains experiment results. The experiment setups are described via the groupcols columns (string, float and integer columns) and the evaluation with the eval_val column (float column). I want to find the best result for each experiment type, so for all experiment with the same setup. For that I wrote three pipelines the have the same final DataFrame as a result:

from time import time
groupcols: list

t0 = time()
res_best1 = (res_long >>
            dp.group_by(*groupcols) >>
            dp.filter_by(X.eval_val == dp.colmax(X.eval_val)) >>
            dp.ungroup() >>
            dp.distinct()). \
    reset_index(drop=True)
print(time() - t0)

t0 = time()
res_best2 = (res_long >>
             dp.arrange(X.eval_val) >>
             dp.group_by(*groupcols) >>
            dp.head(1) >>
            dp.ungroup()). \
    reset_index(drop=True)
print(time() - t0)

t0 = time()
res_best3 = res_long.sort_values('eval_val', ascending=False).groupby(groupcols).first().reset_index()
print(time() - t0)

While the first setup takes about 54.5 seconds to run, the second only takes about 35.1 seconds to run, but - and that is what I want to report - the last pipeline takes only 0.073 seconds to run. So, pandas is A LOT faster than dfply. Maybe this is a bug...?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue #105

Performance issue #105

Make42 commented Apr 17, 2021 •

edited

Loading

Performance issue #105

Performance issue #105

Comments

Make42 commented Apr 17, 2021 • edited Loading

Make42 commented Apr 17, 2021 •

edited

Loading