-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPCH, Query 18 and 17 very slow #5646
Comments
I will take a look. |
Working on it now. I am not sure whether it is regression or those two queries are always slow. |
The bottle neck of q17 should be Aggregation. |
A profile run using flamegraph shows on my machine:
|
FYI @viirya |
One reason for so many |
The most expensive part is the line I guess we should move that evaluation (or other parts of |
Another observation I have is that the plan does some unnecessary casting:
|
@Dandandan , Yes, I also notice this problem, it's related with It isn't a easy problem, BTW, #5831 also is related with this q17, it move cast from expression eval into subplan. |
@jackwener Nice, thank you |
@jackwener this is all resolved now, right? |
I am still working on it. |
Query 18 should is also considerably faster in the next DataFusion version, because of join-related improvements (datastructure improvement and vectorized collision checks). |
One suggestion that will yield some (smaller) performance improvement for query 18 (and most other queries): #6768 |
There is quite a lot of recent work / proposals to make the grouping significantly faster for these queries. See #4973 |
I expect Q17 to go about 2x faster and use much less memory when we merge our most recent work -- see #6800 (comment) for details |
Thanks @djouallah - the new |
BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups). To avoid this, we would need to hash repartition the input based on group keys so the different partitions saw different subsets of the group keys |
@alamb the graph show the overall duration to finish toch_sf100 based on the number of cores, Datafusion is faster than spark even when using only 1 VM ;) |
If you get a chance to test with the latest datafusion (will be in 28.0.0, eta probably next week) I expect performance for high cardinality grouping to be much better due to #6904 |
I wrote up an issue describing this here: #6937 |
using version 28, query 8 start getting errors https://colab.research.google.com/drive/1KzofqAWJxVTboNcywGxSbIgLNatkIsM2
|
Nice, getting close to |
@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ? |
Hard to say in general, but they do some optimizations we don't do or do better planning (e.g. for join selection). |
It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better) |
Hyper-db is developed by Tableau now, so it probably has some improvements over the last years compared to the "research system": |
it seems there is a regression with query 18, it used to works fine with tpch100 using 124 GB of RAM, now the notebook crashes when using DF31 !!! edit : never mind, it was a temporary glitch, |
Given the long history of this issue, I think it is hard to understand what, if anything, it is tracking. I suggest we close it and file another issue to continue discussing additional performance improvements |
BTW I think the core problem here is that DataFusion's parallel hash grouping builds the entire hash table for each input partition -- thus it requires memory proportional to the number of cores. This is tracked more in #6937 |
I spent some time analyzing why |
The issue seems to be (at least partially) related to |
Perhaps the filter can simply switch to Another thing I have seen in the past is heuristically pick a constant selectivity (assume it filters 1/2 the rows). However I think this leads to non-robust plans (sometimes the plans are good, sometimes they are bad, and it is hard to predict when each is hit) |
I filed #8078 with a proposal of a more precise way to represent inexact statistics |
Thank you very much for your works, I am happy with datafusion 33 performance, now it does finish TPCH_SF100 using 64 GB of RAM in Fabric |
was running a TPCH_SF5 just for fun, I notice query 17 and 18 are very slow
full reproducible example
another issue when I increase sf to 10, I start getting OOM errors ?
https://colab.research.google.com/drive/1WJ2ICxJyAYClkDx8guGX-TcOMnf8SBxr#scrollTo=z494Cl6XKUVX
DataFusion 26 gives the following result against duckdb
The text was updated successfully, but these errors were encountered: