TPCH, Query 18 and 17 very slow #5646

djouallah · 2023-03-20T07:53:48Z

was running a TPCH_SF5 just for fun, I notice query 17 and 18 are very slow

full reproducible example

another issue when I increase sf to 10, I start getting OOM errors ?

https://colab.research.google.com/drive/1WJ2ICxJyAYClkDx8guGX-TcOMnf8SBxr#scrollTo=z494Cl6XKUVX

DataFusion 26 gives the following result against duckdb

mingmwang · 2023-03-20T08:53:32Z

I will take a look.

mingmwang · 2023-03-27T07:52:31Z

Working on it now. I am not sure whether it is regression or those two queries are always slow.

mingmwang · 2023-03-29T06:08:07Z

The bottle neck of q17 should be Aggregation.

Dandandan · 2023-04-03T19:02:49Z

A profile run using flamegraph shows on my machine:

arrow_cast::cast::cast_decimal_to_decimal is consuming about 1/4 of the time of q17.

Dandandan · 2023-04-03T19:10:22Z

Dandandan · 2023-04-03T19:17:34Z

FYI @viirya

mingmwang · 2023-04-04T02:25:18Z

One reason for so many downcast_value call is because the grouping column l_partkey is with high cardinality, causing the vectorization is almost useless.

Dandandan · 2023-04-04T06:24:28Z

The most expensive part is the line let values = &cast(values, sum_type)? in sum_batch which performs this casting.

I guess we should move that evaluation (or other parts of sum_batch as well) up, so it will be done before the grouping, rather than after, so it still is vectorized.

Dandandan · 2023-04-04T06:28:41Z

Another observation I have is that the plan does some unnecessary casting:

Projection: CAST(SUM(lineitem.l_extendedprice) AS Float64) / Float64(7) AS avg_yearly
  Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
    Projection: lineitem.l_extendedprice
      Inner Join: part.p_partkey = __scalar_sq_3.l_partkey Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_3.__value AS Decimal128(30, 15))
        Projection: lineitem.l_quantity, lineitem.l_extendedprice, part.p_partkey
          Inner Join: lineitem.l_partkey = part.p_partkey
            TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]
            Projection: part.p_partkey
              Filter: part.p_brand = Utf8("Brand#23") AND part.p_container = Utf8("MED BOX")
                TableScan: part projection=[p_partkey, p_brand, p_container], partial_filters=[part.p_brand = Utf8("Brand#23"), part.p_container = Utf8("MED BOX")]
        SubqueryAlias: __scalar_sq_3
          Projection: lineitem.l_partkey, Float64(0.2) * CAST(AVG(lineitem.l_quantity) AS Float64) AS __value
            Aggregate: groupBy=[[lineitem.l_partkey]], aggr=[[AVG(lineitem.l_quantity)]]
              TableScan: lineitem projection=[l_partkey, l_quantity]

inside the aggregation (in sum_batch)
cast to float for __scalar_sq_3.__value. I guess because 0.2 is assumed to be a float.
cast back to decimal for join filter

mingmwang · 2023-04-04T16:08:33Z

@Dandandan
#5866

jackwener · 2023-04-05T07:36:50Z

Another observation I have is that the plan does some unnecessary casting:

@Dandandan , Yes, I also notice this problem, it's related with type coercion.

It isn't a easy problem, type coercion in pg is a top down process, so top can request type from children.
It's similar with interesting order/enforcement. But current datafusion do type coercion like mysql bottom up, it isn't good enough.
I prepare to improve type coercion in the future according to PG and Spark.

BTW, #5831 also is related with this q17, it move cast from expression eval into subplan.

Dandandan · 2023-04-05T07:54:40Z

@jackwener Nice, thank you

Dandandan · 2023-06-19T15:51:07Z

@jackwener this is all resolved now, right?

djouallah · 2023-06-20T01:53:50Z

datafusion is doing a great progress but still, it is not solved
Version: 26.0.0

mingmwang · 2023-06-25T09:10:57Z

I am still working on it.

Dandandan · 2023-06-25T09:59:46Z

Query 18 should is also considerably faster in the next DataFusion version, because of join-related improvements (datastructure improvement and vectorized collision checks).
I'll be looking into what we can do further.

Dandandan · 2023-06-26T12:44:17Z

One suggestion that will yield some (smaller) performance improvement for query 18 (and most other queries): #6768

alamb · 2023-06-30T20:14:50Z

There is quite a lot of recent work / proposals to make the grouping significantly faster for these queries. See #4973

djouallah · 2023-07-09T22:50:57Z

using Python_datafusion 27, unfortunately still issues with Query 17, my VM has 8 cores and 64 GB of RAM, Query 17 got OOM

alamb · 2023-07-09T23:01:42Z

I expect Q17 to go about 2x faster and use much less memory when we merge our most recent work -- see #6800 (comment) for details

djouallah · 2023-07-09T23:16:12Z

i am doing this experimentation using fabric notebook, datafusion doing alright, would love really to start seeing numbers for 8 cores, as currently with 8 cores has 64 GB yet DF has an OOM :(

Dandandan · 2023-07-10T09:44:57Z

Thanks @djouallah - the new GroupHashAggregate approach will also (drastically) reduce memory usage.

alamb · 2023-07-10T10:18:48Z

BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).

To avoid this, we would need to hash repartition the input based on group keys so the different partitions saw different subsets of the group keys

djouallah · 2023-07-11T05:07:59Z

@alamb the graph show the overall duration to finish toch_sf100 based on the number of cores, Datafusion is faster than spark even when using only 1 VM ;)

alamb · 2023-07-17T20:41:39Z

If you get a chance to test with the latest datafusion (will be in 28.0.0, eta probably next week) I expect performance for high cardinality grouping to be much better due to #6904

alamb · 2023-07-17T20:42:03Z

BTW I think the reason DF's memory usage is increasing with number of cores is because the first partial aggregate phase is using RoundRobin repartitioning (and thus each hash table has an entry for all the groups).

I wrote up an issue describing this here: #6937

djouallah · 2023-08-03T01:24:47Z

using version 28, query 8 start getting errors

https://colab.research.google.com/drive/1KzofqAWJxVTboNcywGxSbIgLNatkIsM2

Query8
Arrow error: Compute error: Overflow happened on: 136581719431 * 100000000000000000000000000000000000000

Dandandan · 2023-08-03T05:47:42Z

Sounds like #6794
Cc @viirya

djouallah · 2023-08-03T06:59:39Z

good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18

viirya · 2023-08-03T07:18:12Z

Sounds like #6794 Cc @viirya

I suppose that decimal multiplication and division precision change at #6832 will fix that.

Dandandan · 2023-08-03T08:12:53Z

good job btw for Query 17 and 18, unfortunately when the RAM is limited, still Datafusion get OOM for Query 18

Nice, getting close to DuckDB. hyper is incredible for some queries!

djouallah · 2023-08-03T08:32:39Z

@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?

Dandandan · 2023-08-03T08:41:31Z

@Dandandan specially when using their native format, Hyper just literally don't care about RAM, i use it with the free colab, and did finish tpch_sf110 !!! just fine, what do you think they are doing different ?

Hard to say in general, but they do some optimizations we don't do or do better planning (e.g. for join selection).
Using a different format instead of parquet might help as well as parquet can be slower to decode / decompress than formats optimized for query performance.

alamb · 2023-08-03T12:59:36Z

!! just fine, what do you think they are doing different ?

It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH

Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)

Dandandan · 2023-08-04T12:11:44Z

!! just fine, what do you think they are doing different ?

It is also probably good to point out that hyper is largely a research system and one of the standard benchmark sets that is used for research systems is TPCH

Thus I suspect a lot of effort has gone into making the patterns that appear in TPCH very fast (e.g. left deep join trees with very selective predicates). That is not to say that the optimizations are entirely TPCH specific, but it wouldn't surprise me if in general purpose use DataFusion performance much closer (or better)

Hyper-db is developed by Tableau now, so it probably has some improvements over the last years compared to the "research system":
Some release notes for last couple of years: https://tableau.github.io/hyper-db/docs/releases

djouallah · 2023-09-14T01:51:20Z

it seems there is a regression with query 18, it used to works fine with tpch100 using 124 GB of RAM, now the notebook crashes when using DF31 !!!

edit : never mind, it was a temporary glitch,

alamb · 2023-09-19T11:05:44Z

Given the long history of this issue, I think it is hard to understand what, if anything, it is tracking. I suggest we close it and file another issue to continue discussing additional performance improvements

djouallah · 2023-09-19T11:11:02Z

the issue is very clear, performance of Query 17 and 18 is still very slow compared to other in process engines !!

djouallah · 2023-09-20T08:28:23Z

Btw, DF31 is making a great progress, my impression as long as the data fit in memory, the performance is very similar to DuckDB, here I am reading Parquet files from an Azure storage, the main issue start with core 8 which has 64 of GB, Query 18 crash the notebook

alamb · 2023-10-17T20:07:54Z

BTW I think the core problem here is that DataFusion's parallel hash grouping builds the entire hash table for each input partition -- thus it requires memory proportional to the number of cores. This is tracked more in #6937

alamb · 2023-10-27T14:52:36Z

I spent some time analyzing why Q17 and Q18 are slow in detail this morning (in the context of #6782 ). My analysis shows we could close most of the gap with a better join order:

korowa · 2023-11-06T19:38:54Z

The issue seems to be (at least partially) related to FilterExec returning unknown statistics in case of unsupported filter predicates -- I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics -- it seems to be enough to fix Q17 plan.

alamb · 2023-11-06T21:14:53Z

I wonder, if it would be better / more correct to rely on worth-case scenario for such filters, and simply propagate input statistics

Perhaps the filter can simply switch to Precision::Inexact when it can't analyze the selectivity for the expression

Another thing I have seen in the past is heuristically pick a constant selectivity (assume it filters 1/2 the rows). However I think this leads to non-robust plans (sometimes the plans are good, sometimes they are bad, and it is hard to predict when each is hit)

alamb · 2023-11-07T10:35:02Z

I filed #8078 with a proposal of a more precise way to represent inexact statistics

djouallah · 2023-11-16T23:28:50Z

Thank you very much for your works, I am happy with datafusion 33 performance, now it does finish TPCH_SF100 using 64 GB of RAM in Fabric

djouallah added the bug Something isn't working label Mar 20, 2023

mingmwang mentioned this issue Mar 27, 2023

Performance regressions since DataFusion 15.x #5060

Closed

alamb mentioned this issue Jun 30, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

Dandandan mentioned this issue Aug 3, 2023

Update Arrow 45.0.0 And Datum Arithmetic, change Decimal Division semantics #6832

Merged

This was referenced Oct 27, 2023

Bad Join Order for TPCH Q17 results in slow performance #7949

Closed

Bad Join Order for TPCH Q18 results in slow performance #7950

Closed

alamb mentioned this issue Nov 7, 2023

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open

djouallah closed this as completed Nov 16, 2023

TPCH, Query 18 and 17 very slow #5646

TPCH, Query 18 and 17 very slow #5646

Comments

djouallah commented Mar 20, 2023 • edited by Dandandan Loading

mingmwang commented Mar 20, 2023

mingmwang commented Mar 27, 2023

mingmwang commented Mar 29, 2023 • edited Loading

Dandandan commented Apr 3, 2023 • edited Loading

Dandandan commented Apr 3, 2023

Dandandan commented Apr 3, 2023

mingmwang commented Apr 4, 2023

Dandandan commented Apr 4, 2023 • edited Loading

Dandandan commented Apr 4, 2023 • edited Loading

mingmwang commented Apr 4, 2023

jackwener commented Apr 5, 2023 • edited Loading

Dandandan commented Apr 5, 2023

Dandandan commented Jun 19, 2023

djouallah commented Jun 20, 2023

mingmwang commented Jun 25, 2023

Dandandan commented Jun 25, 2023 • edited Loading

Dandandan commented Jun 26, 2023

alamb commented Jun 30, 2023

djouallah commented Jul 9, 2023

alamb commented Jul 9, 2023

djouallah commented Jul 9, 2023 • edited Loading

Dandandan commented Jul 10, 2023 • edited Loading

alamb commented Jul 10, 2023

djouallah commented Jul 11, 2023

alamb commented Jul 17, 2023

alamb commented Jul 17, 2023

djouallah commented Aug 3, 2023 • edited Loading

Dandandan commented Aug 3, 2023

djouallah commented Aug 3, 2023 • edited Loading

viirya commented Aug 3, 2023

Dandandan commented Aug 3, 2023

djouallah commented Aug 3, 2023

Dandandan commented Aug 3, 2023

alamb commented Aug 3, 2023 • edited Loading

Dandandan commented Aug 4, 2023 • edited Loading

djouallah commented Sep 14, 2023 • edited Loading

alamb commented Sep 19, 2023

djouallah commented Sep 19, 2023

djouallah commented Sep 20, 2023

alamb commented Oct 17, 2023

alamb commented Oct 27, 2023

korowa commented Nov 6, 2023

alamb commented Nov 6, 2023

alamb commented Nov 7, 2023

djouallah commented Nov 16, 2023

djouallah commented Mar 20, 2023 •

edited by Dandandan

Loading

mingmwang commented Mar 29, 2023 •

edited

Loading

Dandandan commented Apr 3, 2023 •

edited

Loading

Dandandan commented Apr 4, 2023 •

edited

Loading

Dandandan commented Apr 4, 2023 •

edited

Loading

jackwener commented Apr 5, 2023 •

edited

Loading

Dandandan commented Jun 25, 2023 •

edited

Loading

djouallah commented Jul 9, 2023 •

edited

Loading

Dandandan commented Jul 10, 2023 •

edited

Loading

djouallah commented Aug 3, 2023 •

edited

Loading

djouallah commented Aug 3, 2023 •

edited

Loading

alamb commented Aug 3, 2023 •

edited

Loading

Dandandan commented Aug 4, 2023 •

edited

Loading

djouallah commented Sep 14, 2023 •

edited

Loading