Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

andygrove · 2023-11-10T16:31:06Z

Which issue does this PR close?

Closes #7949
Closes #7950

Rationale for this change

Improve benchmark results.

What changes are included in this PR?

Q17 performs a join where the left input is a ParquetExec reading lineitem and the right input is FilterExec wrapping a ParquetExec that reads part.

Both ParquetExecs provide num_rows, but the FilterExec around the part input discards all statistics from the underlying ParquetExec and this means that the existing optimizations for choosing the build side of the join cannot determine which input is smaller due to the missing statistics.

This PR changes the behavior of FilterExec:statistics in the case where we cannot determine accurate statistics. Instead of returning num_rows as Precision::Absent, we now assume that the filter selects 20% of rows from it's input. There is a follow-up issue #8133 to make this configurable.

Benchmark Results: TPCH @ SF10

I see an overall improvement from 128 seconds to 104 seconds locally, so around 18% faster.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ improve-filter-stats ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  5685.47ms │            5513.51ms │     no change │
│ QQuery 2     │  1464.46ms │             740.12ms │ +1.98x faster │
│ QQuery 3     │  4642.75ms │            2840.77ms │ +1.63x faster │
│ QQuery 4     │  2443.85ms │            2520.49ms │     no change │
│ QQuery 5     │  6045.88ms │            5907.45ms │     no change │
│ QQuery 6     │  1347.40ms │            1368.79ms │     no change │
│ QQuery 7     │ 11895.06ms │            8319.67ms │ +1.43x faster │
│ QQuery 8     │  6517.29ms │            6113.85ms │ +1.07x faster │
│ QQuery 9     │ 10649.04ms │            9081.60ms │ +1.17x faster │
│ QQuery 10    │  5829.50ms │            5666.43ms │     no change │
│ QQuery 11    │  1526.80ms │             813.19ms │ +1.88x faster │
│ QQuery 12    │  2432.82ms │            2441.65ms │     no change │
│ QQuery 13    │  5136.62ms │            4757.18ms │ +1.08x faster │
│ QQuery 14    │  1939.14ms │            1949.23ms │     no change │
│ QQuery 15    │  1552.10ms │            1628.54ms │     no change │
│ QQuery 16    │  1465.63ms │             967.13ms │ +1.52x faster │
│ QQuery 17    │ 10933.48ms │            6389.48ms │ +1.71x faster │
│ QQuery 18    │ 20957.45ms │           17243.47ms │ +1.22x faster │
│ QQuery 19    │  3415.85ms │            3120.01ms │ +1.09x faster │
│ QQuery 20    │  4453.76ms │            4291.37ms │     no change │
│ QQuery 21    │ 16522.46ms │           11447.19ms │ +1.44x faster │
│ QQuery 22    │  1585.07ms │            1703.55ms │  1.07x slower │
└──────────────┴────────────┴──────────────────────┴───────────────┘

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No

…nality

andygrove · 2023-11-10T18:51:54Z

@berkaysynnada Could you take a look and make sure this looks sensible?

datafusion/physical-plan/src/filter.rs

Dandandan · 2023-11-10T20:13:10Z

datafusion/physical-plan/src/filter.rs

-            return Ok(Statistics::new_unknown(&schema));
+            // assume worst case, that the filter is highly selective and
+            // returns all the rows from its input
+            return Ok(input_stats.clone().into_inexact());


I wonder if we can make a slightly different assumption that is a better metric, e.g. each filter returning 50% or 20% of input rows?

We could add a configuration option to control the default selectivity. I'll take a look.

This is an interesting idea. Since statistics will be Inexact, it should never result in an incorrect output, but may improve average-case complexity.

It looks like making this configurable will be a larger change. I filed #8133 and linked to it from the comment here.

The talk Join Order Optimization with (almost) no Statistics is focused on full join reordering rather than just choosing the build side of a join but talks about selectivity estimates and is very relevant to this discussion. They found that selectivity of 0.2 worked well with TPC-H.

I pushed a change to use 0.2 as the default, and now Q18 has an improved join order as well. I updated the results in the PR description.

Default selectivities / cost estimates work ok for TPCH queries where the data is relatively uniformly distributed.

However, in general in my experience they tend to cause problems when the data is skewed or has correlations between the columns.

Hopefully we'll be able to keep the number of hard coded constants / assumptions low in DataFusion (so there are fewer things for the optimizer to get wrong :) )

Co-authored-by: Daniël Heres <[email protected]>

Dandandan · 2023-11-11T19:25:01Z

datafusion/physical-plan/src/filter.rs

+            let selectivity = 0.2_f32;
+            let mut stats = input_stats.clone().into_inexact();
+            if let Precision::Inexact(n) = stats.num_rows {
+                stats.num_rows = Precision::Inexact((selectivity * n as f32) as usize);


I think we can/should do the same for the total_byte_size value

Dandandan · 2023-11-12T08:42:02Z

Thanks @andygrove impressive results with a small change!

ozankabak · 2023-11-12T09:52:33Z

Once the type Precision is extended to support range estimates in addition to point estimates, we will be able to do much more. This small change demonstrates the potential of a good statistics infrastructure really well. Thanks @andygrove

alamb · 2023-11-13T14:10:49Z

cc @NGA-TRAN

alamb · 2023-11-13T14:11:46Z

I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔

ozankabak · 2023-11-13T14:15:24Z

BTW when statistics support range estimates, we will not need to hardcode any assumption like 20% within how the operator reports the statistics. In this specific example, the filter would say it could be anything between 0 and the incoming number of rows, which is accurate. The logic that consumes these stats is now free to make any heuristic assumptions it wants to make on top of this information.

Dandandan · 2023-11-13T14:28:41Z

I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔

Was surprised as well and checked why, it is because the tests use CSV rather than Parquet.

alamb · 2023-11-13T14:32:09Z

I am surprised no test needed changing after this PR. How will we ensure future statistics changes don't mess up the TPCH plans 🤔

Was surprised as well and checked why, it is because the tests use CSV rather than Parquet.

Maybe we can create a test with some sort of 'statistics only' with the statistics from the parquet files, but that doesn't need the actual data and verify the plan that way.

Assume filters are highly selective if we cannot truly estimate cardi…

7bf0ab6

…nality

andygrove self-assigned this Nov 10, 2023

andygrove added 2 commits November 10, 2023 11:18

fix regression

21af5aa

cargo fmt

ead9ea1

andygrove added the performance Make DataFusion faster label Nov 10, 2023

simplify code

dcd71f3

andygrove changed the title ~~WIP: Fix join order for TPCH Q17~~ Fix join order for TPCH Q17 Nov 10, 2023

andygrove marked this pull request as ready for review November 10, 2023 18:49

andygrove requested review from alamb and Dandandan November 10, 2023 18:51

andygrove added the enhancement New feature or request label Nov 10, 2023

Dandandan reviewed Nov 10, 2023

View reviewed changes

datafusion/physical-plan/src/filter.rs Outdated Show resolved Hide resolved

Dandandan reviewed Nov 10, 2023

View reviewed changes

Update datafusion/physical-plan/src/filter.rs

6ca5964

Co-authored-by: Daniël Heres <[email protected]>

andygrove mentioned this pull request Nov 10, 2023

Make default filter selectivity estimate configurable #8133

Closed

add comment with link to follow on issue

8669ba6

Dandandan approved these changes Nov 11, 2023

View reviewed changes

Use default of 20% selectivity

a3725b4

github-actions bot added the core Core DataFusion crate label Nov 11, 2023

andygrove changed the title ~~Fix join order for TPCH Q17~~ Fix join order for TPCH Q17 & q18 Nov 11, 2023

andygrove changed the title ~~Fix join order for TPCH Q17 & q18~~ Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics Nov 11, 2023

andygrove added 2 commits November 11, 2023 09:53

trigger CI

4fa1b3d

remove files

61b0340

github-actions bot removed the core Core DataFusion crate label Nov 11, 2023

trigger CI

600c749

Dandandan reviewed Nov 11, 2023

View reviewed changes

address feedback

5150328

Dandandan approved these changes Nov 11, 2023

View reviewed changes

Dandandan merged commit 6fe00ce into apache:main Nov 12, 2023
22 checks passed

andygrove deleted the improve-filter-stats branch November 12, 2023 16:09

This was referenced Nov 14, 2023

Minor: add with_estimated_selectivity to Precision #8177

Merged

Internal error: PhysicalOptimizer rule 'join_selection' failed, due to generate a different schema, when joining with metadata #8285

Closed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

andygrove commented Nov 10, 2023 •

edited

Loading

andygrove commented Nov 10, 2023

Dandandan Nov 10, 2023

andygrove Nov 10, 2023

ozankabak Nov 10, 2023

andygrove Nov 10, 2023

andygrove Nov 10, 2023

andygrove Nov 11, 2023

alamb Nov 13, 2023

Dandandan Nov 11, 2023

Dandandan commented Nov 12, 2023

ozankabak commented Nov 12, 2023

alamb commented Nov 13, 2023

alamb commented Nov 13, 2023

ozankabak commented Nov 13, 2023

Dandandan commented Nov 13, 2023 •

edited

Loading

alamb commented Nov 13, 2023

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

Conversation

andygrove commented Nov 10, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark Results: TPCH @ SF10

Are these changes tested?

Are there any user-facing changes?

andygrove commented Nov 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Nov 12, 2023

ozankabak commented Nov 12, 2023

alamb commented Nov 13, 2023

alamb commented Nov 13, 2023

ozankabak commented Nov 13, 2023

Dandandan commented Nov 13, 2023 • edited Loading

alamb commented Nov 13, 2023

andygrove commented Nov 10, 2023 •

edited

Loading

Dandandan commented Nov 13, 2023 •

edited

Loading