Refactor AnalysisContext and statistics() of FilterExec #6982

berkaysynnada · 2023-07-16T12:35:35Z

Which issue does this PR close?

Partially closes #5535.

Rationale for this change

The implementation of statistics() in FilterExec has been improved in two ways:

It can now handle complex filter predicates. The previous design was only able to handle expressions with a single column and simple binary expressions (such as a<5 or 3>=a).
It could not shrink the input column boundaries, can only find the selectivity, and calculate new row count and byte size statistics. Now, new interval boundaries are inserted into column statistics.

This implementation is a step to unify the range/interval analysis implementations throughout the project. Interval and cp_solver library introduced by this PR is well-documented, easy to use, and well-structured. It is considered appropriate to use in the calculations mentioned.

What changes are included in this PR?

statistics() now updates the newly calculated column intervals.
No need to implement different analyze() functions for all PhysicalExpr's, since the cp_solver library can handle different kinds of PhysicalExpr's while assigning intervals.
Cardinality calculations are added to the interval library.
AnalysisContext's structure is changed. The selectivity value shouldn't be inside of ExprBoundaries, since this statistic is not related with a column, it is a measure of all rows in a table. Also, holding a field for target column results will be outdated since we may now calculate statistics for multiple columns. Keeping this field can be misleading. In the analyze() function, the results are overwritten to the input parameters because the old values are not used after the calculations.

Are these changes tested?

Yes, new tests are added to filter.rs.

Are there any user-facing changes?

… with intervals

…lities.

…ws test" This reverts commit d10182a.

1) Analyze is removed from the methods of PhysicalExpr. 2) Interval arithmetic is applied to AnalysisContext values. 3) Intervals of input columns are updated now.

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit. minor changes

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

…m/synnada-ai/arrow-datafusion into feature/refactor-analysis-context

ozankabak · 2023-07-16T17:32:13Z

I am excited about this. This PR utilizes the rigor and generality of the interval library and lays a solid foundation to create an ever more powerful analysis/statistics module for Datafusion. Looking forward to hearing community feedback from all those interested.

alamb · 2023-07-17T21:41:47Z

This looks awesome @berkaysynnada - thank you so much. I ran out of time today to review this but I have it on my list for tomorrow

alamb

I went through this PR carefully and it looks (really) nice to me -- thank you @berkaysynnada

I left some minor style / make the code easier to work with comments, but those could definitely be done as a follow on PR (or never).

cc @isidentical who I believe worked on an earlier version of the analysis code and @mingmwang / @Dandandan who have expressed interest recently in working on Joins / Join Orders using better cardinality estimates

alamb · 2023-07-18T19:03:18Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

@@ -451,6 +453,75 @@ impl Interval {
        lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
        upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
    };
+
+    // Cardinality is the number of all points included by the interval, considering its bounds.


Something that might be worth considering in the long term that @tustvold mentioned the other day is to vectorize these calculations -- at the moment they are doin in the context of a single expression, but eventually if we want to use this logic to prune large numbers of files / etc based on statistics it may take too long

No change is needed here, I am just planting a seed of an idea in case this was on your list too

Yes, you are right. Speaking of this idea, I also want to mention that we have tried to replace the pruning logic in PruningPredicate with this interval library but realized that it disrupts the vectorized calculations. So I'm thinking about how we can use it without breaking the vectorized process.

alamb · 2023-07-18T19:04:56Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+            IntervalBound::new(ScalarValue::from(-0.0625), false),
+            IntervalBound::new(ScalarValue::from(0.0625), true),
+        );
+        assert_eq!(interval.cardinality()?, distinct_f64 * 2_048);


While I understand the rationale behind this choice, I think in practice this is not likely to provide much meaningful information -- to estimate cardinality in such cases, one approach is to use distinct values / estimates from the input -- like if the input's cardinality is 100, but the range is -0.0625 to 0.0625 then output cardinality of stable expressions is likely to be bounded by 100

You are right that it wouldn't have much use in practice, but using the distinct parameter might be more accurate at an outer scope to calculate practical cardinalities since we have not introduced statistical metrics like distribution yet to the library.

Interval Distinct Count Number of Floating Values

before Filter [-1, 1] 100 1 B

after Filter [0, 1] 50 or 100 ? 0.5 B

The selectivity is actually decreased by %50 in this example. However, distinct count parameter (storing in ExprBoundaries with interval parameter of the column) is not updated with an approximate information (it is not updated with 50). As I think selectivity can be calculated and used approximately, but unless we are sure, we should not update the interval and distinct count parameters.

FYI, in the above examples, values in the column "number of floating values" are just illustrative, not the actual number of floating values in those ranges.

What it is trying to convey is that we halve selectivity because it is an approximate measure, but we don't halve distinct count because it is a conservative/bounding measure and we don't know how these distinct values are distributed.

Yes I agree -- all selectivity estimates are going to have errors introduced by assumptions made about the distribution (which is not fully known). Assuming a uniform distribution is an easy to understand choice, but of course causes substantial estimation error with skewed distributions

It was common for sophisticated cost models in other systems to have histogram information, but I think the current thinking is that it is better to be more adaptive / tolerant of bad cardinality estimations than to try and improve the cost models more.

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

alamb · 2023-07-18T19:09:52Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+            // Since the floating-point numbers are ordered in the same order as their binary representation,
+            // we can consider their binary representations as "indices" and subtract them.
+            // https://stackoverflow.com/questions/8875064/how-many-distinct-floating-point-numbers-in-a-specific-range
+            Ok(data_type) if data_type.is_floating() => {


Not sure it matters but is_floating also includes Float16 but the code below only handles Float32 / Float64

ScalarValue's don't have it 😅

datafusion/physical-expr/src/physical_expr.rs

datafusion/core/src/physical_plan/filter.rs

berkaysynnada · 2023-07-19T12:45:16Z

This looks awesome @berkaysynnada - thank you so much. I ran out of time today to review this but I have it on my list for tomorrow

Thank you for your valuable feedback @alamb. I've reviewed all of them and implemented the necessary improvements.

Dandandan · 2023-07-19T20:54:11Z

datafusion/common/src/scalar.rs

+
+    /// This function returns the next/previous value depending on the `DIR` value.
+    /// If `true`, it returns the next value; otherwise it returns the previous value.
+    pub fn next_value<const DIR: bool>(self) -> ScalarValue {


Should we add this to the public api of ScalarValue Seems quite specific for the usage, so it might be better to add it to the interval module

I think I gave the opposite feedback in my initial review . I am sorry :(

Dandandan · 2023-07-19T20:55:26Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+    pub fn interval_with_closed_bounds(mut self) -> Interval {
+        if self.lower.open {
+            // Get next value
+            self.lower.value = self.lower.value.next_value::<true>();


maybe in this case next_value_add and next_value_sub would maybe be more clear than adding use of generics?

Dandandan · 2023-07-19T20:57:15Z

Seems really cool! I added one minor comment.

alamb · 2023-07-18T19:03:18Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

@@ -451,6 +453,75 @@ impl Interval {
        lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
        upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
    };
+
+    // Cardinality is the number of all points included by the interval, considering its bounds.


Something that might be worth considering in the long term that @tustvold mentioned the other day is to vectorize these calculations -- at the moment they are doin in the context of a single expression, but eventually if we want to use this logic to prune large numbers of files / etc based on statistics it may take too long

No change is needed here, I am just planting a seed of an idea in case this was on your list too

alamb · 2023-07-19T20:58:58Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+            IntervalBound::new(ScalarValue::from(-0.0625), false),
+            IntervalBound::new(ScalarValue::from(0.0625), true),
+        );
+        assert_eq!(interval.cardinality()?, distinct_f64 * 2_048);


Yes I agree -- all selectivity estimates are going to have errors introduced by assumptions made about the distribution (which is not fully known). Assuming a uniform distribution is an easy to understand choice, but of course causes substantial estimation error with skewed distributions

It was common for sophisticated cost models in other systems to have histogram information, but I think the current thinking is that it is better to be more adaptive / tolerant of bad cardinality estimations than to try and improve the cost models more.

alamb · 2023-07-19T20:59:36Z

datafusion/common/src/scalar.rs

+
+    /// This function returns the next/previous value depending on the `DIR` value.
+    /// If `true`, it returns the next value; otherwise it returns the previous value.
+    pub fn next_value<const DIR: bool>(self) -> ScalarValue {


I think I gave the opposite feedback in my initial review . I am sorry :(

ozankabak

I went through this code in detail one last time and everything looks good. There was one bug about floating point next_value which is now fixed. This is good to go after CI passes.

alamb · 2023-07-20T13:42:27Z

Thanks everyone -- this is a great step forward!

berkaysynnada and others added 30 commits May 31, 2023 10:25

Min/Max in ExprBoundaries are replaced with Interval

906fd27

Minor fix

90a1423

Merge branch 'main' into feature/refactor-analysis-context

f460f57

min max values replaced by intervals, analyze() computations are done…

3fd57b3

… with intervals

Merge branch 'main' into feature/refactor-analysis-context

c8c2fb6

Floating points are computed more wisely

79c8f0d

simplifications

7c5c678

simplifications

71f1854

Floating point selectivity is calculated considering interval cardina…

a7391df

…lities.

Merge branch 'main' into feature/refactor-analysis-context

0020120

Interval ranges are calculated with cp_solver lib

8b7d45a

Merge branch 'main' into feature/refactor-analysis-context

754fa02

Merge branch 'main' into feature/refactor-analysis-context

3f1db69

Remove the float equality case due to diverse result of windows test

d10182a

Revert "Remove the float equality case due to diverse result of windo…

f69329c

…ws test" This reverts commit d10182a.

No need to assign literal intervals

61df191

clean clone

3724336

Merge branch 'main' into feature/refactor-analysis-context

63e370a

Multi columns may be evaluated in some cases, cont'd

46c5340

First iteration of refactoring

6ca74bd

1) Analyze is removed from the methods of PhysicalExpr. 2) Interval arithmetic is applied to AnalysisContext values. 3) Intervals of input columns are updated now.

Merge branch 'main' into feature/refactor-analysis-context

aaf8691

Linter fix

85bcea3

Merge branch 'main' into feature/refactor-analysis-context

45f16e4

Tests added, bugs fixed

5a0fe3b

Refactor on code

6edfa83

Refactor of functions, fix for win tests

e05d3be

minor changes

2cfcb74

Merge branch 'main' into feature/refactor-analysis-context

b702521

Comments enriched

f72c790

berkaysynnada and others added 8 commits July 14, 2023 17:23

Update datafusion/physical-expr/src/intervals/interval_aritmetic.rs

f6fc686

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Update datafusion/physical-expr/src/intervals/interval_aritmetic.rs

0c84002

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Update datafusion/physical-expr/src/physical_expr.rs

b88959c

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Update datafusion/physical-expr/src/physical_expr.rs

30170e8

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Update datafusion/physical-expr/src/physical_expr.rs

fb780b4

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Update datafusion/physical-expr/src/intervals/interval_aritmetic.rs

0a619bf

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Comments added

dce7463

Merge branch 'feature/refactor-analysis-context' of https://github.co…

f3cda5d

…m/synnada-ai/arrow-datafusion into feature/refactor-analysis-context

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Jul 16, 2023

alamb approved these changes Jul 18, 2023

View reviewed changes

berkaysynnada added 2 commits July 19, 2023 15:43

Minor improvements

79f9baf

Merge branch 'main' into feature/refactor-analysis-context

ff9faed

Clippy

dec7e7d

Dandandan reviewed Jul 19, 2023

View reviewed changes

alamb reviewed Jul 19, 2023

View reviewed changes

berkaysynnada and others added 3 commits July 20, 2023 09:49

next_value func moves to interval module

c5640a5

Reverts next_value(), adds test

109eb78

Final review

6d29b7b

ozankabak approved these changes Jul 20, 2023

View reviewed changes

alamb merged commit b7ed06d into apache:main Jul 20, 2023

berkaysynnada deleted the feature/refactor-analysis-context branch July 24, 2023 10:41

alamb mentioned this pull request Jul 28, 2023

panicked at 'internal error: entered unreachable code' in cp_solver #7125

Closed

alamb mentioned this pull request Sep 13, 2023

Optimize FilterExec::statistics / don't ignore errors #7553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor AnalysisContext and statistics() of FilterExec #6982

Refactor AnalysisContext and statistics() of FilterExec #6982

berkaysynnada commented Jul 16, 2023 •

edited

Loading

ozankabak commented Jul 16, 2023

alamb commented Jul 17, 2023

alamb left a comment

alamb Jul 18, 2023

berkaysynnada Jul 19, 2023

alamb Jul 18, 2023

berkaysynnada Jul 19, 2023

berkaysynnada Jul 19, 2023

ozankabak Jul 19, 2023 •

edited

Loading

alamb Jul 19, 2023

alamb Jul 18, 2023

berkaysynnada Jul 19, 2023

berkaysynnada commented Jul 19, 2023

Dandandan Jul 19, 2023

alamb Jul 19, 2023

Dandandan Jul 19, 2023

Dandandan commented Jul 19, 2023

alamb Jul 18, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

ozankabak left a comment

alamb commented Jul 20, 2023 •

edited

Loading

	Interval	Distinct Count	Number of Floating Values
before Filter	[-1, 1]	100	1 B
after Filter	[0, 1]	50 or 100 ?	0.5 B

Refactor AnalysisContext and statistics() of FilterExec #6982

Refactor AnalysisContext and statistics() of FilterExec #6982

Conversation

berkaysynnada commented Jul 16, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak commented Jul 16, 2023

alamb commented Jul 17, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada commented Jul 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Jul 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak left a comment

Choose a reason for hiding this comment

alamb commented Jul 20, 2023 • edited Loading

berkaysynnada commented Jul 16, 2023 •

edited

Loading

ozankabak Jul 19, 2023 •

edited

Loading

alamb commented Jul 20, 2023 •

edited

Loading