Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor AnalysisContext and statistics() of FilterExec #6982

Merged
merged 51 commits into from
Jul 20, 2023
Merged

Refactor AnalysisContext and statistics() of FilterExec #6982

merged 51 commits into from
Jul 20, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Jul 16, 2023

Which issue does this PR close?

Partially closes #5535.

Rationale for this change

The implementation of statistics() in FilterExec has been improved in two ways:

  1. It can now handle complex filter predicates. The previous design was only able to handle expressions with a single column and simple binary expressions (such as a<5 or 3>=a).
  2. It could not shrink the input column boundaries, can only find the selectivity, and calculate new row count and byte size statistics. Now, new interval boundaries are inserted into column statistics.

This implementation is a step to unify the range/interval analysis implementations throughout the project. Interval and cp_solver library introduced by this PR is well-documented, easy to use, and well-structured. It is considered appropriate to use in the calculations mentioned.

What changes are included in this PR?

  1. statistics() now updates the newly calculated column intervals.
  2. No need to implement different analyze() functions for all PhysicalExpr's, since the cp_solver library can handle different kinds of PhysicalExpr's while assigning intervals.
  3. Cardinality calculations are added to the interval library.
  4. AnalysisContext's structure is changed. The selectivity value shouldn't be inside of ExprBoundaries, since this statistic is not related with a column, it is a measure of all rows in a table. Also, holding a field for target column results will be outdated since we may now calculate statistics for multiple columns. Keeping this field can be misleading. In the analyze() function, the results are overwritten to the input parameters because the old values are not used after the calculations.

Are these changes tested?

Yes, new tests are added to filter.rs.

Are there any user-facing changes?

berkaysynnada and others added 30 commits May 31, 2023 10:25
1) Analyze is removed from the methods of PhysicalExpr.
2) Interval arithmetic is applied to AnalysisContext values.
3) Intervals of input columns are updated now.
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
minor changes
@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Jul 16, 2023
@ozankabak
Copy link
Contributor

I am excited about this. This PR utilizes the rigor and generality of the interval library and lays a solid foundation to create an ever more powerful analysis/statistics module for Datafusion. Looking forward to hearing community feedback from all those interested.

@alamb
Copy link
Contributor

alamb commented Jul 17, 2023

This looks awesome @berkaysynnada - thank you so much. I ran out of time today to review this but I have it on my list for tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR carefully and it looks (really) nice to me -- thank you @berkaysynnada

I left some minor style / make the code easier to work with comments, but those could definitely be done as a follow on PR (or never).

cc @isidentical who I believe worked on an earlier version of the analysis code and @mingmwang / @Dandandan who have expressed interest recently in working on Joins / Join Orders using better cardinality estimates

@@ -451,6 +453,75 @@ impl Interval {
lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
};

// Cardinality is the number of all points included by the interval, considering its bounds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that might be worth considering in the long term that @tustvold mentioned the other day is to vectorize these calculations -- at the moment they are doin in the context of a single expression, but eventually if we want to use this logic to prune large numbers of files / etc based on statistics it may take too long

No change is needed here, I am just planting a seed of an idea in case this was on your list too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Speaking of this idea, I also want to mention that we have tried to replace the pruning logic in PruningPredicate with this interval library but realized that it disrupts the vectorized calculations. So I'm thinking about how we can use it without breaking the vectorized process.

IntervalBound::new(ScalarValue::from(-0.0625), false),
IntervalBound::new(ScalarValue::from(0.0625), true),
);
assert_eq!(interval.cardinality()?, distinct_f64 * 2_048);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the rationale behind this choice, I think in practice this is not likely to provide much meaningful information -- to estimate cardinality in such cases, one approach is to use distinct values / estimates from the input -- like if the input's cardinality is 100, but the range is -0.0625 to 0.0625 then output cardinality of stable expressions is likely to be bounded by 100

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it wouldn't have much use in practice, but using the distinct parameter might be more accurate at an outer scope to calculate practical cardinalities since we have not introduced statistical metrics like distribution yet to the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  Interval Distinct Count Number of Floating Values
before Filter [-1, 1] 100 1 B
after Filter [0, 1] 50 or 100 ? 0.5 B

The selectivity is actually decreased by %50 in this example. However, distinct count parameter (storing in ExprBoundaries with interval parameter of the column) is not updated with an approximate information (it is not updated with 50). As I think selectivity can be calculated and used approximately, but unless we are sure, we should not update the interval and distinct count parameters.

Copy link
Contributor

@ozankabak ozankabak Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, in the above examples, values in the column "number of floating values" are just illustrative, not the actual number of floating values in those ranges.

What it is trying to convey is that we halve selectivity because it is an approximate measure, but we don't halve distinct count because it is a conservative/bounding measure and we don't know how these distinct values are distributed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree -- all selectivity estimates are going to have errors introduced by assumptions made about the distribution (which is not fully known). Assuming a uniform distribution is an easy to understand choice, but of course causes substantial estimation error with skewed distributions

It was common for sophisticated cost models in other systems to have histogram information, but I think the current thinking is that it is better to be more adaptive / tolerant of bad cardinality estimations than to try and improve the cost models more.

// Since the floating-point numbers are ordered in the same order as their binary representation,
// we can consider their binary representations as "indices" and subtract them.
// https://stackoverflow.com/questions/8875064/how-many-distinct-floating-point-numbers-in-a-specific-range
Ok(data_type) if data_type.is_floating() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it matters but is_floating also includes Float16 but the code below only handles Float32 / Float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarValue's don't have it 😅

datafusion/physical-expr/src/physical_expr.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/physical_expr.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/filter.rs Show resolved Hide resolved
@berkaysynnada
Copy link
Contributor Author

This looks awesome @berkaysynnada - thank you so much. I ran out of time today to review this but I have it on my list for tomorrow

Thank you for your valuable feedback @alamb. I've reviewed all of them and implemented the necessary improvements.


/// This function returns the next/previous value depending on the `DIR` value.
/// If `true`, it returns the next value; otherwise it returns the previous value.
pub fn next_value<const DIR: bool>(self) -> ScalarValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to the public api of ScalarValue Seems quite specific for the usage, so it might be better to add it to the interval module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I gave the opposite feedback in my initial review . I am sorry :(

pub fn interval_with_closed_bounds(mut self) -> Interval {
if self.lower.open {
// Get next value
self.lower.value = self.lower.value.next_value::<true>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in this case next_value_add and next_value_sub would maybe be more clear than adding use of generics?

@Dandandan
Copy link
Contributor

Seems really cool! I added one minor comment.

@@ -451,6 +453,75 @@ impl Interval {
lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false),
};

// Cardinality is the number of all points included by the interval, considering its bounds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that might be worth considering in the long term that @tustvold mentioned the other day is to vectorize these calculations -- at the moment they are doin in the context of a single expression, but eventually if we want to use this logic to prune large numbers of files / etc based on statistics it may take too long

No change is needed here, I am just planting a seed of an idea in case this was on your list too

IntervalBound::new(ScalarValue::from(-0.0625), false),
IntervalBound::new(ScalarValue::from(0.0625), true),
);
assert_eq!(interval.cardinality()?, distinct_f64 * 2_048);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree -- all selectivity estimates are going to have errors introduced by assumptions made about the distribution (which is not fully known). Assuming a uniform distribution is an easy to understand choice, but of course causes substantial estimation error with skewed distributions

It was common for sophisticated cost models in other systems to have histogram information, but I think the current thinking is that it is better to be more adaptive / tolerant of bad cardinality estimations than to try and improve the cost models more.


/// This function returns the next/previous value depending on the `DIR` value.
/// If `true`, it returns the next value; otherwise it returns the previous value.
pub fn next_value<const DIR: bool>(self) -> ScalarValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I gave the opposite feedback in my initial review . I am sorry :(

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this code in detail one last time and everything looks good. There was one bug about floating point next_value which is now fixed. This is good to go after CI passes.

@alamb alamb merged commit b7ed06d into apache:main Jul 20, 2023
@alamb
Copy link
Contributor

alamb commented Jul 20, 2023

Thanks everyone -- this is a great step forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis)
6 participants