-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor AnalysisContext and statistics() of FilterExec #6982
Refactor AnalysisContext and statistics() of FilterExec #6982
Conversation
…ws test" This reverts commit d10182a.
1) Analyze is removed from the methods of PhysicalExpr. 2) Interval arithmetic is applied to AnalysisContext values. 3) Intervals of input columns are updated now.
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit. minor changes
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
…m/synnada-ai/arrow-datafusion into feature/refactor-analysis-context
I am excited about this. This PR utilizes the rigor and generality of the interval library and lays a solid foundation to create an ever more powerful analysis/statistics module for Datafusion. Looking forward to hearing community feedback from all those interested. |
This looks awesome @berkaysynnada - thank you so much. I ran out of time today to review this but I have it on my list for tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR carefully and it looks (really) nice to me -- thank you @berkaysynnada
I left some minor style / make the code easier to work with comments, but those could definitely be done as a follow on PR (or never).
cc @isidentical who I believe worked on an earlier version of the analysis code and @mingmwang / @Dandandan who have expressed interest recently in working on Joins / Join Orders using better cardinality estimates
@@ -451,6 +453,75 @@ impl Interval { | |||
lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false), | |||
upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false), | |||
}; | |||
|
|||
// Cardinality is the number of all points included by the interval, considering its bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that might be worth considering in the long term that @tustvold mentioned the other day is to vectorize these calculations -- at the moment they are doin in the context of a single expression, but eventually if we want to use this logic to prune large numbers of files / etc based on statistics it may take too long
No change is needed here, I am just planting a seed of an idea in case this was on your list too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Speaking of this idea, I also want to mention that we have tried to replace the pruning logic in PruningPredicate with this interval library but realized that it disrupts the vectorized calculations. So I'm thinking about how we can use it without breaking the vectorized process.
IntervalBound::new(ScalarValue::from(-0.0625), false), | ||
IntervalBound::new(ScalarValue::from(0.0625), true), | ||
); | ||
assert_eq!(interval.cardinality()?, distinct_f64 * 2_048); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the rationale behind this choice, I think in practice this is not likely to provide much meaningful information -- to estimate cardinality in such cases, one approach is to use distinct values / estimates from the input -- like if the input's cardinality is 100, but the range is -0.0625
to 0.0625
then output cardinality of stable expressions is likely to be bounded by 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that it wouldn't have much use in practice, but using the distinct parameter might be more accurate at an outer scope to calculate practical cardinalities since we have not introduced statistical metrics like distribution yet to the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interval | Distinct Count | Number of Floating Values | |
---|---|---|---|
before Filter | [-1, 1] | 100 | 1 B |
after Filter | [0, 1] | 50 or 100 ? | 0.5 B |
The selectivity is actually decreased by %50 in this example. However, distinct count parameter (storing in ExprBoundaries with interval parameter of the column) is not updated with an approximate information (it is not updated with 50). As I think selectivity can be calculated and used approximately, but unless we are sure, we should not update the interval and distinct count parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, in the above examples, values in the column "number of floating values" are just illustrative, not the actual number of floating values in those ranges.
What it is trying to convey is that we halve selectivity because it is an approximate measure, but we don't halve distinct count because it is a conservative/bounding measure and we don't know how these distinct values are distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree -- all selectivity estimates are going to have errors introduced by assumptions made about the distribution (which is not fully known). Assuming a uniform distribution is an easy to understand choice, but of course causes substantial estimation error with skewed distributions
It was common for sophisticated cost models in other systems to have histogram information, but I think the current thinking is that it is better to be more adaptive / tolerant of bad cardinality estimations than to try and improve the cost models more.
// Since the floating-point numbers are ordered in the same order as their binary representation, | ||
// we can consider their binary representations as "indices" and subtract them. | ||
// https://stackoverflow.com/questions/8875064/how-many-distinct-floating-point-numbers-in-a-specific-range | ||
Ok(data_type) if data_type.is_floating() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it matters but is_floating
also includes Float16
but the code below only handles Float32 / Float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScalarValue's don't have it 😅
Thank you for your valuable feedback @alamb. I've reviewed all of them and implemented the necessary improvements. |
datafusion/common/src/scalar.rs
Outdated
|
||
/// This function returns the next/previous value depending on the `DIR` value. | ||
/// If `true`, it returns the next value; otherwise it returns the previous value. | ||
pub fn next_value<const DIR: bool>(self) -> ScalarValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this to the public api of ScalarValue
Seems quite specific for the usage, so it might be better to add it to the interval module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I gave the opposite feedback in my initial review . I am sorry :(
pub fn interval_with_closed_bounds(mut self) -> Interval { | ||
if self.lower.open { | ||
// Get next value | ||
self.lower.value = self.lower.value.next_value::<true>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in this case next_value_add
and next_value_sub
would maybe be more clear than adding use of generics?
Seems really cool! I added one minor comment. |
@@ -451,6 +453,75 @@ impl Interval { | |||
lower: IntervalBound::new(ScalarValue::Boolean(Some(true)), false), | |||
upper: IntervalBound::new(ScalarValue::Boolean(Some(true)), false), | |||
}; | |||
|
|||
// Cardinality is the number of all points included by the interval, considering its bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that might be worth considering in the long term that @tustvold mentioned the other day is to vectorize these calculations -- at the moment they are doin in the context of a single expression, but eventually if we want to use this logic to prune large numbers of files / etc based on statistics it may take too long
No change is needed here, I am just planting a seed of an idea in case this was on your list too
IntervalBound::new(ScalarValue::from(-0.0625), false), | ||
IntervalBound::new(ScalarValue::from(0.0625), true), | ||
); | ||
assert_eq!(interval.cardinality()?, distinct_f64 * 2_048); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree -- all selectivity estimates are going to have errors introduced by assumptions made about the distribution (which is not fully known). Assuming a uniform distribution is an easy to understand choice, but of course causes substantial estimation error with skewed distributions
It was common for sophisticated cost models in other systems to have histogram information, but I think the current thinking is that it is better to be more adaptive / tolerant of bad cardinality estimations than to try and improve the cost models more.
datafusion/common/src/scalar.rs
Outdated
|
||
/// This function returns the next/previous value depending on the `DIR` value. | ||
/// If `true`, it returns the next value; otherwise it returns the previous value. | ||
pub fn next_value<const DIR: bool>(self) -> ScalarValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I gave the opposite feedback in my initial review . I am sorry :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this code in detail one last time and everything looks good. There was one bug about floating point next_value
which is now fixed. This is good to go after CI passes.
Thanks everyone -- this is a great step forward! |
Which issue does this PR close?
Partially closes #5535.
Rationale for this change
The implementation of
statistics()
inFilterExec
has been improved in two ways:a<5
or3>=a
).This implementation is a step to unify the range/interval analysis implementations throughout the project. Interval and cp_solver library introduced by this PR is well-documented, easy to use, and well-structured. It is considered appropriate to use in the calculations mentioned.
What changes are included in this PR?
statistics()
now updates the newly calculated column intervals.analyze()
functions for allPhysicalExpr
's, since the cp_solver library can handle different kinds of PhysicalExpr's while assigning intervals.AnalysisContext
's structure is changed. The selectivity value shouldn't be inside ofExprBoundaries
, since this statistic is not related with a column, it is a measure of all rows in a table. Also, holding a field for target column results will be outdated since we may now calculate statistics for multiple columns. Keeping this field can be misleading. In theanalyze()
function, the results are overwritten to the input parameters because the old values are not used after the calculations.Are these changes tested?
Yes, new tests are added to filter.rs.
Are there any user-facing changes?