-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make filter selectivity for statistics configurable #8243
Conversation
@andygrove I do have a super early draft broken implementation, but it's already enough to ask your view on what's the right thing to do with SQL and when the filter is introduced as a part of an optimization but not as an optimization of an existing filter. Could you please review my comments? |
5b01f5e
to
cbf9d36
Compare
1e9c8d3
to
3609e1b
Compare
/// The default filter selectivity used by Filter Statistics | ||
/// when an exact selectivity cannot be determined. Valid values are | ||
/// between 0 (no selectivity) and 100 (all rows are selected). | ||
pub default_filter_selectivity: u8, default = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to make this a float (0.2
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two main reasons for choosing a uint are the lack of Eq trait implementation for f32, as well as the problem that could arise when serializing numbers that cannot be perfectly represented as f32. If you had already made this consideration and you think f32 is still a better option, let me know and I will proceed
Overall looks very good! I think it might make more sense to use a |
FYI @alamb @andygrove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @edmondop and @Dandandan -- this looks pretty neat.
@@ -994,4 +1014,22 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_validation_filter_selectivity() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also add a test showing that changing the default selectivity actually affects the output statistics?
I think if the selectivity got hard coded to 0.2 again, no tests would fail 🤔 Maybe we could add another unit tests here setting selectivity to 0.5
or something and demonstrating the statistics are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should, however I didn't know how to observe the effect of such a parameter. Is there some form of observable state or result exposed that I can use to perform an assertion about what selectivity has been used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to look at the output of FilterExec::statistics()
and the row number estimates will change with different value of selectivity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to finish it between today and tomorrow, thanks for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure however how this should be handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure however how this should be handled
That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like sin(x) = 4.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @edmondop -- this looks good to me.
@@ -994,4 +1014,22 @@ mod tests { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_validation_filter_selectivity() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure however how this should be handled
That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like sin(x) = 4.0
.
)); | ||
let filter = FilterExec::try_new(predicate, input)?; | ||
let statistics = filter.statistics()?; | ||
assert_eq!(statistics.num_rows, Precision::Inexact(200)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 very nice
* Turning filter selectivity as a configurable parameter * Renaming API to be more consistent with struct value * Adding a filter with custom selectivity
Close #8133
Expose an API to set a default value for filter selectivity when the exact value cannot be computed. This value is also exposed as a part of the configuration as
datafusion.optimizer.default_filter_selectivity