-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MINOR]: Unknown input statistics in FilterExec #7544
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In FilterExec::statisticsmethod, if the input statistics are None, the analysis is not performed.
I think we also can modify the statistic derive/propagate
in FilterExec
.
I'm not sure which is better.
As a whole, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a test for code somehow, to prevent breaking it during future refactorings
I'm sorry but I didn't understand exactly what you mean while saying derive/propagate. |
@@ -196,7 +196,8 @@ fn shrink_boundaries( | |||
&final_result.upper.value, | |||
&target_boundaries, | |||
&initial_boundaries, | |||
)?; | |||
) | |||
.unwrap_or(1.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation, columns with infinite bounds are not handled, so it was returning an error, and the newly calculated bounds could not be set. This is a kind of workaround but I will refactor these parts with the context of this issue
I didn't explain it in enough detail, and I'm sorry for that.
This PR is handle a condition: My point is that we can also correct the "Statistic Derive" to make Can you get it accurately now what I mean? @berkaysynnada |
Thanks for the explanation :) There is an issue, I don't know if you have found the chance to review it, but in summary, this Actually what you said "if the input statistics are None, the analysis is not performed." is not what I intended. The analysis is performed with columns having infinite bounds. To do this, filling Statistics with TypeInfo is inevitable. To reflect what you suggest in practice, I plan to add a |
Which issue does this PR close?
Closes #.
Rationale for this change
In
FilterExec::statistics
method, if the input statistics areNone
, the analysis is not performed. However, with the filter predicate, we can estimate the output column boundaries and propagate it to the next exec safely assuming the column boundaries as infinite.What changes are included in this PR?
There is a function
new_with_unbounded_columns
constructing aColumnStatistics
having columns with unbounded min and max values (they areScalarValue::SomeType(None)
, and these null instances are interpreted as infinite in the interval library).After this change, a test needed to be modified due to the change of the build side of the join operation.
Are these changes tested?
Are there any user-facing changes?