-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Statistics, introduce precision estimates (Exact
, Inexact
, Absent
)
#7793
Refactor Statistics, introduce precision estimates (Exact
, Inexact
, Absent
)
#7793
Conversation
I like the idea of "precision", however I have a few concerns regarding the current implementation:
With that, I propose to change the
For min/max value this probably leads to A more drastic (but maybe better?) alternative is to keep |
Hi @crepererum, thanks for taking a look. We actually had similar thoughts and did consider whether we can use
So basically |
So in which case is an "arbitrarily wide Interval" better than "absent"? |
Maybe an example involving selectivity will help: We often end up with an approximate point-wise estimate of selectivity that we get from input/output rows, but we have no rigorous bounds on it. The "inexact" mechanism comes into play when we use such a selectivity in any downstream calculations. |
1) get_statistics_with_limit() is simplified 2) AnalysisContext { boundaries } does not need to be an Option, it is removed. 3) Aggregation statistics() has a special handling for 0 and 1 row cases.
Which selectivity estimator provides an inexact point instead of a lower/upper bound? |
I can give an example of how Let's assume that the number of rows from input statistics() is exactly 1000, and the range of the column values is [1,100]. The predicate is After the analysis, the new range of the column would become [91,100], which estimates the selectivity as 0.1 (intervals are assumed to be uniform, they cannot be treated any differently yet). https://github.com/apache/arrow-datafusion/blob/fa2bb6c4e80d80e3d1d26ce85f7c21232f036dd5/datafusion/physical-expr/src/analysis.rs#L218-L223 Now, when we update the number of rows, we use that selectivity value and find the new row count as 1000*0.1=100. Though the input row count is exact, we cannot take that exactness information further. At that point, how can we define a range for 100 rows? That is why we needed such Inexact implementation. |
@berkaysynnada thank you for the nice example. W/o introducing even more complex statistics into DF, I don't see how to really express that estimate and I think the "inexact" precision proposed in this PR is a reasonable solution until we find something better ✔️ |
Exact
, Inexact
, Absent
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @berkaysynnada -- this is really nice work, both codewise as well as implementation-wise.
Thanks also to everyone who helped review the code
}) | ||
} | ||
_ => { | ||
// When the input row count is 0 or 1, we can adopt that statistic keeping its reliability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems related / may conflict with #7832 from @mustafasrepo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed that with @mustafasrepo, and his solution is temporary. This change should serve as the final version after merging the PRs.
Great, thanks everyone for the reviews! We will wait for a little bit more and merge this tomorrow if all still looks good |
Epic! |
Which issue does this PR close?
Closes #7553.
Rationale for this change
This PR is a comprehensive refactor of the
statistics()
method of execs. The main goal is to enhance DataFusion by introducing an enumeration called "Precision" and revealing its benefit by inserting it into Statistics parameters.This enumeration categorizes the level of information accuracy throughout the code. We're using it to categorize Statistics parameters more precisely and consistently across the codebase. Regarding this, a few algorithms have been enhanced as we have more detailed information now.
The starting point of this update is based on the problem in the issue that will be closed. As mentioned in the issue, statistics function is causing an inefficiency in
FilterExec
, and the error propagation is broken. That is fixed, and the cases in which errors can arise are restricted and clearly defined.What changes are included in this PR?
statistics()
method of exec's now returns a Result.Are these changes tested?
Yes. Existing tests are adapted to Precision. New tests are added for the new functions.
Are there any user-facing changes?