-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonized predicate eval #420
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #420 +/- ##
==========================================
+ Coverage 78.41% 78.52% +0.10%
==========================================
Files 55 58 +3
Lines 11806 12151 +345
Branches 11806 12151 +345
==========================================
+ Hits 9258 9541 +283
- Misses 2041 2096 +55
- Partials 507 514 +7 ☔ View full report in Codecov by Sentry. |
|
||
/// invert an operator. Returns Some<InvertedOp> if the operator supports inversion, None if it | ||
/// cannot be inverted | ||
pub(crate) fn invert(&self) -> Option<BinaryOperator> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should this just consume self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally it took &self
because the caller never had an owned copy laying around. But now the code is completely deleted.
/// testing/debugging but also serves as a reference implementation that documents the expression | ||
/// semantics that kernel relies on for data skipping. | ||
pub(crate) trait PredicateEvaluator { | ||
type Output; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but is there ever a chance we won't want every evaluation to return the same output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already exists even before this PR: Delta data skipping maps Expr
to Option<Expr>
(and then applies the resulting skipping expression to engine data batches during log replay). Parquet footer skipping directly evaluates Expr
to produce Option<bool>
.
(NotEqual, 1, vec![&batch2, &batch1]), | ||
(NotEqual, 3, vec![&batch2, &batch1]), | ||
(NotEqual, 4, vec![&batch2, &batch1]), | ||
(NotEqual, 5, vec![&batch1]), | ||
(NotEqual, 7, vec![&batch1]), | ||
(NotEqual, 5, vec![&batch2, &batch1]), | ||
(NotEqual, 7, vec![&batch2, &batch1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug fix...
( | ||
Expression::literal(3i64), | ||
table_for_numbers(vec![1, 2, 3, 4, 5, 6]), | ||
), | ||
( | ||
column_expr!("number").distinct(3i64), | ||
table_for_numbers(vec![1, 2, 3, 4, 5, 6]), | ||
table_for_numbers(vec![1, 2, 4, 5, 6]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug fix...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only changes in this file are a couple new uses of column_name!
where needed, and a LOT of noise due to renaming get_XXX_stat_value
as get_XXX_stat
(since the method may return an expression for some trait impl).
Today, we have two completely independent data skipping predicate mechanisms:
Besides the duplication, there is also the problem of under-tested Delta stats code, and
at least oneseveral lurking bugs (**). The solution is to define a common predicate evaluation framework that can express not just Delta stats expression rewriting and direct evaluation over parquet footer stats, but also can evaluate any predicate over scalar data, given a way to resolve column names intoScalar
values (theDefaultPredicateEvaluator
trait). The default predicate evaluator allows for much easier testing of Delta data skipping predicates. All while reusing significant code to further reduce the chances of divergence and lurking bugs.(**) Bugs found (and fixed) so far:
NotEqual
implementation was unsound, due to swapping<
with>
(could wrongly skip files).IS [NOT] NULL
was flat out broken, trying to do some black magic involving tightBounds. The correct solution is vastly simpler.NULL
handling inAND
andOR
clauses was too conservative, preventing files from being skipped in several cases.TODO:
eval_sql_where
from the parquet skipping code up to the main predicate evaluator -- but only if we think it's generally useful to have, and worth the trouble to implement for the other two expression evaluators.