Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonized predicate eval #420

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Oct 23, 2024

Today, we have two completely independent data skipping predicate mechanisms:

  1. Delta stats -- takes an expression as input and produces a rewritten expression as output. Difficult to test because you have to create and query a Delta table in order to see what data skipping resulted.
  2. Parquet footer stats -- takes an expression as input and produces an optional boolean as output. Tests can easily hook into it, and we have very thorough test coverage.

Besides the duplication, there is also the problem of under-tested Delta stats code, and at least one several lurking bugs (**). The solution is to define a common predicate evaluation framework that can express not just Delta stats expression rewriting and direct evaluation over parquet footer stats, but also can evaluate any predicate over scalar data, given a way to resolve column names into Scalar values (the DefaultPredicateEvaluator trait). The default predicate evaluator allows for much easier testing of Delta data skipping predicates. All while reusing significant code to further reduce the chances of divergence and lurking bugs.

(**) Bugs found (and fixed) so far:

  • NotEqual implementation was unsound, due to swapping < with > (could wrongly skip files).
  • IS [NOT] NULL was flat out broken, trying to do some black magic involving tightBounds. The correct solution is vastly simpler.
  • NULL handling in AND and OR clauses was too conservative, preventing files from being skipped in several cases.

TODO:

  • Actually leverage default predicate evaluator in tests, to improve test coverage of data_skipping.rs
  • Factor out the default predicate evaluator as a trait, so tests only have to provide the column resolution.
  • Disambiguate naming better -- several places have name collisions that require qualified paths. Those same name collisions make the code a lot harder to understand because it's hard to tell who calls what and where the actual implementation hides.
  • Consider hoisting eval_sql_where from the parquet skipping code up to the main predicate evaluator -- but only if we think it's generally useful to have, and worth the trouble to implement for the other two expression evaluators.
  • Doc comment all the things!

Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 89.45736% with 136 lines in your changes missing coverage. Please review.

Project coverage is 78.52%. Comparing base (4466509) to head (8f5b726).

Files with missing lines Patch % Lines
kernel/src/predicates/tests.rs 84.81% 63 Missing ⚠️
kernel/src/predicates/mod.rs 90.50% 19 Missing and 11 partials ⚠️
kernel/src/engine/parquet_stats_skipping/tests.rs 87.26% 18 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping.rs 81.42% 11 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/tests.rs 96.51% 6 Missing and 1 partial ⚠️
kernel/src/engine/arrow_expression.rs 0.00% 1 Missing ⚠️
kernel/src/engine/parquet_stats_skipping.rs 98.30% 0 Missing and 1 partial ⚠️
kernel/src/expressions/mod.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #420      +/-   ##
==========================================
+ Coverage   78.41%   78.52%   +0.10%     
==========================================
  Files          55       58       +3     
  Lines       11806    12151     +345     
  Branches    11806    12151     +345     
==========================================
+ Hits         9258     9541     +283     
- Misses       2041     2096      +55     
- Partials      507      514       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 23, 2024

/// invert an operator. Returns Some<InvertedOp> if the operator supports inversion, None if it
/// cannot be inverted
pub(crate) fn invert(&self) -> Option<BinaryOperator> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this just consume self?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally it took &self because the caller never had an owned copy laying around. But now the code is completely deleted.

/// testing/debugging but also serves as a reference implementation that documents the expression
/// semantics that kernel relies on for data skipping.
pub(crate) trait PredicateEvaluator {
type Output;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but is there ever a chance we won't want every evaluation to return the same output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already exists even before this PR: Delta data skipping maps Expr to Option<Expr> (and then applies the resulting skipping expression to engine data batches during log replay). Parquet footer skipping directly evaluates Expr to produce Option<bool>.

Comment on lines +334 to +338
(NotEqual, 1, vec![&batch2, &batch1]),
(NotEqual, 3, vec![&batch2, &batch1]),
(NotEqual, 4, vec![&batch2, &batch1]),
(NotEqual, 5, vec![&batch1]),
(NotEqual, 7, vec![&batch1]),
(NotEqual, 5, vec![&batch2, &batch1]),
(NotEqual, 7, vec![&batch2, &batch1]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug fix...

(
Expression::literal(3i64),
table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
),
(
column_expr!("number").distinct(3i64),
table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
table_for_numbers(vec![1, 2, 4, 5, 6]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug fix...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only changes in this file are a couple new uses of column_name! where needed, and a LOT of noise due to renaming get_XXX_stat_value as get_XXX_stat (since the method may return an expression for some trait impl).

@scovich scovich marked this pull request as ready for review November 7, 2024 04:02
@scovich scovich changed the title [WIP] Harmonized predicate eval Harmonized predicate eval Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants