Harmonized predicate eval #420

scovich · 2024-10-23T04:07:34Z

Today, we have two completely independent data skipping predicate mechanisms:

Delta stats -- takes an expression as input and produces a rewritten expression as output. Difficult to test because you have to create and query a Delta table in order to see what data skipping resulted.
Parquet footer stats -- takes an expression as input and produces an optional boolean as output. Tests can easily hook into it, and we have very thorough test coverage.

Besides the duplication, there is also the problem of under-tested Delta stats code, and ~~at least one~~ several lurking bugs (**). The solution is to define a common predicate evaluation framework that can express not just Delta stats expression rewriting and direct evaluation over parquet footer stats, but also can evaluate any predicate over scalar data, given a way to resolve column names into Scalar values (the DefaultPredicateEvaluator trait). The default predicate evaluator allows for much easier testing of Delta data skipping predicates. All while reusing significant code to further reduce the chances of divergence and lurking bugs.

(**) Bugs found (and fixed) so far:

NotEqual implementation was unsound, due to swapping < with > (could wrongly skip files).
IS [NOT] NULL was flat out broken, trying to do some black magic involving tightBounds. The correct solution is vastly simpler.
NULL handling in AND and OR clauses was too conservative, preventing files from being skipped in several cases.

TODO:

Actually leverage default predicate evaluator in tests, to improve test coverage of data_skipping.rs
Factor out the default predicate evaluator as a trait, so tests only have to provide the column resolution.
Disambiguate naming better -- several places have name collisions that require qualified paths. Those same name collisions make the code a lot harder to understand because it's hard to tell who calls what and where the actual implementation hides.
Consider hoisting eval_sql_where from the parquet skipping code up to the main predicate evaluator -- but only if we think it's generally useful to have, and worth the trouble to implement for the other two expression evaluators.
Doc comment all the things!

…, and generic expression eval

codecov · 2024-10-23T04:13:08Z

Codecov Report

Attention: Patch coverage is 89.45736% with 136 lines in your changes missing coverage. Please review.

Project coverage is 78.52%. Comparing base (4466509) to head (8f5b726).

Files with missing lines	Patch %	Lines
kernel/src/predicates/tests.rs	84.81%	63 Missing ⚠️
kernel/src/predicates/mod.rs	90.50%	19 Missing and 11 partials ⚠️
kernel/src/engine/parquet_stats_skipping/tests.rs	87.26%	18 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping.rs	81.42%	11 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/tests.rs	96.51%	6 Missing and 1 partial ⚠️
kernel/src/engine/arrow_expression.rs	0.00%	1 Missing ⚠️
kernel/src/engine/parquet_stats_skipping.rs	98.30%	0 Missing and 1 partial ⚠️
kernel/src/expressions/mod.rs	92.85%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #420      +/-   ##
==========================================
+ Coverage   78.41%   78.52%   +0.10%     
==========================================
  Files          55       58       +3     
  Lines       11806    12151     +345     
  Branches    11806    12151     +345     
==========================================
+ Hits         9258     9541     +283     
- Misses       2041     2096      +55     
- Partials      507      514       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hntd187 · 2024-10-23T20:35:23Z

kernel/src/expressions/mod.rs

-
-    /// invert an operator. Returns Some<InvertedOp> if the operator supports inversion, None if it
-    /// cannot be inverted
-    pub(crate) fn invert(&self) -> Option<BinaryOperator> {


nit: Should this just consume self?

Originally it took &self because the caller never had an owned copy laying around. But now the code is completely deleted.

hntd187 · 2024-10-23T20:37:27Z

kernel/src/predicates/mod.rs

+/// testing/debugging but also serves as a reference implementation that documents the expression
+/// semantics that kernel relies on for data skipping.
+pub(crate) trait PredicateEvaluator {
+    type Output;


I like the idea, but is there ever a chance we won't want every evaluation to return the same output?

It already exists even before this PR: Delta data skipping maps Expr to Option<Expr> (and then applies the resulting skipping expression to engine data batches during log replay). Parquet footer skipping directly evaluates Expr to produce Option<bool>.

scovich · 2024-11-07T03:34:36Z

kernel/tests/read.rs

+        (NotEqual, 1, vec![&batch2, &batch1]),
+        (NotEqual, 3, vec![&batch2, &batch1]),
        (NotEqual, 4, vec![&batch2, &batch1]),
-        (NotEqual, 5, vec![&batch1]),
-        (NotEqual, 7, vec![&batch1]),
+        (NotEqual, 5, vec![&batch2, &batch1]),
+        (NotEqual, 7, vec![&batch2, &batch1]),


scovich · 2024-11-07T03:35:09Z

kernel/tests/read.rs

        (
            Expression::literal(3i64),
            table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
        ),
        (
            column_expr!("number").distinct(3i64),
-            table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
+            table_for_numbers(vec![1, 2, 4, 5, 6]),


scovich · 2024-11-07T04:01:54Z

kernel/src/engine/parquet_row_group_skipping/tests.rs

The only changes in this file are a couple new uses of column_name! where needed, and a LOT of noise due to renaming get_XXX_stat_value as get_XXX_stat (since the method may return an expression for some trait impl).

scovich added 5 commits October 21, 2024 21:27

simplify and clean up data skipping logic a bit

cfb9cb3

checkpoint - one trait captures data skipping, parquet stats skipping…

264ad5f

…, and generic expression eval

Delete redundant code, Delta data skipping passes tests now

7b24f90

it works now, all tests passing

3ed4526

code comment

da16ba7

github-actions bot added the breaking-change Change that will require a version bump label Oct 23, 2024

scovich mentioned this pull request Oct 23, 2024

Simplify and clean up data skipping logic a bit #415

Open

add doc comments

5e6dc11

hntd187 reviewed Oct 23, 2024

View reviewed changes

scovich added 4 commits October 23, 2024 20:15

add default eval tests, fix distinct

1c26d98

more cleanups and doc comments

dadd719

add more tests, fix broken data skipping null checks, AND/OR weirdness

9ec825d

Cleanup and remove redundant parquet stats skipping tests

1ccaa6d

scovich mentioned this pull request Oct 29, 2024

ColumnName tracks a path of field names instead of a simple string #445

Merged

Merge remote-tracking branch 'oss/main' into hamonized-predicate-eval

107bc5f

scovich commented Nov 7, 2024

View reviewed changes

scovich marked this pull request as ready for review November 7, 2024 04:02

scovich changed the title ~~[WIP] Harmonized predicate eval~~ Harmonized predicate eval Nov 7, 2024

scovich requested review from nicklan and zachschuermann November 7, 2024 04:03

cleanup

8f5b726

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonized predicate eval #420

Harmonized predicate eval #420

scovich commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading

hntd187 Oct 23, 2024

scovich Oct 24, 2024

hntd187 Oct 23, 2024

scovich Oct 24, 2024

scovich Nov 7, 2024

scovich Nov 7, 2024

scovich Nov 7, 2024

Harmonized predicate eval #420

Are you sure you want to change the base?

Harmonized predicate eval #420

Conversation

scovich commented Oct 23, 2024 • edited Loading

codecov bot commented Oct 23, 2024 • edited Loading

Codecov Report

hntd187 Oct 23, 2024

Choose a reason for hiding this comment

scovich Oct 24, 2024

Choose a reason for hiding this comment

hntd187 Oct 23, 2024

Choose a reason for hiding this comment

scovich Oct 24, 2024

Choose a reason for hiding this comment

scovich Nov 7, 2024

Choose a reason for hiding this comment

scovich Nov 7, 2024

Choose a reason for hiding this comment

scovich Nov 7, 2024

Choose a reason for hiding this comment

scovich commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading