Terms, Types and Expressions #22

Fokko · 2023-08-04T08:17:04Z

In (Py)Iceberg we have a hierarchy that works very well. In this issue, I'll try to explain it, and also convince y'all to use it in iceberg-rust as well. Disclaimer, I'm an OOP guy, so probably there are some things that don't make sense in Rust. I don't think we even can port the whole hierarchy, since Rust is not OOP.

The important traits are (directly translated from Python):

trait Bound {
    fn invert(&self) -> Bound;
}

trait Unbound {
    fn bind(&self, schema &impl Schema, case_sensitive &bool) -> Unbound;
}

(This excludes Term, Reference, BooleanExpression, maybe we should call Unbound as UnboundBooleanExpression, and Bound as BoundBooleanExpression, it is up to you. In the end, naming things is the hardest thing in computer science).

This is implemented by operations such as,

Unary predicates: IsNull, NotNull, IsNaN, NotNaN
Set predicates: In, NotIn
Literal predicates: EqualTo, NotEqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual
Negation: Not
Compositions: And, Or
Literal: AlwaysTrue, AlwaysFalse

The inverse method is important later on to rewrite Not(...) operations. Not(EqualTo("UserId", "123")), can be rewritten to NotEqualTo("UserID", 123). Similar for Not can be rewritten: !(A and B) == !A or !B.

All the operations come in a bound and unbound one. The EqualTo is in the public API, and once it is bound to a schema, it will a BoundEqualTo. Binding is important since let's say that we have an expression: UserID = '123', then we want to convert this at bind time to UserID = 123 because the UserID is a date field in this case. Since Iceberg is lazy, every file written can have a different schema, and binding to each of the schemas has some advantages:

If you promote a column along the way from i32 to i64, the will UserID will also be promoted to an i64 when binding to a schema with a newer file.
If you add a new column, but this column isn't written in an older file, then it will be converted to an AlwaysFalse().
When an IsNull("UserID") is bound to a UserID INTEGER NOT NULL column, then this will also convert into an AlwaysFalse().
Optional: In PyIceberg we have optimizations, that In("UserID", {123}) is rewritten to EqualTo("UserID" == 123) since there is only one literal.

The text was updated successfully, but these errors were encountered:

liurenjie1024 · 2023-08-08T07:52:17Z

Hi, @Fokko Thanks for this great writing. If I understand correctly, this is quite similar to expression evaluation/optimization in a database query engine. Usually, a database engine has its own expression optimization system, and table format needs to provide statistics to query optimizer. Since currently we haven't decided yet how to integrate with other query engines, I would suggest implementing this later after we have finished core specs such as catalog, table, transaction, metadata, etc.

Fokko · 2023-08-09T07:16:30Z

@liurenjie1024 Yes, provides the input for the evaluation/optimization. I agree that we should not implement an optimizer in iceberg-rust, and I think we should also get the primitives in place. Many databases don't support id-based field resolution, and Iceberg has a lot of metadata at a file level. What I would suggest is that iceberg-rust does do the pruning of the unrelated files, rather than doing this in the database, but let's defer this discussion.

liurenjie1024 · 2023-08-09T08:12:13Z

@Fokko Thanks for the explanation. I agree that providing pruning in iceberg-rust would benefit the community and other query engines.

JanKaul · 2023-08-09T14:52:40Z

Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like:

// returns min value of every manifest file for the given column
fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>
fn manifest_max_values(table: &Table, column_name: &str) -> Vec<Value>

// returns min value of every data file for the given column
fn datafile_min_values(table: &Table, column_name: &str) -> Vec<Value>
fn datafile_max_values(table: &Table, column_name: &str) -> Vec<Value>

or even better with arrow:

fn manifest_min_values(table: &Table, column_name: &str) -> ArrayRef
fn manifest_max_values(table: &Table, column_name: &str) -> ArrayRef

fn datafile_min_values(table: &Table, column_name: &str) -> ArrayRef
fn datafile_max_values(table: &Table, column_name: &str) -> ArrayRef

liurenjie1024 · 2023-08-10T02:38:19Z

Yes, expression system is not small effort, and we can postpone discussion about it later.

ZENOTME · 2023-08-17T14:11:22Z

Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like:

fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>

I'm curious abort what use case need to use this? 🤔

liurenjie1024 · 2023-08-18T01:24:28Z

Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like:

fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>

I'm curious abort what use case need to use this? 🤔

I think it's used in planning

JanKaul · 2023-08-18T04:53:17Z

Datafusion has the trait PruningStatistics that you can implement for a file/container. It can be used with a pruning predicate to check whether an expression could evaluate to true for at least one row in the file/container. This way you don't have to perform the expression evaluation yourself.

viirya · 2024-02-21T21:15:55Z

As DataFusion already implemented mature expression system and evaluation framework, I'm wondering if it is possibly or it is better option to reuse it in iceberg-rust instead re-implementing another expression + evaluation etc.?

liurenjie1024 · 2024-02-22T01:02:16Z

As DataFusion already implemented mature expression system and evaluation framework, I'm wondering if it is possibly or it is better option to reuse it in iceberg-rust instead re-implementing another expression + evaluation etc.?

Hi, @viirya I thought about this option, but I didn't choose it for several reasons:

iceberg-rust's position is similar to iceberg java, e.g. it's a library which will be used by many engines, such as datafusion, polars, daft, risingwave, databend, etc.
Expression system usually binds with type system.
Iceberg don't need a general purpose expression system like datafusion, which may make things complicated. For example, we don't need subquery.
There are some iceberg specific things like transform, sorting, and I'm not sure how much effort it will take to make it compatible with datafusion expression.

This was referenced Dec 20, 2023

Pruning partitions using filter when reading file. #126

Closed

feat: Expression system. #132

Merged

Fokko closed this as completed in #132 Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terms, Types and Expressions #22

Terms, Types and Expressions #22

Fokko commented Aug 4, 2023 •

edited

Loading

liurenjie1024 commented Aug 8, 2023

Fokko commented Aug 9, 2023

liurenjie1024 commented Aug 9, 2023

JanKaul commented Aug 9, 2023

liurenjie1024 commented Aug 10, 2023

ZENOTME commented Aug 17, 2023

liurenjie1024 commented Aug 18, 2023

JanKaul commented Aug 18, 2023 •

edited

Loading

viirya commented Feb 21, 2024

liurenjie1024 commented Feb 22, 2024

Terms, Types and Expressions #22

Terms, Types and Expressions #22

Comments

Fokko commented Aug 4, 2023 • edited Loading

liurenjie1024 commented Aug 8, 2023

Fokko commented Aug 9, 2023

liurenjie1024 commented Aug 9, 2023

JanKaul commented Aug 9, 2023

liurenjie1024 commented Aug 10, 2023

ZENOTME commented Aug 17, 2023

liurenjie1024 commented Aug 18, 2023

JanKaul commented Aug 18, 2023 • edited Loading

viirya commented Feb 21, 2024

liurenjie1024 commented Feb 22, 2024

Fokko commented Aug 4, 2023 •

edited

Loading

JanKaul commented Aug 18, 2023 •

edited

Loading