Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms, Types and Expressions #22

Closed
Fokko opened this issue Aug 4, 2023 · 10 comments · Fixed by #132
Closed

Terms, Types and Expressions #22

Fokko opened this issue Aug 4, 2023 · 10 comments · Fixed by #132

Comments

@Fokko
Copy link
Contributor

Fokko commented Aug 4, 2023

In (Py)Iceberg we have a hierarchy that works very well. In this issue, I'll try to explain it, and also convince y'all to use it in iceberg-rust as well. Disclaimer, I'm an OOP guy, so probably there are some things that don't make sense in Rust. I don't think we even can port the whole hierarchy, since Rust is not OOP.

The important traits are (directly translated from Python):

trait Bound {
    fn invert(&self) -> Bound;
}
trait Unbound {
    fn bind(&self, schema &impl Schema, case_sensitive &bool) -> Unbound;
}

(This excludes Term, Reference, BooleanExpression, maybe we should call Unbound as UnboundBooleanExpression, and Bound as BoundBooleanExpression, it is up to you. In the end, naming things is the hardest thing in computer science).

This is implemented by operations such as,

  • Unary predicates: IsNull, NotNull, IsNaN, NotNaN
  • Set predicates: In, NotIn
  • Literal predicates: EqualTo, NotEqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual
  • Negation: Not
  • Compositions: And, Or
  • Literal: AlwaysTrue, AlwaysFalse

The inverse method is important later on to rewrite Not(...) operations. Not(EqualTo("UserId", "123")), can be rewritten to NotEqualTo("UserID", 123). Similar for Not can be rewritten: !(A and B) == !A or !B.

All the operations come in a bound and unbound one. The EqualTo is in the public API, and once it is bound to a schema, it will a BoundEqualTo. Binding is important since let's say that we have an expression: UserID = '123', then we want to convert this at bind time to UserID = 123 because the UserID is a date field in this case. Since Iceberg is lazy, every file written can have a different schema, and binding to each of the schemas has some advantages:

  • If you promote a column along the way from i32 to i64, the will UserID will also be promoted to an i64 when binding to a schema with a newer file.
  • If you add a new column, but this column isn't written in an older file, then it will be converted to an AlwaysFalse().
  • When an IsNull("UserID") is bound to a UserID INTEGER NOT NULL column, then this will also convert into an AlwaysFalse().
  • Optional: In PyIceberg we have optimizations, that In("UserID", {123}) is rewritten to EqualTo("UserID" == 123) since there is only one literal.
@liurenjie1024
Copy link
Collaborator

Hi, @Fokko Thanks for this great writing. If I understand correctly, this is quite similar to expression evaluation/optimization in a database query engine. Usually, a database engine has its own expression optimization system, and table format needs to provide statistics to query optimizer. Since currently we haven't decided yet how to integrate with other query engines, I would suggest implementing this later after we have finished core specs such as catalog, table, transaction, metadata, etc.

@Fokko
Copy link
Contributor Author

Fokko commented Aug 9, 2023

@liurenjie1024 Yes, provides the input for the evaluation/optimization. I agree that we should not implement an optimizer in iceberg-rust, and I think we should also get the primitives in place. Many databases don't support id-based field resolution, and Iceberg has a lot of metadata at a file level. What I would suggest is that iceberg-rust does do the pruning of the unrelated files, rather than doing this in the database, but let's defer this discussion.

@liurenjie1024
Copy link
Collaborator

@Fokko Thanks for the explanation. I agree that providing pruning in iceberg-rust would benefit the community and other query engines.

@JanKaul
Copy link
Collaborator

JanKaul commented Aug 9, 2023

Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like:

// returns min value of every manifest file for the given column
fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>
fn manifest_max_values(table: &Table, column_name: &str) -> Vec<Value>

// returns min value of every data file for the given column
fn datafile_min_values(table: &Table, column_name: &str) -> Vec<Value>
fn datafile_max_values(table: &Table, column_name: &str) -> Vec<Value>

or even better with arrow:

fn manifest_min_values(table: &Table, column_name: &str) -> ArrayRef
fn manifest_max_values(table: &Table, column_name: &str) -> ArrayRef

fn datafile_min_values(table: &Table, column_name: &str) -> ArrayRef
fn datafile_max_values(table: &Table, column_name: &str) -> ArrayRef

@liurenjie1024
Copy link
Collaborator

Yes, expression system is not small effort, and we can postpone discussion about it later.

@ZENOTME
Copy link
Contributor

ZENOTME commented Aug 17, 2023

Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like:

fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>

I'm curious abort what use case need to use this? 🤔

@liurenjie1024
Copy link
Collaborator

Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like:

fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>

I'm curious abort what use case need to use this? 🤔

I think it's used in planning

@JanKaul
Copy link
Collaborator

JanKaul commented Aug 18, 2023

Datafusion has the trait PruningStatistics that you can implement for a file/container. It can be used with a pruning predicate to check whether an expression could evaluate to true for at least one row in the file/container. This way you don't have to perform the expression evaluation yourself.

@viirya
Copy link
Member

viirya commented Feb 21, 2024

As DataFusion already implemented mature expression system and evaluation framework, I'm wondering if it is possibly or it is better option to reuse it in iceberg-rust instead re-implementing another expression + evaluation etc.?

@liurenjie1024
Copy link
Collaborator

As DataFusion already implemented mature expression system and evaluation framework, I'm wondering if it is possibly or it is better option to reuse it in iceberg-rust instead re-implementing another expression + evaluation etc.?

Hi, @viirya I thought about this option, but I didn't choose it for several reasons:

  1. iceberg-rust's position is similar to iceberg java, e.g. it's a library which will be used by many engines, such as datafusion, polars, daft, risingwave, databend, etc.
  2. Expression system usually binds with type system.
  3. Iceberg don't need a general purpose expression system like datafusion, which may make things complicated. For example, we don't need subquery.
  4. There are some iceberg specific things like transform, sorting, and I'm not sure how much effort it will take to make it compatible with datafusion expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants