-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terms, Types and Expressions #22
Comments
Hi, @Fokko Thanks for this great writing. If I understand correctly, this is quite similar to expression evaluation/optimization in a database query engine. Usually, a database engine has its own expression optimization system, and table format needs to provide statistics to query optimizer. Since currently we haven't decided yet how to integrate with other query engines, I would suggest implementing this later after we have finished core specs such as catalog, table, transaction, metadata, etc. |
@liurenjie1024 Yes, provides the input for the evaluation/optimization. I agree that we should not implement an optimizer in iceberg-rust, and I think we should also get the primitives in place. Many databases don't support id-based field resolution, and Iceberg has a lot of metadata at a file level. What I would suggest is that iceberg-rust does do the pruning of the unrelated files, rather than doing this in the database, but let's defer this discussion. |
@Fokko Thanks for the explanation. I agree that providing pruning in |
Implementing expressions feels like a huge task which also has to be maintained later on. I'm not sure about Databend or RisingWave but for Datafusion it would be sufficient to have functionality like: // returns min value of every manifest file for the given column
fn manifest_min_values(table: &Table, column_name: &str) -> Vec<Value>
fn manifest_max_values(table: &Table, column_name: &str) -> Vec<Value>
// returns min value of every data file for the given column
fn datafile_min_values(table: &Table, column_name: &str) -> Vec<Value>
fn datafile_max_values(table: &Table, column_name: &str) -> Vec<Value> or even better with arrow: fn manifest_min_values(table: &Table, column_name: &str) -> ArrayRef
fn manifest_max_values(table: &Table, column_name: &str) -> ArrayRef
fn datafile_min_values(table: &Table, column_name: &str) -> ArrayRef
fn datafile_max_values(table: &Table, column_name: &str) -> ArrayRef |
Yes, expression system is not small effort, and we can postpone discussion about it later. |
I'm curious abort what use case need to use this? 🤔 |
I think it's used in planning |
Datafusion has the trait PruningStatistics that you can implement for a file/container. It can be used with a pruning predicate to check whether an expression could evaluate to true for at least one row in the file/container. This way you don't have to perform the expression evaluation yourself. |
As DataFusion already implemented mature expression system and evaluation framework, I'm wondering if it is possibly or it is better option to reuse it in iceberg-rust instead re-implementing another expression + evaluation etc.? |
Hi, @viirya I thought about this option, but I didn't choose it for several reasons:
|
In (Py)Iceberg we have a hierarchy that works very well. In this issue, I'll try to explain it, and also convince y'all to use it in
iceberg-rust
as well. Disclaimer, I'm an OOP guy, so probably there are some things that don't make sense in Rust. I don't think we even can port the whole hierarchy, since Rust is not OOP.The important traits are (directly translated from Python):
(This excludes
Term
,Reference
,BooleanExpression
, maybe we should callUnbound
asUnboundBooleanExpression
, andBound
asBoundBooleanExpression
, it is up to you. In the end, naming things is the hardest thing in computer science).This is implemented by operations such as,
IsNull
,NotNull
,IsNaN
,NotNaN
In
,NotIn
EqualTo
,NotEqualTo
,LessThan
,LessThanOrEqual
,GreaterThan
,GreaterThanOrEqual
Not
And
,Or
AlwaysTrue
,AlwaysFalse
The inverse method is important later on to rewrite
Not(...)
operations.Not(EqualTo("UserId", "123"))
, can be rewritten toNotEqualTo("UserID", 123)
. Similar for Not can be rewritten:!(A and B) == !A or !B
.All the operations come in a bound and unbound one. The
EqualTo
is in the public API, and once it is bound to a schema, it will aBoundEqualTo
. Binding is important since let's say that we have an expression:UserID = '123'
, then we want to convert this at bind time toUserID = 123
because the UserID is a date field in this case. Since Iceberg is lazy, every file written can have a different schema, and binding to each of the schemas has some advantages:i32
toi64
, the will UserID will also be promoted to ani64
when binding to a schema with a newer file.AlwaysFalse()
.IsNull("UserID")
is bound to aUserID INTEGER NOT NULL
column, then this will also convert into anAlwaysFalse()
.In("UserID", {123})
is rewritten toEqualTo("UserID" == 123)
since there is only one literal.The text was updated successfully, but these errors were encountered: