-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't error on unknown column when pruning if predicate can still be proven false #7869
Comments
I agree this would be a nice improvement to the pruning logic. |
I am not sure how easy this will be to implement with the current implementation of |
Yeah I took a brief look and came to a similar conclusion - it seems to be a significant rework to enable this behaviour as-is. #7887 sounds like a great idea! |
I think this also affects scanning parquet files with "evolved" schemas -- namely where missing columns are replaced by For example, in a parquet table that has two files:
If there is a query with a predicate on
|
Here is an idea on how to extend PruningPredicate to handle this case Problem:
I think we could teach
For the example in this ticket's description with predicate
|
I plan to work on this this week |
Next steps: @appletreeisyellow and I will write up a proposal and share it around |
We will post a proposal sometime this week |
Update is that @appletreeisyellow and I have been working on a design for this feature, and we hope to have a proposal later this week |
After quite a bit of study, @appletreeisyellow and I have realized that this ticket describes something different than knowing a column is ALL null. This ticket describes handling predicates with only known information and
In the usecase we have in InfluxDB this means getting a predicate that references columns that nothing is known about at all (neither the schema nor the values are known). I am not sure how common this usecase is across implementations, and I can not come up with any good solution. Thus, I filed a separate ticket to track handling columns whose types are known, and are known to be all |
Let's focus on #9171 and then come back to this idea |
Update here is that @appletreeisyellow has much improved the ability of DataFusion to prune when a column is known to be entirely null (aka #9171) What remains is the ability to prune when nothing (including not knowing schema) is known about the column which exists in certain places in our InfluxDB 3.0 codebase. From my perspective, given the potential complexity of implementing this feature in DataFusion we may decide not to work on it ever and pursue other ways of achieving the same end. I don't think we have decided |
I think I've gotten this working. I pass in a schema that only has the columns I have stats for. Other columns seem to be ignored. I also wrap the predicate with |
Is your feature request related to a problem or challenge?
At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time.
For example, given the following in-memory data:
We may have to evaluate a predicate such as
col_a != A AND col_b=bananas
. Wherecol_b
is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole.Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the
PruningPredicate
covers onlycol_a, value
.However the
col_a != A
portion of the predicate can be proven FALSE irrespective ofcol_b
. Unfortunately constructing thePruningPredicate
eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields namedcol_b
before attempting to evaluate any portion of the predicate.Describe the solution you'd like
Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today.
For the example above, ideally pruning should return FALSE as it can be proven that
col_a != A
is FALSE even thoughcol_b
is unknown at pruning time.Describe alternatives you've considered
Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error.
Additional context
This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default.
The text was updated successfully, but these errors were encountered: