Don't error on unknown column when pruning if predicate can still be proven false #7869

domodwyer · 2023-10-19T12:45:50Z

Is your feature request related to a problem or challenge?

At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time.

For example, given the following in-memory data:

col_a	value
A	42

We may have to evaluate a predicate such as col_a != A AND col_b=bananas. Where col_b is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole.

Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the PruningPredicate covers only col_a, value.

However the col_a != A portion of the predicate can be proven FALSE irrespective of col_b. Unfortunately constructing the PruningPredicate eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields named col_b before attempting to evaluate any portion of the predicate.

Describe the solution you'd like

Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today.

For the example above, ideally pruning should return FALSE as it can be proven that col_a != A is FALSE even though col_b is unknown at pruning time.

Describe alternatives you've considered

Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error.

Additional context

This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default.

The text was updated successfully, but these errors were encountered:

alamb · 2023-10-19T14:31:17Z

I agree this would be a nice improvement to the pruning logic.

alamb · 2023-10-20T18:22:45Z

I am not sure how easy this will be to implement with the current implementation of PruningPredicate -- It may need something more substantial like #7887

domodwyer · 2023-10-20T18:39:49Z

Yeah I took a brief look and came to a similar conclusion - it seems to be a significant rework to enable this behaviour as-is. #7887 sounds like a great idea!

alamb · 2023-12-30T13:29:47Z

I think this also affects scanning parquet files with "evolved" schemas -- namely where missing columns are replaced by NULL

For example, in a parquet table that has two files:

file1: Columns a and b
file2: Columns b (does not have column a

If there is a query with a predicate on a, like WHERE a = 5 the current logic in the parquet reader will rewrite this to NULL = a when scanning file2 but is not then smart enough to understand the expression can thus never be true and should skip the file. The same reasoning can be applied to many more complicated expressions that can never evaluate to true for file2 such as

WHERE a = 5 AND b = 'foo;
WHERE CASE WHEN a > 5 THEN b = 'foo' WHEN a < 5 THEN b = 'bar' ELSE false END
WHERE a IN (1,2,3)

alamb · 2023-12-30T13:39:23Z

Here is an idea on how to extend PruningPredicate to handle this case

Problem:

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

I think we could teach PruningPredicate about all null colums like this:

Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

alamb · 2024-01-02T16:03:09Z

I plan to work on this this week

alamb · 2024-01-12T15:59:30Z

Next steps: @appletreeisyellow and I will write up a proposal and share it around

appletreeisyellow · 2024-01-16T16:46:39Z

We will post a proposal sometime this week

alamb · 2024-01-31T18:50:15Z

Update is that @appletreeisyellow and I have been working on a design for this feature, and we hope to have a proposal later this week

alamb · 2024-02-09T02:15:15Z

After quite a bit of study, @appletreeisyellow and I have realized that this ticket describes something different than knowing a column is ALL null. This ticket describes handling predicates with only known information and

The predicate can reference columns that are not currently in memory or known at query time.

In the usecase we have in InfluxDB this means getting a predicate that references columns that nothing is known about at all (neither the schema nor the values are known).

I am not sure how common this usecase is across implementations, and I can not come up with any good solution.

Thus, I filed a separate ticket to track handling columns whose types are known, and are known to be all NULL (which is also a usecase we have in InfluxDB), which I do think is a common usecase that is of general use: #9171

alamb · 2024-02-12T11:38:21Z

Let's focus on #9171 and then come back to this idea

alamb · 2024-05-10T12:24:24Z

Update here is that @appletreeisyellow has much improved the ability of DataFusion to prune when a column is known to be entirely null (aka #9171)

What remains is the ability to prune when nothing (including not knowing schema) is known about the column which exists in certain places in our InfluxDB 3.0 codebase.

From my perspective, given the potential complexity of implementing this feature in DataFusion we may decide not to work on it ever and pursue other ways of achieving the same end. I don't think we have decided

adriangb · 2024-09-28T19:35:45Z

I think I've gotten this working. I pass in a schema that only has the columns I have stats for. Other columns seem to be ignored. I also wrap the predicate with (<predicate>) IS NOT FALSE which unless I'm getting my tri state logic wrong means that any nulls in any column means we can't exclude that row/page/row group, allowing for schema evolution in collected statistics.

domodwyer added the enhancement New feature or request label Oct 19, 2023

alamb mentioned this issue Nov 30, 2023

Epic: Statistics improvements #8227

Open

19 tasks

alamb changed the title ~~Don't error on unknown column when pruning if predicate can be proven false~~ Don't error on unknown column when pruning if predicate can still be proven false Dec 5, 2023

alamb mentioned this issue Dec 28, 2023

Implement the contained method of RowGroupPruningStatistics #8669

Closed

alamb mentioned this issue Jan 1, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 1, 2024 #8704

Closed

9 tasks

alamb self-assigned this Jan 2, 2024

This was referenced Jan 3, 2024

fix guarantees in allways_true of PruningPredicate #8732

Merged

Minor: Improve PruningPredicate docstrings #8748

Merged

DataFusion weekly project plan (Andrew Lamb) - Jan 8, 2024 #8786

Closed

alamb mentioned this issue Jan 10, 2024

[Minor] extract const and add doc and more tests for in_list pruning #8815

Merged

alamb removed their assignment Jan 12, 2024

alamb mentioned this issue Jan 14, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 15, 2024 #8864

Closed

9 tasks

alamb assigned appletreeisyellow Jan 16, 2024

alamb mentioned this issue Jan 21, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb mentioned this issue Jan 28, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

This was referenced Feb 4, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 5, 2024 #9121

Closed

Support "A column is known to be entirely NULL" in PruningPredicate #9171

Closed

alamb mentioned this issue Feb 9, 2024

Add example of using PruningPredicate to datafusion-examples #9183

Merged

alamb unassigned appletreeisyellow Feb 12, 2024

appletreeisyellow mentioned this issue Feb 13, 2024

chore(pruning): Support IS NOT NULL predicates in PruningPredicate #9208

Merged

alamb mentioned this issue Sep 28, 2024

Add unhandled hook to PruningPredicate #12606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't error on unknown column when pruning if predicate can still be proven false #7869

Don't error on unknown column when pruning if predicate can still be proven false #7869

domodwyer commented Oct 19, 2023

alamb commented Oct 19, 2023

alamb commented Oct 20, 2023

domodwyer commented Oct 20, 2023

alamb commented Dec 30, 2023

alamb commented Dec 30, 2023

alamb commented Jan 2, 2024

alamb commented Jan 12, 2024

appletreeisyellow commented Jan 16, 2024

alamb commented Jan 31, 2024

alamb commented Feb 9, 2024

alamb commented Feb 12, 2024

alamb commented May 10, 2024

adriangb commented Sep 28, 2024

Don't error on unknown column when pruning if predicate can still be proven false #7869

Don't error on unknown column when pruning if predicate can still be proven false #7869

Comments

domodwyer commented Oct 19, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Oct 19, 2023

alamb commented Oct 20, 2023

domodwyer commented Oct 20, 2023

alamb commented Dec 30, 2023

alamb commented Dec 30, 2023

Problem:

alamb commented Jan 2, 2024

alamb commented Jan 12, 2024

appletreeisyellow commented Jan 16, 2024

alamb commented Jan 31, 2024

alamb commented Feb 9, 2024

alamb commented Feb 12, 2024

alamb commented May 10, 2024

adriangb commented Sep 28, 2024