Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't error on unknown column when pruning if predicate can still be proven false #7869

Open
domodwyer opened this issue Oct 19, 2023 · 13 comments
Labels
enhancement New feature or request

Comments

@domodwyer
Copy link
Contributor

Is your feature request related to a problem or challenge?

At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time.

For example, given the following in-memory data:

col_a value
A 42

We may have to evaluate a predicate such as col_a != A AND col_b=bananas. Where col_b is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole.

Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the PruningPredicate covers only col_a, value.

However the col_a != A portion of the predicate can be proven FALSE irrespective of col_b. Unfortunately constructing the PruningPredicate eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields named col_b before attempting to evaluate any portion of the predicate.

Describe the solution you'd like

Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today.

For the example above, ideally pruning should return FALSE as it can be proven that col_a != A is FALSE even though col_b is unknown at pruning time.

Describe alternatives you've considered

Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error.

Additional context

This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default.

@domodwyer domodwyer added the enhancement New feature or request label Oct 19, 2023
@alamb
Copy link
Contributor

alamb commented Oct 19, 2023

I agree this would be a nice improvement to the pruning logic.

@alamb
Copy link
Contributor

alamb commented Oct 20, 2023

I am not sure how easy this will be to implement with the current implementation of PruningPredicate -- It may need something more substantial like #7887

@domodwyer
Copy link
Contributor Author

Yeah I took a brief look and came to a similar conclusion - it seems to be a significant rework to enable this behaviour as-is. #7887 sounds like a great idea!

@alamb alamb changed the title Don't error on unknown column when pruning if predicate can be proven false Don't error on unknown column when pruning if predicate can still be proven false Dec 5, 2023
@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

I think this also affects scanning parquet files with "evolved" schemas -- namely where missing columns are replaced by NULL

For example, in a parquet table that has two files:

  • file1: Columns a and b
  • file2: Columns b (does not have column a

If there is a query with a predicate on a, like WHERE a = 5 the current logic in the parquet reader will rewrite this to NULL = a when scanning file2 but is not then smart enough to understand the expression can thus never be true and should skip the file. The same reasoning can be applied to many more complicated expressions that can never evaluate to true for file2 such as

  • WHERE a = 5 AND b = 'foo;
  • WHERE CASE WHEN a > 5 THEN b = 'foo' WHEN a < 5 THEN b = 'bar' ELSE false END
  • WHERE a IN (1,2,3)

@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

Here is an idea on how to extend PruningPredicate to handle this case

Problem:

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

I think we could teach PruningPredicate about all null colums like this:

  1. Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
  2. Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
  3. Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

  1. the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
  2. PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
  3. The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

@alamb
Copy link
Contributor

alamb commented Jan 2, 2024

I plan to work on this this week

@alamb
Copy link
Contributor

alamb commented Jan 12, 2024

Next steps: @appletreeisyellow and I will write up a proposal and share it around

@appletreeisyellow
Copy link
Contributor

We will post a proposal sometime this week

@alamb
Copy link
Contributor

alamb commented Jan 31, 2024

Update is that @appletreeisyellow and I have been working on a design for this feature, and we hope to have a proposal later this week

@alamb
Copy link
Contributor

alamb commented Feb 9, 2024

After quite a bit of study, @appletreeisyellow and I have realized that this ticket describes something different than knowing a column is ALL null. This ticket describes handling predicates with only known information and

The predicate can reference columns that are not currently in memory or known at query time.

In the usecase we have in InfluxDB this means getting a predicate that references columns that nothing is known about at all (neither the schema nor the values are known).

I am not sure how common this usecase is across implementations, and I can not come up with any good solution.

Thus, I filed a separate ticket to track handling columns whose types are known, and are known to be all NULL (which is also a usecase we have in InfluxDB), which I do think is a common usecase that is of general use: #9171

@alamb
Copy link
Contributor

alamb commented Feb 12, 2024

Let's focus on #9171 and then come back to this idea

@alamb
Copy link
Contributor

alamb commented May 10, 2024

Update here is that @appletreeisyellow has much improved the ability of DataFusion to prune when a column is known to be entirely null (aka #9171)

What remains is the ability to prune when nothing (including not knowing schema) is known about the column which exists in certain places in our InfluxDB 3.0 codebase.

From my perspective, given the potential complexity of implementing this feature in DataFusion we may decide not to work on it ever and pursue other ways of achieving the same end. I don't think we have decided

@adriangb
Copy link
Contributor

I think I've gotten this working. I pass in a schema that only has the columns I have stats for. Other columns seem to be ignored. I also wrap the predicate with (<predicate>) IS NOT FALSE which unless I'm getting my tri state logic wrong means that any nulls in any column means we can't exclude that row/page/row group, allowing for schema evolution in collected statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants