Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example of using PruningPredicate to datafusion-examples #9183

Merged
merged 4 commits into from
Feb 10, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 9, 2024

Which issue does this PR close?

Part of #7013

Related to #7869 and #9171

Rationale for this change

  1. We rely heavily on PruningPredicate in InfluxDB to prune data based on catalog information, so I want an easier way to point our engineers at it and understand how it works.
  2. We also had some non trivial confusion internally about how/if pruning predicates handled unknown column values, which I wanted to document

What changes are included in this PR?

  1. Add pruning.rs example to datafusion-examples with an annotated guide to using `PruningPredicate
  2. Add link to the example in the PruningPredicate API docs

Are these changes tested?

Yes, as part of CI

Are there any user-facing changes?

A new example, no code changes

@github-actions github-actions bot added the core Core DataFusion crate label Feb 9, 2024
// File 2: `x = 5 AND y = 10` can never evaluate to true because y
// has only the value of 7. Thus this file can be skipped.
false,
// File 3: `x = 5 AND y = 10` can never evaluate to true because x
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @appletreeisyellow here is an actual example showing that the pruning predicate does the right thing with unknown column values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File 3 example makes sense to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the result will be for a file 4 like:

File 4: x has values between 4 and 6
nothing is known about the value of y

Same the predicate x = 5 AND y = 10, my understanding is that it will evaluate to true.

x = 5 AND y = 10
--> true AND null
--> null

Since y is unknown, so there is a possibility that y is 10 in this file / partition / row group of data. Thus this file can not be skipped and the result is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same the predicate x = 5 AND y = 10, my understanding is that it will evaluate to true.

Yes, this is my understanding too (that the PruningPredicate will return true for this container)

Since y is unknown, so there is a possibility that y is 10 in this file / partition / row group of data. Thus this file can not be skipped and the result is true

Yes, that is my understanding as well

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @alamb I love reviewing such docs as it gives more understanding
There likely a typo

@alamb
Copy link
Contributor Author

alamb commented Feb 9, 2024

Nice @alamb I love reviewing such docs as it gives more understanding There likely a typo

Nice eyes -- thanks @comphead

BTW if you like reading background material, just wait for #9184 :)

Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding examples @alamb. Super helpful! I left a question for a new example and a suggestion

// File 2: `x = 5 AND y = 10` can never evaluate to true because y
// has only the value of 7. Thus this file can be skipped.
false,
// File 3: `x = 5 AND y = 10` can never evaluate to true because x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File 3 example makes sense to me 👍

Comment on lines +123 to +125
// Note, returning null means the value isn't known, NOT
// that we know the entire column is null.
(None, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably looks familiar :)

// File 2: `x = 5 AND y = 10` can never evaluate to true because y
// has only the value of 7. Thus this file can be skipped.
false,
// File 3: `x = 5 AND y = 10` can never evaluate to true because x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the result will be for a file 4 like:

File 4: x has values between 4 and 6
nothing is known about the value of y

Same the predicate x = 5 AND y = 10, my understanding is that it will evaluate to true.

x = 5 AND y = 10
--> true AND null
--> null

Since y is unknown, so there is a possibility that y is 10 in this file / partition / row group of data. Thus this file can not be skipped and the result is true

datafusion-examples/examples/pruning.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

@comphead comphead merged commit a48e271 into apache:main Feb 10, 2024
22 checks passed
@alamb alamb deleted the alamb/pruning_example branch February 10, 2024 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants