-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unhandled hook to PruningPredicate #12606
Conversation
cc @alamb would appreciate a review! |
pub trait UnhandledPredicateHook { | ||
/// Called when a predicate can not be handled by DataFusion's transformation rules | ||
/// or is referencing a column that is not in the schema. | ||
fn handle(&self, expr: &Arc<dyn PhysicalExpr>) -> Arc<dyn PhysicalExpr>; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a closure but I had issues with lifetimes, etc. Having the trait also gives it a useful name 😄
The other API questions are:
- Should this be mutable? I think implementers can just use interior mutability if needed.
- Should this make it easier to say "use the existing expression"? I don't think that's a common case, and the current APIs use
&Arc<dyn PhysicalExpr> -> Arc<dyn PhysicalExpr>
as well. Plus it's as easy as a Clone on an Arc.
Perhaps you can rewrite the predicate before passing it to the parquet exec or the I don't understand the benefit that is obtained by doing the rewrite during the pruning predicate rewrite 🤔 |
It might also be good to look at https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#method.literal_guarantees which you might be able to use to apply you index |
The issue is that PruningPredicate discards (by returning |
Basically I want to take the predicate |
I admit I'm still a bit confused about |
I'll add that I've been using this (as in this change + an actual implementation that uses it) in production for a couple days now and it works amazingly. It's taken some queries from >3s to <1s (from downloading all of a column for all of time to a <100ms lookup in a Postgres index). |
I see -- what I am not understanding is why you need to do this rewrite as part of the PruningPredicate logic (which is already complicated). WHy can't you do the rewrite/transformation before passing the predicate to |
Here's an example: use std::sync::Arc;
use arrow_schema::{DataType, Field, Schema};
use datafusion::{common::DFSchema, physical_optimizer::pruning::PruningPredicate, prelude::*};
fn main() {
let ctx = SessionContext::new();
let schema = Arc::new(Schema::new(vec![Field::new("col", DataType::Int32, true)]));
let df_schema = DFSchema::try_from(schema.clone()).unwrap();
// An expression that PruningPredicate doesn't understand becomes `true`
let expr = ctx.parse_sql_expr("col = ANY([1, 2])", &df_schema).unwrap();
println!("expr: {:?}", expr);
let phys_expr = ctx.create_physical_expr(expr, &df_schema).unwrap();
println!("phys_expr: {:?}", phys_expr);
let pruning = PruningPredicate::try_new(phys_expr, schema.clone()).unwrap();
let pruning_expr = pruning.predicate_expr().clone();
println!("pruning_expr: {:?}", pruning_expr);
// pruning_expr: Literal { value: Boolean(true) }
// An expression referencing columns that don't have statistics collected (i.e. aren't int the schema)
// causes an Err
let expr = ctx.parse_sql_expr("other = 1", &df_schema).unwrap();
println!("expr: {:?}", expr);
let phys_expr = ctx.create_physical_expr(expr, &df_schema).unwrap();
println!("phys_expr: {:?}", phys_expr);
PruningPredicate::try_new(phys_expr, schema.clone()).unwrap();
// SchemaError(FieldNotFound { field: Column { relation: None, name: "other" }, valid_fields: [Column { relation: None, name: "col" }] }, Some(""))
} If I do the rewrite before PruningPredicate then I end up with just |
I don't see a rewrite in your example. I would have expected that you wrote something that substituted What rewrite are you doing in your actual production system This sounds quite similar to #7869 maybe |
The point is the rewrite I want to do is |
I am sorry I don't quite follow your example I found the PruningPredicate logic very tricky (when you could absolutely be sure no rows will match needs to be 100% precise), especially in the context of tristate logic. The pruning predicate logic has to treat "I don't know if the predicate could pass (typically NULL in evaluation)" differently than if the predicate actually evaluated to NULL on the actual row |
I think to follow up here, my conclusion is the current PR as it is implemented now likely has only one user but makes the overall rewrite logic harder to follow (and it is already complicated enough) As I understand it, the real usecase here is to not to actually use the PruningPredicate directly to prune values, but instead to use the same rewrite logic to turn the predicate into a query to run elsewhere (postgres as I understand) I suggest either:
Hopefully that makes sense |
Makes sense, thank you Andrew. If I wanted to go with option (1) I would still need some way to control what the rewrite does when faced with expressions it does not recognize. Would you in that case accept something along the lines of the hook proposed in this PR? If not I think I'll have to go with (2), not a big deal but I would rather not have to vendor the code. |
Yeah, I would expect there to be some sort of hook in the rewrite logic (with a default implementation to replace with I think if done right, it could be quite elegant and better separate out the expression rewrite from the rest of the pruning predicate code |
Marking as draft as I think this PR is no longer waiting on review and I am trying to clear the review backlog |
I have a secondary index with min/max stats columns that is compatible with PruningPredicate's rewrites.
I now want to add an index for point lookups (I plan on implementing it as a column with distinct array values, but that's a bit of an implementation detail).
The point is that when
PruningPredicate
encounters this column (for which there are no stats, and which it doesn't recognize because I only pass in Fields for which there are stats) it currently returnstrue
such thata_column_with_stats = 123 and a_point_lookup_column = 'abc'
becomesa_column_with_stats_min <= 123 and a_column_with_stats_max >= 123 and true
(ignoring nulls, maybe simplifying other bits) but I want it to becomea_column_with_stats_min <= 123 and a_column_with_stats_max >= 123 and a_point_lookup_column @> '{abc}'::text[]
or something like that.I don't think it's reasonable to add APIs to DataFusion for this specific case since it depends on implementation details outside of DataFusion's control, but I also can't easily work around it on my end (I'd have to re-implement all of PruningPredicate). So I'm hoping that adding this hook is acceptable 😄