Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow place holders like $1 in more types of queries. #13632

Merged
merged 6 commits into from
Dec 5, 2024

Conversation

davisp
Copy link
Member

@davisp davisp commented Dec 3, 2024

Previously, a query like SELECT $1; would fail to generate a LogicalPlan. With this change these queries are now workable. This required creating a new LogicalPlan::validate_parametere_types that is called from Optimizer::optimize to assert that all parameter types have been inferred.

Which issue does this PR close?

Closes #5617

This might also close an untracked step from #4539 which is listed as "Support PREPARE statements without explicit parameters". The tests I updated mentioned a TODO for supporting placeholders without types which is now shown to be fixed in this PR. But I'm uncertain enough on the listed step in that epic to pre-emptively create an issue to attach to it. If this does cover it, I would be more than happy to open the issue and add a comment to the epic as well as updating this PR description.

Rationale for this change

Initially I just wanted to support SELECT $1; in a FlightSQL server. It appears fixing that has also fixed a number of other issues.

What changes are included in this PR?

This removes the error from ExprSchemable::get_type when a Placeholder has no specified type. I believe this error to be a bug based on the fact that its suggesting that the error can be fixed by providing parameter values and the fact that the error is triggered when creating a LogicalPlan.

I've instead added a new LogicalPlan::validate_placeholder_types method which is invoked at the start of the Optimizer::optimize method.

I should note, that this PR may also be related to #8819. While developing this, I noticed that the Optimizer seems perfectly happy to optimize plans where some datatypes are DataType::Null. I have absolutely nowhere near enough context to know if that's something that should be allowed or not so I added the validate_placeholder_types so that the old behavior is maintained where optimization fails on untyped placeholders.

Are these changes tested?

Yep. I've updated/removed old tests that asserted the broken behavior and added tests that cover both my case of SELECT $1; as well as SELECT * FROM foo WHERE col LIKE $1 which covers #5617.

Are there any user-facing changes?

Some queries that would have failed planning may now fail at optimization time instead? Not sure if that's important. Also, because I was moving the error message anyway, I included the Placeholder.id in the error text which seemed like it'd help improve users' lives.

Previously, a query like `SELECT $1;` would fail to generate a
LogicalPlan. With this change these queries are now workable. This
required creating a new LogicalPlan::validate_parametere_types that is
called from Optimizer::optimize to assert that all parameter types have
been inferred correctly.
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 3, 2024
davisp added a commit to davisp/datafusion-expr that referenced this pull request Dec 3, 2024
This is a straight copy of the `datafusion/expr` directory in the
`apache/datafusion` git repository. This repo exist while my PR to
Datafusion isn't on crates.io.

apache/datafusion#13632
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @davisp -- very much appreciated.

I have some suggestions on how to make this PR better (specifically by not adding a new check, but using an existing check). However, I think we could improve it as a follow on PR as well and this could be merged as is.

Thank you again

@@ -2014,7 +2014,7 @@ async fn test_dataframe_placeholder_missing_param_values() -> Result<()> {
assert!(optimized_plan
.unwrap_err()
.to_string()
.contains("Placeholder type could not be resolved. Make sure that the placeholder is bound to a concrete type, e.g. by providing parameter values."));
.contains("Placeholder type for '$0' could not be resolved. Make sure that the placeholder is bound to a concrete type, e.g. by providing parameter values."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

);

// The placeholder is not replaced with a value,
// so the filter data type is not know, i.e. a = $0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// so the filter data type is not know, i.e. a = $0.
// so the filter data type is not known, i.e. a = $0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: e1d66ac

// so the filter data type is not know, i.e. a = $0.
// Therefore, the optimization fails.
let optimized_plan = ctx.state().optimize(logical_plan);
assert!(optimized_plan.is_err());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is redundant -- as unwrap_err() will panic if optimize_plan is not an err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: e1d66ac

);

// The placeholder is not replaced with a value,
// so the filter data type is not know, i.e. a = $0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// so the filter data type is not know, i.e. a = $0.
// so the filter data type is not known, i.e. a = $0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: 3d645e0

@@ -357,6 +357,7 @@ impl Optimizer {
{
let start_time = Instant::now();
let options = config.options();
plan.validate_parameter_types()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking parameter types doesn't seem to logically belong to the optimizer which is supposed to preserve the semantics of each plan but potentially make them faster.

https://docs.rs/datafusion/latest/datafusion/optimizer/trait.OptimizerRule.html

OptimizerRules transforms one LogicalPlan into another which computes the same results, but in a potentially more efficient way. If there are no suitable transformations for the input plan, the optimizer should simply return it unmodified.

I think the Analyzer might be a more appropriate place to put this check:

https://docs.rs/datafusion/latest/datafusion/optimizer/struct.Analyzer.html#method.execute_and_check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although thinking about it some more, if we are going to defer the error anyways, I think the cleanest solution would be to move the error to when it is needed

When I removed the specific check in the Optimizer and tried to run the queries, it failed with this error:

This feature is not implemented: Physical plan does not support logical expression Placeholder(Placeholder { id: "$1", data_type: None })

Maybe rather than a special check we could simply make this error better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we can do this in a follow on PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought your suggestion was a lot cleaner and have pushed e5325c3 to address it. I've updated the dataframe tests, added a few to the core/tests/sql/select.rs module, and added a PREPARE test to the sqllogic test showing how things fail there as well.

Let me know if you can think of anywhere else I should add tests.

@@ -571,10 +571,6 @@ Dml: op=[Insert Into] table=[test_decimal]
"INSERT INTO person (id, first_name, last_name) VALUES ($1, $2, $3, $4)",
"Error during planning: Placeholder $4 refers to a non existent column"
)]
#[case::placeholder_type_unresolved(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens now in this case? Does it pass successfully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this PR is intentionally about allowing the creation of LogicalPlan's with placeholders. Previously, LogicalPlan creation would fail when any placeholder is part of the schema. Given that all this test is doing is creating LogicalPlan's, this example now passes. I thought about moving it to a new test that explicitly shows it failing at the optimization step but then felt that that case was already covered by the other test updates.

datafusion/expr/src/expr_schema.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Dec 4, 2024

because I was moving the error message anyway, I included the Placeholder.id in the error text which seemed like it'd help improve users' lives.

This is great ❤️

@github-actions github-actions bot added physical-expr Physical Expressions and removed optimizer Optimizer rules labels Dec 4, 2024
Previously, these errors would occurr during optimization. Now that we
allow unbound placeholders through the optimizer they fail when creating
the physical plan instead.
@davisp davisp force-pushed the allow-untyped-prepare-parameters branch from 53f8e91 to e5325c3 Compare December 4, 2024 17:13
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beautiful -- thank you so much @davisp

@alamb alamb changed the title Allow place holders in the column list Allow place holders like $1 in more types of queries. Dec 5, 2024
@alamb alamb merged commit b30c200 into apache:main Dec 5, 2024
26 checks passed
@jonahgao jonahgao mentioned this pull request Dec 5, 2024
5 tasks
zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 6, 2024
* Allow place holders in the column list

Previously, a query like `SELECT $1;` would fail to generate a
LogicalPlan. With this change these queries are now workable. This
required creating a new LogicalPlan::validate_parametere_types that is
called from Optimizer::optimize to assert that all parameter types have
been inferred correctly.

* Remove redundant asserts

* Fix typo in comments

* Add comment explaining DataType::Null

* Move unbound placeholder error to physical-expr

Previously, these errors would occurr during optimization. Now that we
allow unbound placeholders through the optimizer they fail when creating
the physical plan instead.

* Fix expected error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support parameter ($1) type inference for LIKE predicates
2 participants