-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement prettier SQL unparsing (more human readable) #11186
Conversation
Hi @alamb this can use more testing, match cases and moving stuff around as I'm not too familiar with the unparser. Appreciate if you can provide other cases/expressions that should be tested. |
I plan to review this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very cool @MohamedAbdeen21 -- thank you
I left a few suggestions for other tests.
FYI @phillipleblanc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @MohamedAbdeen21! Very neat idea and execution. I have a few thoughts on simplifying/unifying the code.
Thanks @alamb for tagging me. 🙏
datafusion/sql/src/unparser/expr.rs
Outdated
pub fn pretty_expr_to_sql(&self, expr: &Expr) -> Result<ast::Expr> { | ||
let root_expr = self.expr_to_sql(expr)?; | ||
Ok(self.pretty(root_expr, LOWEST, LOWEST)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't have an extra method here and would combine it with expr_to_sql
. The ast::Expr
it produces is logically the same as the input one, just with unnecessary nesting removed. In fact, you could even think about this as serving the same purpose as an optimizer rewrite pass for LogicalPlan - it should produce logically the same thing as the input, just more efficient.
datafusion/sql/src/unparser/expr.rs
Outdated
/// | ||
/// Also note that when fetching the precedence of a nested expression, we ignore other nested | ||
/// expressions, so precedence of expr `(a * (b + c))` equals `*` and not `+`. | ||
fn pretty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just have a single expr_to_sql
method, then it might make sense to rename this method as well to something like remove_unnecessary_nesting
to more accurately describe what it does. Then if we added more "rewrite passes" to make it prettier or achieve some other functional goal, then they would just be added as separate functions after this one.
@@ -218,37 +218,33 @@ impl Operator { | |||
} | |||
|
|||
/// Get the operator precedence | |||
/// use <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference | |||
/// use <https://www.postgresql.org/docs/7.2/sql-precedence.html> as a reference | |||
pub fn precedence(&self) -> u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is only used in a few places for formatting the parenthesis in a Display
trait on BinaryExpr
. I don't think that will have any major impact on library users.
Thanks @alamb and @phillipleblanc. Updated PR accordingly I'm aware of a failing test, will look into it later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks @MohamedAbdeen21!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @MohamedAbdeen21 and @phillipleblanc -- I think this is looking close
@@ -135,7 +135,9 @@ async fn query_parquet_demo() -> Result<()> { | |||
|
|||
/// DataFusion can parse a SQL text and convert it back to SQL using [`Unparser`]. | |||
async fn round_trip_parse_sql_expr_demo() -> Result<()> { | |||
let sql = "((int_col < 5) OR (double_col = 8))"; | |||
// unparser can also remove extra parentheses, | |||
// so `((int_col < 5) OR (double_col = 8))` will also produce the same SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
&mut PlannerContext::new(), | ||
)?; | ||
|
||
assert_eq!(expr.to_string(), pretty_expr.to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this me veriying that the expr are the same (not just that the string is the same)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they can't be the same, the whole idea behind removing parenthesis is to maintain the final result regardless of which operations get executed first/has the parenthesis aka associativity.
Here's an example from the tests:
("((id + 5) * (age * 8))", "(id + 5) * age * 8"),
Here's the tree of the first/original expr:
*
/ \
+ *
/ \ / \
id 5 age 8
And here's the cleaned one
*
/ \
* 8
/ \
+ age
/ \
id 5
These obviously don't match, but it doesn't matter because multiplication is associative. Almost all operators are associative except for division and subtraction, which are handled.
Comparing the string representation seemed like the best solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to maintain such equivalence then the only case we remove parenthesis would be that the parenthesis give priority to the left side, which is the natural associativity of all binary operators (except =
I think)*. Or to the operation with higher (but not equal) precedence. Something like (a + b) + c
or true AND (a > b)
for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the response @MohamedAbdeen21 -- this makes sense
🤔 I am not sure if maintaining the structural equivalence of the input/output is important or if just the semantics are important to preserve
Perhaps @goldmedal @seddonm1 or @phillipleblanc or @devinjdangelo has thoughts on what the preferred behavior is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's related to #11186 (comment). If our main purpose is providing SQL for full pushdown to other data sources, this behavior makes sense to me. For my current project (Wren), it's acceptable.
However, if our purpose is human readability, I think preserving the semantics is important. Maybe we can just remove the outermost brackets (they're exactly redundant) for a calculation expression like:
((3 + (5 + 6)) + 3) -> (3 + (5 + 6)) + 3
I'm not sure, but I think it might be more similar to what a human would write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think maintaining structural equivalence is important as long as the correct answer would still be generated if the SQL were run.
But maybe this is exposing two opposing(?) goals of the unparser
- the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability. I'm in the first camp - it doesn't matter to me what the resulting SQL looks like as long as it runs correctly (although having nicer SQL does make debugging easier).
However, I would expect that we could serve both goals with a single unparser
implementation. Any work that is done to improve the SQL for human readability shouldn't come at the expense of making it invalid SQL for running in a separate query engine. That is why I thought it would be better to have just a single expr_to_sql
function instead of splitting the functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the primary goal is pushdown for other systems, we have two options.
-
Completely ignore this issue. Systems can have very different precedence rules, and therefore, we can not guarantee correctness once some parentheses are removed. Also, it can be hard to decide which parentheses are "safe" to remove.
-
Informing DF of your target system operator precedence through implementing a dialect. This can be a lot of work for the dev, but I can see this being a desired option for some people; as Phillip pointed out, It can make debugging easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability.
I think this is an excellent summary of the issue and given these two different goals there would be two different ideal outcomes.
Perhaps as @MohamedAbdeen21 suggests we can add some sort of API to allow the user to decide. We could potentially use Unparser
for this (though I don't think Unparser
is part of the public API yet).
datafusion/datafusion/sql/src/unparser/mod.rs
Lines 30 to 32 in 45599ce
pub struct Unparser<'a> { | |
dialect: &'a dyn Dialect, | |
} |
So something like
// default unparser generates fully parenthesized sql:
let unparser = Unparser::default();
// can also create an unparser that generates pretty sql
let unparser = Unparser::default()
// enable pretty-printing of the sql easier suited to human consumption
.with_pretty(true);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I didn't think about how different query engines might have different precedence rules, so having this functionality split now makes a lot more sense.
Making it explicit: Removing parenthesis according to the precedence rules of DataFusion might make it invalid SQL for other query engines with different precedence rules, even if its valid for DataFusion. That would break one of the use-cases for the unparser, so allowing people to opt-in to that behavior is desirable.
The proposed API by @alamb looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed the proposed API, updated docs, examples and tests
("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"), | ||
("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"), | ||
("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"), | ||
("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem right to me -- this implies that (3 + 5) + (6 + 3)
and 3 + ((5 + 6) + 3)
are logically equivalent to each other as well as with 3 + 5 + 6 + 3
I think (3 + 5) + (6 + 3)
and 3 + ((5 + 6) + 3)
are different, right? Aka the parenthesis are needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved as now pretty unparsing is done conditionally
thanks @MohamedAbdeen21 -- I am very behind on reviews. I put this one on my list again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at this @MohamedAbdeen21 and TLDR is I think it looks great now,
While reviewing the code, I noticed some documentation that is missing / non ideal. I'll make a follow on PR to improve things
@@ -153,5 +153,14 @@ async fn round_trip_parse_sql_expr_demo() -> Result<()> { | |||
|
|||
assert_eq!(sql, round_trip_sql); | |||
|
|||
// enable pretty-unparsing. This make the output more human-readable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"), | ||
("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"), | ||
("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"), | ||
("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved as now pretty unparsing is done conditionally
cc @backkem and @devinjdangelo |
Here is a PR to improve the documentation #11395 (based on this one) |
Thanks again @MohamedAbdeen21 |
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
Which issue does this PR close?
Takes a shot at #10633 .
Rationale for this change
Algorithm adapted from this SO answer.
What changes are included in this PR?
Make unparsed binary expressions easier to read.
Are these changes tested?
Yes, but can use more tests/statements/plans to unparse.
Are there any user-facing changes?
New
.with_pretty()
method for the unparser to remove extra parenthesis from an expr.