Implement prettier SQL unparsing (more human readable) #11186

MohamedAbdeen21 · 2024-06-30T14:29:37Z

Which issue does this PR close?

Takes a shot at #10633 .

Rationale for this change

Algorithm adapted from this SO answer.

What changes are included in this PR?

Make unparsed binary expressions easier to read.

Are these changes tested?

Yes, but can use more tests/statements/plans to unparse.

Are there any user-facing changes?

New .with_pretty() method for the unparser to remove extra parenthesis from an expr.

MohamedAbdeen21 · 2024-06-30T14:51:18Z

Hi @alamb this can use more testing, match cases and moving stuff around as I'm not too familiar with the unparser. ~~And I'm still working on subtraction and division~~

Appreciate if you can provide other cases/expressions that should be tested.

alamb · 2024-07-01T21:19:56Z

I plan to review this tomorrow

alamb

This is looking very cool @MohamedAbdeen21 -- thank you

I left a few suggestions for other tests.

datafusion/sql/src/unparser/expr.rs

datafusion/sql/tests/cases/plan_to_sql.rs

alamb · 2024-07-02T16:45:14Z

FYI @phillipleblanc

phillipleblanc

Looks great @MohamedAbdeen21! Very neat idea and execution. I have a few thoughts on simplifying/unifying the code.

Thanks @alamb for tagging me. 🙏

phillipleblanc · 2024-07-03T01:48:54Z

datafusion/sql/src/unparser/expr.rs

+    pub fn pretty_expr_to_sql(&self, expr: &Expr) -> Result<ast::Expr> {
+        let root_expr = self.expr_to_sql(expr)?;
+        Ok(self.pretty(root_expr, LOWEST, LOWEST))
+    }


I wouldn't have an extra method here and would combine it with expr_to_sql. The ast::Expr it produces is logically the same as the input one, just with unnecessary nesting removed. In fact, you could even think about this as serving the same purpose as an optimizer rewrite pass for LogicalPlan - it should produce logically the same thing as the input, just more efficient.

datafusion/sql/src/unparser/expr.rs

phillipleblanc · 2024-07-03T02:00:18Z

datafusion/sql/src/unparser/expr.rs

+    ///
+    /// Also note that when fetching the precedence of a nested expression, we ignore other nested
+    /// expressions, so precedence of expr `(a * (b + c))` equals `*` and not `+`.
+    fn pretty(


If we just have a single expr_to_sql method, then it might make sense to rename this method as well to something like remove_unnecessary_nesting to more accurately describe what it does. Then if we added more "rewrite passes" to make it prettier or achieve some other functional goal, then they would just be added as separate functions after this one.

phillipleblanc · 2024-07-03T02:02:07Z

datafusion/expr/src/operator.rs

@@ -218,37 +218,33 @@ impl Operator {
    }

    /// Get the operator precedence
-    /// use <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference
+    /// use <https://www.postgresql.org/docs/7.2/sql-precedence.html> as a reference
    pub fn precedence(&self) -> u8 {


It looks like this is only used in a few places for formatting the parenthesis in a Display trait on BinaryExpr. I don't think that will have any major impact on library users.

MohamedAbdeen21 · 2024-07-03T05:20:54Z

Thanks @alamb and @phillipleblanc. Updated PR accordingly

I'm aware of a failing test, will look into it later today.

phillipleblanc

Looks great to me, thanks @MohamedAbdeen21!

alamb

THanks @MohamedAbdeen21 and @phillipleblanc -- I think this is looking close

alamb · 2024-07-03T20:57:02Z

datafusion-examples/examples/parse_sql_expr.rs

@@ -135,7 +135,9 @@ async fn query_parquet_demo() -> Result<()> {

 /// DataFusion can parse a SQL text and convert it back to SQL using [`Unparser`].
 async fn round_trip_parse_sql_expr_demo() -> Result<()> {
-    let sql = "((int_col < 5) OR (double_col = 8))";
+    // unparser can also remove extra parentheses,
+    // so `((int_col < 5) OR (double_col = 8))` will also produce the same SQL


alamb · 2024-07-03T21:00:17Z

datafusion/sql/tests/cases/plan_to_sql.rs

+            &mut PlannerContext::new(),
+        )?;
+
+        assert_eq!(expr.to_string(), pretty_expr.to_string());


Shouldn't this me veriying that the expr are the same (not just that the string is the same)?

But they can't be the same, the whole idea behind removing parenthesis is to maintain the final result regardless of which operations get executed first/has the parenthesis aka associativity.

Here's an example from the tests:
("((id + 5) * (age * 8))", "(id + 5) * age * 8"),

Here's the tree of the first/original expr:

* / \ + * / \ / \ id 5 age 8

And here's the cleaned one

* / \ * 8 / \ + age / \ id 5

These obviously don't match, but it doesn't matter because multiplication is associative. Almost all operators are associative except for division and subtraction, which are handled.

Comparing the string representation seemed like the best solution.

If you want to maintain such equivalence then the only case we remove parenthesis would be that the parenthesis give priority to the left side, which is the natural associativity of all binary operators (except = I think)*. Or to the operation with higher (but not equal) precedence. Something like (a + b) + c or true AND (a > b) for example

* https://www.postgresql.org/docs/7.2/sql-precedence.html

Thank you for the response @MohamedAbdeen21 -- this makes sense

🤔 I am not sure if maintaining the structural equivalence of the input/output is important or if just the semantics are important to preserve

Perhaps @goldmedal @seddonm1 or @phillipleblanc or @devinjdangelo has thoughts on what the preferred behavior is

I think it's related to #11186 (comment). If our main purpose is providing SQL for full pushdown to other data sources, this behavior makes sense to me. For my current project (Wren), it's acceptable.

However, if our purpose is human readability, I think preserving the semantics is important. Maybe we can just remove the outermost brackets (they're exactly redundant) for a calculation expression like:

((3 + (5 + 6)) + 3) -> (3 + (5 + 6)) + 3

I'm not sure, but I think it might be more similar to what a human would write.

I don't think maintaining structural equivalence is important as long as the correct answer would still be generated if the SQL were run.

But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability. I'm in the first camp - it doesn't matter to me what the resulting SQL looks like as long as it runs correctly (although having nicer SQL does make debugging easier).

However, I would expect that we could serve both goals with a single unparser implementation. Any work that is done to improve the SQL for human readability shouldn't come at the expense of making it invalid SQL for running in a separate query engine. That is why I thought it would be better to have just a single expr_to_sql function instead of splitting the functionality.

Since the primary goal is pushdown for other systems, we have two options.

Completely ignore this issue. Systems can have very different precedence rules, and therefore, we can not guarantee correctness once some parentheses are removed. Also, it can be hard to decide which parentheses are "safe" to remove.

Informing DF of your target system operator precedence through implementing a dialect. This can be a lot of work for the dev, but I can see this being a desired option for some people; as Phillip pointed out, It can make debugging easier.

But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability.

I think this is an excellent summary of the issue and given these two different goals there would be two different ideal outcomes.

Perhaps as @MohamedAbdeen21 suggests we can add some sort of API to allow the user to decide. We could potentially use Unparser for this (though I don't think Unparser is part of the public API yet).

datafusion/datafusion/sql/src/unparser/mod.rs

Lines 30 to 32 in 45599ce

pub struct Unparser<'a> {

dialect: &'a dyn Dialect,

}

So something like

// default unparser generates fully parenthesized sql: let unparser = Unparser::default(); // can also create an unparser that generates pretty sql let unparser = Unparser::default() // enable pretty-printing of the sql easier suited to human consumption .with_pretty(true);

That makes sense to me. I didn't think about how different query engines might have different precedence rules, so having this functionality split now makes a lot more sense.

Making it explicit: Removing parenthesis according to the precedence rules of DataFusion might make it invalid SQL for other query engines with different precedence rules, even if its valid for DataFusion. That would break one of the use-cases for the unparser, so allowing people to opt-in to that behavior is desirable.

The proposed API by @alamb looks good to me.

pushed the proposed API, updated docs, examples and tests

alamb · 2024-07-03T21:01:48Z

datafusion/sql/tests/cases/plan_to_sql.rs

+        ("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"),
+        ("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"),
+        ("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"),
+        ("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"),


this doesn't seem right to me -- this implies that (3 + 5) + (6 + 3) and 3 + ((5 + 6) + 3) are logically equivalent to each other as well as with 3 + 5 + 6 + 3

I think (3 + 5) + (6 + 3) and 3 + ((5 + 6) + 3) are different, right? Aka the parenthesis are needed here

resolved as now pretty unparsing is done conditionally

alamb · 2024-07-09T21:38:39Z

thanks @MohamedAbdeen21 -- I am very behind on reviews. I put this one on my list again

alamb

I took a look at this @MohamedAbdeen21 and TLDR is I think it looks great now,

While reviewing the code, I noticed some documentation that is missing / non ideal. I'll make a follow on PR to improve things

alamb · 2024-07-10T16:16:24Z

datafusion-examples/examples/parse_sql_expr.rs

@@ -153,5 +153,14 @@ async fn round_trip_parse_sql_expr_demo() -> Result<()> {

    assert_eq!(sql, round_trip_sql);

+    // enable pretty-unparsing. This make the output more human-readable


alamb · 2024-07-10T16:18:21Z

datafusion/sql/tests/cases/plan_to_sql.rs

+        ("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"),
+        ("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"),
+        ("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"),
+        ("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"),


resolved as now pretty unparsing is done conditionally

alamb · 2024-07-10T16:22:27Z

cc @backkem and @devinjdangelo

alamb · 2024-07-10T17:09:54Z

Here is a PR to improve the documentation #11395 (based on this one)

alamb · 2024-07-11T16:20:14Z

Thanks again @MohamedAbdeen21

* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser

initial prettier unparse

c65d72a

github-actions bot added the sql SQL Planner label Jun 30, 2024

bug fix

1e25567

MohamedAbdeen21 added 3 commits June 30, 2024 18:36

handling minus and divide

79532b9

cleaning references and comments

5c6aeca

moved tests

dcc6664

MohamedAbdeen21 marked this pull request as ready for review June 30, 2024 19:38

alamb mentioned this pull request Jul 1, 2024

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 #11190

Closed

10 tasks

MohamedAbdeen21 added 2 commits July 2, 2024 19:29

Update precedence of BETWEEN

29b5aa5

rerun CI

384dde1

alamb reviewed Jul 2, 2024

View reviewed changes

datafusion/sql/src/unparser/expr.rs Outdated Show resolved Hide resolved

datafusion/sql/tests/cases/plan_to_sql.rs Show resolved Hide resolved

datafusion/sql/tests/cases/plan_to_sql.rs Show resolved Hide resolved

MohamedAbdeen21 added 2 commits July 2, 2024 20:34

Change precedence to match PGSQLs

4d6967c

more pretty unparser tests

2c8f5c4

github-actions bot added the logical-expr Logical plan and expressions label Jul 2, 2024

Update operator precedence to match latest PGSQL

f753f05

phillipleblanc reviewed Jul 3, 2024

View reviewed changes

MohamedAbdeen21 added 3 commits July 3, 2024 07:41

directly prettify expr_to_sql

912ce3a

handle IS operator

eadc077

correct IS precedence

91d8b43

phillipleblanc approved these changes Jul 3, 2024

View reviewed changes

update unparser tests

c3dcb02

github-actions bot added the core Core DataFusion crate label Jul 3, 2024

MohamedAbdeen21 added 2 commits July 3, 2024 20:05

update unparser example

61fddb9

update more unparser examples

38a04de

alamb reviewed Jul 3, 2024

View reviewed changes

alamb changed the title ~~initial prettier unparse~~ Implement prettier SQL unparsing (more human readable) Jul 7, 2024

add with_pretty builder to unparser

98893f0

github-actions bot removed the core Core DataFusion crate label Jul 9, 2024

alamb mentioned this pull request Jul 9, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

9 tasks

alamb approved these changes Jul 10, 2024

View reviewed changes

alamb mentioned this pull request Jul 10, 2024

Improved unparser documentation #11395

Merged

alamb merged commit 1e9f0e1 into apache:main Jul 11, 2024
24 checks passed

alamb mentioned this pull request Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 #11474

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement prettier SQL unparsing (more human readable) #11186

Implement prettier SQL unparsing (more human readable) #11186

MohamedAbdeen21 commented Jun 30, 2024 •

edited

Loading

MohamedAbdeen21 commented Jun 30, 2024 •

edited

Loading

alamb commented Jul 1, 2024

alamb left a comment

alamb commented Jul 2, 2024

phillipleblanc left a comment

phillipleblanc Jul 3, 2024

phillipleblanc Jul 3, 2024

phillipleblanc Jul 3, 2024

MohamedAbdeen21 commented Jul 3, 2024

phillipleblanc left a comment

alamb left a comment

alamb Jul 3, 2024

alamb Jul 3, 2024

MohamedAbdeen21 Jul 3, 2024 •

edited

Loading

MohamedAbdeen21 Jul 3, 2024 •

edited

Loading

alamb Jul 5, 2024 •

edited

Loading

goldmedal Jul 5, 2024

phillipleblanc Jul 7, 2024

MohamedAbdeen21 Jul 7, 2024

alamb Jul 7, 2024

phillipleblanc Jul 7, 2024

MohamedAbdeen21 Jul 9, 2024

alamb Jul 3, 2024

alamb Jul 10, 2024

alamb commented Jul 9, 2024

alamb left a comment

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb commented Jul 10, 2024

alamb commented Jul 10, 2024

alamb commented Jul 11, 2024

		@@ -153,5 +153,14 @@ async fn round_trip_parse_sql_expr_demo() -> Result<()> {

		assert_eq!(sql, round_trip_sql);

		// enable pretty-unparsing. This make the output more human-readable

Implement prettier SQL unparsing (more human readable) #11186

Implement prettier SQL unparsing (more human readable) #11186

Conversation

MohamedAbdeen21 commented Jun 30, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

MohamedAbdeen21 commented Jun 30, 2024 • edited Loading

alamb commented Jul 1, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 2, 2024

phillipleblanc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MohamedAbdeen21 commented Jul 3, 2024

phillipleblanc left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MohamedAbdeen21 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

MohamedAbdeen21 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 9, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 10, 2024

alamb commented Jul 10, 2024

alamb commented Jul 11, 2024

MohamedAbdeen21 commented Jun 30, 2024 •

edited

Loading

MohamedAbdeen21 commented Jun 30, 2024 •

edited

Loading

MohamedAbdeen21 Jul 3, 2024 •

edited

Loading

MohamedAbdeen21 Jul 3, 2024 •

edited

Loading

alamb Jul 5, 2024 •

edited

Loading