Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement prettier SQL unparsing (more human readable) #11186

Merged
merged 17 commits into from
Jul 11, 2024

Conversation

MohamedAbdeen21
Copy link
Contributor

@MohamedAbdeen21 MohamedAbdeen21 commented Jun 30, 2024

Which issue does this PR close?

Takes a shot at #10633 .

Rationale for this change

Algorithm adapted from this SO answer.

What changes are included in this PR?

Make unparsed binary expressions easier to read.

Are these changes tested?

Yes, but can use more tests/statements/plans to unparse.

Are there any user-facing changes?

New .with_pretty() method for the unparser to remove extra parenthesis from an expr.

@github-actions github-actions bot added the sql SQL Planner label Jun 30, 2024
@MohamedAbdeen21
Copy link
Contributor Author

MohamedAbdeen21 commented Jun 30, 2024

Hi @alamb this can use more testing, match cases and moving stuff around as I'm not too familiar with the unparser. And I'm still working on subtraction and division

Appreciate if you can provide other cases/expressions that should be tested.

@MohamedAbdeen21 MohamedAbdeen21 marked this pull request as ready for review June 30, 2024 19:38
@alamb
Copy link
Contributor

alamb commented Jul 1, 2024

I plan to review this tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very cool @MohamedAbdeen21 -- thank you

I left a few suggestions for other tests.

datafusion/sql/src/unparser/expr.rs Outdated Show resolved Hide resolved
datafusion/sql/tests/cases/plan_to_sql.rs Show resolved Hide resolved
datafusion/sql/tests/cases/plan_to_sql.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

FYI @phillipleblanc

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jul 2, 2024
Copy link
Contributor

@phillipleblanc phillipleblanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @MohamedAbdeen21! Very neat idea and execution. I have a few thoughts on simplifying/unifying the code.

Thanks @alamb for tagging me. 🙏

Comment on lines 109 to 112
pub fn pretty_expr_to_sql(&self, expr: &Expr) -> Result<ast::Expr> {
let root_expr = self.expr_to_sql(expr)?;
Ok(self.pretty(root_expr, LOWEST, LOWEST))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have an extra method here and would combine it with expr_to_sql. The ast::Expr it produces is logically the same as the input one, just with unnecessary nesting removed. In fact, you could even think about this as serving the same purpose as an optimizer rewrite pass for LogicalPlan - it should produce logically the same thing as the input, just more efficient.

datafusion/sql/src/unparser/expr.rs Show resolved Hide resolved
///
/// Also note that when fetching the precedence of a nested expression, we ignore other nested
/// expressions, so precedence of expr `(a * (b + c))` equals `*` and not `+`.
fn pretty(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just have a single expr_to_sql method, then it might make sense to rename this method as well to something like remove_unnecessary_nesting to more accurately describe what it does. Then if we added more "rewrite passes" to make it prettier or achieve some other functional goal, then they would just be added as separate functions after this one.

@@ -218,37 +218,33 @@ impl Operator {
}

/// Get the operator precedence
/// use <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference
/// use <https://www.postgresql.org/docs/7.2/sql-precedence.html> as a reference
pub fn precedence(&self) -> u8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is only used in a few places for formatting the parenthesis in a Display trait on BinaryExpr. I don't think that will have any major impact on library users.

@MohamedAbdeen21
Copy link
Contributor Author

Thanks @alamb and @phillipleblanc. Updated PR accordingly

I'm aware of a failing test, will look into it later today.

Copy link
Contributor

@phillipleblanc phillipleblanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks @MohamedAbdeen21!

@github-actions github-actions bot added the core Core DataFusion crate label Jul 3, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks @MohamedAbdeen21 and @phillipleblanc -- I think this is looking close

@@ -135,7 +135,9 @@ async fn query_parquet_demo() -> Result<()> {

/// DataFusion can parse a SQL text and convert it back to SQL using [`Unparser`].
async fn round_trip_parse_sql_expr_demo() -> Result<()> {
let sql = "((int_col < 5) OR (double_col = 8))";
// unparser can also remove extra parentheses,
// so `((int_col < 5) OR (double_col = 8))` will also produce the same SQL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

&mut PlannerContext::new(),
)?;

assert_eq!(expr.to_string(), pretty_expr.to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this me veriying that the expr are the same (not just that the string is the same)?

Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they can't be the same, the whole idea behind removing parenthesis is to maintain the final result regardless of which operations get executed first/has the parenthesis aka associativity.

Here's an example from the tests:
("((id + 5) * (age * 8))", "(id + 5) * age * 8"),

Here's the tree of the first/original expr:

	    *
	   / \
	  +   * 
	/  \  / \
      id  5 age 8

And here's the cleaned one

	       *
	      / \
	     *   8
	   / \
	  +   age
	 / \
	id  5

These obviously don't match, but it doesn't matter because multiplication is associative. Almost all operators are associative except for division and subtraction, which are handled.

Comparing the string representation seemed like the best solution.

Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to maintain such equivalence then the only case we remove parenthesis would be that the parenthesis give priority to the left side, which is the natural associativity of all binary operators (except = I think)*. Or to the operation with higher (but not equal) precedence. Something like (a + b) + c or true AND (a > b) for example

* https://www.postgresql.org/docs/7.2/sql-precedence.html

Copy link
Contributor

@alamb alamb Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the response @MohamedAbdeen21 -- this makes sense

🤔 I am not sure if maintaining the structural equivalence of the input/output is important or if just the semantics are important to preserve

Perhaps @goldmedal @seddonm1 or @phillipleblanc or @devinjdangelo has thoughts on what the preferred behavior is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's related to #11186 (comment). If our main purpose is providing SQL for full pushdown to other data sources, this behavior makes sense to me. For my current project (Wren), it's acceptable.

However, if our purpose is human readability, I think preserving the semantics is important. Maybe we can just remove the outermost brackets (they're exactly redundant) for a calculation expression like:

((3 + (5 + 6)) + 3) -> (3 + (5 + 6)) + 3

I'm not sure, but I think it might be more similar to what a human would write.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think maintaining structural equivalence is important as long as the correct answer would still be generated if the SQL were run.

But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability. I'm in the first camp - it doesn't matter to me what the resulting SQL looks like as long as it runs correctly (although having nicer SQL does make debugging easier).

However, I would expect that we could serve both goals with a single unparser implementation. Any work that is done to improve the SQL for human readability shouldn't come at the expense of making it invalid SQL for running in a separate query engine. That is why I thought it would be better to have just a single expr_to_sql function instead of splitting the functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the primary goal is pushdown for other systems, we have two options.

  1. Completely ignore this issue. Systems can have very different precedence rules, and therefore, we can not guarantee correctness once some parentheses are removed. Also, it can be hard to decide which parentheses are "safe" to remove.

  2. Informing DF of your target system operator precedence through implementing a dialect. This can be a lot of work for the dev, but I can see this being a desired option for some people; as Phillip pointed out, It can make debugging easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability.

I think this is an excellent summary of the issue and given these two different goals there would be two different ideal outcomes.

Perhaps as @MohamedAbdeen21 suggests we can add some sort of API to allow the user to decide. We could potentially use Unparser for this (though I don't think Unparser is part of the public API yet).

pub struct Unparser<'a> {
dialect: &'a dyn Dialect,
}

So something like

// default unparser generates fully parenthesized sql:
let unparser = Unparser::default();


// can also create an unparser that generates pretty sql
let unparser = Unparser::default()
 // enable pretty-printing of the sql easier suited to human consumption
 .with_pretty(true);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I didn't think about how different query engines might have different precedence rules, so having this functionality split now makes a lot more sense.

Making it explicit: Removing parenthesis according to the precedence rules of DataFusion might make it invalid SQL for other query engines with different precedence rules, even if its valid for DataFusion. That would break one of the use-cases for the unparser, so allowing people to opt-in to that behavior is desirable.

The proposed API by @alamb looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed the proposed API, updated docs, examples and tests

("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"),
("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"),
("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"),
("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right to me -- this implies that (3 + 5) + (6 + 3) and 3 + ((5 + 6) + 3) are logically equivalent to each other as well as with 3 + 5 + 6 + 3

I think (3 + 5) + (6 + 3) and 3 + ((5 + 6) + 3) are different, right? Aka the parenthesis are needed here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved as now pretty unparsing is done conditionally

@alamb alamb changed the title initial prettier unparse Implement prettier SQL unparsing (more human readable) Jul 7, 2024
@github-actions github-actions bot removed the core Core DataFusion crate label Jul 9, 2024
@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

thanks @MohamedAbdeen21 -- I am very behind on reviews. I put this one on my list again

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this @MohamedAbdeen21 and TLDR is I think it looks great now,

While reviewing the code, I noticed some documentation that is missing / non ideal. I'll make a follow on PR to improve things

@@ -153,5 +153,14 @@ async fn round_trip_parse_sql_expr_demo() -> Result<()> {

assert_eq!(sql, round_trip_sql);

// enable pretty-unparsing. This make the output more human-readable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"),
("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"),
("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"),
("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved as now pretty unparsing is done conditionally

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

cc @backkem and @devinjdangelo

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

Here is a PR to improve the documentation #11395 (based on this one)

@alamb alamb merged commit 1e9f0e1 into apache:main Jul 11, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 11, 2024

Thanks again @MohamedAbdeen21

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 12, 2024
* initial prettier unparse

* bug fix

* handling minus and divide

* cleaning references and comments

* moved tests

* Update precedence of BETWEEN

* rerun CI

* Change precedence to match PGSQLs

* more pretty unparser tests

* Update operator precedence to match latest PGSQL

* directly prettify expr_to_sql

* handle IS operator

* correct IS precedence

* update unparser tests

* update unparser example

* update more unparser examples

* add with_pretty builder to unparser
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* initial prettier unparse

* bug fix

* handling minus and divide

* cleaning references and comments

* moved tests

* Update precedence of BETWEEN

* rerun CI

* Change precedence to match PGSQLs

* more pretty unparser tests

* Update operator precedence to match latest PGSQL

* directly prettify expr_to_sql

* handle IS operator

* correct IS precedence

* update unparser tests

* update unparser example

* update more unparser examples

* add with_pretty builder to unparser
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* initial prettier unparse

* bug fix

* handling minus and divide

* cleaning references and comments

* moved tests

* Update precedence of BETWEEN

* rerun CI

* Change precedence to match PGSQLs

* more pretty unparser tests

* Update operator precedence to match latest PGSQL

* directly prettify expr_to_sql

* handle IS operator

* correct IS precedence

* update unparser tests

* update unparser example

* update more unparser examples

* add with_pretty builder to unparser
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* initial prettier unparse

* bug fix

* handling minus and divide

* cleaning references and comments

* moved tests

* Update precedence of BETWEEN

* rerun CI

* Change precedence to match PGSQLs

* more pretty unparser tests

* Update operator precedence to match latest PGSQL

* directly prettify expr_to_sql

* handle IS operator

* correct IS precedence

* update unparser tests

* update unparser example

* update more unparser examples

* add with_pretty builder to unparser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants