Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change SQL dialect to PostgreSQL #214

Closed
wants to merge 2 commits into from

Conversation

returnString
Copy link
Contributor

@returnString returnString commented Apr 27, 2021

Some background first! @Dandandan did the initial work here and identified the VarProvider stuff as a potential followup issue in apache/arrow#9541. This PR just implements the original dialect change with the proposed extra change of removing scalar vars - thanks Daniël! 😄

Which issue does this PR close?

Closes #183.

Rationale for this change

This formalises our decision to use the Postgres SQL dialect.

What changes are included in this PR?

  • The DataFusion SQL parser now uses the Postgres dialect from sqlparser-rs
  • The notion of ScalarVariable exprs and the related VarProvider concept have been removed entirely, as these aren't supported in the Postgres dialect and the use case can be handled with nullary functions as is done in Postgres.

Are there any user-facing changes?

DataFusion (unsure if this worked before in Ballista) no longer supports user or system variables exposed with the non-standard @ prefix.

@Dandandan
Copy link
Contributor

Cool, thanks @returnString . To me it sounds like a good decision moto remove support for the ScalarVariable. I am not sure there is a user for the functionality?
What do you think @alamb @andygrove

Dandandan
Dandandan previously approved these changes Apr 28, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the removal of ScalarVariable may be a problem.

@@ -916,35 +898,6 @@ mod tests {
Ok(())
}

#[tokio::test]
async fn create_variable_expr() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these were added in apache/arrow#8135 . Perhaps @wqc200 has some comment about how / if this feature is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove support for scalar variables in the logical plan. This seems unrelated to changing the default SQL dialect. Users can use ScalarVariable without using SQL.

@@ -21,7 +21,7 @@

use sqlparser::{
ast::{ColumnDef, ColumnOptionDef, Statement as SQLStatement, TableConstraint},
dialect::{keywords::Keyword, Dialect, GenericDialect},
dialect::{keywords::Keyword, Dialect, PostgreSqlDialect},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach might be to make the SQL dialect be configurable on ExecutionConfig so that users could choose what dialect they wanted to mimic: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/execution/context.rs#L606

I think @andygrove has said in the past using DataFusion to mimic engines such as MySQL is a good usecase too

Copy link
Contributor Author

@returnString returnString Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting approach - we'd have to be careful about our sqlparser => expr conversions, but that could be quite useful as a feature 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more over lunch: if we enable a bring-your-own-dialect setup, we'd need a decent testing strategy to support this. In this example, @someval will be parsed as UnaryOp { op: PGAbs, expr: Ident("someval") } for Postgres, so we'd need to decide how we implement DF-specific parsing overrides, e.g. as used to support this var provider system, on a per-dialect basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does sound fairly complicated.... I personally / project wise don't have a need for a MySQL specific mimic (postgres is good enough for us in IOx) but I think perhaps we should let others weigh in here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky for sure. I would be fine with having Postgres as the default and officially supported (well tested) dialect, while also allowing users to provide a dialect at their own risk (and have this be well documented) but even that might create an undue burden on maintainers. It would definitely be good to try and find out more about our user's requirements here.

Copy link
Contributor Author

@returnString returnString Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be totally clear, I'm definitely not married to the contents of this PR as they stand; if the consensus is that we do want to support multiple dialects, I'm happy to rework it towards that goal 🙂

If that's a route we want to explore in depth, we could perhaps build some sort of dialect config setup that allows for controlling parser overrides in DF, but that feels a bit too heavyweight if we don't have tonnes of use cases right now (and hence no-one to drive or own that work).

Here's my immediate idea:

  • revert the deletion of all the logical plan stuff here, and just retain the parser override deletion
  • make the dialect default to Postgres, but be configurable, with a doc warning mentioning the potential risks and edge cases
  • mention nullary functions as a replacement for scalar variables in SQL in the next set of release notes
  • random bonus thought: maybe expose a unary function to retrieve scalar variables by name from SQL?

Does that sound vaguely sensible?

Full disclosure: my goal here is to get a DataFusion-powered service queryable via BI tooling, and these tools often use the more esoteric features of the Postgres dialect to bootstrap their UIs (think table listings etc). I still have a fair bit of work to do on this, but I can at least sort of see a path to having it working now. I suspect this will be beneficial for lots of other use cases, but admittedly I don't have any evidence for that claim beyond intuition 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the dialect default to Postgres, but be configurable, with a doc warning mentioning the potential risks and edge cases

I think this makes sense

mention nullary functions as a replacement for scalar variables in SQL in the next set of release notes

👍

random bonus thought: maybe expose a unary function to retrieve scalar variables by name from SQL?

This also sounds good -- if you wrote it up as a ticket, I bet it is a good "first issue" type ticket for someone else in the community to do if they were looking for something to learn DataFusion code a bit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the continued discussion and progress on this. I think the outcome is good. Just for context, I have built database gateways in the past that mimic MySQL and Hive protocols and dialects (different projects) and although I don't work on any projects these days that need either, I am just trying to keep options open for future users of DataFusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I definitely wouldn't want to impose unnecessary constraints on people who did want to use other dialects :) If people are happy with the ideas in my last comment, I'll close this, get some new issues logged, and start tackling those.

@alamb alamb dismissed Dandandan’s stale review May 6, 2021 18:13

Removing approval as I think that @returnString is reworking the approach

@alamb alamb marked this pull request as draft May 6, 2021 18:13
@alamb
Copy link
Contributor

alamb commented Jun 2, 2021

Ping @returnString -- is do you mind if I close this PR (to clean up the PR review queue)?

@returnString
Copy link
Contributor Author

Yeah no worries, I need to set aside some time to write up the takeaways from the review thread as a ticket and followup PR, will close for now :) Thanks for chasing this up!

Dandandan pushed a commit to Dandandan/arrow-datafusion that referenced this pull request Sep 29, 2023
Dandandan added a commit that referenced this pull request Sep 29, 2023
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change SQL dialect to PostgreSQL
4 participants