Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrain JSON_TABLE table function parsing to MySqlDialect and AnsiDialect #1123

Closed
wants to merge 19 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 5, 2024

JSON_TABLE is introduced in SQL:2016 and added into this parser by #1062. But it is not actually implemented by most SQL engines. Having it as general parser support may cause issue like apache/datafusion#9122. Maybe we should restrain JSON_TABLE to MySqlDialect for now.

Btw, the PR #1062 added the support also only wrote the test for mysql (tests/sqlparser_mysql.rs).

Comment on lines 7509 to 7511
} else if dialect_of!(self is MySqlDialect)
&& self.parse_keyword(Keyword::JSON_TABLE)
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change is here. Other changes are from rustfmt.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a simple test to generic SQL dialect to ensure this behaviour?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @viirya

Is there any chance you can revert the changes to derive/src/lib.rs in this PR?

cc @lovasoa

@@ -7506,7 +7506,7 @@ impl<'a> Parser<'a> {
with_offset,
with_offset_alias,
})
} else if self.parse_keyword(Keyword::JSON_TABLE) {
} else if dialect_of!(self is MySqlDialect) && self.parse_keyword(Keyword::JSON_TABLE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual change, right? It seems like the changes to derive/src/lib.rs are unrelated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a little bit adhoc. Shouldn't this be a method on dialect ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Yea, derive/src/lib.rs was changed by rustfmt. I restored it manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have dialects for Oracle and DB2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a little bit adhoc. Shouldn't this be a method on dialect ?

This can be discussed separately, but this syntax follows existing ones.

@coveralls
Copy link

coveralls commented Feb 8, 2024

Pull Request Test Coverage Report for Build 7838315622

  • 0 of 8 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.003%) to 87.668%

Totals Coverage Status
Change from base Build 7838299826: 0.003%
Covered Lines: 19579
Relevant Lines: 22333

💛 - Coveralls

@@ -7506,7 +7506,7 @@ impl<'a> Parser<'a> {
with_offset,
with_offset_alias,
})
} else if self.parse_keyword(Keyword::JSON_TABLE) {
} else if dialect_of!(self is MySqlDialect) && self.parse_keyword(Keyword::JSON_TABLE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -7506,7 +7506,7 @@ impl<'a> Parser<'a> {
with_offset,
with_offset_alias,
})
} else if self.parse_keyword(Keyword::JSON_TABLE) {
} else if dialect_of!(self is MySqlDialect) && self.parse_keyword(Keyword::JSON_TABLE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lovasoa
Copy link
Contributor

lovasoa commented Feb 8, 2024

JSON_TABLE is part of the SQL:2016 standard: https://modern-sql.com/blog/2017-06/whats-new-in-sql-2016
It is implemented in Db2, MariaDB, MySQL and Oracle.

I think it's ok to accept json_table as a table name for specific dialects where it has been tested to be accepted unquoted, but the default behavior should probably be to parse it as the json_table function, not as a table name.

In simpler cases, we may even be able to distinguish between the unquoted table name and the function call directly during parsing, and make everyone happy.

tests/sqlparser_common.rs Outdated Show resolved Hide resolved
@@ -7506,7 +7506,7 @@ impl<'a> Parser<'a> {
with_offset,
with_offset_alias,
})
} else if self.parse_keyword(Keyword::JSON_TABLE) {
} else if dialect_of!(self is MySqlDialect | AnsiDialect) && self.parse_keyword(Keyword::JSON_TABLE) {
Copy link
Member Author

@viirya viirya Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although AnsiDialect is for SQL:2011 standard based on its comment, seems it makes sense to add it in supported dialect.

Copy link
Member Author

@viirya viirya Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to accept json_table as a table name for specific dialects where it has been tested to be accepted unquoted, but the default behavior should probably be to parse it as the json_table function, not as a table name.

It could be the default behavior for dialects which are likely to follow the standard to support the table function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there are only very few dialects supporting this, I think it makes more sense to not add it into GenericDialect currently.

@@ -7506,7 +7506,7 @@ impl<'a> Parser<'a> {
with_offset,
with_offset_alias,
})
} else if self.parse_keyword(Keyword::JSON_TABLE) {
} else if dialect_of!(self is MySqlDialect | AnsiDialect) && self.parse_keyword(Keyword::JSON_TABLE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think the dialect identity should be hardcoded. This should probably be a dialect.supports_json_table. And it should be true in the generic dialect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be nicer to do if self.dialect.supports_json_table() instead. However, I think we could also do this as a follow on PR as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why it should be true in generic dialect. GenericDialect is not strictly SQL standard dialect (AnsiDialect is for the purpose). It is for syntax that most dialects use. Currently the table function is only supported by very few dialects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be nicer to do if self.dialect.supports_json_table() instead. However, I think we could also do this as a follow on PR as well

Agreed. As I mentioned early, this can be discussed separately. Although I wonder if it will result in many similar functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I wonder if it will result in many similar functions.

Yes I think it would. The codebase has a mix of styles currently -- both dialect_of! and trait methods

@alamb
Copy link
Contributor

alamb commented Feb 9, 2024

I believe CI is failing due to a new rust version being released (and thus new clippy lints). I will fix that separately

@alamb alamb requested a review from lovasoa February 9, 2024 01:06
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks good and follows the existing conventions.

I think @viirya has addressed all comments. @lovasoa please let us know if you disagree.

@alamb
Copy link
Contributor

alamb commented Feb 9, 2024

I merged up from main to pick up #1130

@alamb
Copy link
Contributor

alamb commented Feb 9, 2024

I think @lovasoa 's comment on apache/datafusion#9122 (comment) is worth considering (should we perhaps instead make the error message more helpful?)

@viirya
Copy link
Member Author

viirya commented Feb 9, 2024

Sounds good. Let me revise this to improve the error message for unsupported dialects.

Comment on lines 8424 to 8430
let parsed = all_dialects().parse_sql_statements("SELECT * FROM json_table");
assert_eq!(
ParserError::ParserError(
"Cannot specify a reserved keyword as identifier for table factor".to_string()
),
parsed.unwrap_err()
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I think this can have a more meaningful message to end users.

@viirya viirya changed the title Restrain JSON_TABLE to MySqlDialect Restrain JSON_TABLE to MySqlDialect and AnsiDialect Feb 9, 2024
Comment on lines 8424 to 8430
let parsed = all_dialects().parse_sql_statements("SELECT * FROM json_table");
assert_eq!(
ParserError::ParserError(
"Cannot specify a keyword as identifier for table factor".to_string()
),
parsed.unwrap_err()
);
Copy link
Member Author

@viirya viirya Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, even for the supported dialects, json_table cannot be used as table identifier too. Previously it cannot be too but the paring error looks confusing.

tests/sqlparser_bigquery.rs Outdated Show resolved Hide resolved
tests/sqlparser_bigquery.rs Outdated Show resolved Hide resolved
@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Hmm, there are more tests using keywords as identifiers...

@lovasoa
Copy link
Contributor

lovasoa commented Feb 13, 2024

I don't think banning using keywords as identifiers is a good idea. It will probably break a lot people's applications. Currently, keywords are defined globally in sqlparser, even though the list of supported keywords varies from database from database. Banning everything that is a keyword in one database from use as a keyword globally is probably not a good idea.

I recently had the case with MySQL, which supports returning as an identifier: select 1 as returning is valid in MySQL. Since RETURNING is also a keyword, that would break.
MS SQL Server also has a lot of keywords that only it uses. select 1 as open is valid everywhere, but not in mssql.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

I don't think banning using keywords as identifiers is a good idea. It will probably break a lot people's applications.

I have concerns about this too. But I think this is what you first suggested (e.g., apache/datafusion#9122 (comment)), no?

@lovasoa
Copy link
Contributor

lovasoa commented Feb 13, 2024

What I was suggesting was just to close apache/datafusion#9122 as "working as intended" :)

Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Keyword as identifier is only disallowed with ANSI mode now. Other dialects still can have keywords as identifiers for their custom usage.

I think there are disallowed cases of using keyword as identifier in individual dialects, ie., Each engine has its reserved keyword list.

We probably can consider to enhance dialect behavior regarding this, i.e., to create keyword list per dialect, in the future.

@lovasoa
Copy link
Contributor

lovasoa commented Feb 13, 2024

Keyword as identifier is only disallowed with ANSI

Does the ansi sql dialect still parse select 1 as open correctly, for instance?

@viirya
Copy link
Member Author

viirya commented Feb 13, 2024

Does the ansi sql dialect still parse select 1 as open correctly, for instance?

It still parses the sql, although open is in keyword list.
Currently it is only disallowed when using it as table identifier.

For column name etc., it is out of the range of the purpose of this PR, we can work on this later, I think.

@lovasoa
Copy link
Contributor

lovasoa commented Feb 14, 2024

Yes, the problem is, I think select * from open is valid ansi sql too

@lovasoa
Copy link
Contributor

lovasoa commented Feb 14, 2024

Another example is select * from view, which is valid in ansi sql, postgres, sqlite, mysql and mariadb, but would be refused if this pr were merged. The principle of banning keywords as table names before we have dialect-specific keyword lists will make sqlparser start to refuse a lot of valid sql queries it used to accept, even if done just for "ansi sql".

@alamb
Copy link
Contributor

alamb commented Feb 14, 2024

So I would like to suggest we take a step back here and review what we are trying to accomplish

What our DataFusion user hit, and I actually hit as well is quite confusing behavior.

Specifically, while it is fine to create a table named json_table, like this

DataFusion CLI v35.0.0
❯ create table json_table(x int);
0 rows in set. Query took 0.005 seconds.

If you try to query it you get a very confusing syntax error:

select * from json_table;  🤔 Invalid statement: sql parser error: Expected (, found: EOF

For better or worse I think using json_table as a table name is somewhat common.

I think the experience would have been far better if the error had instead looked like:

select * from json_table;  🤔 Invalid statement: sql parser error: Can not parse json_table clause. Expected (, found: EOF. Hint: to select from a table named json_table, use double quotes "json_table"

So my suggestion is:

  1. Leave the existing error behavior (as it which is SQL standard, as @lovasoa points out, however confusing that may be).
  2. Improve the error message (which I think is the root cause of the problem)

@lovasoa
Copy link
Contributor

lovasoa commented Feb 14, 2024

There is a third option, that I had suggested above, and that I think would be the best:

  1. whatever the dialect, parse json_table as a function in queries if and only if it is followed by (.

This way, select * from json_table would be accepted, and parsed as an unquoted table name, just as it was before adding support for the json_table function.

And select * from json_table(t, '$[*]' COLUMNS (id INT PATH '$.id')) would also be accepted, and parsed as a reference to the json_table function.

If I'm not mistaken, implementing that is just a matter of adding && self.peek_token() == LParen after the existing if condition that matches json_table.

@lovasoa
Copy link
Contributor

lovasoa commented Feb 14, 2024

I opened an alternative pr to this one here: #1134

@alamb
Copy link
Contributor

alamb commented Feb 14, 2024

I opened an alternative pr to this one here: #1134

Thank you @lovasoa -- I like #1134 a lot. It would be great to get some other opinions as well

@viirya viirya closed this Feb 14, 2024
@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

Yes, the problem is, I think select * from open is valid ansi sql too

open is reserved keyword...

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

Another example is select * from view, which is valid in ansi sql, postgres, sqlite, mysql and mariadb, but would be refused if this pr were merged. The principle of banning keywords as table names before we have dialect-specific keyword lists will make sqlparser start to refuse a lot of valid sql queries it used to accept, even if done just for "ansi sql".

view is reserved keywords in SQL-92. You simplify the question but the actual situation is more complicated. The fact is reserved keyword list is changed over time and different on different query engines. sqlparser doesn't currently take care about this difference. That is the problem. We don't even state what SQL standard the ansi dialect follows. I agreed that we shouldn't block all keywords as it will break applications. But this is what you suggested at the beginning, right? Let me attach your comment in case you forgot it (apache/datafusion#9122 (comment)). In the comment, you stated that:

I think the JSON_TABLE function is part of the SQL standard. So SELECT * FROM json_table is not valid SQL, and datafusion is correct in refusing it.

Doesn't it mean you want to block the SQL syntax which not "valid" SQL?

I know you changed your mind later.

I just don't like to go forth and back on this PR. So I don't want to follow up this PR. Please move forward with your PR.

@lovasoa
Copy link
Contributor

lovasoa commented Feb 14, 2024

My initial comment was not suggesting to block more keywords. I was saying that the current behavior is reasonable.

Accepting even more valid syntax is better.

Blocking syntax that is valid in most databases and currently works in sqlparser is worse.

@lovasoa
Copy link
Contributor

lovasoa commented Feb 14, 2024

Anyway, I think we are all ok with #1134 in the end.

And your effort was not in vain, it ended up raising important points.

@viirya viirya deleted the restrain_json_table branch February 14, 2024 20:23
@alamb
Copy link
Contributor

alamb commented Feb 14, 2024

Thanks again @viirya and @lovasoa -- hopefully we can figure out how to improve the communication efficiency going forward -- I agree the solution we arrived at is good, but it took a of back and forth.

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

Thank you @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants