-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support quoted identifiers in Iceberg partitioning #12227
Support quoted identifiers in Iceberg partitioning #12227
Conversation
@@ -29,10 +29,13 @@ | |||
|
|||
public final class PartitionFields | |||
{ | |||
private static final String NAME = "[a-z_][a-z0-9_]*"; | |||
private static final String IDENTIFIER = "[[a-z]_][[a-z0-9]_]*"; | |||
private static final String QUOTED_IDENTIFIER = "(?:\"[^\"]*\")+"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a column name can contain a quotation mark itself ("
).
in SQL, it is denoted by repeating the character "a column with "" quotation mark"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was indeed only partially implemented. I have adjusted the regex and added additional test cases. Build failure doesn't seem related java.util.concurrent.TimeoutException: Idle timeout 5000 ms
. Build was green locally.
b0e3b42
to
c9f8fc8
Compare
Another place where the parsing of the partition field fails is the following:
|
c9f8fc8
to
0537951
Compare
|
The currently supported syntax would be:
Without the quotes we currently fail as it has to comply with the standard identifier regex: The reasoning behind would be that the array contains valid SQL strings that obey to the standard SQL spec. |
private static final String FUNCTION_ARGUMENT_NAME = "\\((" + NAME + ")\\)"; | ||
private static final String FUNCTION_ARGUMENT_NAME_AND_INT = "\\((" + NAME + "), *(\\d+)\\)"; | ||
private static final String IDENTIFIER = "[a-z_][a-z0-9_]*"; | ||
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for each of the non-trivial regex patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with @findepi's great suggestion, that regex is now a lot simpler. It was indeed a bit convoluted.
I am not yet convinced we need quotes at all, for Do we envision the partition transforms to represent a language, e.g. have nested expression-like structure? |
From a user perspective I think the second one is a lot more obvious, clearly separating arguments in standard SQL syntax. Omitting the quotes will also not let us distinguish between quoted identifiers vs normal identifiers, which might get us in trouble once we truely support quoted identifiers. In the future I would say this is new feature, so it should conform to the SQL identifier specs as mentioned by @kasiafi in #11163 (comment) |
We don't need to. Partitioning specification is a mini-language and doesn't need to follow SQL identifier semantics (which are not as simple as they could be)
I agree. OTOH it's a fair price for putting commas in a column name. It's a bad idea and nothing will change that. Anyway, sans apostrophes it hurts my eyes, so yeah, let's go with quotes |
private static final String FUNCTION_ARGUMENT_NAME_AND_INT = "\\((" + NAME + "), *(\\d+)\\)"; | ||
private static final String IDENTIFIER = "[a-z_][a-z0-9_]*"; | ||
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\""; | ||
private static final String NAME = "\\s*(" + IDENTIFIER + "|" + QUOTED_IDENTIFIER + ")\\s*"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDENTIFIER -> UNQUOTED_IDENTIFIER
NAME -> IDENTIFIER
private static final String FUNCTION_ARGUMENT_NAME = "\\((" + NAME + ")\\)"; | ||
private static final String FUNCTION_ARGUMENT_NAME_AND_INT = "\\((" + NAME + "), *(\\d+)\\)"; | ||
private static final String IDENTIFIER = "[a-z_][a-z0-9_]*"; | ||
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\""; | |
private static final String QUOTED_IDENTIFIER = "\"(?:\"\"|[^\"])*\""; |
0537951
to
5c085f2
Compare
@mdesmet So:
should fail, but:
should pass. I am only a bit concerned about the case. Before this change, only identifiers which were fully in lowercase would pass |
Good point.
I don't want SQL semantics here, let's be simpler. Let's require user to provide the exact same case as the column actually has (regardless of how it was created). -- should work
CREATE TABLE t(x bigint) WITH (partitioning = ARRAY['x']);
CREATE TABLE t(x bigint) WITH (partitioning = ARRAY['"x"']);
-- should work, the column is actually `x`, not `X`
CREATE TABLE t(X bigint) WITH (partitioning = ARRAY['x']);
-- should work, the column is actually `x`, not `X` (until #17)
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['x']);
-- should fail, there is no column `X`. Until #17, the column is actually `x`.
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['"X"']); |
I suggest that instead we require that users pass only lowercase names (through a check in Comparing case-sensitive seems like something we might not want to do right now. Currently, we resolve column names case-insensitive, so let's be consistent. |
This is effectively what i want. |
Noted conclusion under #12226 (comment) |
@mdesmet please add test cases as indicated in #12227 (comment) |
ba315b8
to
384baa2
Compare
private static final Pattern VOID_PATTERN = Pattern.compile("void" + FUNCTION_ARGUMENT_NAME); | ||
private static final String UNQUOTED_IDENTIFIER = "[a-zA-Z_][a-zA-Z0-9_]*"; | ||
// We only support lowercase quoted identifiers for now. | ||
// See https://github.com/trinodb/trino/issues/12226#issuecomment-1128839259 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to #17
please add -- or better: remove the comment here, leaving only the one at fromIdentifier
private static String fromIdentifier(String identifier) | ||
// Currently, all Iceberg columns are stored in lowercase in the Iceberg metadata files. | ||
// Unquoted identifiers are canonicalized to lowercase here which is not according ANSI SQL spec. | ||
// Quoted identifiers are restricted to lowercase only through the regex pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the [^A-Z]
isn't sufficient for that.
What about Ą
?
simplify regex, and use .toLowerCase(ENGLISH)
in Java to verify parsed value is all-lower for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex has been simplified and verification is done using .toLowercase(ENGLISH)
.
@@ -625,7 +626,7 @@ public void setTableComment(ConnectorSession session, ConnectorTableHandle table | |||
public Optional<ConnectorTableLayout> getNewTableLayout(ConnectorSession session, ConnectorTableMetadata tableMetadata) | |||
{ | |||
Schema schema = toIcebergSchema(tableMetadata.getColumns()); | |||
PartitionSpec partitionSpec = parsePartitionFields(schema, getPartitioning(tableMetadata.getProperties())); | |||
PartitionSpec partitionSpec = createPartitionSpec(schema, getPartitioning(tableMetadata.getProperties())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When should i call createPartitionSpec
and when parsePartitionFields
?
they have similar names, and even more similar semantics.
Also, how does introduction of the wrapper (that throws TrinoException) related to adding quoted identifiers?
the parsing could fail even before the change, right? (eg unsupported transform, missing closing brace, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the wrapper and just catch the exception in parsePartitionFields
now and rethrow as TrinoException
. This makes the testing easier in BaseConnectorTest
as we are expecting TrinoException
there,
5577dab
to
80b5e05
Compare
Following test failed in last build. It's actually related to #12626. I have setup the product tests to run with
|
b95953f
to
2dd644f
Compare
@@ -151,7 +152,11 @@ | |||
|
|||
public final class IcebergUtil | |||
{ | |||
private static final Pattern SIMPLE_NAME = Pattern.compile("[a-z][a-z0-9]*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like SIMPLE_NAME tries to achieve the same as UNQUOTED_IDENTIFIER but not exactly following SQL semantics. Changing it had some impacts in exception checking in the tests.
@@ -12,5 +12,8 @@ connector.name=iceberg | |||
hive.metastore.uri=thrift://localhost:9083 | |||
hive.hdfs.socks-proxy=localhost:1180 | |||
|
|||
# Ensure test retries don't write to non-empty locations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning behind adding this setting for the DevelopmentServer Iceberg connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment #12227 (comment)
We can never really ensure that all cleanup (finally stuff) is run eg. For example in case of Hive timeouts some table location might not be emptied, so I think this is the best way to handle this as this ensures every table will have its unique location. This can be moved to a separate PR if necessary.
onSpark().executeQuery(format( | ||
"CREATE TABLE %s (id INTEGER, `mIxEd_COL` STRING) USING ICEBERG", | ||
sparkTableName)); | ||
assertQueryFailure(() -> onTrino().executeQuery("ALTER TABLE " + trinoTableName + " SET PROPERTIES partitioning = ARRAY['mIxEd_COL']")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bummer.
We should provide in Trino a way for the users to cope with such situations because otherwise the users would face a Spark lock-in for such situations.
I've created a PR in iceberg to address this problem. apache/iceberg#5110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give you some context. The original PR I did handle this correctly. According SQL quoted identifier semantics, a quoted identifier should be matched case sensitively. The backticks in Spark behave the same as the quoted identifiers in SQL. IMHO this is not an issue with Iceberg.
There was some discussion about this feature as indeed in Trino the column names are converted into lowercase. Take for example this query
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['"X"']);
Because the column name is converted to lowercase in Trino, this query would fail, as at that time the "X" has become x and the partitioning parsing logic fails to find this column. This is definitely confusing for the user. In a ALTER TABLE scenario however this is not true. The column will be known as "X". So we agreed on blocking this scenario for now, as mentioned on #12226 (comment)
Here is the code that explicitly blocks this, removing the lowercase verification would fix the Trino query above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this is not an issue with Iceberg.
I think that the PR apache/iceberg#5110 is more of a usability "improvement" and not a bug.
Please correct me if I'm wrong, I don't think it should be mandatory to specify the source column name in the same case in the table definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we better stick to SQL semantics.
Imagine following query perfectly valid syntax (not currently working in Trino, but actually working on snowflake):
CREATE TABLE t("X" bigint, "x" bigint) WITH (partitioning = ARRAY['"X"']);
Because of the quoted identifers we would know which x to match. I would think Iceberg partitioning spec should be exactly matched against the Iceberg schema and not impose a certain way of working.
This apparently also rebased on current master. |
The build failed because of the refactor of To summarise the changes:
Let me know what you think. |
What kind of problem is this fixing? (btw this will become default in #12941, so we don't want to set this explicitly in product tests) |
With #12626, we throw an exception when trying to create a table and files exist on that location. Sometimes tests fail randomly and are retried without paths being cleaned. In this case
Anyway if this setting becomes default (which I definitely support), this is not an issue anymore. Will remove that commit. |
2dd644f
to
89b4d46
Compare
@@ -312,12 +317,38 @@ public static String quotedTableName(SchemaTableName name) | |||
|
|||
private static String quotedName(String name) | |||
{ | |||
if (SIMPLE_NAME.matcher(name).matches()) { | |||
if (UNQUOTED_IDENTIFIER_PATTERN.matcher(name).matches()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That changes semantics of the method.
previously, for My_Table
we would output "My_Table"
.
now we output My_Table
without quotes.
if the table name is actually My_Table, it needs to be referenced as "My_Table"
in SQL,
so the output of this command no longer can be pasted into SQL.
Please revert the change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You indeed point out a bug, that also applies to the fromColumnToIdentifier
method. It is however the same semantics: we are taking something from metadata (a column or table name), and need to ensure that it can be pasted in an SQL editor, respecting SQL identifier semantics.
return name; | ||
} | ||
return '"' + name.replace("\"", "\"\"") + '"'; | ||
} | ||
|
||
public static String fromColumnToIdentifier(String column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method introduced here is unused (in the commit which introduced it), and it's unclear what's the context in which is should be used.
Squash the changes with the next commit.
return quotedName(column); | ||
} | ||
|
||
public static String fromIdentifierToColumn(String identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method introduced here is unused (in the commit which introduced it), and it's unclear what's the context in which is should be used.
Squash the changes with the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later it becomes used in PartitionFields
which is the context in which it's comprehensible.
(otherwise the "identifier" will be misleading, as normally String
identifier should not be expected to have any quotes inside (or the quotes be treated literal).
Move the method to PartitionFields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The placement in IcebergUtils
had been discussed in #12872, the need to parse quoted identifiers also exist in other table properties (sort_order, orc_bloom_filter, ...).
89b4d46
to
5675f13
Compare
(cannot comment at #12227 (comment))
I am aware of that PR.
It's important for a shared method to have an easy to understand semantics (name, input and output types need to intuitively hint and what it does). |
5675f13
to
1ea29fc
Compare
@mdesmet failures seem related. |
e69642d
to
95d16a9
Compare
95d16a9
to
18cd9b3
Compare
I have rebased with latest master and resolved the issues. |
Description
Adds support for quoted identifiers in Iceberg partitioning.
Trino Iceberg allows tables to be created using quoted identifiers.
CREATE TABLE test AS SELECT 1 as "a quoted identifier";
However when a partitioning property is added these columns can't be declared.
CREATE TABLE test WITH(partitioning=ARRAY['a quoted identifier']) ...
fails with errorInvalid partition field declaration: a quoted identifier
Fix
Change to Iceberg connector
Related issues, pull requests, and links
resources. For example:
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: