Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support quoted identifiers in Iceberg partitioning #12227

Conversation

mdesmet
Copy link
Contributor

@mdesmet mdesmet commented May 3, 2022

Description

Adds support for quoted identifiers in Iceberg partitioning.

Trino Iceberg allows tables to be created using quoted identifiers.

CREATE TABLE test AS SELECT 1 as "a quoted identifier";

However when a partitioning property is added these columns can't be declared.

CREATE TABLE test WITH(partitioning=ARRAY['a quoted identifier']) ... fails with error Invalid partition field declaration: a quoted identifier

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Change to Iceberg connector

How would you describe this change to a non-technical end user or system administrator?

Related issues, pull requests, and links

resources. For example:

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label May 3, 2022
@@ -29,10 +29,13 @@

public final class PartitionFields
{
private static final String NAME = "[a-z_][a-z0-9_]*";
private static final String IDENTIFIER = "[[a-z]_][[a-z0-9]_]*";
private static final String QUOTED_IDENTIFIER = "(?:\"[^\"]*\")+";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a column name can contain a quotation mark itself (").
in SQL, it is denoted by repeating the character "a column with "" quotation mark"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was indeed only partially implemented. I have adjusted the regex and added additional test cases. Build failure doesn't seem related java.util.concurrent.TimeoutException: Idle timeout 5000 ms. Build was green locally.

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from b0e3b42 to c9f8fc8 Compare May 4, 2022 15:43
@findinpath findinpath added the bug Something isn't working label May 7, 2022
@findinpath
Copy link
Contributor

Another place where the parsing of the partition field fails is the following:

CREATE TABLE iceberg.default.testp3  WITH (partitioning = ARRAY['truncate(name   , 1)']) AS SELECT * FROM tpch.sf1.nation WHERE nationkey < 10;
Query 20220506_135829_00008_wqaig failed: Invalid partition field declaration: truncate(name   , 1)
java.lang.IllegalArgumentException: Invalid partition field declaration: truncate(name   , 1)
	at io.trino.plugin.iceberg.PartitionFields.parsePartitionField(PartitionFields.java:73)
	at io.trino.plugin.iceberg.PartitionFields.parsePartitionFields(PartitionFields.java:54)
	at io.trino.plugin.iceberg.IcebergMetadata.getNewTableLayout(IcebergMetadata.java:542)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getNewTableLayout(ClassLoaderSafeConnectorMetadata.java:118)

@findinpath findinpath self-requested a review May 8, 2022 05:35
@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from c9f8fc8 to 0537951 Compare May 8, 2022 11:57
@findepi
Copy link
Member

findepi commented May 9, 2022

cc @electrum @phd3 @alexjo2144

@findinpath
Copy link
Contributor

trino> CREATE TABLE iceberg.default.test AS SELECT 1 as "a quoted identifier";
CREATE TABLE: 1 row

Query 20220509_104608_00015_idcj4, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
1.45 [0 rows, 0B] [0 rows/s, 0B/s]

trino> CREATE TABLE iceberg.default.test2 WITH(partitioning=ARRAY['a quoted identifier']) as select * from iceberg.default.test;
Query 20220509_104756_00019_idcj4 failed: Invalid partition field declaration: a quoted identifier
java.lang.IllegalArgumentException: Invalid partition field declaration: a quoted identifier
	at io.trino.plugin.iceberg.PartitionFields.parsePartitionField(PartitionFields.java:92)
	at io.trino.plugin.iceberg.PartitionFields.parsePartitionFields(PartitionFields.java:57)
	at io.trino.plugin.iceberg.IcebergMetadata.getNewTableLayout(IcebergMetadata.java:538)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getNewTableLayout(ClassLoaderSafeConnectorMetadata.java:118)
	at io.trino.metadata.MetadataManager.getNewTableLayout(MetadataManager.java:830)
	at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitCreateTableAsSelect(StatementAnalyzer.java:853)
	at io.trino.sql.analyzer.StatementAnalyzer$Visitor.visitCreateTableAsSelect(StatementAnalyzer.java:404)
	at io.trino.sql.tree.CreateTableAsSelect.accept(CreateTableAsSelect.java:96)
	at io.trino.sql.tree.AstVisitor.process(AstVisitor.java:27)
	at io.trino.sql.analyzer.StatementAnalyzer$Visitor.process(StatementAnalyzer.java:421)
	at io.trino.sql.analyzer.StatementAnalyzer.analyze(StatementAnalyzer.java:384)
	at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:79)
	at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:71)
	at io.trino.execution.SqlQueryExecution.analyze(SqlQueryExecution.java:269)
	at io.trino.execution.SqlQueryExecution.<init>(SqlQueryExecution.java:193)
	at io.trino.execution.SqlQueryExecution$SqlQueryExecutionFactory.createQueryExecution(SqlQueryExecution.java:808)
	at io.trino.dispatcher.LocalDispatchQueryFactory.lambda$createDispatchQuery$0(LocalDispatchQueryFactory.java:135)
	at io.trino.$gen.Trino_380_3_g0537951____20220509_104303_2.call(Unknown Source)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)

@mdesmet
Copy link
Contributor Author

mdesmet commented May 9, 2022

trino> CREATE TABLE iceberg.default.test AS SELECT 1 as "a quoted identifier";
CREATE TABLE: 1 row

Query 20220509_104608_00015_idcj4, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
1.45 [0 rows, 0B] [0 rows/s, 0B/s]

trino> CREATE TABLE iceberg.default.test2 WITH(partitioning=ARRAY['a quoted identifier']) as select * from iceberg.default.test;
Query 20220509_104756_00019_idcj4 failed: Invalid partition field declaration: a quoted identifier

The currently supported syntax would be:

CREATE TABLE iceberg.default.test2 WITH(partitioning=ARRAY['"a quoted identifier"']) as select * from iceberg.default.test;

Without the quotes we currently fail as it has to comply with the standard identifier regex: [a-z_][a-z0-9_]*. According SQL spec it should be matched case insensitively. I think that's not yet implemented.

The reasoning behind would be that the array contains valid SQL strings that obey to the standard SQL spec.

private static final String FUNCTION_ARGUMENT_NAME = "\\((" + NAME + ")\\)";
private static final String FUNCTION_ARGUMENT_NAME_AND_INT = "\\((" + NAME + "), *(\\d+)\\)";
private static final String IDENTIFIER = "[a-z_][a-z0-9_]*";
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\"";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment for each of the non-trivial regex patterns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with @findepi's great suggestion, that regex is now a lot simpler. It was indeed a bit convoluted.

@findepi
Copy link
Member

findepi commented May 10, 2022

I am not yet convinced we need quotes at all, for partitioning values.
For example, month(some date) can be interpreted unambiguously (and so can be month(some column with unusual characters: )(,(()).

Do we envision the partition transforms to represent a language, e.g. have nested expression-like structure?

@mdesmet
Copy link
Contributor Author

mdesmet commented May 10, 2022

I am not yet convinced we need quotes at all, for partitioning values. For example, month(some date) can be interpreted unambiguously (and so can be month(some column with unusual characters: )(,(()).

Do we envision the partition transforms to represent a language, e.g. have nested expression-like structure?

truncate(test, 12, 12) vs truncate("test, 12", 12)

From a user perspective I think the second one is a lot more obvious, clearly separating arguments in standard SQL syntax.

Omitting the quotes will also not let us distinguish between quoted identifiers vs normal identifiers, which might get us in trouble once we truely support quoted identifiers. In the future SELECT 1 as "TEST" may not match with SELECT 1 as "test" if complying with SQL spec (currently Trino converts all columns to lowercase).

I would say this is new feature, so it should conform to the SQL identifier specs as mentioned by @kasiafi in #11163 (comment)

@findepi
Copy link
Member

findepi commented May 11, 2022

Omitting the quotes will also not let us distinguish between quoted identifiers vs normal identifiers, which might get us in trouble once we truely support quoted identifiers.

We don't need to. Partitioning specification is a mini-language and doesn't need to follow SQL identifier semantics (which are not as simple as they could be)

truncate(test, 12, 12) vs truncate("test, 12", 12)

From a user perspective I think the second one is a lot more obvious, clearly separating arguments in standard SQL syntax.

I agree. OTOH it's a fair price for putting commas in a column name. It's a bad idea and nothing will change that.

Anyway, sans apostrophes it hurts my eyes, so yeah, let's go with quotes

@findepi findepi added enhancement New feature or request and removed syntax-needs-review bug Something isn't working labels May 11, 2022
private static final String FUNCTION_ARGUMENT_NAME_AND_INT = "\\((" + NAME + "), *(\\d+)\\)";
private static final String IDENTIFIER = "[a-z_][a-z0-9_]*";
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\"";
private static final String NAME = "\\s*(" + IDENTIFIER + "|" + QUOTED_IDENTIFIER + ")\\s*";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDENTIFIER -> UNQUOTED_IDENTIFIER
NAME -> IDENTIFIER

private static final String FUNCTION_ARGUMENT_NAME = "\\((" + NAME + ")\\)";
private static final String FUNCTION_ARGUMENT_NAME_AND_INT = "\\((" + NAME + "), *(\\d+)\\)";
private static final String IDENTIFIER = "[a-z_][a-z0-9_]*";
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\"";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\"";
private static final String QUOTED_IDENTIFIER = "\"(?:\"\"|[^\"])*\"";

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from 0537951 to 5c085f2 Compare May 11, 2022 16:38
@kasiafi
Copy link
Member

kasiafi commented May 15, 2022

@mdesmet
I agree that it is reasonable to require quotes around any identifier which does not follow the pattern "[a-z_][a-z0-9_]*". I think that this will seem familiar to a Trino user, as this is how we handle identifiers in queries.

So:

CREATE TABLE iceberg.default.test2 WITH(partitioning=ARRAY['a quoted identifier']) as select * from iceberg.default.test;

should fail, but:

CREATE TABLE iceberg.default.test2 WITH(partitioning=ARRAY['"a quoted identifier"']) as select * from iceberg.default.test;

should pass.

I am only a bit concerned about the case. Before this change, only identifiers which were fully in lowercase would pass parsePartitionField(). After this change, also uppercase and mixed-case strings will pass, and then possibly fail later(?). Could you please add a test with partitioning on "X" while there is column x?
If we want to enforce that users pass lowercased names, we could check the case in parsePartitionField(). If we want to canonicalize for them, we could add lowercasing in toIdentifier().

@findepi
Copy link
Member

findepi commented May 16, 2022

I am only a bit concerned about the case. Before this change, only identifiers which were fully in lowercase would pass parsePartitionField(). After this change, also uppercase and mixed-case strings will pass,

Good point.
While we want to support non-lowercase identifiers in the future (#17), we don't want to be breaking backwards compatibility when doing so. We should allow only lowercase identifiers for now.

If we want to enforce that users pass lowercased names, we could check the case in parsePartitionField(). If we want to canonicalize for them, we could add lowercasing in toIdentifier().

I don't want SQL semantics here, let's be simpler. Let's require user to provide the exact same case as the column actually has (regardless of how it was created).

-- should work
CREATE TABLE t(x bigint) WITH (partitioning = ARRAY['x']);
CREATE TABLE t(x bigint) WITH (partitioning = ARRAY['"x"']);

-- should work, the column is actually `x`, not `X`
CREATE TABLE t(X bigint) WITH (partitioning = ARRAY['x']);

-- should work, the column is actually `x`, not `X` (until #17)
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['x']); 

-- should fail, there is no column `X`. Until #17, the column is actually `x`.
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['"X"']); 

@kasiafi
Copy link
Member

kasiafi commented May 16, 2022

I don't want SQL semantics here, let's be simpler. Let's require user to provide the exact same case as the column actually has (regardless of how it was created).

I suggest that instead we require that users pass only lowercase names (through a check in parsePartitionField(). This solution will be the closest to the current semantics.

Comparing case-sensitive seems like something we might not want to do right now. Currently, we resolve column names case-insensitive, so let's be consistent.

@findepi
Copy link
Member

findepi commented May 17, 2022

I suggest that instead we require that users pass only lowercase names (through a check in parsePartitionField(). This solution will be the closest to the current semantics.

This is effectively what i want.

@findepi
Copy link
Member

findepi commented May 17, 2022

Noted conclusion under #12226 (comment)

@findepi
Copy link
Member

findepi commented May 17, 2022

@mdesmet please add test cases as indicated in #12227 (comment)

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch 2 times, most recently from ba315b8 to 384baa2 Compare May 18, 2022 20:20
private static final Pattern VOID_PATTERN = Pattern.compile("void" + FUNCTION_ARGUMENT_NAME);
private static final String UNQUOTED_IDENTIFIER = "[a-zA-Z_][a-zA-Z0-9_]*";
// We only support lowercase quoted identifiers for now.
// See https://github.com/trinodb/trino/issues/12226#issuecomment-1128839259
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to #17

please add -- or better: remove the comment here, leaving only the one at fromIdentifier

private static String fromIdentifier(String identifier)
// Currently, all Iceberg columns are stored in lowercase in the Iceberg metadata files.
// Unquoted identifiers are canonicalized to lowercase here which is not according ANSI SQL spec.
// Quoted identifiers are restricted to lowercase only through the regex pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the [^A-Z] isn't sufficient for that.
What about Ą?

simplify regex, and use .toLowerCase(ENGLISH) in Java to verify parsed value is all-lower for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex has been simplified and verification is done using .toLowercase(ENGLISH).

@@ -625,7 +626,7 @@ public void setTableComment(ConnectorSession session, ConnectorTableHandle table
public Optional<ConnectorTableLayout> getNewTableLayout(ConnectorSession session, ConnectorTableMetadata tableMetadata)
{
Schema schema = toIcebergSchema(tableMetadata.getColumns());
PartitionSpec partitionSpec = parsePartitionFields(schema, getPartitioning(tableMetadata.getProperties()));
PartitionSpec partitionSpec = createPartitionSpec(schema, getPartitioning(tableMetadata.getProperties()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When should i call createPartitionSpec and when parsePartitionFields?
they have similar names, and even more similar semantics.

Also, how does introduction of the wrapper (that throws TrinoException) related to adding quoted identifiers?
the parsing could fail even before the change, right? (eg unsupported transform, missing closing brace, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the wrapper and just catch the exception in parsePartitionFields now and rethrow as TrinoException. This makes the testing easier in BaseConnectorTest as we are expecting TrinoException there,

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch 2 times, most recently from 5577dab to 80b5e05 Compare June 19, 2022 20:29
@mdesmet
Copy link
Contributor Author

mdesmet commented Jun 20, 2022

Following test failed in last build. It's actually related to #12626. I have setup the product tests to run with iceberg.unique-table-location=true to avoid having retries writing to the same location

TestIcebergHiveViewsCompatibility > testIcebergHiveViewsCompatibility [groups: storage_formats, hms_only, iceberg]
java.sql.SQLException: Query failed (#20220619_213321_01205_fz3sq): Cannot create a table on a non-empty location: hdfs://hadoop-master:9000/user/hive/warehouse/iceberg_table, set 'iceberg.unique-table-location=true' in your Iceberg catalog properties to use unique table locations for every table.

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from b95953f to 2dd644f Compare June 20, 2022 17:37
@@ -151,7 +152,11 @@

public final class IcebergUtil
{
private static final Pattern SIMPLE_NAME = Pattern.compile("[a-z][a-z0-9]*");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like SIMPLE_NAME tries to achieve the same as UNQUOTED_IDENTIFIER but not exactly following SQL semantics. Changing it had some impacts in exception checking in the tests.

@mdesmet mdesmet requested a review from findepi June 21, 2022 06:38
@@ -12,5 +12,8 @@ connector.name=iceberg
hive.metastore.uri=thrift://localhost:9083
hive.hdfs.socks-proxy=localhost:1180

# Ensure test retries don't write to non-empty locations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind adding this setting for the DevelopmentServer Iceberg connector?

Copy link
Contributor Author

@mdesmet mdesmet Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment #12227 (comment)

We can never really ensure that all cleanup (finally stuff) is run eg. For example in case of Hive timeouts some table location might not be emptied, so I think this is the best way to handle this as this ensures every table will have its unique location. This can be moved to a separate PR if necessary.

onSpark().executeQuery(format(
"CREATE TABLE %s (id INTEGER, `mIxEd_COL` STRING) USING ICEBERG",
sparkTableName));
assertQueryFailure(() -> onTrino().executeQuery("ALTER TABLE " + trinoTableName + " SET PROPERTIES partitioning = ARRAY['mIxEd_COL']"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bummer.
We should provide in Trino a way for the users to cope with such situations because otherwise the users would face a Spark lock-in for such situations.

I've created a PR in iceberg to address this problem. apache/iceberg#5110

Copy link
Contributor Author

@mdesmet mdesmet Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give you some context. The original PR I did handle this correctly. According SQL quoted identifier semantics, a quoted identifier should be matched case sensitively. The backticks in Spark behave the same as the quoted identifiers in SQL. IMHO this is not an issue with Iceberg.

There was some discussion about this feature as indeed in Trino the column names are converted into lowercase. Take for example this query

CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['"X"']); 

Because the column name is converted to lowercase in Trino, this query would fail, as at that time the "X" has become x and the partitioning parsing logic fails to find this column. This is definitely confusing for the user. In a ALTER TABLE scenario however this is not true. The column will be known as "X". So we agreed on blocking this scenario for now, as mentioned on #12226 (comment)

Here is the code that explicitly blocks this, removing the lowercase verification would fix the Trino query above.

https://github.com/trinodb/trino/blob/2dd644fc00c636ecae168c81644761c44101327d/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java#L336-L343

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this is not an issue with Iceberg.

I think that the PR apache/iceberg#5110 is more of a usability "improvement" and not a bug.

Please correct me if I'm wrong, I don't think it should be mandatory to specify the source column name in the same case in the table definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we better stick to SQL semantics.

Imagine following query perfectly valid syntax (not currently working in Trino, but actually working on snowflake):

image

CREATE TABLE t("X" bigint, "x" bigint) WITH (partitioning = ARRAY['"X"']); 

Because of the quoted identifers we would know which x to match. I would think Iceberg partitioning spec should be exactly matched against the Iceberg schema and not impose a certain way of working.

@findepi
Copy link
Member

findepi commented Jun 22, 2022

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from b95953f to 2dd644f 2 days ago

This apparently also rebased on current master.
What has changed within this PR?

@mdesmet
Copy link
Contributor Author

mdesmet commented Jun 22, 2022

This apparently also rebased on current master.
What has changed within this PR?

The build failed because of the refactor of SIMPLE_NAME in IcebergUtils and the impact on the exception in a few tests.

To summarise the changes:

  • Split commits further
  • Moved the parsing of quoted and unquoted identifier logic to IcebergUtil so it can easily reused in other code.
  • Catch exceptions in parsePartitioningFields instead of the extra method.
  • Fixed flakyness of tests since Prevent table creation on non-empty location for Iceberg tables #12626, this also failed the build once by setting iceberg.unique-table-location in product tests.

Let me know what you think.

@findepi
Copy link
Member

findepi commented Jun 24, 2022

What kind of problem is this fixing?

(btw this will become default in #12941, so we don't want to set this explicitly in product tests)

@mdesmet
Copy link
Contributor Author

mdesmet commented Jun 25, 2022

What kind of problem is this fixing?

With #12626, we throw an exception when trying to create a table and files exist on that location. Sometimes tests fail randomly and are retried without paths being cleaned. In this case testIcebergHiveViewsCompatibility failed and was retried. A random table suffix would have also fixed that issue.

TestIcebergHiveViewsCompatibility > testIcebergHiveViewsCompatibility [groups: storage_formats, hms_only, iceberg]
java.sql.SQLException: Query failed (#20220619_213321_01205_fz3sq): Cannot create a table on a non-empty location: hdfs://hadoop-master:9000/user/hive/warehouse/iceberg_table, set 'iceberg.unique-table-location=true' in your Iceberg catalog properties to use unique table locations for every table.

(btw this will become default in #12941, so we don't want to set this explicitly in product tests)

Anyway if this setting becomes default (which I definitely support), this is not an issue anymore. Will remove that commit.

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from 2dd644f to 89b4d46 Compare June 26, 2022 19:09
@@ -312,12 +317,38 @@ public static String quotedTableName(SchemaTableName name)

private static String quotedName(String name)
{
if (SIMPLE_NAME.matcher(name).matches()) {
if (UNQUOTED_IDENTIFIER_PATTERN.matcher(name).matches()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That changes semantics of the method.

previously, for My_Table we would output "My_Table".
now we output My_Table without quotes.

if the table name is actually My_Table, it needs to be referenced as "My_Table" in SQL,
so the output of this command no longer can be pasted into SQL.

Please revert the change here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You indeed point out a bug, that also applies to the fromColumnToIdentifier method. It is however the same semantics: we are taking something from metadata (a column or table name), and need to ensure that it can be pasted in an SQL editor, respecting SQL identifier semantics.

return name;
}
return '"' + name.replace("\"", "\"\"") + '"';
}

public static String fromColumnToIdentifier(String column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method introduced here is unused (in the commit which introduced it), and it's unclear what's the context in which is should be used.
Squash the changes with the next commit.

return quotedName(column);
}

public static String fromIdentifierToColumn(String identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method introduced here is unused (in the commit which introduced it), and it's unclear what's the context in which is should be used.
Squash the changes with the next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later it becomes used in PartitionFields which is the context in which it's comprehensible.
(otherwise the "identifier" will be misleading, as normally String identifier should not be expected to have any quotes inside (or the quotes be treated literal).

Move the method to PartitionFields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placement in IcebergUtils had been discussed in #12872, the need to parse quoted identifiers also exist in other table properties (sort_order, orc_bloom_filter, ...).

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from 89b4d46 to 5675f13 Compare June 28, 2022 19:07
@findepi
Copy link
Member

findepi commented Jul 25, 2022

(cannot comment at #12227 (comment))

The placement in IcebergUtils had been discussed in #12872,

I am aware of that PR.

Later it becomes used in PartitionFields which is the context in which it's comprehensible.

It's important for a shared method to have an easy to understand semantics (name, input and output types need to intuitively hint and what it does).
That's why i suggested scoping it down for now to a private method in PartitionFields

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from 5675f13 to 1ea29fc Compare July 27, 2022 15:28
@findepi
Copy link
Member

findepi commented Aug 3, 2022

@mdesmet failures seem related.

@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from e69642d to 95d16a9 Compare August 3, 2022 19:30
@mdesmet mdesmet force-pushed the feature/iceberg-partitioning-quoted-identifiers branch from 95d16a9 to 18cd9b3 Compare August 3, 2022 23:22
@mdesmet
Copy link
Contributor Author

mdesmet commented Aug 4, 2022

@mdesmet failures seem related.

I have rebased with latest master and resolved the issues.

@findepi findepi merged commit 90a714b into trinodb:master Aug 8, 2022
@findepi findepi mentioned this pull request Aug 8, 2022
@github-actions github-actions bot added this to the 393 milestone Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

Allow defining Iceberg partitioning over a column with whitespace in its name
6 participants