Spark: Added ability to add uuid suffix to the table location in Hive catalog #2850

sshkvar · 2021-07-22T12:42:09Z

Hi, I would like to add ability to have unique table location for each new table in hive catalog.

Let me provide some details why I need it.
We have 2 main engines which works with iceberg tables: trino and spark. For storing metadata we are using Hive metastore. Data and metadata stored on s3. We also have a requirement don't drop table data and metadata when dropping table from hive metastore. It will allow us to restore any dropped table.

In this PR I added new Catalog property APPEND_UUID_SUFFIX_TO_TABLE_LOCATION. If this property is set to true, on table create action we will add UUID suffix (like s3://bucket/tableName-UUID) which will be used for building table location.

@RussellSpitzer could you please take a look?

kbendick · 2021-07-26T07:51:31Z

In testing etc, I very often use a similar pattern (possibly using a timestamp as the table suffix).

However, I'm not sure if the best place to be doing this is in the Iceberg code.

What other tools are you using to create these tables that have UUID suffixes? Usually, when I encounter this need, I'm doing it in one of two places:
(1) Directly from shell scripts or small Spark / Trino jobs when testing on S3 (and wanting to ensure a brand new table). The solution for me there is simply to either place the table name with a timestamp in the code. Here's a sample from some code I have elsewhere:

    val currentTime = new Date().getTime
    val tableName = "table_" + currentTime;
    spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} (name string, age int) USING iceberg")

(2) From some sort of scheduling tool, such as Airflow or Azkaban. In this case, it's very easy to create a UUID when passing In the "new table name" to the spark job.

Effectively, for me, I'm not sure if this is something that makes sense to place it in Iceberg.

Can you elaborate further on why this isn't something that you can pass as an argument to your jobs etc? It feels very use case specific, with possible ways for you to deal with it using existing tools, but maybe I'm not fully understanding the scope of your problem. 🙂

kbendick · 2021-07-26T07:53:01Z

We also have a requirement don't drop table data and metadata when dropping table from hive metastore. It will allow us to restore any dropped table.

I believe this is possible using EXTERNAL tables already (where the table is dropped from HMS but th data is retained). Is this not sufficient for your use case?

sshkvar · 2021-07-26T09:00:57Z

In testing etc, I very often use a similar pattern (possibly using a timestamp as the table suffix).

However, I'm not sure if the best place to be doing this is in the Iceberg code.

What other tools are you using to create these tables that have UUID suffixes? Usually, when I encounter this need, I'm doing it in one of two places:
(1) Directly from shell scripts or small Spark / Trino jobs when testing on S3 (and wanting to ensure a brand new table). The solution for me there is simply to either place the table name with a timestamp in the code. Here's a sample from some code I have elsewhere:
    val currentTime = new Date().getTime
    val tableName = "table_" + currentTime;
    spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} (name string, age int) USING iceberg")
(2) From some sort of scheduling tool, such as Airflow or Azkaban. In this case, it's very easy to create a UUID when passing In the "new table name" to the spark job.

Effectively, for me, I'm not sure if this is something that makes sense to place it in Iceberg.

Can you elaborate further on why this isn't something that you can pass as an argument to your jobs etc? It feels very use case specific, with possible ways for you to deal with it using existing tools, but maybe I'm not fully understanding the scope of your problem. 🙂

@kbendick Thanks for the quick reply
Let me provide additional details.
Actually we do not need to change table name (and we don't do it), in
this PR just add uuid suffix to the table location.
We need this to store tables with same name in different "folders" on s3.
Our use-case:

We created table with name test_table and inserted some data to this table
Then we dropped this table only from metastore, because we should have ability to restore this table
Then we again created new table with same name test_table.
And again dropped this table
With this PR we will be able to restore any of this tables, because data and metadata placed in different folders, we just need to restore information about table location in metastore (we can easy do it via iceberg API).

Also we have scheduled compaction and orhan files cleanup processes. If we will have data and metadata files for both tables in same folder, orhan files cleanup process will delete data and metadata for table which was deleted in step 2.
Based on described above EXTERNAL table is not an option for us

pvary · 2021-07-26T09:59:41Z

@sshkvar: I think you could use the following Hive query to archive the desired results:

CREATE EXTERNAL TABLE customers (id INTEGER)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION '<WAREHOUSE>/934b0724-b4da-4202-aa90-4d6e9543b8b4//default/customers';

Would this work for you?
Thanks,
Peter

sshkvar · 2021-07-26T10:09:47Z

@sshkvar: I think you could use the following Hive query to archive the desired results:
CREATE EXTERNAL TABLE customers (id INTEGER)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION '<WAREHOUSE>/934b0724-b4da-4202-aa90-4d6e9543b8b4//default/customers';
Would this work for you?
Thanks,
Peter

@pvary yes, this solution can be used, but in this case we need to change a lot of existing ETL's and ask people to set custom location for each new table. We would like to have some general option

pvary · 2021-07-26T10:21:45Z

@sshkvar: Do you think is a general enough request to change the HiveCatalog code, or it is ok for you to create your own CustomCatalog and use it in your deployment?

sshkvar · 2021-07-26T10:49:32Z

@sshkvar: Do you think is a general enough request to change the HiveCatalog code, or it is ok for you to create your own CustomCatalog and use it in your deployment?

As for me it can be useful in case when we not deleting table data.
But if you would like to avoid this changes we can try to find other solutions

pvary · 2021-07-26T13:18:34Z

But if you would like to avoid this changes we can try to find other solutions

@sshkvar: I would like to hear others opinion on this too before we come to any decision.

RussellSpitzer · 2021-07-26T15:00:49Z

I guess i'm also a little apprehensive of adding more parameters when this could be done in user space pretty easily. But if other folks have this need more frequently I don't think it's unreasonable. There is actually a similar thing in C* which always prefixes table names with UUID's when creating their actual locations to avoid these kind of conflicts.

findepi · 2021-07-27T10:58:12Z

FWIW here is the related change in Trino trinodb/trino#6063
and related discussion: trinodb/trino#5632 (comment)

sshkvar · 2021-07-29T07:22:57Z

Another important reason why we need unique table location is Rename table action.
We have a cases when users changing name of tables and then create tables with same name, like:

User create table with name table1
Then rename it to table2
Create new table with name table1

In this case data and metadata files will be placed in same 'folder' on s3. When we will perform remove orphan files action for example for table1 it will remove data and metadata files for table2

pvary · 2021-07-29T08:41:40Z

Finally I put together this in my head with the same issue in Hive:

https://issues.apache.org/jira/browse/HIVE-24445

Wouldn't this solve both the issues?

CC: @deniskuzZ, @zchovan

sshkvar · 2021-07-30T12:08:53Z

This PR adds ability to have unique table location.
In this case each new table will have unique path on s3.
Based on it example from my previous message will looks:

User create table with name table1 (location: s3://iceberg/table1-eecec917-53b8-4d28-a900-c88061fc7ac2)
Then rename it to table2 (location not change during update s3://iceberg/table1-eecec917-53b8-4d28-a900-c88061fc7ac2)
Create new table with name table1 (location: s3://iceberg/table1-309f7903-c4b2-4df1-b890-a481bb47c4c5)

As we have unique location we can perform remove orphan files (compaction) for bot tables safety.
Looks like it can be pretty general case.
@RussellSpitzer @kbendick what do you think? Is it make sense to add such property to Hive catalog?

… catalog

sshkvar · 2021-08-09T12:20:17Z

@RussellSpitzer @kbendick the same functionality already merged to Trino Iceberg connector trinodb/trino#6063
Also as @pvary mentioned the same PR opened to the Hive https://issues.apache.org/jira/browse/HIVE-24906

This functionality will allow us to avoid issue with table rename as I described above.
I will be really appreciate for the review, I am ready to work on it and want to hear you opinion about it

github-actions · 2024-07-18T00:13:20Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

deniskuzZ · 2024-07-18T11:54:06Z

Finally I put together this in my head with the same issue in Hive:

https://issues.apache.org/jira/browse/HIVE-24445

Wouldn't this solve both the issues?

CC: @deniskuzZ, @zchovan

I think so, the above was done for ACID tables in Hive, it won't work out of the box for iceberg, but with minimal changes -yes.
HIVE_ACID_CREATE_TABLE_USE_SUFFIX(hive.acid.createtable.softdelete) could be replaced with a more generic version.

github-actions · 2024-08-21T00:12:57Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-11T00:13:48Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added hive spark labels Jul 22, 2021

Spark: Added ability to add uuid suffix to the table location in Hive…

aed7c8d

… catalog

sshkvar force-pushed the hive-catalog/unique-location branch from 52ab5f4 to aed7c8d Compare August 9, 2021 12:15

github-actions bot added the stale label Jul 18, 2024

github-actions bot removed the stale label Jul 19, 2024

github-actions bot added the stale label Aug 21, 2024

github-actions bot closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Added ability to add uuid suffix to the table location in Hive catalog #2850

Spark: Added ability to add uuid suffix to the table location in Hive catalog #2850

sshkvar commented Jul 22, 2021

kbendick commented Jul 26, 2021

kbendick commented Jul 26, 2021

sshkvar commented Jul 26, 2021

pvary commented Jul 26, 2021 •

edited

Loading

sshkvar commented Jul 26, 2021

pvary commented Jul 26, 2021

sshkvar commented Jul 26, 2021

pvary commented Jul 26, 2021

RussellSpitzer commented Jul 26, 2021

findepi commented Jul 27, 2021

sshkvar commented Jul 29, 2021

pvary commented Jul 29, 2021 •

edited

Loading

sshkvar commented Jul 30, 2021

sshkvar commented Aug 9, 2021

github-actions bot commented Jul 18, 2024

deniskuzZ commented Jul 18, 2024 •

edited

Loading

github-actions bot commented Aug 21, 2024

github-actions bot commented Sep 11, 2024

Spark: Added ability to add uuid suffix to the table location in Hive catalog #2850

Spark: Added ability to add uuid suffix to the table location in Hive catalog #2850

Conversation

sshkvar commented Jul 22, 2021

kbendick commented Jul 26, 2021

kbendick commented Jul 26, 2021

sshkvar commented Jul 26, 2021

pvary commented Jul 26, 2021 • edited Loading

sshkvar commented Jul 26, 2021

pvary commented Jul 26, 2021

sshkvar commented Jul 26, 2021

pvary commented Jul 26, 2021

RussellSpitzer commented Jul 26, 2021

findepi commented Jul 27, 2021

sshkvar commented Jul 29, 2021

pvary commented Jul 29, 2021 • edited Loading

sshkvar commented Jul 30, 2021

sshkvar commented Aug 9, 2021

github-actions bot commented Jul 18, 2024

deniskuzZ commented Jul 18, 2024 • edited Loading

github-actions bot commented Aug 21, 2024

github-actions bot commented Sep 11, 2024

pvary commented Jul 26, 2021 •

edited

Loading

pvary commented Jul 29, 2021 •

edited

Loading

deniskuzZ commented Jul 18, 2024 •

edited

Loading