-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Added ability to add uuid suffix to the table location in Hive catalog #2850
Conversation
In testing etc, I very often use a similar pattern (possibly using a timestamp as the table suffix). However, I'm not sure if the best place to be doing this is in the Iceberg code. What other tools are you using to create these tables that have UUID suffixes? Usually, when I encounter this need, I'm doing it in one of two places: val currentTime = new Date().getTime
val tableName = "table_" + currentTime;
spark.sql(s"CREATE TABLE IF NOT EXISTS my_catalog.default.${tableName} (name string, age int) USING iceberg") (2) From some sort of scheduling tool, such as Airflow or Azkaban. In this case, it's very easy to create a UUID when passing In the "new table name" to the spark job. Effectively, for me, I'm not sure if this is something that makes sense to place it in Iceberg. Can you elaborate further on why this isn't something that you can pass as an argument to your jobs etc? It feels very use case specific, with possible ways for you to deal with it using existing tools, but maybe I'm not fully understanding the scope of your problem. 🙂 |
I believe this is possible using |
@kbendick Thanks for the quick reply
Also we have scheduled compaction and orhan files cleanup processes. If we will have data and metadata files for both tables in same folder, orhan files cleanup process will delete data and metadata for table which was deleted in step 2. |
@sshkvar: I think you could use the following Hive query to archive the desired results:
Would this work for you? |
@pvary yes, this solution can be used, but in this case we need to change a lot of existing ETL's and ask people to set custom location for each new table. We would like to have some general option |
@sshkvar: Do you think is a general enough request to change the HiveCatalog code, or it is ok for you to create your own CustomCatalog and use it in your deployment? |
As for me it can be useful in case when we not deleting table data. |
@sshkvar: I would like to hear others opinion on this too before we come to any decision. |
I guess i'm also a little apprehensive of adding more parameters when this could be done in user space pretty easily. But if other folks have this need more frequently I don't think it's unreasonable. There is actually a similar thing in C* which always prefixes table names with UUID's when creating their actual locations to avoid these kind of conflicts. |
FWIW here is the related change in Trino trinodb/trino#6063 |
Another important reason why we need unique table location is
In this case data and metadata files will be placed in same 'folder' on s3. When we will perform remove orphan files action for example for |
Finally I put together this in my head with the same issue in Hive: Wouldn't this solve both the issues? CC: @deniskuzZ, @zchovan |
This PR adds ability to have unique table location.
As we have unique location we can perform remove orphan files (compaction) for bot tables safety. |
52ab5f4
to
aed7c8d
Compare
@RussellSpitzer @kbendick the same functionality already merged to Trino Iceberg connector trinodb/trino#6063 This functionality will allow us to avoid issue with table rename as I described above. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
I think so, the above was done for ACID tables in Hive, it won't work out of the box for iceberg, but with minimal changes -yes. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Hi, I would like to add ability to have unique table location for each new table in hive catalog.
Let me provide some details why I need it.
We have 2 main engines which works with iceberg tables: trino and spark. For storing metadata we are using Hive metastore. Data and metadata stored on s3. We also have a requirement don't drop table data and metadata when dropping table from hive metastore. It will allow us to restore any dropped table.
In this PR I added new Catalog property
APPEND_UUID_SUFFIX_TO_TABLE_LOCATION
. If this property is set totrue
, on table create action we will add UUID suffix (like s3://bucket/tableName-UUID) which will be used for building table location.@RussellSpitzer could you please take a look?