Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg: Add support for creating and dropping tables using Iceberg object store #20555

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jakelong95
Copy link

@jakelong95 jakelong95 commented Feb 2, 2024

Description

Currently, Trino is only able to write to and read from tables that use Iceberg's ObjectStorageLocationProvider, but is unable to create or drop tables using the location provider.

This PR enables Trino to create tables using Iceberg's object storage by adding the following properties:

  • iceberg.object-store.enabled - Corresponds with Spark's write.object-storage.enabled, which enables use of Iceberg's ObjectStorageLocationProvider
  • iceberg.data-location - Corresponds to Spark's write.data.path, which sets where Iceberg data files will be written

Enabling the object store property and setting the data path will cause Iceberg to provide data file locations prefixed by a deterministic hash generated from the file name in the location specified by write.data.path, which will help reduce throttling from cloud storage systems like S3 by evenly distributing files across multiple prefixes. Iceberg's own documentation on this feature has more information.

For example, without the object store enabled you would get the following locations for data files, all under the same prefix iceberg-tables/myschema/mytable/data:

  • s3://mybucket/iceberg-tables/myschema/mytable/data/file1.parquet
  • s3://mybucket/iceberg-tables/myschema/mytable/data/file2.parquet
  • s3://mybucket/iceberg-tables/myschema/mytable/data/file3.parquet

But, if you enable the object store and set the data path to is set to s3://mybucket/datafiles, you would get the following locations, each in their own prefix:

  • s3://mybucket/datafiles/<file1 hash>/myschema/mytable/file1.parquet
  • s3://mybucket/datafiles/<file2 hash>/myschema/mytable/file2.parquet
  • s3://mybucket/datafiles/<file3 hash>/myschema/mytable/file3.parquet

Additional context and related issues

This PR maintains compatibility with Spark by using the table properties write.object-storage.enabled and write.data.path, which had previously been set up to allow Trino to write to Iceberg tables using the ObjectStorageLocationProvider in #8573

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Add support for creating and dropping tables using Iceberg object store  ({issue}`8861`)

Copy link

cla-bot bot commented Feb 2, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@jakelong95
Copy link
Author

Re-created from #20516

@github-actions github-actions bot added tests:hive iceberg Iceberg connector labels Feb 2, 2024
@cla-bot cla-bot bot added the cla-signed label Apr 17, 2024
@jakelong95 jakelong95 marked this pull request as ready for review April 17, 2024 20:38
@mosabua mosabua requested a review from electrum April 17, 2024 20:55
@findinpath findinpath requested a review from homar April 26, 2024 10:07
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label May 17, 2024
@mosabua
Copy link
Member

mosabua commented May 29, 2024

Any thoughts @amogh-jahagirdar

@github-actions github-actions bot removed the stale label May 30, 2024
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jul 15, 2024
@jakelong95 jakelong95 reopened this Jul 15, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Jul 15, 2024
@mosabua
Copy link
Member

mosabua commented Jul 15, 2024

I added the stale-ignore label so the PR stays open. Please rebase as a next step.

@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch 3 times, most recently from 6ed669a to a7454d1 Compare July 15, 2024 19:19
@jakelong95
Copy link
Author

Hi @mosabua, pushed the rebase in. Looks like the failed check timed out but I'm not able to restart it.
I'm also not sure what the process is for bumping Trino PRs, should I re-post in the Trino slack?

@mosabua
Copy link
Member

mosabua commented Jul 29, 2024

Please ask for help on the #iceberg or the #core-dev channel

@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch 2 times, most recently from ed46a94 to e452e6d Compare August 20, 2024 23:34
@jakelong95
Copy link
Author

Hi @findinpath, did you have a chance to take a look at my updates?

@ebyhr ebyhr self-requested a review October 31, 2024 21:52
@ebyhr ebyhr force-pushed the jl-iceberg-object-storage branch 4 times, most recently from 7473dad to a99dde1 Compare November 1, 2024 08:47
Comment on lines +207 to +209
* - `iceberg.data-location`
- Sets the path that data files will be written to.
- table location + /data
Copy link
Member

@ebyhr ebyhr Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this property is needed for catalog level. When do we set this globally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my mistake. I did mean to add this globally so it doesn't necessarily need to be configured per table. I'll add this in

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr actually, looking at it I have this property set up in the same way iceberg.file-format is set up so it should be configurable at the catalog level already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg.file-format is unrelated. I'm asking the scenario when we want to set it globally.
Please remove this property if you are not sure. We can add it in follow-up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. I thought you were asking if it's possible to set it globally.

The use case for being able to set it globally would be if you want every table to be created with the object store file layout to preemptively avoid running into rate limiting. Without being able to set it globally, you would have to manually update tables that run into rate limiting which can be time consuming and prone to missing some tables that might run into the issue in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting it globally would also be helpful if you want to have all of your tables follow the same setup, instead of some using the object store file layout and some using the default

// TODO: support path override in Iceberg table creation: https://github.com/trinodb/trino/issues/8861
if (table.properties().containsKey(OBJECT_STORE_PATH) ||
table.properties().containsKey("write.folder-storage.path") || // Removed from Iceberg as of 0.14.0, but preserved for backward compatibility
table.properties().containsKey(WRITE_METADATA_LOCATION) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was WRITE_METADATA_LOCATION removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I totally follow. I'm not removing the property, just its usage here since Trino will be able to drop tables created with Iceberg's object storage file layout with this change

@grantatspothero
Copy link
Contributor

grantatspothero commented Nov 4, 2024

Adding support for updating this table property would be useful, see IcebergMetadata#setTableProperties

It is a common to create a table and only later encounter object storage rate limiting. Not allowing updates would make this workflow to fix rate limiting a pain.

Not sure if updates should go in this PR or another one @ebyhr but heads up.

@ebyhr ebyhr force-pushed the jl-iceberg-object-storage branch 2 times, most recently from 892a76d to 38fe6eb Compare November 7, 2024 04:40
@jakelong95
Copy link
Author

Adding support for updating this table property would be useful, see IcebergMetadata#setTableProperties

It is a common to create a table and only later encounter object storage rate limiting. Not allowing updates would make this workflow to fix rate limiting a pain.

Not sure if updates should go in this PR or another one @ebyhr but heads up.

That's a good callout. I'll add it in this PR

@ebyhr
Copy link
Member

ebyhr commented Nov 7, 2024

That's a good callout. I'll add it in this PR

@jakelong95 I already push the code to support SET PROPERTIES.

@jakelong95
Copy link
Author

That's a good callout. I'll add it in this PR

@jakelong95 I already push the code to support SET PROPERTIES.

Yeah, saw that you added adding in object_store_enabled. I'm adding in support for modifying data_location since the two properties are related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs iceberg Iceberg connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

7 participants