Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Support procedure remove_orphan_files #23267

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Jul 21, 2024

Description

This PR support the procedure remove_orphan_files for iceberg. It can be used to remove files which are not referenced in any metadata files of an Iceberg table and can thus be considered "orphaned".

See examples as follow:

  • Remove any files which are not known to the table db.sample and older than specified timestamp::
CALL iceberg.system.remove_orphan_files('db', 'sample', TIMESTAMP '2023-08-31 00:00:00.000');
  • Remove any files which are not known to the table db.sample and created 3 days ago (by default)::
CALL iceberg.system.remove_orphan_files(schema => 'db', table_name => 'sample');

Motivation and Context

Support removing orphan files that are not referenced in any metadata files for Iceberg

Test Plan

  • Newly added test cases in TestRemoveOrphanFilesProcedure

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==


Iceberg Connector Changes
* Add procedure `remove_orphan_files` to remove orphan files that are not referenced in any metadata files for Iceberg. :pr:`23267`

Copy link
Contributor

@kiersten-stokes kiersten-stokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be nice to have! Just NITs from me

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Local doc build with the new table looks good. A minor suggested rephrase for active voice, and shortening for readability.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
steveburnett
steveburnett previously approved these changes Jul 29, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, looks good. Thanks!

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments. Great to see us getting more parity with Iceberg's Spark procedures

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two final things. Also, I would like to see a test which exercises the default expiry limit. I don't have a good idea on how to do that in a reasonable time frame within the test, so I am happy to leave as is if you think it would be difficult

@hantangwangd
Copy link
Member Author

I would like to see a test which exercises the default expiry limit. I don't have a good idea on how to do that in a reasonable time frame within the test, so I am happy to leave as is if you think it would be difficult

Seems it's indeed a bit difficult to test this scenario unless we change the implementation of remove_orphan_files to support configuring the default value of older_than. I'm not so sure if it's worthing do this. Maybe we can supplement such test cases when we figure out a better time test frame?

@hantangwangd hantangwangd merged commit 86fc085 into prestodb:master Jul 31, 2024
57 checks passed
@hantangwangd hantangwangd deleted the remove_orphan_files branch July 31, 2024 15:33
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants