Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg split generation rate limit #18214

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

marton-bod
Copy link
Contributor

@marton-bod marton-bod commented Jul 10, 2023

Description

Hive, Hudi and DeltaLake connectors already use the same approach implemented in this PR, whereby they first asynchronously load (stream) splits into an AsyncQueue (which can be configured with throttling settings), and then when getNextBatch() is periodically called from the SplitSource, splits are borrowed from this queue.

The primary aim is to control the rate at which splits are fetched for scheduling, thereby protecting the storage layer from getting overwhelmed by requests (e.g. namenodes in HDFS). This is achieved by a newly-introduced config flag (already present in Hive, Hudi and Delta): iceberg.max-splits-per-second.

Additional context and related issues

Solves the issue: #17974

The logic for filtering out splits should be unchanged: the pruning logic related to dynamic filter is in getNextBatch() since DF needs to be recomputed periodically. The rest of the pruning logic we can be applied to a split even before enqueueing it, therefore we can perform those checks within acceptScanTask() when we create the split stream.

Note: FileAwareScanTask is a new class introduced to keep metadata for FileScanTasks regarding their parent files. This is needed to replicate the current logic which stops the iteration if we hit enough records already for a LIMIT clause.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# Iceberg connector
* Introduce new config property to rate limit split generation in Iceberg table scans: iceberg.max-splits-per-second.

@cla-bot cla-bot bot added the cla-signed label Jul 10, 2023
@github-actions github-actions bot added the iceberg Iceberg connector label Jul 10, 2023
@cla-bot
Copy link

cla-bot bot commented Jul 10, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla-signed label Jul 10, 2023
@cla-bot cla-bot bot added the cla-signed label Jul 10, 2023
@findepi findepi changed the title Iceberg: add config flags to rate limit split generation Iceberg: add config flags to rate limit Iceberg split generation Jul 12, 2023
@findepi findepi changed the title Iceberg: add config flags to rate limit Iceberg split generation Iceberg split generation rate limit Jul 12, 2023
@marton-bod
Copy link
Contributor Author

@alexjo2144 @findinpath Just a friendly bump, I would really appreciate it if you could review this PR when you get the chance. Thank you!

@mosabua
Copy link
Member

mosabua commented Jul 18, 2023

Given that this is modelled of implementation in other connectors this seems like a good approach. Could you potentially also look @electrum or @findepi

@marton-bod
Copy link
Contributor Author

Thanks a lot @mosabua for taking a first look. @electrum / @findepi please let me know if you have any thoughts, I do appreciate any input. Thanks a lot!

@cla-bot
Copy link

cla-bot bot commented Jul 26, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Jul 27, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

.addCopies(new FileOperation(SNAPSHOT, INPUT_FILE_NEW_STREAM), 1)
.addCopies(new FileOperation(MANIFEST, INPUT_FILE_NEW_STREAM), numberOfFiles + min(icebergManifestPrefetching, numberOfFiles))
.build());
assertFileSystemDataFileAccesses("SELECT * FROM test_select_with_limit LIMIT " + limit, min(limit, numberOfFiles));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi the number of manifest accesses is no longer deterministic, because the splits are enqueued async. As soon as it's identified during a getNextBatch call that we have enough splits, the enqueuing will stop, but we can't make any guarantees about the number of manifest accesses.

On the other hand, the number of data file accesses are deterministic and no more data files will be read than necessary. Changed this test to reflect this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this

  • we still want to test manifest accesses; this is important metric
  • there should probably be a kill-switch for this functionality

so maybe the test should just use that kill-switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this too, but I think adding a kill switch would complicate the code greatly. The iterative split generation approach is fundamentally different from the background async split loading approach, and a kill switch would require maintaining the code for both approaches in parallel and duplicating the logic in some places (and more painfully, porting fixes to both places). Because of this, I think it's impractical to introduce this flag.

I've made some enhancement to the test though, since even with this new approach we can make some guarantees about the number of manifest accesses. I've changed the test to compare the manifest access count against a lower bound and upper bound, and added some code comments to explain the reasoning behind the bounds. Regardless of the number of manifests in the scan, if there's a limit, the number of actually-accessed manifests should stay within a constant, narrow range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterative split generation approach is fundamentally different from the background async split loading approach

kill switches are useful in case a bug is uncovered. if the new code is fundamentally different from the old one, it sounds like it's not safe to proceed, is it?

Copy link
Contributor Author

@marton-bod marton-bod Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I phrased it wrong, I think. They're not fundamentally different, since most of the split generation and filtering logic has remained unchanged. It's different only in the sense that hiding it behind a feature flag is not trivial like putting some code in an if-else statement. But it's doable. For example, I could refactor it so that IcebergAsyncSplitSource extends IcebergSplitSource and then IcebergSplitManager can toggle between which implementation to use based on the flag - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the new implementation produce the same behaviour as before if max-outstanding-splits was configurable and set to 1 ?

@marton-bod
Copy link
Contributor Author

@findepi Thanks so much for your time to review this. I've tried answering your questions as best as I could. Please let me know if you have any other questions/improvement ideas. Thanks!

@cla-bot
Copy link

cla-bot bot commented Aug 1, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@vgankidi
Copy link

@raunaqmorarka @findepi @findinpath @electrum Would really appreciate reviews here, its been a long pending one. Hope to close this out soon.

@vgankidi
Copy link

@raunaqmorarka @findepi @findinpath @electrum @mosabua Gentle ping again on this.

@bitsondatadev
Copy link
Member

bitsondatadev commented Mar 14, 2024

@vgankidi I'm just gonna ping and ping until someone can get to this lol. Sorry this has taken so long.

Also, please attend the contributor meeting next week: https://github.com/trinodb/trino/wiki/Contributor-meetings

Bring this up there.

@dain
Copy link
Member

dain commented Mar 14, 2024

This feature is pretty anti Trino. Trino is designed to complete queries as fast as possible. Storage systems should be protecting themselves by pushing back and throttling (like all of the cloud ones do). Also split generation is not representitive of actually performing any IO, so even if we wanted to add a throttle, this is the wrong place.

dain
dain previously requested changes Mar 14, 2024
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be doing this, and it should be discussed (slack, contributors meeting, etc.) before merging.

@dain dain dismissed their stale review March 15, 2024 04:45

Turns out this is in every other hive style connector for some reason, so I guess this this ship has sailed

@dain
Copy link
Member

dain commented Mar 15, 2024

@electrum pointed out this is in the outher Hive style connectors already. I have no idea why this was added because it is a poor stand in for real concurrency control... anyway, I suggest we do not document or recommend this feature.

@marton-bod
Copy link
Contributor Author

@dain

  • The same approach is already followed by both Hive, Delta Lake and Hudi. As for the "anti-trinoness" of this approach, I would just remind you of this Delta commit: 9956543e, which introduces the Delta connector, which was co-authored by you as well as couple dozen Starburst engineers, signing off on this very same approach.
  • The point of this change is to protect storage systems which do not have throttling capabilities, like HDFS. I clearly pointed this out in the issue description as well. We have plenty of Iceberg datasets in HDFS. For storage systems with throttling, like S3, I agree this shouldn't be used, but no one suggested it otherwise.

@ebyhr ebyhr removed their request for review March 25, 2024 00:55
@dain
Copy link
Member

dain commented Apr 10, 2024

I suggest we fix this up and merge it, so there is parity across the implementations. Then I suggest we add a real throttling wrapper to the new native file systems to actualy deal with real errors (we can start with a simpel per machine uncoordinated limt). Once that is in we should drop this from all of the connectors.

@mosabua
Copy link
Member

mosabua commented Apr 10, 2024

Awesome @dain .. do you want me to file an issue so we can track this and move it along?

@marton-bod can you rebase and ping us when its ready for review again? Then we can figure out how @findepi @raunaqmorarka @ebyhr and @dain collaborate efficiently to proceed this to merge.

@marton-bod
Copy link
Contributor Author

@mosabua @dain Thanks all! I've rebased this to the latest master, so should be ready again for reviews

Copy link

github-actions bot commented May 3, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label May 3, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels May 7, 2024
@marton-bod
Copy link
Contributor Author

Hi @dain, I was wondering if you could please take a look at this to close it out? Thanks so much!

@marton-bod marton-bod requested a review from dain May 20, 2024 16:56
@mosabua
Copy link
Member

mosabua commented May 24, 2024

Please rebase @marton-bod

Either @findepi or @dain can then proceed with final review and merge

@marton-bod
Copy link
Contributor Author

@mosabua @dain Rebase done. Thanks!

@marton-bod
Copy link
Contributor Author

@mosabua Shall we just close this PR instead?

@mosabua
Copy link
Member

mosabua commented Jul 17, 2024

@marton-bod @dain .. we kinda agreed to get this merged to get parity for all the object storage connectors on this feature. I still think we should do that .. and also work on a proper solution, maybe on the file system level @electrum

Thoughts?

And also .. sorry for the delays and back and forth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs iceberg Iceberg connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

8 participants