You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to limit the impact on storage systems, the Iceberg connector should add configuration option to rate limit the split generation. This can help avoid situations where a non-selective query overwhelms the storage (especially HDFS) with block requests. This config flag is already implemented by the Hive, Hudi and Delta connectors.
The text was updated successfully, but these errors were encountered:
This will require a refactor of the way Iceberg splits are returned in batches (probably using the AsyncQueue), therefore implementing iceberg.max-outstanding-splits in tandem would make sense.
This idea has come up a couple times before, but it's always been put off because there hasn't been an obvious need for it. It seems that Iceberg users are much more likely to be using cloud storage systems like S3, GCS, or ADLS which are much less likely to hit rate limits than HDFS.
Have you been hitting rate limits in a production environment, or were you thinking of doing this just for parity with the other connectors?
Hey @alexjo2144 , thanks for reaching out. We do have teams using iceberg tables on HDFS in production. We've had production problems before where a badly written query (a full table scan reading millions of files) would overwhelm the NN with block location requests and bring the whole cluster down. Hence this PR to install some guardrails against that scenario.
In order to limit the impact on storage systems, the Iceberg connector should add configuration option to rate limit the split generation. This can help avoid situations where a non-selective query overwhelms the storage (especially HDFS) with block requests. This config flag is already implemented by the Hive, Hudi and Delta connectors.
The text was updated successfully, but these errors were encountered: