-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Set better defaults for S3 retry behaviour #11052
Conversation
@@ -38,6 +40,14 @@ | |||
import software.amazon.awssdk.auth.credentials.AnonymousCredentialsProvider; | |||
import software.amazon.awssdk.core.client.config.ClientOverrideConfiguration; | |||
import software.amazon.awssdk.core.client.config.SdkAdvancedClientOption; | |||
import software.amazon.awssdk.core.exception.SdkServiceException; | |||
import software.amazon.awssdk.core.retry.RetryMode; | |||
import software.amazon.awssdk.core.retry.RetryPolicy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why not use RetryStrategy
from the retries
API introduced in version 2.26
? The current version used in iceberg supports that and technically RetryPolicy
has already been flagged as deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly for the retryCapacityCondition that we use in retryPolicy to workaround the HEAD 503 problem which I don't find a good replacement in RetryStrategy. We have an issue open with the SDK to fix that, after that we should be able to move.
Just double checking the support level, it seems like there is no immediate date to stop supporting RetryPolicy as well: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/retry-strategy.html#migrate-from-retry-policies
.retryCondition( | ||
OrRetryCondition.create( | ||
RetryCondition.defaultRetryCondition(), | ||
RetryOnExceptionsCondition.create(XMLStreamException.class))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is comprehensive. We've seen issues with exceptions thrown as UncheckedIOException
when using the urlconnect client which throws directly the UncheckedIOException
and is not catched by the RetryableStage
, which leads to these failures not being retried even if we set them as retryable conditions on exceptions (since they are not recognized/catched at all).
Example:
java.io.UncheckedIOException: java.net.SocketException: Unexpected end of file from server
at software.amazon.awssdk.utils.FunctionalUtils.asRuntimeException(FunctionalUtils.java:180)
at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:110)
at software.amazon.awssdk.utils.FunctionalUtils.invokeSafely(FunctionalUtils.java:136)
at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.lambda$tryGetOutputStream$0(UrlConnectionHttpClient.java:324)
at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.getAndHandle100Bug(UrlConnectionHttpClient.java:360)
at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.tryGetOutputStream(UrlConnectionHttpClient.java:324)
at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.call(UrlConnectionHttpClient.java:301)
at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.call(UrlConnectionHttpClient.java:274)
at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:67)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:77)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:56)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:52)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:37)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:196)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
at software.amazon.awssdk.services.s3.DefaultS3Client.putObject(DefaultS3Client.java:8783)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it seems RetryableStage doesn't even catch all the default exceptions mentioned here particularly the UncheckedIO one. Seems like we need to solve this on the SDK side regardless but I am leaning towards leaving them here so that we don't need to come back and revisit this later on. What do you think?
I can update aws/aws-sdk-java-v2#5442 with this info as well, not sure whether you know a separate issue open regarding this.
aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java
Outdated
Show resolved
Hide resolved
aws/src/test/java/org/apache/iceberg/aws/s3/TestFuzzyS3InputStream.java
Outdated
Show resolved
Hide resolved
aws/src/test/java/org/apache/iceberg/aws/s3/TestFuzzyS3InputStream.java
Outdated
Show resolved
Hide resolved
aws/src/test/java/org/apache/iceberg/aws/s3/TestFuzzyS3InputStream.java
Outdated
Show resolved
Hide resolved
Removed all InputStream related changes in favor #10433 |
/** Minimum wait time to retry a S3 operation */ | ||
public static final String S3_RETRY_MIN_WAIT_MS = "s3.retry.min-wait-ms"; | ||
|
||
public static final long S3_RETRY_MIN_WAIT_MS_DEFAULT = 2_000; // 2 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would 500 ms make more sense maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the original value that we did our load tests with but we did not see any benefit from lower value and it caused significantly higher throttle rates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar could you also take a look please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ookumuso, I think adding the configuration options is reasonable but I'm a bit concerned that the defaults that are set when compounding with retries that integrations may already retry (e.g. Spark task level retry), may be too much and not improve reliability in the general case. But as I understand it, you've done some benchmarks, do we have some datapoints when smaller values for retries and duration were configured? That would help to make it a bit more clear to see what's the impact when values are lower.
/** Number of times to retry S3 operations. */ | ||
public static final String S3_RETRY_NUM_RETRIES = "s3.retry.num-retries"; | ||
|
||
public static final int S3_RETRY_NUM_RETRIES_DEFAULT = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ookumuso This value seems really high to me? This can compound considering engine integrations may already have their own level of retries on top of this (e.g. Spark task retries). I would also hypothesize that the chance of success after some small number of retries (e.g. I think the SDK default is 5?) is slim and that it's generally not worth trying beyond that.
In the performed load tests did this 32 number markedly improved reliability generally compared to lower values?
I think setting this default value comes down to what is the typical distribution of outcomes and I'd speculate that in the average case when there is 5xx errors this large number of retries won't help? In some extreme cases it seems possible, but I think for those cases, users can override the config defined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar Main consideration here is to give S3 enough time to auto-scale particularly for workloads which observe 503s when they go beyond S3 advertised limits. What we observed in our load tests that high concurrent workloads exhaust the default retries(3) and fail almost immediately. How fast these retries exhausted will depend on the concurrency level. Retry count makes the main difference here since the overall application becomes more resistant to 503s which leads to success and better tps scaling for upcoming workloads with auto-scaling.
I would say in average case particularly workloads running below advertised limits, customers will not see a change in behavior since default retries wouldn't be exhausted for those anyway but as we see more adoption to Iceberg with bigger workloads running; exhausting retries on throttles becomes an issue. One downside would be that if there is a more systemic error in communications with S3, workloads will take longer time to fail, up to 10 minutes but I would argue that those will be more on the rare side compared to the value it brings for eventually succeeding workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar let me know whether you have any other questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside would be that if there is a more systemic error in communications with S3, workloads will take longer time to fail, up to 10 minutes but I would argue that those will be more on the rare side compared to the value it brings for eventually succeeding workloads.
Thanks for your patience, yeah this is exactly the part I'm mulling over. I can see a rationale for retrying specifically throttling this many times for larger workloads but if I'm reading this PR right, this number of retries is for all retryable exceptions not just throttling, so I'm double checking if this large level of retries is too broad for all the possible retryable exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with the concern here. If we can narrow this retry behavior to just the 503 error code, I would be more amenable (though 32 is still very high), but as a default retry this causes really bad behaviors for other errors. The other issue is that there are retries in the surrounding execution path (task retries and stage retries in spark for example). The compounding effect of high retry values has a multiplicative effect for the overall job.
I would also say that trying to hide this is problematic for those diagnosing slowness with their workloads, so we should log messages when slowdowns are occurring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar @danielcweeks See the response from Drew here.
If we can narrow this retry behavior to just the 503 error code
Token bucket behavior should exactly achieve that since it captures all non-throttling exceptions and disables retries.
@amogh-jahagirdar @nastra for the concerns regarding the new config values, to give some additional data points here, we have similar configs internally for quite some time now in Iceberg (based on the original PR #8221), and we tested these new config values with EMR workloads, and at least we have not observe any regressions so far, and from loadd test data from S3 side, looks like throttling have improved a lot with these settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
@ookumuso looks like CI failed for some unrelated reason, can you rebase the PR to retrigger the CI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll wait a bit in case @amogh-jahagirdar has any comments
RetryCondition.defaultRetryCondition(), | ||
RetryOnExceptionsCondition.create(XMLStreamException.class))) | ||
|
||
// Workaround: exclude all 503s from consuming retry tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this it sounds like you're proposing unlimited retries on 503s and 32 retries on other failures? This feels far too aggressive, but I'm not sure if I'm reading this correctly. Why would we treat bucket HEAD requests differently anyway? (Also, I'm not sure that we actually make that request currently, so I don't think this is necessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry policy is configured with both a max retry count and a token bucket. Throttling exceptions (503) won't consume tokens from the token bucket, so the retry count is limited by the max retry count of 32. Retryable but non-throttling exceptions will consume tokens from the token bucket. We're reusing the SDK default values for the token bucket, which is 500 tokens, and a non-throttling exception will consume 5 tokens. So, if there's a spike in non-throttling exceptions, we'd expect the token bucket to kick in and effectively disable retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we treat bucket HEAD requests differently anyway?
It's a workaround to an SDK bug where 503s from HEAD requests aren't properly categorized as throttling exceptions.
HEAD is used here: https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/BaseS3File.java#L80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I confused the HEAD request here with a Bucket HEAD request (not an object HEAD request).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if there's a spike in non-throttling exceptions, we'd expect the token bucket to kick in and effectively disable retries.
I see, this is the part I missed. I tried digging into the AWS SDK to see if there's a simpler way to express all this but don't think there is. But this implementation does seem correct for achieving this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall since this is specific to throttling exceptions I'm more sold on the idea! One last part from my end, I think @danielcweeks brought up a good point on adding logging when there are many throttles since to someone debugging performance issues they won't really know, unless they configure s3 SDK metrics.
Is that something that we could add?
RetryCondition.defaultRetryCondition(), | ||
RetryOnExceptionsCondition.create(XMLStreamException.class))) | ||
|
||
// Workaround: exclude all 503s from consuming retry tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if there's a spike in non-throttling exceptions, we'd expect the token bucket to kick in and effectively disable retries.
I see, this is the part I missed. I tried digging into the AWS SDK to see if there's a simpler way to express all this but don't think there is. But this implementation does seem correct for achieving this.
@ookumuso After thinking about this a little more, I'm increasingly concerned about the high value for defaults. If we look at this in isolation, it seems like the right thing to do with the assumption that we're getting throttled and S3 will repartition and everything will be fine going forward. However, in aggregate, there's no guarantee that S3 is going to repartition and in the degenerate cases (lots of small files, but also a rather small overall size), I'm not convinced that S3 will actually repartition. Higher retries at this point will mask the underlying issue and the net result is that you just get slower overall performance. My recommendation is,
|
Co-authored-by: Drew Schleit <[email protected]>
@danielcweeks, Understand you concern regarding the making this as a default and potentially causing unexpected behaviors. I set the default to 5 and updated recommendation in the docs to set to 32 if the failures observed due to throttling. I did not add additional logging as we are keeping the default low. I think it would be better for customers to enable SDK logging intentionally or the metrics to observe these which most customer dealing with throttles do anyway. Otherwise it might be too noisy.
File size doesn't matter here with respect to auto-scaling, the main issue here is the duration of the traffic. Current behavior for high tps workloads is that they quickly exhaust the retries under a minute or even in seconds with default retry count and they get in a loop of keep retrying manually and failing the same way. I agree though that with the option to set the retry counts, customers can still go around it with this change. |
Thanks @ookumuso, this looks good to me. I think we can add logging later if it's a significant enough concern. |
Iceberg workloads which exceed S3's prefix limits will see HTTP 503 (SlowDown) error responses. The ideal customer experience is for Iceberg to retry these 503 errors persistently, such that the workload is able to make progress, and eventually trigger S3 autoscaling.
Customers using Iceberg OSS with S3FileIO leverage the SDK's default retry mechanism, which is limited to 3 retries. We observe in our load tests that 3 retries is insufficient, and these workloads fail almost immediately. This change aims to improve the customer experience by providing configuration parameters for S3 retries and also introduces better defaults to prevent fast failures.
This change is also addressing some of the concerns that were mentioned in a previous pull request(#8221). Feedback particularly calls out that the number of configurations introduced might be hard for customers to reason about and retries from IOExceptions should be independent of SDK retries. In this commit, we're proposing a new implementation which addresses this feedback. There are three notable changes here:
There were additional retry configurations applied based on load testing discoveries:
S3Client.headObject()
are not correctly identified as throttled exceptions aws/aws-sdk-java-v2#5414)Change is originally authored by @drewschleit, I am following up on his behalf.