Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Set better defaults for S3 retry behaviour #11052

Merged
merged 1 commit into from
Oct 1, 2024

Conversation

ookumuso
Copy link
Contributor

Iceberg workloads which exceed S3's prefix limits will see HTTP 503 (SlowDown) error responses. The ideal customer experience is for Iceberg to retry these 503 errors persistently, such that the workload is able to make progress, and eventually trigger S3 autoscaling.

Customers using Iceberg OSS with S3FileIO leverage the SDK's default retry mechanism, which is limited to 3 retries. We observe in our load tests that 3 retries is insufficient, and these workloads fail almost immediately. This change aims to improve the customer experience by providing configuration parameters for S3 retries and also introduces better defaults to prevent fast failures.

This change is also addressing some of the concerns that were mentioned in a previous pull request(#8221). Feedback particularly calls out that the number of configurations introduced might be hard for customers to reason about and retries from IOExceptions should be independent of SDK retries. In this commit, we're proposing a new implementation which addresses this feedback. There are three notable changes here:

  • Introduced three configuration properties to control the retry behavior: the retry count, minimum backoff time, and maximum backoff time.
  • Better default values for S3 retries, which allow most workloads to succeed.
  • The retry code introduced by AWS: Add retry logic for S3InputStream and S3OutputStream #8221 was simplified to only retry IOExceptions thrown outside of the SDK call (i.e. SSLException, SocketTimoutException).

There were additional retry configurations applied based on load testing discoveries:

Change is originally authored by @drewschleit, I am following up on his behalf.

@@ -38,6 +40,14 @@
import software.amazon.awssdk.auth.credentials.AnonymousCredentialsProvider;
import software.amazon.awssdk.core.client.config.ClientOverrideConfiguration;
import software.amazon.awssdk.core.client.config.SdkAdvancedClientOption;
import software.amazon.awssdk.core.exception.SdkServiceException;
import software.amazon.awssdk.core.retry.RetryMode;
import software.amazon.awssdk.core.retry.RetryPolicy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why not use RetryStrategy from the retries API introduced in version 2.26? The current version used in iceberg supports that and technically RetryPolicy has already been flagged as deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly for the retryCapacityCondition that we use in retryPolicy to workaround the HEAD 503 problem which I don't find a good replacement in RetryStrategy. We have an issue open with the SDK to fix that, after that we should be able to move.

Just double checking the support level, it seems like there is no immediate date to stop supporting RetryPolicy as well: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/retry-strategy.html#migrate-from-retry-policies

.retryCondition(
OrRetryCondition.create(
RetryCondition.defaultRetryCondition(),
RetryOnExceptionsCondition.create(XMLStreamException.class)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is comprehensive. We've seen issues with exceptions thrown as UncheckedIOException when using the urlconnect client which throws directly the UncheckedIOException and is not catched by the RetryableStage, which leads to these failures not being retried even if we set them as retryable conditions on exceptions (since they are not recognized/catched at all).

Example:

java.io.UncheckedIOException: java.net.SocketException: Unexpected end of file from server
 at software.amazon.awssdk.utils.FunctionalUtils.asRuntimeException(FunctionalUtils.java:180)
 at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:110)
 at software.amazon.awssdk.utils.FunctionalUtils.invokeSafely(FunctionalUtils.java:136)
 at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.lambda$tryGetOutputStream$0(UrlConnectionHttpClient.java:324)
 at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.getAndHandle100Bug(UrlConnectionHttpClient.java:360)
 at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.tryGetOutputStream(UrlConnectionHttpClient.java:324)
 at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.call(UrlConnectionHttpClient.java:301)
 at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.call(UrlConnectionHttpClient.java:274)
 at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:67)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:77)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:56)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:52)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:37)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
 at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
 at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
 at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:196)
 at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
 at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171)
 at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
 at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179)
 at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
 at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
 at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
 at software.amazon.awssdk.services.s3.DefaultS3Client.putObject(DefaultS3Client.java:8783)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it seems RetryableStage doesn't even catch all the default exceptions mentioned here particularly the UncheckedIO one. Seems like we need to solve this on the SDK side regardless but I am leaning towards leaving them here so that we don't need to come back and revisit this later on. What do you think?

I can update aws/aws-sdk-java-v2#5442 with this info as well, not sure whether you know a separate issue open regarding this.

@ookumuso
Copy link
Contributor Author

ookumuso commented Sep 9, 2024

Removed all InputStream related changes in favor #10433

@ookumuso ookumuso changed the title AWS: Add configuration and set better defaults for S3 retry behaviour AWS: Set better defaults for S3 retry behaviour Sep 9, 2024
docs/docs/aws.md Outdated Show resolved Hide resolved
/** Minimum wait time to retry a S3 operation */
public static final String S3_RETRY_MIN_WAIT_MS = "s3.retry.min-wait-ms";

public static final long S3_RETRY_MIN_WAIT_MS_DEFAULT = 2_000; // 2 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would 500 ms make more sense maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original value that we did our load tests with but we did not see any benefit from lower value and it caused significantly higher throttle rates

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar could you also take a look please?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ookumuso, I think adding the configuration options is reasonable but I'm a bit concerned that the defaults that are set when compounding with retries that integrations may already retry (e.g. Spark task level retry), may be too much and not improve reliability in the general case. But as I understand it, you've done some benchmarks, do we have some datapoints when smaller values for retries and duration were configured? That would help to make it a bit more clear to see what's the impact when values are lower.

/** Number of times to retry S3 operations. */
public static final String S3_RETRY_NUM_RETRIES = "s3.retry.num-retries";

public static final int S3_RETRY_NUM_RETRIES_DEFAULT = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ookumuso This value seems really high to me? This can compound considering engine integrations may already have their own level of retries on top of this (e.g. Spark task retries). I would also hypothesize that the chance of success after some small number of retries (e.g. I think the SDK default is 5?) is slim and that it's generally not worth trying beyond that.

In the performed load tests did this 32 number markedly improved reliability generally compared to lower values?

I think setting this default value comes down to what is the typical distribution of outcomes and I'd speculate that in the average case when there is 5xx errors this large number of retries won't help? In some extreme cases it seems possible, but I think for those cases, users can override the config defined here.

Copy link
Contributor Author

@ookumuso ookumuso Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar Main consideration here is to give S3 enough time to auto-scale particularly for workloads which observe 503s when they go beyond S3 advertised limits. What we observed in our load tests that high concurrent workloads exhaust the default retries(3) and fail almost immediately. How fast these retries exhausted will depend on the concurrency level. Retry count makes the main difference here since the overall application becomes more resistant to 503s which leads to success and better tps scaling for upcoming workloads with auto-scaling.

I would say in average case particularly workloads running below advertised limits, customers will not see a change in behavior since default retries wouldn't be exhausted for those anyway but as we see more adoption to Iceberg with bigger workloads running; exhausting retries on throttles becomes an issue. One downside would be that if there is a more systemic error in communications with S3, workloads will take longer time to fail, up to 10 minutes but I would argue that those will be more on the rare side compared to the value it brings for eventually succeeding workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar let me know whether you have any other questions

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside would be that if there is a more systemic error in communications with S3, workloads will take longer time to fail, up to 10 minutes but I would argue that those will be more on the rare side compared to the value it brings for eventually succeeding workloads.

Thanks for your patience, yeah this is exactly the part I'm mulling over. I can see a rationale for retrying specifically throttling this many times for larger workloads but if I'm reading this PR right, this number of retries is for all retryable exceptions not just throttling, so I'm double checking if this large level of retries is too broad for all the possible retryable exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with the concern here. If we can narrow this retry behavior to just the 503 error code, I would be more amenable (though 32 is still very high), but as a default retry this causes really bad behaviors for other errors. The other issue is that there are retries in the surrounding execution path (task retries and stage retries in spark for example). The compounding effect of high retry values has a multiplicative effect for the overall job.

I would also say that trying to hide this is problematic for those diagnosing slowness with their workloads, so we should log messages when slowdowns are occurring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar @danielcweeks See the response from Drew here.

If we can narrow this retry behavior to just the 503 error code

Token bucket behavior should exactly achieve that since it captures all non-throttling exceptions and disables retries.

@jackye1995
Copy link
Contributor

@amogh-jahagirdar @nastra for the concerns regarding the new config values, to give some additional data points here, we have similar configs internally for quite some time now in Iceberg (based on the original PR #8221), and we tested these new config values with EMR workloads, and at least we have not observe any regressions so far, and from loadd test data from S3 side, looks like throttling have improved a lot with these settings.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@jackye1995
Copy link
Contributor

@ookumuso looks like CI failed for some unrelated reason, can you rebase the PR to retrigger the CI?

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll wait a bit in case @amogh-jahagirdar has any comments

RetryCondition.defaultRetryCondition(),
RetryOnExceptionsCondition.create(XMLStreamException.class)))

// Workaround: exclude all 503s from consuming retry tokens.
Copy link
Contributor

@danielcweeks danielcweeks Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this it sounds like you're proposing unlimited retries on 503s and 32 retries on other failures? This feels far too aggressive, but I'm not sure if I'm reading this correctly. Why would we treat bucket HEAD requests differently anyway? (Also, I'm not sure that we actually make that request currently, so I don't think this is necessary)

Copy link
Contributor

@drewschleit drewschleit Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry policy is configured with both a max retry count and a token bucket. Throttling exceptions (503) won't consume tokens from the token bucket, so the retry count is limited by the max retry count of 32. Retryable but non-throttling exceptions will consume tokens from the token bucket. We're reusing the SDK default values for the token bucket, which is 500 tokens, and a non-throttling exception will consume 5 tokens. So, if there's a spike in non-throttling exceptions, we'd expect the token bucket to kick in and effectively disable retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we treat bucket HEAD requests differently anyway?

It's a workaround to an SDK bug where 503s from HEAD requests aren't properly categorized as throttling exceptions.

HEAD is used here: https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/BaseS3File.java#L80

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I confused the HEAD request here with a Bucket HEAD request (not an object HEAD request).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if there's a spike in non-throttling exceptions, we'd expect the token bucket to kick in and effectively disable retries.

I see, this is the part I missed. I tried digging into the AWS SDK to see if there's a simpler way to express all this but don't think there is. But this implementation does seem correct for achieving this.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall since this is specific to throttling exceptions I'm more sold on the idea! One last part from my end, I think @danielcweeks brought up a good point on adding logging when there are many throttles since to someone debugging performance issues they won't really know, unless they configure s3 SDK metrics.

Is that something that we could add?

RetryCondition.defaultRetryCondition(),
RetryOnExceptionsCondition.create(XMLStreamException.class)))

// Workaround: exclude all 503s from consuming retry tokens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if there's a spike in non-throttling exceptions, we'd expect the token bucket to kick in and effectively disable retries.

I see, this is the part I missed. I tried digging into the AWS SDK to see if there's a simpler way to express all this but don't think there is. But this implementation does seem correct for achieving this.

@danielcweeks
Copy link
Contributor

@ookumuso After thinking about this a little more, I'm increasingly concerned about the high value for defaults. If we look at this in isolation, it seems like the right thing to do with the assumption that we're getting throttled and S3 will repartition and everything will be fine going forward.

However, in aggregate, there's no guarantee that S3 is going to repartition and in the degenerate cases (lots of small files, but also a rather small overall size), I'm not convinced that S3 will actually repartition. Higher retries at this point will mask the underlying issue and the net result is that you just get slower overall performance.

My recommendation is,

  1. set the default lower (maybe 5 retries), but allow it to be configured if there are problems for specific workloads/queries.
  2. log in the retry handler when throttling is happening (I know the SDK logs, but I don't that's typically enabled by default, so we need to expose what's happening).

@ookumuso
Copy link
Contributor Author

ookumuso commented Oct 1, 2024

@ookumuso After thinking about this a little more, I'm increasingly concerned about the high value for defaults. If we look at this in isolation, it seems like the right thing to do with the assumption that we're getting throttled and S3 will repartition and everything will be fine going forward.

However, in aggregate, there's no guarantee that S3 is going to repartition and in the degenerate cases (lots of small files, but also a rather small overall size), I'm not convinced that S3 will actually repartition. Higher retries at this point will mask the underlying issue and the net result is that you just get slower overall performance.

My recommendation is,

1. set the default lower (maybe 5 retries), but allow it to be configured if there are problems for specific workloads/queries.

2. log in the retry handler when throttling is happening (I know the SDK logs, but I don't that's typically enabled by default, so we need to expose what's happening).

@danielcweeks, Understand you concern regarding the making this as a default and potentially causing unexpected behaviors. I set the default to 5 and updated recommendation in the docs to set to 32 if the failures observed due to throttling. I did not add additional logging as we are keeping the default low. I think it would be better for customers to enable SDK logging intentionally or the metrics to observe these which most customer dealing with throttles do anyway. Otherwise it might be too noisy.

However, in aggregate, there's no guarantee that S3 is going to repartition and in the degenerate cases (lots of small files, but also a rather small overall size), I'm not convinced that S3 will actually repartition. Higher retries at this point will mask the underlying issue and the net result is that you just get slower overall performance.

File size doesn't matter here with respect to auto-scaling, the main issue here is the duration of the traffic. Current behavior for high tps workloads is that they quickly exhaust the retries under a minute or even in seconds with default retry count and they get in a loop of keep retrying manually and failing the same way. I agree though that with the option to set the retry counts, customers can still go around it with this change.

@danielcweeks
Copy link
Contributor

Thanks @ookumuso, this looks good to me. I think we can add logging later if it's a significant enough concern.

@danielcweeks danielcweeks merged commit 09c7376 into apache:main Oct 1, 2024
49 checks passed
@ookumuso ookumuso deleted the s3-retry branch October 2, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants