Improve Objstore Client Level Failures and Retries #3907

bwplotka · 2021-03-10T18:52:35Z

A good retry mechanism is critical. For example compactor is doing all the compaction work, and when the upload is aborted, and if object client library decides to not retry, we crash the container and start from scratch. Since we never know at what moment we crashed we have to download, potentially the same blocks again, and compact it again which is inefficient if we just failed upload. Hashing added by @GiedriusS mitigates that partially by ensuring downloaded block consistency, but it might be only small mitigation.

We had many attempts, many PRs for adding bucket level retries e.g: #3894, #3756

Adding retries on object storage level? The problem was discussed many times already. The main problem is that we have (and should have) retries in individual client implementations. It's because we know more about the client-side, so we can perform retries efficiently. Adding retry logic to all parent layers is not necessary and should be then avoided.

At the end, we trust each client to have good retry logic ideally per HTTP multi-part request even AND with good visibility/metrics Each client should also understand things backpressure statuses like Rate-limit, TryLater etc

but.. all of this is not the case in practice, unfortunately.

Solution?

I can see two option:

Should we try to contribute this to at least main clients e.g GCS/S3 or double-check this?
Create "just in case" SINGLE bucket wrapper that tries to apply some retries within timeout to handle non-consistent retry logics around clients? 🤔

Thoughts welcome 🤗

The text was updated successfully, but these errors were encountered:

Biswajitghosh98 · 2021-03-10T20:25:56Z

I had a discussion with @prmsrswt and we came to a conclusion along the lines of your solution 2.
I'm trying to refactor the exponential backoff code so that it can be added in runutils and then used in the way you suggested with a user input flag value consisting of max time till retry and other parameters. @bwplotka

GiedriusS · 2021-03-10T20:41:53Z

I even attempted to do this once upon a time: #2785 😄

bwplotka · 2021-03-16T11:18:22Z

Thanks @Biswajitghosh98 and yea, we could use @GiedriusS implementation directly.

I am not yet fully convinced we want to go this path though. It is very inefficient to retry on this level, especially is client already retries. Do we really want this? (:

pracucci · 2021-03-16T11:48:35Z

It is very inefficient to retry on this level, especially is client already retries.

I agree on this. I'm not much convinced about doing retries with a wrapper. We would end up retrying 4xx errors and retry over already retried requests for client already supporting it.

What if we start with a better analysis, like:

Among object store clients we do support, which support retries and which not?
For the ones not supporting it, are there open issues in the respective projects? If not, can we open it? (eg. Minio developers have been very supportive so far)

stale · 2021-06-02T16:04:25Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-06-16T21:04:53Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

bwplotka added feature request/improvement difficulty: hard help wanted labels Mar 10, 2021

bwplotka mentioned this issue Mar 10, 2021

sidecar: Allow sidecar to have write (no delete) access: Avoid deleting on failed upload. #3744

Closed

Biswajitghosh98 mentioned this issue Mar 10, 2021

Retry and prevent cleanup of local blocks for upto 5 iterations #3894

Closed

stale bot added the stale label Jun 2, 2021

stale bot closed this as completed Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Objstore Client Level Failures and Retries #3907

Improve Objstore Client Level Failures and Retries #3907

bwplotka commented Mar 10, 2021

Biswajitghosh98 commented Mar 10, 2021

GiedriusS commented Mar 10, 2021

bwplotka commented Mar 16, 2021

pracucci commented Mar 16, 2021

stale bot commented Jun 2, 2021

stale bot commented Jun 16, 2021

Improve Objstore Client Level Failures and Retries #3907

Improve Objstore Client Level Failures and Retries #3907

Comments

bwplotka commented Mar 10, 2021

Solution?

Biswajitghosh98 commented Mar 10, 2021

GiedriusS commented Mar 10, 2021

bwplotka commented Mar 16, 2021

pracucci commented Mar 16, 2021

stale bot commented Jun 2, 2021

stale bot commented Jun 16, 2021