-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Objstore Client Level Failures and Retries #3907
Comments
I had a discussion with @prmsrswt and we came to a conclusion along the lines of your solution 2. |
I even attempted to do this once upon a time: #2785 😄 |
Thanks @Biswajitghosh98 and yea, we could use @GiedriusS implementation directly. I am not yet fully convinced we want to go this path though. It is very inefficient to retry on this level, especially is client already retries. Do we really want this? (: |
I agree on this. I'm not much convinced about doing retries with a wrapper. We would end up retrying 4xx errors and retry over already retried requests for client already supporting it. What if we start with a better analysis, like:
|
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
A good retry mechanism is critical. For example compactor is doing all the compaction work, and when the upload is aborted, and if object client library decides to not retry, we crash the container and start from scratch. Since we never know at what moment we crashed we have to download, potentially the same blocks again, and compact it again which is inefficient if we just failed upload. Hashing added by @GiedriusS mitigates that partially by ensuring downloaded block consistency, but it might be only small mitigation.
We had many attempts, many PRs for adding bucket level retries e.g: #3894, #3756
Adding retries on object storage level? The problem was discussed many times already. The main problem is that we have (and should have) retries in individual client implementations. It's because we know more about the client-side, so we can perform retries efficiently. Adding retry logic to all parent layers is not necessary and should be then avoided.
At the end, we trust each client to have good retry logic ideally per HTTP multi-part request even AND with good visibility/metrics Each client should also understand things backpressure statuses like
Rate-limit
,TryLater
etcbut.. all of this is not the case in practice, unfortunately.
Solution?
I can see two option:
Should we try to contribute this to at least main clients e.g GCS/S3 or double-check this?
Create "just in case" SINGLE bucket wrapper that tries to apply some retries within timeout to handle non-consistent retry logics around clients? 🤔
Thoughts welcome 🤗
The text was updated successfully, but these errors were encountered: