-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving {Halt,Retry}Error semantics #2784
Comments
In certain environments, especially private, on-premise clouds, sometimes transient errors can happen due to misconfiguration or some reason. In my case, the S3 end-point sometimes had spuriously reset the TCP connection. Without functionality like this, it was almost impossible for Thanos Compact to finish its work considering sometimes we have to download hundreds of gigabytes of files. Curiously enough, only the downloading path was affected so this only adds retries for that direction. Opening this up without any tests to see if it is something we want to add to Thanos or we should implement thanos-io#2784? Or maybe both? Signed-off-by: Giedrius Statkevičius <[email protected]>
My gut feeling is that your proposed logic adds extra complexity to cover a case which shouldn't be the normal one. I wouldn't expect failures to happen so frequently and, if they do, probably fixing the root cause is better, but I'm open to discuss it further.
Some object storages already export the checksum of objects when reading attributes (HEAD operation). Have you checked if it's available for any object store we support? If so, you wouldn't need to store any checksum in |
I actually had to implement a rough version of #2785 because it was impossible for Thanos Compactor to finish at least one iteration due to transient errors. If the object storage code doesn't support (and it shouldn't) retries when some more serious error happens and if our retrying logic is kind of useless IMHO then where do you think such things should be handled?
That's true but I don't think we should add another method for this to our object storage interface if we can avoid it because that raises the bar for any other object storage that might not support such functionality. |
We have the attributes function which was designed for this. Assuming it's feasible to add the checksum (I haven't checked if it's implemented by all storages we do support nowadays) I agree that adding extra attributes there could be a blocker to support other storages in the future. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
WIP. Have basic hashes support in my own branch, now need to add the needed functionality in Compactor. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Attempting to implement this in #3031. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Trying to push #3031 over the line. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
#3031 merged so closing 🎉 |
Vào Th 3, 2 thg 3, 2021 lúc 04:15 Giedrius Statkevičius <
[email protected]> đã viết:
Closed #2784 <#2784>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2784 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASCHX7A33GCKJ64SZJKZVCLTBP7WVANCNFSM4OCROJSQ>
.
--
locvo
|
It's a continuation of a small discussion that we had on Slack a few weeks ago. Currently, we have
HaltError
andRetryError
.HaltError
s happen when some critical issue occurs and we have to halt the whole process so that the operator could investigate what went wrong.RetryError
happens when some operation fails that we can retry. Some relevant snippets of code:However, the way it is right now is if a
RetryError
occurs then we retry the whole iteration i.e.compactMainFn()
. That's awesome but all of the previously performed work disappears because once we start that function again, we wipe the workspace:https://github.com/thanos-io/thanos/blob/master/pkg/compact/compact.go#L938-L941
https://github.com/thanos-io/thanos/blob/master/pkg/compact/compact.go#L464-L469
This means that all of the files that were previously successfully downloaded have been lost.
I propose extending this logic to check what is on the disk before trying to overwrite the files. My idea is:
meta.json
to include file integrity checksums of all files in the directory besidesmeta.json
itself anddeletion-mark.json
. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;Affected Thanos versions: <=0.13.0.
Thoughts?
The text was updated successfully, but these errors were encountered: