Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving {Halt,Retry}Error semantics #2784

Closed
GiedriusS opened this issue Jun 19, 2020 · 13 comments
Closed

Improving {Halt,Retry}Error semantics #2784

GiedriusS opened this issue Jun 19, 2020 · 13 comments

Comments

@GiedriusS
Copy link
Member

It's a continuation of a small discussion that we had on Slack a few weeks ago. Currently, we have HaltError and RetryError. HaltErrors happen when some critical issue occurs and we have to halt the whole process so that the operator could investigate what went wrong. RetryError happens when some operation fails that we can retry. Some relevant snippets of code:

  366             // The RetryError signals that we hit an retriable error (transient error, no connection).                                                                                                         
  367             // You should alert on this being triggered too frequently.
  354             // The HaltError type signals that we hit a critical bug and should block                                                                                                                          
  355             // for investigation. You should alert on this being halted.

However, the way it is right now is if a RetryError occurs then we retry the whole iteration i.e. compactMainFn(). That's awesome but all of the previously performed work disappears because once we start that function again, we wipe the workspace:

https://github.com/thanos-io/thanos/blob/master/pkg/compact/compact.go#L938-L941
https://github.com/thanos-io/thanos/blob/master/pkg/compact/compact.go#L464-L469

This means that all of the files that were previously successfully downloaded have been lost.

I propose extending this logic to check what is on the disk before trying to overwrite the files. My idea is:

  • Extend meta.json to include file integrity checksums of all files in the directory besides meta.json itself and deletion-mark.json. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;
  • Make Thanos Compact consult those checksums if they are present before trying to download files again. If they match then we don't download the file. If they don't - we overwrite and download the file again.

Affected Thanos versions: <=0.13.0.

Thoughts?

GiedriusS added a commit to GiedriusS/thanos that referenced this issue Jun 19, 2020
In certain environments, especially private, on-premise clouds,
sometimes transient errors can happen due to misconfiguration or some
reason. In my case, the S3 end-point sometimes had spuriously reset the TCP
connection. Without functionality like this, it was almost impossible
for Thanos Compact to finish its work considering sometimes we have to
download hundreds of gigabytes of files.

Curiously enough, only the downloading path was affected so this only
adds retries for that direction.

Opening this up without any tests to see if it is something we want to
add to Thanos or we should implement
thanos-io#2784? Or maybe both?

Signed-off-by: Giedrius Statkevičius <[email protected]>
@pracucci
Copy link
Contributor

My gut feeling is that your proposed logic adds extra complexity to cover a case which shouldn't be the normal one. I wouldn't expect failures to happen so frequently and, if they do, probably fixing the root cause is better, but I'm open to discuss it further.

Extend meta.json to include file integrity checksums of all files in the directory besides meta.json itself and deletion-mark.json. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;

Some object storages already export the checksum of objects when reading attributes (HEAD operation). Have you checked if it's available for any object store we support? If so, you wouldn't need to store any checksum in meta.json.

@GiedriusS
Copy link
Member Author

My gut feeling is that your proposed logic adds extra complexity to cover a case which shouldn't be the normal one. I wouldn't expect failures to happen so frequently and, if they do, probably fixing the root cause is better, but I'm open to discuss it further.

I actually had to implement a rough version of #2785 because it was impossible for Thanos Compactor to finish at least one iteration due to transient errors. RetryError should help out in this case but it wipes everything right now so it's pretty useless. It only "helps" by not restarting the whole process when --wait is specified.

If the object storage code doesn't support (and it shouldn't) retries when some more serious error happens and if our retrying logic is kind of useless IMHO then where do you think such things should be handled?

RetryError seems like an elegant solution to me but it needs improvements as outlined in the OP.

Extend meta.json to include file integrity checksums of all files in the directory besides meta.json itself and deletion-mark.json. These should only be calculated by writers only if some flag has been specified so that this wouldn't incur a performance cost on all writers;

Some object storages already export the checksum of objects when reading attributes (HEAD operation). Have you checked if it's available for any object store we support? If so, you wouldn't need to store any checksum in meta.json.

That's true but I don't think we should add another method for this to our object storage interface if we can avoid it because that raises the bar for any other object storage that might not support such functionality.

@pracucci
Copy link
Contributor

That's true but I don't think we should add another method for this to our object storage interface if we can avoid it because that raises the bar for any other object storage that might not support such functionality.

We have the attributes function which was designed for this. Assuming it's feasible to add the checksum (I haven't checked if it's implemented by all storages we do support nowadays) I agree that adding extra attributes there could be a blocker to support other storages in the future.

@stale
Copy link

stale bot commented Jul 19, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 19, 2020
@GiedriusS
Copy link
Member Author

WIP. Have basic hashes support in my own branch, now need to add the needed functionality in Compactor.

@stale
Copy link

stale bot commented Aug 19, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 19, 2020
@GiedriusS
Copy link
Member Author

Attempting to implement this in #3031.

@stale stale bot removed the stale label Aug 19, 2020
@stale
Copy link

stale bot commented Oct 18, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Oct 18, 2020
@kakkoyun kakkoyun removed the stale label Oct 19, 2020
@stale
Copy link

stale bot commented Dec 19, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Dec 19, 2020
@GiedriusS
Copy link
Member Author

Trying to push #3031 over the line.

@stale stale bot removed the stale label Dec 22, 2020
@stale
Copy link

stale bot commented Feb 21, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Feb 21, 2021
@GiedriusS GiedriusS removed the stale label Feb 22, 2021
@GiedriusS
Copy link
Member Author

#3031 merged so closing 🎉

@locvo986
Copy link

locvo986 commented Mar 1, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants