Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate retriable/haltable errors from compactor #1183

Merged
merged 3 commits into from
Jun 6, 2019

Conversation

mattrco
Copy link
Contributor

@mattrco mattrco commented May 28, 2019

Changes

We observed thanos compactor was exiting on upload errors. Previously, a new
error was created from all worker errors so the custom type was not propagated.
This introduces go-multierror to preserve the error types.

If all errors returned are retriable, it is safe to retry the compaction.

Since the compactor may still return a single error this logic is preserved.

(Happy to refactor out the duplicated logic with individual error checks.)

Verification

Tests are passing and I'll run this change with real workloads shortly.

@mattrco
Copy link
Contributor Author

mattrco commented May 28, 2019

Also, although this will fix compact exiting, the work the compactor has written to disk is thrown away when the compaction is retried. I'm thinking about a good way to avoid this - the simplest thing would be to just retry the bucket requests more times before returning an error.

@GiedriusS
Copy link
Member

prometheus/tsdb already has a struct for this tsdb.MultiError that we use in some places already. Would it be possible to reuse it here?

@mattrco
Copy link
Contributor Author

mattrco commented May 29, 2019

@GiedriusS sure, I can use that instead 👍

@mattrco
Copy link
Contributor Author

mattrco commented May 29, 2019

Updated, I'll squash the commits if we're happy to proceed.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, otherwise LGTM.

Thanks for this

cmd/thanos/compact.go Outdated Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
@mattrco
Copy link
Contributor Author

mattrco commented Jun 4, 2019

@bwplotka @povilasv if you have a moment to take another look at this, that'd be much appreciated 🙇

Previously, a new error was created from all worker errors so the custom
type was not propagated.

If all errors returned are retriable, we assume it is safe to retry the compaction.

Since the compactor may still return a single error this logic is preserved.
@mattrco mattrco force-pushed the mattrco/propagate-retries branch from a1e8191 to 3dfb893 Compare June 5, 2019 14:24
@mattrco
Copy link
Contributor Author

mattrco commented Jun 5, 2019

Rebased this against master and updated the import path against tsdb 0.8.0 👍

@mattrco mattrco force-pushed the mattrco/propagate-retries branch from f89f92f to a658dbf Compare June 5, 2019 15:02
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect to me, LGTM thanks!

Dissmissing @povilasv review as all his suggestions were addressed and he is chilling on vacations (:

@bwplotka bwplotka merged commit ce1b22a into thanos-io:master Jun 6, 2019
@mattrco mattrco deleted the mattrco/propagate-retries branch June 6, 2019 15:18
FUSAKLA pushed a commit to FUSAKLA/thanos that referenced this pull request Jun 8, 2019
* Propagate retriable/haltable errors from compactor

Previously, a new error was created from all worker errors so the custom
type was not propagated.

If all errors returned are retriable, we assume it is safe to retry the compaction.

Since the compactor may still return a single error this logic is preserved.

* Update MultiError import for tsdb 0.8.0

* Update import path in tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants