-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compact] Increase default consistency delay to 24h #1901
Conversation
Firstly, thanks for adding sharding to compact/store using relabeling config - I took part in a previous discussion on this and it's great to have the ability to do it now 🎉 We've run into an issue which caused some data loss. We have very large blocks that can take hours to upload to S3. When running multiple shards, we found that the malformed block check caused erroneous deletions: ``` • shard A compacts blocks X and Y, creating block Z on disk • it starts to upload Z, starting with chunk 1 and going sequentially • shard B scans the bucket, finds partially uploaded block Z • it thinks Z is malformed because meta.json isn’t there (it gets uploaded last, after all chunks) • shard B deletes block Z • meanwhile, shard A is still uploading, so you end up with some of the latter blocks uploaded, and meta.json ``` As a quick fix, upping the consistency deadline makes things safer. But I wonder if we should also disable the malformed block deletion by default (and log a warning) if a relabeling config exists, so that the user can manually inspect and delete those blocks.
(Another solution would be to use different buckets, but we'd need to add relabeling config support to sidecar.) |
Thanks for this! ConsistencyDelay I think should stay shorter. Otherwise, you don't compact quick enough the small 2h blocks. Unfortunately, we shard on meta.json which is missing due to pending upload, so putting blocks in different buckets would work but I think it's quite inconvenient. We need to make it work with what we have now. Ideas:
|
We are discussing this issue, and many related ones, and would like to come up with something that prevents these racing deletions in general. Both of the changes mentioned by Bartek are pretty low cost and likely high impact changes that will fix a good proportion of the races |
Thanks, I realise this doesn't behave precisely how I intended. I'll add a flag for |
@mattrco are you on CNCF slack maybe? See this working document we are working on before you jump into implementation: https://docs.google.com/document/d/1QvHt9NXRvmdzy51s4_G21UW00QwfL9BfqBDegcqNvg0/edit |
Great, let me jump on slack 👀 |
#1937 fixes the issue. |
Firstly, thanks for adding sharding to compact/store using relabeling config - I took part in a previous discussion on this and it's great to have the ability to do it now 🎉
We've run into an issue which caused some data loss. We have very large blocks that can take hours to upload to S3. When running multiple shards, we found that the malformed block check caused erroneous deletions:
As a quick fix, upping the consistency deadline makes things safer. But I wonder if we should also disable the malformed block deletion by default (and log a warning) if a relabeling config exists, so that the user can manually inspect and delete those blocks.
If we're happy to go ahead I'll add a changelog entry.