[compact] Increase default consistency delay to 24h #1901

mattrco · 2019-12-17T13:04:31Z

Firstly, thanks for adding sharding to compact/store using relabeling config - I took part in a previous discussion on this and it's great to have the ability to do it now 🎉

We've run into an issue which caused some data loss. We have very large blocks that can take hours to upload to S3. When running multiple shards, we found that the malformed block check caused erroneous deletions:

• shard A compacts blocks X and Y, creating block Z on disk
• it starts to upload Z, starting with chunk 1 and going sequentially
• shard B scans the bucket, finds partially uploaded block Z
• it thinks Z is malformed because meta.json isn’t there (it gets uploaded last, after all chunks)
• shard B deletes block Z
• meanwhile, shard A is still uploading, so you end up with some of the latter blocks uploaded, and meta.json

As a quick fix, upping the consistency deadline makes things safer. But I wonder if we should also disable the malformed block deletion by default (and log a warning) if a relabeling config exists, so that the user can manually inspect and delete those blocks.

If we're happy to go ahead I'll add a changelog entry.

Firstly, thanks for adding sharding to compact/store using relabeling config - I took part in a previous discussion on this and it's great to have the ability to do it now 🎉 We've run into an issue which caused some data loss. We have very large blocks that can take hours to upload to S3. When running multiple shards, we found that the malformed block check caused erroneous deletions: ``` • shard A compacts blocks X and Y, creating block Z on disk • it starts to upload Z, starting with chunk 1 and going sequentially • shard B scans the bucket, finds partially uploaded block Z • it thinks Z is malformed because meta.json isn’t there (it gets uploaded last, after all chunks) • shard B deletes block Z • meanwhile, shard A is still uploading, so you end up with some of the latter blocks uploaded, and meta.json ``` As a quick fix, upping the consistency deadline makes things safer. But I wonder if we should also disable the malformed block deletion by default (and log a warning) if a relabeling config exists, so that the user can manually inspect and delete those blocks.

mattrco · 2019-12-17T13:09:38Z

(Another solution would be to use different buckets, but we'd need to add relabeling config support to sidecar.)

bwplotka · 2019-12-17T13:39:05Z

Thanks for this!

ConsistencyDelay I think should stay shorter. Otherwise, you don't compact quick enough the small 2h blocks.
We are touching the idea of upload timeout, but it looks like it cannot be hardcoded/configured with single value, especially on compactor with bigger blocks: It can vary a lot.

Unfortunately, we shard on meta.json which is missing due to pending upload, so putting blocks in different buckets would work but I think it's quite inconvenient. We need to make it work with what we have now.

Ideas:

For this particular problem, I think increasing just minimumAge of removal on partial upload will do the work.
Make upload more resistant and check if all files are there afterward.

squat · 2019-12-17T13:47:57Z

We are discussing this issue, and many related ones, and would like to come up with something that prevents these racing deletions in general. Both of the changes mentioned by Bartek are pretty low cost and likely high impact changes that will fix a good proportion of the races

mattrco · 2019-12-17T14:05:53Z

For this particular problem, I think increasing just minimumAge of removal on partial upload will do the work.

Thanks, I realise this doesn't behave precisely how I intended. I'll add a flag for minimumAge so it can be configured 👍

bwplotka · 2019-12-17T14:09:46Z

@mattrco are you on CNCF slack maybe? See this working document we are working on before you jump into implementation: https://docs.google.com/document/d/1QvHt9NXRvmdzy51s4_G21UW00QwfL9BfqBDegcqNvg0/edit

mattrco · 2019-12-17T14:14:08Z

Great, let me jump on slack 👀

bwplotka · 2020-01-06T18:11:28Z

#1937 fixes the issue.

bwplotka mentioned this pull request Dec 20, 2019

compactor: Rare race of sharded compactors / slow upload causing data loss. #1919

Closed

bwplotka closed this Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compact] Increase default consistency delay to 24h #1901

[compact] Increase default consistency delay to 24h #1901

mattrco commented Dec 17, 2019

mattrco commented Dec 17, 2019 •

edited

Loading

bwplotka commented Dec 17, 2019 •

edited

Loading

squat commented Dec 17, 2019

mattrco commented Dec 17, 2019 •

edited

Loading

bwplotka commented Dec 17, 2019

mattrco commented Dec 17, 2019

bwplotka commented Jan 6, 2020

[compact] Increase default consistency delay to 24h #1901

[compact] Increase default consistency delay to 24h #1901

Conversation

mattrco commented Dec 17, 2019

mattrco commented Dec 17, 2019 • edited Loading

bwplotka commented Dec 17, 2019 • edited Loading

squat commented Dec 17, 2019

mattrco commented Dec 17, 2019 • edited Loading

bwplotka commented Dec 17, 2019

mattrco commented Dec 17, 2019

bwplotka commented Jan 6, 2020

mattrco commented Dec 17, 2019 •

edited

Loading

bwplotka commented Dec 17, 2019 •

edited

Loading

mattrco commented Dec 17, 2019 •

edited

Loading