-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Compactor] Detect the incomplete uploaded blocks and exclude them from compaction #6328
Comments
I've personally seen this problem come up time and time again. But this sounds like a bug in |
Hi @GiedriusS, sorry for the long-overdue reply 😅 It still keeps happening after I raised the issue here. However, I'm increasingly doubt that it might have something to do with our in-house Ceph cluster.
Do you mean you've personally encountered this issue in your work/project or you've seen this problem is repeatedly reported? If it's the former, do you also use some sort of in-house storage solution too? Unfortunately I don't have a way to reproduce this consistently because I'm not fully understand the condition to trigger this. We recently had an outage of the Ceph cluster, and it leads to corrupted chunks too. I think my point is that the current safeguards doesn't seem to be sufficient, especially when using some in-house storage solution that might not be as reliable as the AWS S3. The compactor should have a data integrity check and skip corrupted blocks. |
Can confirm that we see similar problems with an in-house Minio storage, that has problems syncing between the replicas due to DNS lookup timeouts. |
Also experiencing this on AWS S3 (which likely is not to be considered an in-house S3 solution?) |
This would help a lot! If they are excluded from compaction, would they still be deleted when retention comes, or would they still have to be manually cleaned up? |
Is your proposal related to a problem?
As the issue #5978 mentioned, it's currently a halt error when it tries to compact an incomplete block (e.g. only
meta.json
orindex
is uploaded, but not thechunks
folder). It seems the safeguard here https://github.com/thanos-io/thanos/blob/main/pkg/block/block.go#L156 doesn't really guarantee that thepartial upload == missing or corrupted meta.json
. #5859 suggests the same.Although the issue #5978 is closed, the solution is deleting bad blocks manually, which is arguably a toil and also error-prone.
Describe the solution you'd like
We would like to propose the compactor extend the detection of the partial upload blocks. It's proven that the assumption that "The presence of meta.json means a complete upload" sometimes doesn't stand. Simply checking the
meta.json
doesn't guarantee the block is intact, a more comprehensive approach is needed to cover the cases where thechunks
are missing/partially uploaded.It can be done when collecting the block health stats:
thanos/pkg/compact/compact.go
Lines 1035 to 1059 in a1ec4d5
Or in the
BestEffortCleanAbortedPartialUploads
:thanos/cmd/thanos/compact.go
Lines 404 to 419 in a1ec4d5
The text was updated successfully, but these errors were encountered: