-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thanos compactor crashes with "write compaction: chunk 8 not found: reference sequence 0 out of range" #1300
Comments
Sounds like Thanos Compact has crashed midway during the upload and you got inconsistent data. Nowadays Thanos Compact removes old, malformed data that is older than the maximum of consistency delay and the minimum age for removal (30 minutes). Does that not work for you for some reason, @bjakubski? |
Also, Thanos Compact now has metrics about failed compactions (in 0.6.0 which is soon to be released) so you can use that :P |
I've just encountered the same problem on another object in bucket, but this time it was created by sidecar. I do not see any message in logs about sidecar ever uploading it, but this object is older (> week old) but according to GCP storage it was just created yesterday. I see messages in prometheus logs object in bucket seems to have been created at 2019-07-10 21:25 - which is just after one of thanos sidecars successfully connected to prometheus (after pod restart). In logs I can see it uploading various objects, but nothing about this particular ulid Now: I'm running unusual setup with thanos-sidecar: prometheus compaction is disabled, but retention is quite long: 15d IIRC It generally seems to work fine, although I wonder if it might cause problems like I have here? What is the consistency delay? Should I've been expecting compactor to ignore/remove this object from bucket? |
Also I met the same question about this , the error log is
|
I'm seeing the same error for blocks uploaded with thanos-compactor 0.6.0 (and then processed by 0.6.0) myself. Backend storage is Ceph cluster via Swift API.
thanos-compactor has uploaded a compacted block yesterday:
and now it's choking on that block:
First, I'd expect it to survive broken blocks, but what's more concerning is that the block has been uploaded successfully before (unless those warning messages are not there just for show, and there is indeed something wrong). What's uploaded:
and meta.json: {
"ulid": "01DGZ0PDS2MFX25P9CKG1C1TA3",
"minTime": 1561622400000,
"maxTime": 1561651200000,
"stats": {
"numSamples": 271799367,
"numSeries": 141915,
"numChunks": 2265057
},
"compaction": {
"level": 2,
"sources": [
"01DEC9F6B0BVQHFG7BZDS75N0V",
"01DECGAXK0CM9HC2QPZ7MMH8M0",
"01DECQ6MV0K0FR00HHY9906178",
"01DECY2C2ZX4F6N9GKHKM2PC61"
],
"parents": [
{
"ulid": "01DEC9F6B0BVQHFG7BZDS75N0V",
"minTime": 1561622400000,
"maxTime": 1561629600000
},
{
"ulid": "01DECGAXK0CM9HC2QPZ7MMH8M0",
"minTime": 1561629600000,
"maxTime": 1561636800000
},
{
"ulid": "01DECQ6MV0K0FR00HHY9906178",
"minTime": 1561636800000,
"maxTime": 1561644000000
},
{
"ulid": "01DECY2C2ZX4F6N9GKHKM2PC61",
"minTime": 1561644000000,
"maxTime": 1561651200000
}
]
},
"version": 1,
"thanos": {
"labels": {
"datacenter": "p24",
"environment": "internal",
"replica": "02"
},
"downsample": {
"resolution": 0
},
"source": "compactor"
}
} |
This also breaks when running with I was expecting it to fail, but not quit... but it did crashloop with that error. After some time I deleted the pod and it seems to be running fine now... maybe that 30m mentioned above? |
OK that was quick, just crashlooped again |
Are you hitting this maybe? #1331 Essentially partial upload and empty or missing chunk files in |
You have Thanks all for reporting let's investigate this upload path with minio client. Essentially this might be some assumption: #1331 (comment) |
Let's continue on one of the issues. This looks like a duplicate of #1331 EDIT: Actually this one was first but I think it does not matter much - let's continue on one (: Hope this is ok with you guys! |
not sure, I'm using GCS... that issue seems to be specific to S3... |
hm, maybe we are wrong and GCS is affected as well. What about other question about |
sorry, didn't see that question... So, the log says:
It seems to be the problem is |
anything I can do as a workaround so compact works again? |
You should delete the blocks which have duplicated data and only leave one copy. It's up to you to decide which one it is (: It sounds like you need to delete the one you've mentioned but please double check. |
yeah, did that and it seems to be running now |
Hi, I am having this issue using thanos 0.8.1. I have tried moving the directories for blocks it complains about out of the bucket but then every time I run the compactor it just finds some more to be sad about :-( This is crashing the compactor with 0.8.1 even with Any help to further debug this would be appreciated! (@bwplotka maybe?) |
Hi, we started some doc where we discussed another potential cause of this, which is the false assumption of partial block upload: https://docs.google.com/document/d/1QvHt9NXRvmdzy51s4_G21UW00QwfL9BfqBDegcqNvg0/edit# |
Reopening as it still occurs. The mentioned doc above states potential issue (however quite unlikely) of compactor removing a block when it is being still uploaded for some reason (long upload / old block upload). We are working on fix for that particular case, but more info would be useful why chunk file is missing from the block. |
To compliment the last comment:
Compactor logs:
Meantime in the bucket (there is the only file meta.json in the bucket indeed):
So it seems compactor failed to upload itself and not crashing against it. |
This replaces man 4 inconsistent meta.json syncs places in other components. Fixes: #1335 Fixes: #1919 Fixes: #1300 * One place for sync logic for both compactor and store * Corrupted disk cache for meta.json is handled gracefully. * Blocks without meta.json are handled properly for all compactor phases. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. * Remove Compactor Syncer. * Added metric for partialUploadAttempt deletions. * More tests. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartek Plotka <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
This replaces man 4 inconsistent meta.json syncs places in other components. Fixes: #1335 Fixes: #1919 Fixes: #1300 * One place for sync logic for both compactor and store * Corrupted disk cache for meta.json is handled gracefully. * Blocks without meta.json are handled properly for all compactor phases. * Synchronize was not taking into account deletion by removing meta.json. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Better observability for syncronize process. * More logs for store startup process. * Remove Compactor Syncer. * Added metric for partialUploadAttempt deletions. * More tests. TODO in separate PR: * More observability for index-cache loading / adding time. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
Fixes: #1335 Fixes: #1919 Fixes: #1300 * Clean up of meta files are now started only if block which is being uploaded is older than 2 days (only a mitigation). * Blocks without meta.json are handled properly for all compactor phases. * Prepare for future implementation of https://thanos.io/proposals/201901-read-write-operations-bucket.md/ * Added metric for partialUploadAttempt deletions and delayed it. * More tests. Signed-off-by: Bartlomiej Plotka <[email protected]>
We merged fix for the case of: #1919 which potentially might cause such chunk file to be missing. |
Thanos, Prometheus and Golang version used
thanos, version 0.5.0 (branch: HEAD, revision: 72820b3)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5
What happened
thanos compactor crashes with ""write compaction: chunk 8 not found: reference sequence 0 out of range"
What you expected to happen
Should work fine :-)
How to reproduce it (as minimally and precisely as possible):
Not sure :-/
Full logs to relevant components
Out of the list of objects dumped along with error message I've found one without chunks
meta.json contents:
It is an object created by compactor apparently.
We've been running compactor for some time, but after a while (due to lack of local disk storage) it was crashing constantly. After an extended period of crashing storage has been added and compactor was able to go further until it encountered problem described here.
I guess important part is how such object ended up in bucket, although I wonder if it is possible for thanos to ignore such objects and keep processing data rest of data? (exposing data about bad objects in metrics)
I'd guess that it was somehow created during constant crashes we've had earlier, but have nothing to support that.
Anything else we need to know
#688 describes similar issue, although it is about much older thanos than we use here.
We've been running 0.5 and 0.4 before that. I'm not sure, but it is possible that 0.3.2 (compactor) was used at the beginning.
The text was updated successfully, but these errors were encountered: