-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3/GCS: Upload/Delete inconsistency (missing chunk file) #1331
Comments
While investigating this we added the following detection mechanisms to a fork used for debugging: 0robustus1/thanos@orig-v0.4.0...0robustus1:partially-written-blocks |
That's awesome! I wanted to do something like that myself a few months ago: my initial idea was to only check that files in Either way, yes, as you can see we can always definitely improve error checking around this since this can definitely happen because most remote object storage providers use multi-part upload and sometimes uploading some file can fail. However, the newest versions of Thanos Compact are supposed to delete corrupt blocks which are older than some time #1053. Have you tried that mechanism, does it work? |
I'll prepare a pull-request. Regarding thanos-compact: I think this wasn't applicable in our scenario as "our" malformed blocks had the meta.json file present, and the mentioned PR only removes these blocks if they do not have a meta.json file, correct? |
How is that possible? We write meta.json at end 🤔 What kind of unavailability you seen? On read path or write path for S3? |
Also doing fixes for 0.4.0 is risky as the codebase improved from that version |
During that incidents we saw it both in writes as well as reads (both "exists" and "upload" operations for example). I agree it's probably not best to add this as a fix for 0.4.0. I would adjust the PR in such a way that it would fit for current master and not for 0.4.0. |
see thanos-io#1331 This enables us to identify partially written blocks that observe these issues: * missing index * chunks referenced in index that aren't present in the chunks/ directory of the block * chunks in chunks/ subdirectory of the block that aren't referenced from the index (this should not be possible)
Potentially related to #1335 |
This is rather unlikely as we fail the upload and clean everything in any upload failure. Since we upload meta.json at the end if we spot block without meta and older than X we remove it assuming malformed one.
Is it strongly consistent? E.g if you see the missing item (e.g index only) it is only temporarily? Or maybe write it asynchronous? |
Let me answer the questions posted in #1370 here:
The Prometheus filesystem is a regular ext4 filesystem from a mounted PersistentVolume (in Kubernetes) originating from VSphere.
The S3 object storage is a Dell ECS cluster addressed via the S3 Protocol. According to the documentation it has strong consistency. In any case this delay would have been longer than 48 hours (which was the retention of the Prometheus at the time) because the block was partially written even days later.
Regarding your questions posted above:
It does sound rather unlikely but it was what the following query was reporting:
Dell ECS claims the following: "Multi-site read/writes with strong consistency simplifies application development." here. And the missing item wasn't temporary, it never appeared. The retention for the prometheus was set to |
we recently upgraded thanos components to chunks directory was absent, I cant reproduce this issue but mentioning this here, my issue could be similar.
its aws s3 bucket, there were 5-6 folders that I had to delete to recover compactor pod. |
Cool. So technically it is impossible that chunk folder upload failed, but meta.json is uploaded. This means that what you observe is a severe bug. We can check our For sure it does not happen on GCE. I think I like @0robustus1 your #1370 here - we can know more in those cases. The sidecar can possibly try to scan the potentially properly uploaded block to double-check if all files are there. With this unknown on interface, it would be even not enough: file might be there but partially uploaded. But it sounds really bad that we have to do that given the success response. AC:
@infa-ddeore you are right |
here is the compactor error log and aws s3 folder contents,
|
Well we are more interested in |
sidecar logs are gone :-( will capture next time if issue occurs again |
In fact it can be compactor upload failed as well -> but those are using same code path so failed upload for compact would do as well. Since you don't have any this confirms our theory mentioned here: #1331 (comment) |
@infa-ddeore found any workaround to make compaction work again? |
compactor worked after deleting the blocks from s3, same issue didn't appear again so I couldn't gather more data. here is a script if you want to use to check missing
|
got it, you deleted only the blocks it complained about or everything? |
compactor gave the error mentioned in my above comment, deleted that particular block which had so to answer your question, I deleted only the blocks with missing You can run |
cool, yeah just deleted the missing block and it seems to be running now thanks :) |
if you can provide more details about what was missing in the bad block folder and the logs from sidecar that shows information about that block, that would help getting more information regarding this issue |
I don't have the logs anymore, but it was missing the entire chunks folder :( |
Won't be any help but for the record I experienced this exact issue on one of our setups and recovered in the same way with 0.6.0 |
got some more s3 folders with missing chunks (one is missing index as well) sidecar logs doesn't show error:
s3 has missing chunks and one block missing index:
@bwplotka do you want me to check something more? one more observation, pod restart time matches the s3 folder's timestamp, looks like something messes up during the restart.
|
Additional observations (maybe unrelated issue) suggests temporary inconsistency (cc @SuperQ): https://cloud-native.slack.com/archives/CK5RSSC10/p1568459542001900
|
I think I found the bug guys. The problem is most likely here: thanos/pkg/objstore/objstore.go Line 95 in 2c5f2cd
Thanos is resilient on partial uploads in most cases. It's done based on small meta.json file. If it's present and block has more than X minutes. We consider this block ready to be used. Delay is for eventual consistency buckets. If there is no meta.json after X minutes we assumed it is partial upload and compactor removes the block. Now this works well as we always upload meta.json at the end. However, we don't do deletions in a proper way. In the linked code we delete in lexicographical order. This means chunks go first, then index then meta.. If we restart compactor or sidecar in the middle of this we have choked Compactor. Fixing this now. This also means that NONE of those blocks that were blocking compactor were not important e.g removing them should drop no metrics overall. Let me know if that makes sense (: And thanks all the reports that helped us to identify problem - especially #1331 (comment) |
Fixes #1331 Problem that we are fixing is explained in the linked issue. Signed-off-by: Bartek Plotka <[email protected]>
Fixes #1331 Problem that we are fixing is explained in the linked issue. Signed-off-by: Bartek Plotka <[email protected]>
Fix: #1525 |
Fixes #1331 Problem that we are fixing is explained in the linked issue. Signed-off-by: Bartek Plotka <[email protected]>
…1525) Fixes #1331 Problem that we are fixing is explained in the linked issue. Signed-off-by: Bartek Plotka <[email protected]>
…hanos-io#1525) Fixes thanos-io#1331 Problem that we are fixing is explained in the linked issue. Signed-off-by: Bartek Plotka <[email protected]>
…(#1525) Fixes thanos-io/thanos#1331 Problem that we are fixing is explained in the linked issue. Signed-off-by: Bartek Plotka <[email protected]>
Thanos, Prometheus and Golang version used
What happened
We use an on-premise S3 store (in thanos via
type: S3
in the object storeconfig) that experienced availability issues for a 3 - 15 hour period (3 hours
of common connection issues, 15 hours of less common connection issues).
Multiple thanos components (shipper/sidecar, compactor) experienced timeouts
while awaiting responses from the S3 service.
This resulted in the shipper/sidecar not writing complete blocks.
We observed the following types of partially written blocks:
Those partially written (one might call them corrupted) blocks caused subsequent issues:
What you expected to happen
How to reproduce it (as minimally and precisely as possible):
We have been running into issues trying to build a minimal reproducible scenario.
It would seem that it should be enough to have blocks with the mentioned criteria:
When trying this however we ran into the situation that these blocks did not run
into the compaction plan (see here).
It seems however that once this is considered in the compaction plan and GatherIndexIssueStats is executed the compactor will fail if the index file is not present. If chunks are missing it will fail later in the prometheus tsdb compaction code.
Partial logs to relevant components
Currently this is the only log we have. Should we run into the issue again, i'll make sure to attach more logs of the other components and cases.
The text was updated successfully, but these errors were encountered: