-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor is unable to properly compact blocks #4677
Comments
Also tried using
Here's the config I used:
|
So, while I haven't been able to fix my problem (hopefully - yet), I've been investigating and came to some conclusions:
While I do not know the reason out of order blocks appeared in Cortex blocks (whatever happens, Cortex is designed not to ingest anything that is out of order), I am leaning to assuming that this might be related to a Prometheus race condition issue: #4573 that Cortex inherited. Some time ago Thanos introduced a feature to allow skipping out of order blocks upon compaction thanos-io/thanos#4469, however this change hasn't made to Cortex yet #4453. The PR to introduce Thanos out of order blocks skipping feature #4453 was raised late August 2021 (and has stayed open ever since). There was an another PR in October 2021 #4505 that got merged and introduced that feature, however hardcoded it to be disabled https://github.com/cortexproject/cortex/blob/release-1.11/pkg/compactor/compactor.go#L660. I am assuming this might be because the code is only partially ready for this feature to be enabled, but since was unlucky in finding this on my own or getting responses in the CNCF Slack https://cloud-native.slack.com/archives/CCYDASBLP/p1648214149043299, I've created a separate issue to hopefully get answers: #4692 I've been also comparing log sequence for a proper compaction and the one that fails, so here we go. The proper one:
The "out of order error" one:
I don't know the reason it seems to loop three times before failing. Anyway, in my case the bad block seems to store about 50 minutes of metrics. If it comes to that, I think removing it would be OK, however as of now I'm trying to meddle with Cortex code to force it to go through all blocks in GCS and verify them for any errors that would prevent compaction. No point in deleting one block if there are 50 more that are bad / out of order. Fingers crossed. |
It's because of that:
|
Closed by #4707 Kudos to @alanprot and @alvinlin123 <3 |
Describe the bug
I'm running two config-identical Cortex clusters, let's say: prod & nonprod.
Nonprod looks fine.
In prod it seems that compactor is unable to properly compact blocks and this leads to prod having ~11 000 blocks, while nonprod ~600 (the volume of data alone is not 20x bigger on prod, so this is unexpected).
Having 11k blocks causes problems with store gateway pods which tend to take a lot of time to load blocks and, until all blocks loaded, the cluster does not work great.
Why am I assuming that compacting does not work for prod? It seems that upon successful compaction there should be some entries such as
compacted blocks
andmarking compacted block for deletion
etc. There are none, only endless entries like the ones above. Also I'm running various dashboards, ie. https://github.com/monitoring-mixins/website/blob/master/assets/cortex/dashboards/cortex-compactor-resources.json which shows literally no compacted blocks for prod (and some for nonprod). Sharing my logs below.I am aware that there are at least several issues that could be causing compactor not to work #4453 or #3569, but I'd very much welcome any hints that could allow me to unblock compacting as the current volume of blocks makes the cluster prone to not working properly (which is not great for production usage, obviously).
Expected behavior
Compacting works, the amount of blocks in prod in not ~20x the amount of blocks in nonprod (more like ~3-4 times at best).
Environment:
K8s in GKE v1.21, deployed by the official cortex Helm chart v1.4.0
Storage Engine
Blocks
Additional Context
Two sets of logs are here:
The text was updated successfully, but these errors were encountered: