-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: periodic huge spikes in latency when blocks are cut #7913
Comments
Logs from receiver during block cutting: Logs
|
I found a similar discussion regarding autoscaling best practices with Thanos in a router-ingestor configuration, which might be relevant to your issue. The discussion is still open and can be found here [1]. Regarding your issue with latency spikes during block cutting, here are a few suggestions that might help:
These adjustments might help mitigate the latency spikes and improve the overall performance of your Thanos setup. |
prometheus/prometheus#15374 should help, I think. I have been a bit busy recently so I didn't have time to dig into it 100% |
This is the one that should help: prometheus/prometheus#15242, can you make sure it's in your codebase? |
Thanos, Prometheus and Golang version used: 0.36.1
Object Storage Provider: Azure
What happened: Recently we migrated from hard-tenancy to soft-tenancy. This means we went from having a single tenant per receive statefultset, to having around 14 tenants in one large statefulset. Since migrating, we observe huge spikes in latency on the routing receivers to the receivers:
This manifests in a large increase in 503 errors, leading to clients needing to retry requests, and a noticeable drop in throughput for a period of about 3-4 minutes:
The times in the graphs are UTC. The spikes at 07:00, 09:00, 11:00 are when blocks are cut to persistent disk. This leads to an ingestion degradation of around 5 minutes. It looks like we hit some gRPC timeouts of 5s (assumedly between routing receives and receives) and I haven't found a way to configure this or otherwise increase it.
I am not sure what happens at 08:00 and 12:00 (though the effects seem to be much smaller, note the graphs are logarithmic).
What you expected to happen: as before: no disruption to ingestion when blocks are being cut
How to reproduce it (as minimally and precisely as possible): I have only seen this issue at production loads. I will detail our setup below as best as I can.
Environment: kubernetes, AKS
Routing receives
10 replicas
4 vCPU (no limits)
16/32 GiB Memory req/lim
Receives
10 replicas
10 vCPU (no limits, no throttling)
128/256 GiB Memory req/lim
Hashring controller
We are not using Istio or any other proxy in-between routing receivers and receivers.
Full logs to relevant components: too big, will post in a separate comment.
Anything else we need to know: I can provide profiles, or anything else as needed.
Each receive has about 26 million active head series. I have tried scaling out to more, smaller, receivers (up to 60 replicas) but still saw the same issue of high latency, as well as finding that this had a negative impact on query performance.
The text was updated successfully, but these errors were encountered: