-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thanos-query crashes with "concurrent map iteration and map write" #1272
Comments
Seems like something related to the UI and the execution of templates. Do you use some kind of particular functionality of it? I wonder how to reproduce it. |
Hi, I am interested in helping. Can someone assign this to me pls? |
As far as I know traffic received by those crashing instances consists of:
Other than that thanos UI is practically unused |
I believe this issue is due to map concurrent writes in |
Fix was merged to master, please try the new version. |
@lx223 thanks for the fix! |
I quickly deployed version with the fix and I see no crashes for one hour (usually there were a couple of them in similar time rage in this cluster). |
Thanos, Prometheus and Golang version used
thanos, version 0.5.0 (branch: HEAD, revision: 72820b3)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5
What happened
In one of k8s clusters that we run thanos-query in it crashes every couple of minutes with "fatal error: concurrent map iteration and map write" or "fatal error: concurrent map writes"
What you expected to happen
No crash :-)
How to reproduce it (as minimally and precisely as possible):
I've no idea. I didn't manage to find anything that triggers it. Same problem was observed in 0.4.0. I'm not sure about 0.3.0.
thanos runs in GCP GKE cluster, query is deployed via our own helm chart. Crashing containers run:
Same deployment (differs in selector-label values) crashes less in other GKE cluster and almost not at all in yet another GKE cluster, while receiving similar (very low) traffic via GRPC.
Those query instances serve as GRPC endpoints for global thanos-query (that runs in another, "observability" cluster and does not crash) to return recent data (older data is served from bucket). They are behind GCP load balancer (using http2 to communicate LB <-> thanos in GKE)
Full logs to relevant components
Example after-crash dump is here: https://gist.github.com/bjakubski/18a98f6f1fc2922e5056df3106fe1477
The text was updated successfully, but these errors were encountered: