thanos-query crashes with "concurrent map iteration and map write" #1272

bjakubski · 2019-06-24T13:55:50Z

Thanos, Prometheus and Golang version used

thanos, version 0.5.0 (branch: HEAD, revision: 72820b3)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5

What happened

In one of k8s clusters that we run thanos-query in it crashes every couple of minutes with "fatal error: concurrent map iteration and map write" or "fatal error: concurrent map writes"

What you expected to happen

No crash :-)

How to reproduce it (as minimally and precisely as possible):

I've no idea. I didn't manage to find anything that triggers it. Same problem was observed in 0.4.0. I'm not sure about 0.3.0.
thanos runs in GCP GKE cluster, query is deployed via our own helm chart. Crashing containers run:

  thanos query
      --log.level=debug
      --query.replica-label=prometheus_replica
      --grpc-server-tls-cert=/etc/certs/tls.crt
      --grpc-server-tls-key=/etc/certs/tls.key
      --store=dnssrv+_grpc._tcp.thanos-sidecars-prometheus.monitoring.svc
      --selector-label=location="REDACTED"
      --selector-label=stack="REDACTED"
      --selector-label=REDACTED

Same deployment (differs in selector-label values) crashes less in other GKE cluster and almost not at all in yet another GKE cluster, while receiving similar (very low) traffic via GRPC.

Those query instances serve as GRPC endpoints for global thanos-query (that runs in another, "observability" cluster and does not crash) to return recent data (older data is served from bucket). They are behind GCP load balancer (using http2 to communicate LB <-> thanos in GKE)

Full logs to relevant components

Example after-crash dump is here: https://gist.github.com/bjakubski/18a98f6f1fc2922e5056df3106fe1477

The text was updated successfully, but these errors were encountered:

GiedriusS · 2019-06-26T08:53:36Z

Seems like something related to the UI and the execution of templates. Do you use some kind of particular functionality of it? I wonder how to reproduce it.

lx223 · 2019-06-26T10:34:08Z

Hi, I am interested in helping. Can someone assign this to me pls?

bjakubski · 2019-06-26T15:13:58Z

As far as I know traffic received by those crashing instances consists of:

queries passed from other (global) thanos-query instances. They go through GCP Loadbalancer
healthchecks performed by GKE (k8s) and GCP Loadbalanced: http port is checked by accessing /graph url and grpc via uses "http2" health check in GCP (which I do not know the exact behaviour).

Other than that thanos UI is practically unused

lx223 · 2019-06-26T21:36:34Z

I believe this issue is due to map concurrent writes in ui.go where web prefix are rendered into the HTML templates. A bit surprised that it actually triggered with low traffic. Here is the PR: #1280. PTAL

povilasv · 2019-06-27T10:38:27Z

Fix was merged to master, please try the new version.

povilasv · 2019-06-27T10:42:30Z

@lx223 thanks for the fix!

bjakubski · 2019-06-27T12:28:53Z

I quickly deployed version with the fix and I see no crashes for one hour (usually there were a couple of them in similar time rage in this cluster).
Thanks @lx223, much appreciated!

bwplotka added bug component: query help wanted labels Jun 24, 2019

bwplotka added the good first issue label Jun 26, 2019

bwplotka assigned lx223 Jun 26, 2019

povilasv closed this as completed Jun 27, 2019

lx223 added a commit to lx223/thanos that referenced this issue Jun 27, 2019

thanos-io#1272: add bugfix to changelog

cec835c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thanos-query crashes with "concurrent map iteration and map write" #1272

thanos-query crashes with "concurrent map iteration and map write" #1272

bjakubski commented Jun 24, 2019

GiedriusS commented Jun 26, 2019

lx223 commented Jun 26, 2019

bjakubski commented Jun 26, 2019

lx223 commented Jun 26, 2019 •

edited

Loading

povilasv commented Jun 27, 2019

povilasv commented Jun 27, 2019

bjakubski commented Jun 27, 2019

thanos-query crashes with "concurrent map iteration and map write" #1272

thanos-query crashes with "concurrent map iteration and map write" #1272

Comments

bjakubski commented Jun 24, 2019

GiedriusS commented Jun 26, 2019

lx223 commented Jun 26, 2019

bjakubski commented Jun 26, 2019

lx223 commented Jun 26, 2019 • edited Loading

povilasv commented Jun 27, 2019

povilasv commented Jun 27, 2019

bjakubski commented Jun 27, 2019

lx223 commented Jun 26, 2019 •

edited

Loading