Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar stops responding when querying over multiple days #773

Closed
mreichardt95 opened this issue Jan 29, 2019 · 14 comments
Closed

Sidecar stops responding when querying over multiple days #773

mreichardt95 opened this issue Jan 29, 2019 · 14 comments

Comments

@mreichardt95
Copy link
Contributor

mreichardt95 commented Jan 29, 2019

Thanos, Prometheus and Golang version used
Thanos 0.2.0 & 0.2.1, Prometheus 2.6.1 & 2.6.0, Golang go1.11.4

What happened
I tried querying data for multiple days. Example query sum(irate(nginx_http_requests_total{env="staging"}[1m])) by (host). Time period requested: 1 week. After request Thanos Sidecar gets unresponsive and container needs to be restarted.

What you expected to happen
Get data for multiple days

How to reproduce it (as minimally and precisely as possible):

  • Run thanos 0.2.0 or 0.2.1 as sidecar
  • query over a longer time period.

Full logs to relevant components

Sidecar / Prometheus logs

level=info ts=2019-01-29T09:52:59.368424664Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=172.18.0.11:19090
level=info ts=2019-01-29T09:52:59.370024121Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
level=info ts=2019-01-29T09:52:59.370112792Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
level=info ts=2019-01-29T09:52:59.370150167Z caller=main.go:245 host_details="(Linux 4.13.0-0.bpo.1-amd64 #1 SMP Debian 4.13.13-1~bpo9+1 (2017-11-22) x86_64 6c203665d2d1 (none))"
level=info ts=2019-01-29T09:52:59.37018473Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-29T09:52:59.370214022Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-01-29T09:52:59.371100476Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2019-01-29T09:52:59.37115068Z caller=web.go:429 component=web msg="Start listening for connections" address=:9090
level=info ts=2019-01-29T09:52:59.371408727Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546063200000 maxt=1546257600000 ulid=01D02CF3Q5HYEBMA6HJWMHH4CA
level=info ts=2019-01-29T09:52:59.371474878Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546257600000 maxt=1546452000000 ulid=01D085VQJM3G6NPB95CWNPGHC8
level=info ts=2019-01-29T09:52:59.37152652Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546452000000 maxt=1546646400000 ulid=01D0DZ8A5DM52P1G8AGF908H8F
level=info ts=2019-01-29T09:52:59.371577968Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546646400000 maxt=1546840800000 ulid=01D0KRMZFK30CENEHVWC7YJE1K
level=info ts=2019-01-29T09:52:59.371627919Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546840800000 maxt=1547035200000 ulid=01D0SJ1K9YE1JGE81ZJBZJDHH2
level=info ts=2019-01-29T09:52:59.371676145Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547035200000 maxt=1547229600000 ulid=01D0ZBE6HKHTB4S2EZYE8SW8Q8
level=info ts=2019-01-29T09:52:59.371725335Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547229600000 maxt=1547424000000 ulid=01D154TV03YR6MBB44KKBXX40B
level=info ts=2019-01-29T09:52:59.371779254Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547424000000 maxt=1547618400000 ulid=01D1AY7FAE61B76Z3QV1N158D8
level=info ts=2019-01-29T09:52:59.371830667Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547618400000 maxt=1547812800000 ulid=01D1GQM21KKMSBZY2K1SA143W0
level=info ts=2019-01-29T09:52:59.371883086Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547812800000 maxt=1548007200000 ulid=01D1PH0NC7NQMR268W6ZF7AD8G
level=info ts=2019-01-29T09:52:59.371933564Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548007200000 maxt=1548201600000 ulid=01D1WAD8H6N4AN78SQVC8Q890Z
level=info ts=2019-01-29T09:52:59.37198501Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548201600000 maxt=1548396000000 ulid=01D223SX0ZT1QCJ9W4FH2DHBFR
level=info ts=2019-01-29T09:52:59.372035671Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548396000000 maxt=1548590400000 ulid=01D27X6JNPP31B0RCTX47MA48M
level=info ts=2019-01-29T09:52:59.372074654Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548590400000 maxt=1548655200000 ulid=01D29TZVP98C4981TR2FFAE13S
level=info ts=2019-01-29T09:52:59.372113734Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548655200000 maxt=1548720000000 ulid=01D2BRSEE1FBV99AJ8YGR424XG
level=info ts=2019-01-29T09:52:59.372141048Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548741600000 maxt=1548748800000 ulid=01D2CDCBNF1W0HEJT4BVM1TZWZ
level=info ts=2019-01-29T09:52:59.372178384Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548720000000 maxt=1548741600000 ulid=01D2CDCHTZBFCC6T00B78Z8VNE
level=info ts=2019-01-29T09:52:59.384660762Z caller=main.go:256 component=sidecar msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-01-29T09:52:59.384708241Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-01-29T09:52:59.384725816Z caller=sidecar.go:243 msg="No supported bucket was configured, uploads will be disabled"
level=info ts=2019-01-29T09:52:59.384746103Z caller=sidecar.go:280 msg="starting sidecar" peer=
level=info ts=2019-01-29T09:52:59.384850046Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:19191
level=info ts=2019-01-29T09:52:59.385064779Z caller=sidecar.go:220 component=sidecar msg="Listening for StoreAPI gRPC" address=0.0.0.0:19090
level=warn ts=2019-01-29T09:52:59.386197513Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T09:53:01.385369439Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T09:53:03.430785467Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T09:53:05.427291397Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=info ts=2019-01-29T09:53:06.729265508Z caller=main.go:571 msg="TSDB started"
level=info ts=2019-01-29T09:53:06.729329276Z caller=main.go:631 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-01-29T09:53:06.759515126Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-01-29T09:53:06.759545038Z caller=main.go:530 msg="Server is ready to receive web requests."

Sidecar start command

/bin/thanos sidecar --http-address 0.0.0.0:19191 --tsdb.path /prometheus --prometheus.url http://127.0.0.1:9090 --grpc-address 0.0.0.0:19090 &

Thanos Query Logs

level=info ts=2019-01-29T09:49:56.193530896Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=172.18.0.7:19090
level=info ts=2019-01-29T09:49:56.202274799Z caller=flags.go:105 msg="QueryAPI address that will be propagated through gossip" address=172.18.0.7:19192
level=info ts=2019-01-29T09:49:56.211651277Z caller=main.go:256 component=query msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-01-29T09:49:56.211699597Z caller=query.go:427 msg="starting query node"
level=info ts=2019-01-29T09:49:56.211755527Z caller=query.go:396 msg="Listening for query and metrics" address=0.0.0.0:19192
level=info ts=2019-01-29T09:49:56.212422636Z caller=query.go:419 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:19090
level=info ts=2019-01-29T09:49:56.265621295Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=172.16.2.21:19090
level=info ts=2019-01-29T09:49:56.265695353Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=172.16.2.7:19090
level=info ts=2019-01-29T09:52:56.213459492Z caller=storeset.go:224 component=storeset msg="removing store because it's unhealthy or does not exist" address=172.16.2.21:19090
level=info ts=2019-01-29T09:53:56.214270653Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:55:46.212838748Z caller=storeset.go:286 component=storeset msg="update of store node failed" err="fetching store info from 172.16.2.20:19090: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=info ts=2019-01-29T10:55:46.212961816Z caller=storeset.go:224 component=storeset msg="removing store because it's unhealthy or does not exist" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:55:56.213246468Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:55:56.213240501Z caller=storeset.go:286 component=storeset msg="update of store node failed" err="fetching store info from 172.16.2.7:19090: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=info ts=2019-01-29T10:55:56.213304983Z caller=storeset.go:224 component=storeset msg="removing store because it's unhealthy or does not exist" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:06.213635598Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:56:06.213670041Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:16.214011008Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:16.214107053Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:56:26.214366985Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:56:26.21455581Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:36.214788863Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:56:36.214788881Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:46.215057632Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:46.215072102Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:56:56.215324013Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:56:56.215383136Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:06.21559515Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:06.215595383Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:57:16.21587468Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:57:16.21587464Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:26.216184761Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:57:26.216222667Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:36.216534532Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:36.216586779Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:57:46.216834598Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:46.21691304Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:57:56.217205963Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T10:57:56.217339452Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:58:06.217588001Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:58:16.217896041Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:58:26.21816559Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:58:36.218448897Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=info ts=2019-01-29T10:58:36.218525167Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=172.16.2.21:19090
level=warn ts=2019-01-29T10:58:46.218722208Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:58:56.219054704Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:59:06.219343378Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:59:16.219678423Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:59:26.219991344Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:59:36.220328665Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:59:46.220626486Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T10:59:56.22087541Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:00:06.221132879Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:00:16.221430086Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:00:26.221777923Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:00:36.222139913Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:00:46.222461099Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:00:56.226582106Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:01:06.226987071Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:01:16.227334351Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=warn ts=2019-01-29T11:01:26.227666854Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.7:19090
level=info ts=2019-01-29T11:02:26.213977359Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=172.16.2.20:19090

Thanos Query Command

  thanos_query:
    image: improbable/thanos:v0.2.1
    ports:
      - 19090:19090
    command: query --store dns+tasks.prometheus:19090 --grpc-address 0.0.0.0:19090 --query.replica-label replica --http-address 0.0.0.0:19192
    labels:
      container_group: monitoring
    networks:
      - monitoring
    deploy:
      mode: global
      update_config:
        parallelism: 1
        delay: 30s
        order: stop-first
        monitor: 10m
        failure_action: rollback
      restart_policy:
        delay: 15s
      resources:
        limits:
          cpus: '1'
          memory: 512M
      placement:
        constraints: [node.labels.monitoring_prometheus == true ]
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "10"
        labels: container_group
        env: ${APP_ENV}

Anything else we need to know
We're running on docker swarm currently. The sidecar process is running directly inside Prometheus container. Currently we don't use any Bucket storage everything is kept on Disk for now (30d retention).

@mreichardt95
Copy link
Contributor Author

Digging deeper into this:
It only happens when Thanos Query has more than one store setup (Prometheus HA pair). On our preview system with only one Prometheus instance this problem does not happen. I guess this maybe has something to do with the replica label? We currently randomly generate a Number on startup and place it inside the external label for prometheus. I rewrote our image to save the replica number so it does not change on startup. See if that helps with the issue

@bwplotka
Copy link
Member

Thanks for report!

So randomized replica is quite bad yes - any compaction/downsampling would not work, but you don't use those. For your setup the issue you have I thing is not because of replica.

Questions:

  • Are IPs ok? I can see sidecar advertising on 172.18.0.11:19090, whereas querier tries to connect with 172.16.2.20:19090 and 172.16.2.7:19090
  • During problematic period have you tried to reach the gRPC of sidecar on those IPs? Is sidecar unresponsive? Is metric endpoints work, gRPC works?

@mreichardt95
Copy link
Contributor Author

Thanks for the info. We want to use compact/downsampling in the near future. You are correct the labels were not the issue. With fixed replica label I'm hitting the same error.

The IP's look off, because Prometheus and all monitoring components are in a seperated network inside our Docker swarm cluster. Thanos Query is only in "monitoring" network as it only needs a connection to Prometheus. Prometheus itself is in common and monitoring network, because it has to reach the service that it's supposed to monitor. AFAIK the sidecar process sets its advertise IP wrong. But this only counts for gossip right? We rely on DNS Discovery and static entries, so this should not be a problem.

When triggering the error the process becomes completely unresponsive. All ports are closed (also form inside the container).

I also started the sidecar with log.level=debug to see if that helps; it didn't unfortunately.
Query logs are a bit interesting, but I guess this is due to not having consistent replica labels until now

level=warn ts=2019-01-29T15:50:01.741236774Z caller=proxy.go:136 err="No store matched for this query" stores="store Addr: 172.16.2.20:19090 Labels: [{env staging {} [] 0} {replica cloud-6110 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out;store Addr: 172.16.2.21:19090 Labels: [{env staging {} [] 0} {replica cloud-3893 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out"
level=warn ts=2019-01-29T15:50:01.741629855Z caller=proxy.go:136 err="No store matched for this query" stores="store Addr: 172.16.2.20:19090 Labels: [{env staging {} [] 0} {replica cloud-6110 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out;store Addr: 172.16.2.21:19090 Labels: [{env staging {} [] 0} {replica cloud-3893 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out"
level=warn ts=2019-01-29T15:51:01.741394597Z caller=proxy.go:136 err="No store matched for this query" stores="store Addr: 172.16.2.20:19090 Labels: [{env staging {} [] 0} {replica cloud-6110 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out;store Addr: 172.16.2.21:19090 Labels: [{env staging {} [] 0} {replica cloud-3893 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out"
level=warn ts=2019-01-29T15:51:01.741403366Z caller=proxy.go:136 err="No store matched for this query" stores="store Addr: 172.16.2.21:19090 Labels: [{env staging {} [] 0} {replica cloud-3893 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out;store Addr: 172.16.2.20:19090 Labels: [{env staging {} [] 0} {replica cloud-6110 {} [] 0}] Mint: 0 Maxt: 9223372036854775807 filtered out"
level=warn ts=2019-01-29T15:51:11.574428582Z caller=storeset.go:286 component=storeset msg="update of store node failed" err="fetching store info from 172.16.2.20:19090: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.20:19090
level=warn ts=2019-01-29T15:51:11.574415184Z caller=storeset.go:286 component=storeset msg="update of store node failed" err="fetching store info from 172.16.2.21:19090: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=172.16.2.21:19090
level=info ts=2019-01-29T15:51:11.574501515Z caller=storeset.go:224 component=storeset msg="removing store because it's unhealthy or does not exist" address=172.16.2.20:19090
level=info ts=2019-01-29T15:51:11.574523046Z caller=storeset.go:224 component=storeset msg="removing store because it's unhealthy or does not exist" address=172.16.2.21:19090

Sidecar logs with --log.level=debug:

level=info ts=2019-01-29T15:48:42.703010821Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=172.18.0.11:19090
level=info ts=2019-01-29T15:48:42.703999568Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
level=info ts=2019-01-29T15:48:42.704050679Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
level=info ts=2019-01-29T15:48:42.704069482Z caller=main.go:245 host_details="(Linux 4.13.0-0.bpo.1-amd64 #1 SMP Debian 4.13.13-1~bpo9+1 (2017-11-22) x86_64 cda782b2f750 (none))"
level=info ts=2019-01-29T15:48:42.704086189Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-29T15:48:42.704099912Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-01-29T15:48:42.704669053Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2019-01-29T15:48:42.704725323Z caller=web.go:429 component=web msg="Start listening for connections" address=:9090
level=info ts=2019-01-29T15:48:42.705131406Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546063200000 maxt=1546257600000 ulid=01D02CF3Q5HYEBMA6HJWMHH4CA
level=info ts=2019-01-29T15:48:42.705211993Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546257600000 maxt=1546452000000 ulid=01D085VQJM3G6NPB95CWNPGHC8
level=info ts=2019-01-29T15:48:42.705285963Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546452000000 maxt=1546646400000 ulid=01D0DZ8A5DM52P1G8AGF908H8F
level=info ts=2019-01-29T15:48:42.705341599Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546646400000 maxt=1546840800000 ulid=01D0KRMZFK30CENEHVWC7YJE1K
level=info ts=2019-01-29T15:48:42.70539585Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1546840800000 maxt=1547035200000 ulid=01D0SJ1K9YE1JGE81ZJBZJDHH2
level=info ts=2019-01-29T15:48:42.705450253Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547035200000 maxt=1547229600000 ulid=01D0ZBE6HKHTB4S2EZYE8SW8Q8
level=info ts=2019-01-29T15:48:42.705517733Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547229600000 maxt=1547424000000 ulid=01D154TV03YR6MBB44KKBXX40B
level=info ts=2019-01-29T15:48:42.705589676Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547424000000 maxt=1547618400000 ulid=01D1AY7FAE61B76Z3QV1N158D8
level=info ts=2019-01-29T15:48:42.705641047Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547618400000 maxt=1547812800000 ulid=01D1GQM21KKMSBZY2K1SA143W0
level=info ts=2019-01-29T15:48:42.70569065Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1547812800000 maxt=1548007200000 ulid=01D1PH0NC7NQMR268W6ZF7AD8G
level=info ts=2019-01-29T15:48:42.705739906Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548007200000 maxt=1548201600000 ulid=01D1WAD8H6N4AN78SQVC8Q890Z
level=info ts=2019-01-29T15:48:42.705790501Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548201600000 maxt=1548396000000 ulid=01D223SX0ZT1QCJ9W4FH2DHBFR
level=info ts=2019-01-29T15:48:42.705839997Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548396000000 maxt=1548590400000 ulid=01D27X6JNPP31B0RCTX47MA48M
level=info ts=2019-01-29T15:48:42.705879564Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548590400000 maxt=1548655200000 ulid=01D29TZVP98C4981TR2FFAE13S
level=info ts=2019-01-29T15:48:42.705917779Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548655200000 maxt=1548720000000 ulid=01D2BRSEE1FBV99AJ8YGR424XG
level=info ts=2019-01-29T15:48:42.705968009Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548720000000 maxt=1548741600000 ulid=01D2CDCHTZBFCC6T00B78Z8VNE
level=info ts=2019-01-29T15:48:42.705994088Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548763200000 maxt=1548770400000 ulid=01D2D1ZHDH862PZZ9WTY9KTRQ7
level=info ts=2019-01-29T15:48:42.706027543Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548741600000 maxt=1548763200000 ulid=01D2D1ZNM9R023VPE1DQGTZ634
level=debug ts=2019-01-29T15:48:42.714796281Z caller=cluster.go:158 component=cluster msg="resolved peers to following addresses" peers=
level=info ts=2019-01-29T15:48:42.715042315Z caller=main.go:256 component=sidecar msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-01-29T15:48:42.715067634Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-01-29T15:48:42.715077617Z caller=sidecar.go:243 msg="No supported bucket was configured, uploads will be disabled"
level=info ts=2019-01-29T15:48:42.715090869Z caller=sidecar.go:280 msg="starting sidecar" peer=
level=info ts=2019-01-29T15:48:42.715263351Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:19191
level=info ts=2019-01-29T15:48:42.715347936Z caller=sidecar.go:220 component=sidecar msg="Listening for StoreAPI gRPC" address=0.0.0.0:19090
level=warn ts=2019-01-29T15:48:42.71584379Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T15:48:44.718900314Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T15:48:46.731317236Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T15:48:48.726783867Z caller=sidecar.go:132 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
level=warn ts=2019-01-29T15:48:49.956630334Z caller=head.go:434 component=tsdb msg="unknown series references" count=92
level=info ts=2019-01-29T15:48:50.030767774Z caller=main.go:571 msg="TSDB started"
level=info ts=2019-01-29T15:48:50.030834855Z caller=main.go:631 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-01-29T15:48:50.041428498Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-01-29T15:48:50.041453469Z caller=main.go:530 msg="Server is ready to receive web requests."
level=debug ts=2019-01-29T15:48:50.738544543Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01D2D4RQJTNSWHF42P2B5PZX8K addr=172.18.0.11:10900
level=debug ts=2019-01-29T15:48:50.738614162Z caller=cluster.go:233 component=cluster msg="joined cluster" peerType=source knownPeers=
level=debug ts=2019-01-29T15:49:50.73945594Z caller=cluster.go:307 component=cluster msg="refresh cluster done" peers= resolvedPeers=
level=debug ts=2019-01-29T15:50:50.739643314Z caller=cluster.go:307 component=cluster msg="refresh cluster done" peers= resolvedPeers=

@jojohappy
Copy link
Member

Interesting!

Could you compare the dns results of tasks.prometheus and the real ips for sidecar and Prometheus containers? Not sure it advertised wrong address for sidecar, but it seems the query discovered the wrong ips from store dns. And could you use static store ip for query component and redo your query request?

@mreichardt95
Copy link
Contributor Author

mreichardt95 commented Jan 30, 2019

tasks.prometheus resolves to the correct IPs. Sidecar and also Query advertise an IP on the docker_gwbridge network, because it is the only one that has a default route configured. The network is a built in to docker swarm and being used for Container <-> Host communication. The whole setup works perfectly as long as we don't query a longer time period

In one of our env's prometheus/sidecar outputs following logs:

level=info ts=2019-01-30T05:00:01.11905342Z caller=compact.go:398 component=tsdb msg="write block" mint=1548813600000 maxt=1548820800000 ulid=01D2EJ1MMJ9GMJZNWDRHMF3YRQ
level=info ts=2019-01-30T05:00:01.397979455Z caller=head.go:488 component=tsdb msg="head GC completed" duration=40.108247ms
level=info ts=2019-01-30T05:00:02.480904788Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=1489 high=1490 duration=1.082871822s
level=info ts=2019-01-30T07:00:01.195972704Z caller=compact.go:398 component=tsdb msg="write block" mint=1548820800000 maxt=1548828000000 ulid=01D2ERXBZ49BR9V47BF7BJSSDH
level=info ts=2019-01-30T07:00:01.468978465Z caller=head.go:488 component=tsdb msg="head GC completed" duration=38.235833ms
level=info ts=2019-01-30T07:00:02.515920216Z caller=head.go:535 component=tsdb msg="WAL checkpoint complete" low=1491 high=1492 duration=1.046897123s
level=error ts=2019-01-30T07:21:18.650061417Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:18.782541522Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:22.804851087Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:22.85381848Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:26.870707093Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:27.084767467Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:34.639169485Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:34.72698079Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:36.393864942Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:36.461508956Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:36.492019374Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:36.55558539Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:36.714436105Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:37.839131275Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:39.11955773Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"
level=error ts=2019-01-30T07:21:39.216256191Z caller=stdlib.go:89 component=web msg="http: multiple response.WriteHeader calls"

Starting prometheus with --log.level=debug didn't help either. Thanos just dies without any output why it does that.

level=info ts=2019-01-30T09:52:18.74139847Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-01-30T09:52:18.741424946Z caller=main.go:530 msg="Server is ready to receive web requests."
level=debug ts=2019-01-30T09:56:41.073001949Z caller=scrape.go:825 component="scrape manager" scrape_pool=thanos-sidecar target=http://172.16.0.198:19191/metrics msg="Scrape failed" err="Get http://172.16.0.198:19191/metrics: dial tcp 172.16.0.198:19191: connect: connection refused"
level=debug ts=2019-01-30T09:56:44.627509553Z caller=scrape.go:825 component="scrape manager" scrape_pool=thanos-sidecar target=http://172.16.0.51:19191/metrics msg="Scrape failed" err="Get http://172.16.0.51:19191/metrics: dial tcp 172.16.0.51:19191: connect: connection refused"
level=debug ts=2019-01-30T09:56:51.072968383Z caller=scrape.go:825 component="scrape manager" scrape_pool=thanos-sidecar target=http://172.16.0.198:19191/metrics msg="Scrape failed" err="Get http://172.16.0.198:19191/metrics: dial tcp 172.16.0.198:19191: connect: connection refused"

EDIT: Also tried configuring store endpoints statically on Query.. same result as before. I was also able to reproduce this with only one prometheus instance running. I have now completly reset all volumes and data for Prometheus on our staging system. Maybe this problem only occurs if Sidecar is reading data not stored in WAL. Will have a look at it as soon as Prometheus writes its first full block to disk.

@mreichardt95
Copy link
Contributor Author

Prometheus now has written multiple blocks to disk and all are accessible over Thanos. Maybe I'm experiencing something similar to #396?

@jojohappy
Copy link
Member

@mreichardt95 Thanks for your report.

Maybe you can try to upgrade Thanos to the master head version and redo your query request. Meanwhile please check the health of Prometheus instance when your query is no response for you.

@mreichardt95
Copy link
Contributor Author

Sidecar stopped again... Prometheus is fully working and healthy when sidecar goes down.

I reconfigured our Prometheus base image to build thanos from master:

/bin # ./thanos --version
thanos, version 0.2.1-master (branch: master, revision: 827e968499f72360570834bb62334c796bf9fffe)
  build user:       root@85ef39fdd407
  build date:       20190201-10:57:56
  go version:       go1.11.2

I tested a bit around was not able to break sidecar again. Only experienced some delays while querying a lot of data, but I guess this is as expected :). Will keep an eye on it. Thanks for your help!

@jojohappy
Copy link
Member

jojohappy commented Feb 1, 2019

👍

@mreichardt95
Copy link
Contributor Author

Sidecar just crashed again. I think this might not only related to the queries executed, but also to the time sidecar is running. After a restart I can query everything as expected without any problems.
Sidecar process also does really crash it just appears as a "zombie" process in the container.

@mreichardt95 mreichardt95 reopened this Feb 4, 2019
@mreichardt95
Copy link
Contributor Author

Ran into it again on the morning. This time Prometheus itself crashed too. I guess this is caused , when sidecar does not receive queries for some time and then when it receives the next query it crashes.

If have spun up a Rule component that just evals up==0 against all our environments. Maybe this will help.

@bwplotka
Copy link
Member

bwplotka commented Feb 6, 2019

Have you checked if sidecar is not just OOMing? (: If the query is large it can consume a lot of memory.

This will help with this: #488

But in mean time you could check if you have enough memory for sidecar to operate.

@mreichardt95
Copy link
Contributor Author

Thanks for your input! Yes the container was OOMing when Prometheus crashed yesterday, but this does not fix the crash from sidecar itself. Since I fired up the rule component we haven't had a single crash or hiccup in any of our env's. I will watch this further but this is probably solved by constantly querying the sidecar to stop it from crashing.
This is probably the same behaviour #705 describes (only on a different Thanos component).

@bwplotka
Copy link
Member

bwplotka commented Feb 6, 2019

Not sure, but might be.

Let's close this issue and let's track and discuss on #488 - to essentially improve mem consumption from both sidecar and Prometheus

@bwplotka bwplotka closed this as completed Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants