Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Query stop responding if not queried for some time #705

Closed
kedare opened this issue Jan 3, 2019 · 31 comments
Closed

Thanos Query stop responding if not queried for some time #705

kedare opened this issue Jan 3, 2019 · 31 comments
Assignees

Comments

@kedare
Copy link

kedare commented Jan 3, 2019

Thanos, Prometheus and Golang version used

Docker tags:

thanos:
  version: v0.2.1
prometheus:
  version: v2.5.0

What happened

When thanos-query is not used for some time, it stops answering when we try to query again and only work after are restart.
We saw that when we shut our persistent grafana dashboard (shown on screens on the office), now every time we go to grafana or to the thanos-query interface, the queries hang up forever until we restart thanos-query.
CPU usage on all the thanos relarted processes are low so it doesn't look like it's doing anything.
All the stores are up in thanos-query dashboard but all queries are failing (even up).
After restarting thanos query everything works fine for a few hours (I don't know the exact time)

What you expected to happen
Query working fine

How to reproduce it (as minimally and precisely as possible):

Don't do queries for a few hours, then try to query

Full logs to relevant components

Nothing in the logs except context cancelled because the HTTP queries are timing out on Grafana

Anything else we need to know

Environment:

  • OS: Docker 18.09.0 on Centos 7
  • Kernel (e.g. uname -a): Linux gc-euw1-prometheus-central-1 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Others:
@bwplotka
Copy link
Member

bwplotka commented Jan 3, 2019

Thanks for report.
What is your configuration for connecting Querier to StoreAPIs? What can you see on Store UI in Thanos Querier when you have case for Thanos Querier not responding?

@kedare
Copy link
Author

kedare commented Jan 3, 2019

There you have the related salt state to deploy query and store :

Make sure Thanos Query is running in Docker:
  docker_container.running:
    - name: thanos-query
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4002
    - restart_policy: always
    - network_mode: host
    - command: >
        query
        --cluster.peers store.thanos.internal.xxx.net:10903
        --query.replica-label replica
        --cluster.address 0.0.0.0:10906
        --cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10906
        --grpc-address 0.0.0.0:10907
        --grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10907
        --http-address 0.0.0.0:10908
        --http-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10908
    - binds:
      - "/var/data/thanos:/var/data/thanos:Z"
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json

Make sure Thanos Store is running in Docker:
  docker_container.running:
    - name: thanos-store
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4002
    - restart_policy: always
    - network_mode: host
    - command: >
        store
        --data-dir /var/data/thanos/store
        --cluster.peers query.thanos.internal.xxx.net:10900
        --objstore.config-file /etc/thanos/bucket.yml
        --cluster.address 0.0.0.0:10903
        --cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10903
        --grpc-address 0.0.0.0:10904
        --grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10904
        --http-address 0.0.0.0:10905
    - binds:
      - "/var/data/thanos:/var/data/thanos:Z"
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
      - "/etc/monitoring/thanos:/etc/thanos:Z"
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json

## Then on remote nodes containing Prometheus instances
Make sure Thanos Sidecar is running in Docker:
  docker_container.running:
    - name: thanos-sidecar
    - image: improbable/thanos:{{ pillar["thanos"]["version"] }}
    - user: 4000
    - restart_policy: always
    - network_mode: host
    - command: >
        sidecar
        --prometheus.url http://localhost:9090
        --tsdb.path /prometheus
        --objstore.config-file /etc/thanos/bucket.yml
        --cluster.peers query.thanos.internal.xxx.net:10906
        --cluster.peers store.thanos.internal.xxx.net:10903
        --cluster.address 0.0.0.0:10900
        --cluster.advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10900
        --grpc-address 0.0.0.0:10901
        --grpc-advertise-address {{ grains["ip4_interfaces"]["eth0"][0] }}:10901
        --http-address 0.0.0.0:10902
    - environment:
      - GOOGLE_APPLICATION_CREDENTIALS: /etc/prometheus/gcloud.json
    - binds:
      - "/etc/monitoring/prometheus:/etc/prometheus:Z"
      - "/etc/monitoring/thanos:/etc/thanos:Z"
      - "/var/data/prometheus:/prometheus:Z"

It happened right now, on the stores panel everything is green, health checks are still responding correctly but I can't for example run a up query, hanging forever on it.

screenshot 2019-01-03 at 17 33 41

screenshot 2019-01-03 at 17 34 29

Still no errors in the logs

@bwplotka
Copy link
Member

bwplotka commented Jan 3, 2019

Intesting. No immdiate clue - we cannot repro it.

So what is really weird in your setup, is that you have 5 replica HA for Prometheus.. Does it mean all 5 instances scrape the same data? Because that's for query-replica flag in Thanos querier is for. It might work despite that just accidently (because all internal metric labels are unique), but I think you are doing something you did not intent to do.

Nevertheless what might happen now, is that one of those StoreAPI might hang for some reason? Potentially store gateway? One hanging leaf can cause Thanos query to timeout after some time.

I would suggest:

  • checking logs on leafs (for sure for 10.1.0.2 which is your store gateway)
  • Turning off one by one leafs from gossip, (it's easier when you use DNS than gossip) and try Thanos Querier against on of those leafs only to detect which causes problem. Again, I think you might get some problem with store gateway acessing object store or something?

@kedare
Copy link
Author

kedare commented Jan 4, 2019

Each instance monitor their own system (site or project), they don't share metrics data (Except to monitor each other monitoring system (monitoring each other prometheus, alertmanager, sidecars, etc.)

Should I remove the replica global label from my configuration ?

I've checked the store logs but there is nothing.

What I see is that for example queries like up that need all the leafs are hanging forever but targetting some leafs like up{replica="central"} works fine and it looks like one (bcn) is not working.

When taking a look at this leaf log, there's nothing shown failing recently (Just the classic upload new block).

When restarting this leaf (And not the thanos query like before) it works, but I suppose this will happen again in a few hours (as it was restarted before this ticket as I updated the version from master-2018-11-12-9cec837 to v0.2.1 to try to solve the issue)

@bwplotka
Copy link
Member

bwplotka commented Jan 4, 2019

Yes, they are definitely not replicas then. They are shards (: I would not remove replica label as then you external labels won't be unique (they have to). Just add something that suits your functional shard like cluster or shard (:

Not sure if that is the reason for your weirdiness query timeouts, but for a try.

Have you try turning of some leaf then? when Querier hangs, maybe worth to turn off them one by one.

Also in case on hanging worth to check Thanos Querier metrics.
Maybe concurrent queries are full for some reason for you?

# HELP prometheus_engine_queries The current number of queries being executed or waiting.
# TYPE prometheus_engine_queries gauge
prometheus_engine_queries 4
# HELP prometheus_engine_queries_concurrent_max The max number of concurrent queries.
# TYPE prometheus_engine_queries_concurrent_max gauge
prometheus_engine_queries_concurrent_max 100

``

@kedare
Copy link
Author

kedare commented Jan 4, 2019

When I shutdown the only one that had the issue (bcn) and wait for it to appear red in the store part, queries starts to work again, then when I start it again it keeps working (for a few hours I suppose)

The query limit is 20 everywhere but I can see that sometime the thanos-query reach more than 60 concurrent queries, I think this is when I try to open Grafana and that it gets hanging forever because of the issue.

If I change the label (I suppose it need to be changed everywhere at the same time), that won't affect the ability to read the previous metrics ? Any consequence ?

@bwplotka
Copy link
Member

bwplotka commented Jan 4, 2019

The query limit is 20 everywhere but I can see that sometime the thanos-query reach more than 60 concurrent queries, I think this is when I try to open Grafana and that it gets hanging forever because of the issue.

Doesn't this sounds like your issue here? Can you try to change max-concurrent to 100 e.g? BTW what is running on 10.1.0.2:10904?

Well, the result will have different labels, otherwise for your case it should be fine (:

@kedare
Copy link
Author

kedare commented Jan 4, 2019

10.1.0.2:10904 is the Thanos store, I will try next week to change the limit to 100, but I don't think that as this would only affect Grafana and not single queries from the query UI ?

@kedare
Copy link
Author

kedare commented Jan 7, 2019

I just pushed the change, I will let you know if it happens again, btw maybe it would be interesting to sync the flag to have the same name than on Prometheus ? query.max-concurrent vs query.max-concurrency

@bwplotka
Copy link
Member

bwplotka commented Jan 7, 2019

I think it's up to your deployment/ config management to sync those. From Querier standpoint we don't know what StoreAPI is. (:

@kedare
Copy link
Author

kedare commented Jan 7, 2019

So far setting the limit to 100 everywhere (checked on the metric) didn't fix the issue, I just tried right now and I'm getting the same issue.

By syncing I was meaning to use the same flag name than Prometheus, not to sync the value :)

@bwplotka
Copy link
Member

bwplotka commented Jan 7, 2019

ah, I yes - fully agree. Have you checked resource consumption for components during the issue?

Also, where you able to repro this issue with just the store?

@kedare
Copy link
Author

kedare commented Jan 7, 2019

So far it's not related to the store, I tried to restart the store, didn't fix the issue (And Query doesn't reconnect to it until query is restarted), it looks like it's bound to my bcn sidecar, when restarting this one (or query) it starts working again. (Query and Store are on the same host so I don't think it's network either issue either)

CPU usage is almost 0 when doing the query. No errors in any log, network looks ok as hearthbeat goes fine everywhere.

@bwplotka
Copy link
Member

bwplotka commented Jan 7, 2019

Wonder if it's similar to this: https://improbable-eng.slack.com/archives/CA4UWKEEN/p1546867840210100

@kedare
Copy link
Author

kedare commented Jan 7, 2019

I could not get the link to work, it always give me the end of the #thanos channel 😓

@bwplotka
Copy link
Member

bwplotka commented Jan 7, 2019

There is some discussion in #thanos about sidecar slowness. Might be related, might not.

Hm.. in your case, I would try debugging sidecar in some way. Checking it's metrics on how many connections/requests, resources it uses and it the end.. debugging the code flow more. (:

Will try to repro similar on our stack soon as well.

@kedare
Copy link
Author

kedare commented Jan 9, 2019

Thanks :)

Let me know if you need any more testing from me

@bwplotka
Copy link
Member

bwplotka commented Feb 6, 2019

Cannot repro, sorry. I literally performed demo on latest version and it's super reliable.

Now when I looked it again I can see you use still gossip. We want to remove support for it (see https://github.com/improbable-eng/thanos/blob/master/docs/proposals/approved/201809_gossip-removal.md) and it looks like you might be hitting some gossip flakiness. I would really suggest using DNS as it will make things easier to debug. I am suspecting some flakiness in gossip causing the peers to disconnect temporarily.

@wbh1
Copy link
Contributor

wbh1 commented May 23, 2019

I think I might've run into the same issue. My Thanos query instance stopped responding seemingly randomly. There are no error logs or any other indicators that I can find that something went wrong. The logs for Thanos Query are simply empty between the time it stopped responding and the time I restarted the container:

level=warn ts=2019-05-22T21:48:00.894943163Z caller=proxy.go:346 err="failed to receive any data from Addr: thanos_sidecar:10901 Labels: [{replica 1 {} [] 0} {team occ {} [] 0}] Mint: 1558461600000 Maxt: 9223372036854775807: context canceled" msg="returning partial response"
level=info ts=2019-05-23T02:25:36.52785599Z caller=main.go:193 msg="caught signal. Exiting." signal=terminated

On the Thanos Query HTTP page, the Stores all showed as healthy. And restarting only Thanos Query resolved the issue so I don't think it's the sidecars or the Thanos Store node.

Thanos Query just stopped responding to requests, but still reported itself as healthy. It was still exposing its metrics, so it wasn't frozen. Queries just timed out when run in the web UI (or Grafana)
Screen Shot 2019-05-22 at 11 12 52 PM

Version:

thanos, version 0.3.2-master (branch: master, revision: c0bfe35a9c4ae006e0da40d43b5cf34065afc2d6)
  build user:       circleci@65e389e74c7e
  build date:       20190408-14:15:16
  go version:       go1.12.2

@wbh1
Copy link
Contributor

wbh1 commented May 23, 2019

After further research, it looks like I also hit the 20 concurrent queries limit. It spiked to over 20 and then came back down to exactly 20 and stayed there. Is there a reason why queries would be stuck in the queue and not getting cleared out?

Screen Shot 2019-05-23 at 11 10 33 AM

@povilasv
Copy link
Member

I've hit this case too with v0.4.0 and have collected all pprof profiles from Thanos Querier.
I will assign this issue to myself and plan to work on this soon.

@povilasv
Copy link
Member

Ok turns out I was running image: improbable/thanos:master-2019-03-26-7465db9, but the same issue should be hapenning in all new versions.

@povilasv
Copy link
Member

Digging thru goroutine dump I found that many goroutines wait forever on wg.Wait() call:

goroutine 28662879 [semacquire, 2958 minutes]:
sync.runtime_Semacquire(0xc0140a4344)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc0140a4344)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc0140a4344, 0xc000736ae0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:144 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x13cdd60, 0xc0003338e0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:198 +0x74a
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000f0ddd0, 0xc004a030e0)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

goroutine 159624 [semacquire, 3880 minutes]:
sync.runtime_Semacquire(0xc001b952a8)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc001b952a0)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc001b952a0, 0xc000e66790)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:144 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x13cdd60, 0xc000dd9c20)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:171 +0x13fb
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc0029c5980, 0xc000c854f0)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

goroutine 28663415 [semacquire, 2958 minutes]:
sync.runtime_Semacquire(0xc002c4f3f8)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc002c4f3f0)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc002c4f3f0, 0xc00239ac00)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:144 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x13cdd60, 0xc0055b9320)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:198 +0x74a
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000969860, 0xc001a46410)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

goroutine 159718 [semacquire, 3880 minutes]:
sync.runtime_Semacquire(0xc001e2ddf8)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc001e2ddf0)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc001e2ddf0, 0xc000f18040)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:144 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x13cdd60, 0xc000b4e800)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:171 +0x13fb
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000b741b0, 0xc002655590)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66


@povilasv
Copy link
Member

povilasv commented May 31, 2019

Another good question is what did thanos store API call goroutines were doing, because they haven't returned data in 3880min ~ 64.66h

@povilasv
Copy link
Member

povilasv commented May 31, 2019

Other super long goroutines

goroutine 28662020 [IO wait, 2958 minutes]:
internal/poll.runtime_pollWait(0x7fa6712c03a8, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc000738898, 0x72, 0x0, 0x1, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc000738880, 0xc000be8131, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc000738880, 0xc000be8131, 0x1, 0x1, 0xc00051b7d8, 0xc00462b768, 0x4715cd)
	/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00bdaa8d8, 0xc000be8131, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x69
net/http.(*connReader).backgroundRead(0xc000be8120)
	/usr/local/go/src/net/http/server.go:677 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/go/src/net/http/server.go:673 +0xca

goroutine 28560291 [chan receive, 2958 minutes]:
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series(0xc000340960, 0xc0002e38f0, 0x13fcea0, 0xc00093f7c0, 0x1, 0x5)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:201 +0x28c
github.com/improbable-eng/thanos/pkg/query.(*querier).Select(0xc000846600, 0xc002b15c50, 0xc00bcb8180, 0x3, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/query/querier.go:184 +0x29e
github.com/prometheus/prometheus/promql.(*Engine).populateSeries.func2(0x13cdec0, 0xc015b06540, 0xc0012f6b00, 0x3, 0x4, 0xc000846600, 0xc0003b98c0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:585 +0x884
github.com/prometheus/prometheus/promql.inspector.Visit(0xc0012f6ac0, 0x13cdec0, 0xc015b06540, 0xc0012f6b00, 0x3, 0x4, 0x10a23e0, 0x13cdec0, 0xc00bcb8340, 0x20)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:309 +0x5c
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0012f6ac0, 0x13cdec0, 0xc015b06540, 0xc0012f6b00, 0x3, 0x4, 0x4, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:245 +0x8a
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0012f6ac0, 0x13cfec0, 0xc00bcb8360, 0xc0012f6b00, 0x3, 0x4, 0x2, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:258 +0x2bf
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0012f6ac0, 0x13cde80, 0xc00bcb81a0, 0xc00bcb8340, 0x2, 0x2, 0x1, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:276 +0xa9c
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0012f6ac0, 0x13cde60, 0xc0012f6980, 0xc000f18700, 0x1, 0x1, 0x0, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:268 +0x7e4
github.com/prometheus/prometheus/promql.Inspect(...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:320
github.com/prometheus/prometheus/promql.(*Engine).populateSeries(0xc0003b98c0, 0x13ebfc0, 0xc002b15b30, 0x13cdb20, 0xc00093f540, 0xc00093f5e0, 0x1, 0xc002b15b60, 0x13ebfc0, 0xc002b15b30, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:547 +0x3dc
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc0003b98c0, 0x13ebfc0, 0xc002b15ad0, 0xc015b065a0, 0xc00093f5e0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:409 +0x171
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc0003b98c0, 0x13ebfc0, 0xc002b15ad0, 0xc015b065a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:390 +0x567
github.com/prometheus/prometheus/promql.(*query).Exec(0xc015b065a0, 0x13ebfc0, 0xc002b15830, 0xc001749b20)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:178 +0x94
github.com/improbable-eng/thanos/pkg/query/api.(*API).query(0xc0003402d0, 0xc00086fa00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/query/api/v1.go:298 +0x4d9
github.com/improbable-eng/thanos/pkg/query/api.(*API).Register.func1.1(0x13e7b00, 0xc015b063c0, 0xc00086fa00)
	/go/src/github.com/improbable-eng/thanos/pkg/query/api/v1.go:157 +0x55
net/http.HandlerFunc.ServeHTTP(0xc0004d4320, 0x13e7b00, 0xc015b063c0, 0xc00086fa00)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x13e63c0, 0xc0005222f8, 0xc00086fa00)
	/go/pkg/mod/github.com/!n!y!times/[email protected]/gzip.go:287 +0x210
net/http.HandlerFunc.ServeHTTP(0xc0000a5590, 0x13e63c0, 0xc0005222f8, 0xc00086fa00)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/improbable-eng/thanos/pkg/tracing.HTTPMiddleware.func1(0x13e63c0, 0xc0005222f8, 0xc00086f900)
	/go/src/github.com/improbable-eng/thanos/pkg/tracing/http.go:37 +0x511
net/http.HandlerFunc.ServeHTTP(0xc000340780, 0x13e63c0, 0xc0005222f8, 0xc00086f900)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/prometheus/client_golang/prometheus.InstrumentHandlerFuncWithOpts.func1(0x13e7980, 0xc00130c0e0, 0xc00086f900)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/http.go:260 +0x232
github.com/prometheus/common/route.(*Router).handle.func1(0x13e7980, 0xc00130c0e0, 0xc00086f800, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:60 +0x26e
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0003b9950, 0x13e7980, 0xc00130c0e0, 0xc00086f800)
	/go/pkg/mod/github.com/julienschmidt/[email protected]/router.go:299 +0x6d1
github.com/prometheus/common/route.(*Router).ServeHTTP(0xc0003d9100, 0x13e7980, 0xc00130c0e0, 0xc00086f800)
	/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:98 +0x4c
net/http.(*ServeMux).ServeHTTP(0xc000152940, 0x13e7980, 0xc00130c0e0, 0xc00086f800)
	/usr/local/go/src/net/http/server.go:2375 +0x1d6
net/http.serverHandler.ServeHTTP(0xc00013e000, 0x13e7980, 0xc00130c0e0, 0xc00086f800)
	/usr/local/go/src/net/http/server.go:2774 +0xa8
net/http.(*conn).serve(0xc00b854d20, 0x13ebf00, 0xc001a62a80)
	/usr/local/go/src/net/http/server.go:1878 +0x851
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2884 +0x2f4

goroutine 68762076 [chan receive, 1164 minutes]:
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series(0xc000340960, 0xc002208620, 0x13fcea0, 0xc011ad22d0, 0x2, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:201 +0x28c
github.com/improbable-eng/thanos/pkg/query.(*querier).Select(0xc00b16e900, 0xc000da24e0, 0xc00051c6e0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/query/querier.go:184 +0x29e
github.com/prometheus/prometheus/promql.(*Engine).populateSeries.func2(0x13cdec0, 0xc000ac5aa0, 0xc000160fc0, 0x3, 0x4, 0xc00b16e900, 0xc0003b98c0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:585 +0x884
github.com/prometheus/prometheus/promql.inspector.Visit(0xc000160f80, 0x13cdec0, 0xc000ac5aa0, 0xc000160fc0, 0x3, 0x4, 0x10a23e0, 0x13cdec0, 0xc0016f0d60, 0x20)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:309 +0x5c
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc000160f80, 0x13cdec0, 0xc000ac5aa0, 0xc000160fc0, 0x3, 0x4, 0x4, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:245 +0x8a
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc000160f80, 0x13cfec0, 0xc0016f0d80, 0xc000160fc0, 0x3, 0x4, 0x2, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:258 +0x2bf
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc000160f80, 0x13cde80, 0xc0016f0bc0, 0xc0016f0d60, 0x2, 0x2, 0x1, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:276 +0xa9c
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc000160f80, 0x13cde40, 0xc011ad20a0, 0xc00051c950, 0x1, 0x1, 0x0, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:263 +0x5ac
github.com/prometheus/prometheus/promql.Inspect(...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:320
github.com/prometheus/prometheus/promql.(*Engine).populateSeries(0xc0003b98c0, 0x13ebfc0, 0xc000da23c0, 0x13cdb20, 0xc011ad2000, 0xc011ad20f0, 0x1, 0xc000da23f0, 0x13ebfc0, 0xc000da23c0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:547 +0x3dc
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc0003b98c0, 0x13ebfc0, 0xc000da2360, 0xc000ac5b00, 0xc011ad20f0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:409 +0x171
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc0003b98c0, 0x13ebfc0, 0xc000da2360, 0xc000ac5b00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:390 +0x567
github.com/prometheus/prometheus/promql.(*query).Exec(0xc000ac5b00, 0x13ebfc0, 0xc0035dfe90, 0xc0008b9da0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:178 +0x94
github.com/improbable-eng/thanos/pkg/query/api.(*API).queryRange(0xc0003402d0, 0xc011ad0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/query/api/v1.go:402 +0xb1e
github.com/improbable-eng/thanos/pkg/query/api.(*API).Register.func1.1(0x13e7b00, 0xc000ac58c0, 0xc011ad0000)
	/go/src/github.com/improbable-eng/thanos/pkg/query/api/v1.go:157 +0x55
net/http.HandlerFunc.ServeHTTP(0xc0004d4790, 0x13e7b00, 0xc000ac58c0, 0xc011ad0000)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x13e63c0, 0xc00ef3b650, 0xc011ad0000)
	/go/pkg/mod/github.com/!n!y!times/[email protected]/gzip.go:287 +0x210
net/http.HandlerFunc.ServeHTTP(0xc00013b4a0, 0x13e63c0, 0xc00ef3b650, 0xc011ad0000)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/improbable-eng/thanos/pkg/tracing.HTTPMiddleware.func1(0x13e63c0, 0xc00ef3b650, 0xc00455df00)
	/go/src/github.com/improbable-eng/thanos/pkg/tracing/http.go:37 +0x511
net/http.HandlerFunc.ServeHTTP(0xc000340d70, 0x13e63c0, 0xc00ef3b650, 0xc00455df00)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/prometheus/client_golang/prometheus.InstrumentHandlerFuncWithOpts.func1(0x13e7980, 0xc000a36620, 0xc00455df00)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/http.go:260 +0x232
github.com/prometheus/common/route.(*Router).handle.func1(0x13e7980, 0xc000a36620, 0xc00455de00, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:60 +0x26e
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0003b9950, 0x13e7980, 0xc000a36620, 0xc00455de00)
	/go/pkg/mod/github.com/julienschmidt/[email protected]/router.go:299 +0x6d1
github.com/prometheus/common/route.(*Router).ServeHTTP(0xc0003d9100, 0x13e7980, 0xc000a36620, 0xc00455de00)
	/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:98 +0x4c
net/http.(*ServeMux).ServeHTTP(0xc000152940, 0x13e7980, 0xc000a36620, 0xc00455de00)
	/usr/local/go/src/net/http/server.go:2375 +0x1d6
net/http.serverHandler.ServeHTTP(0xc00013e000, 0x13e7980, 0xc000a36620, 0xc00455de00)
	/usr/local/go/src/net/http/server.go:2774 +0xa8
net/http.(*conn).serve(0xc002552aa0, 0x13ebf00, 0xc000160d00)
	/usr/local/go/src/net/http/server.go:1878 +0x851
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2884 +0x2f4

goroutine 28664185 [IO wait, 2958 minutes]:
internal/poll.runtime_pollWait(0x7fa6712c1028, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(*pollDesc).wait(0xc000de1718, 0x72, 0x0, 0x1, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc000de1700, 0xc001b1fe11, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc000de1700, 0xc001b1fe11, 0x1, 0x1, 0xc0003318d8, 0xc000685f68, 0x4715cd)
	/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc00014cb90, 0xc001b1fe11, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x69
net/http.(*connReader).backgroundRead(0xc001b1fe00)
	/usr/local/go/src/net/http/server.go:677 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/go/src/net/http/server.go:673 +0xca

goroutine 68762150 [chan receive, 1164 minutes]:
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series(0xc000340960, 0xc000358cb0, 0x13fcea0, 0xc0028eb270, 0x1, 0x2)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:201 +0x28c
github.com/improbable-eng/thanos/pkg/query.(*querier).Select(0xc000641780, 0xc000f0c990, 0xc0028f0390, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/query/querier.go:184 +0x29e
github.com/prometheus/prometheus/promql.(*Engine).populateSeries.func2(0x13cdf80, 0xc0028f60c0, 0xc0028e9700, 0x2, 0x2, 0xc000641780, 0xc0003b98c0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:566 +0x329
github.com/prometheus/prometheus/promql.inspector.Visit(0xc0028f8280, 0x13cdf80, 0xc0028f60c0, 0xc0028e9700, 0x2, 0x2, 0x10a27e0, 0x13cdf80, 0xc0028f07a0, 0x10)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:309 +0x5c
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0028f8280, 0x13cdf80, 0xc0028f60c0, 0xc0028e9700, 0x2, 0x2, 0x2, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:245 +0x8a
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0028f8280, 0x13cde40, 0xc0028eaeb0, 0xc0028e9700, 0x2, 0x2, 0x1, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:263 +0x5ac
github.com/prometheus/prometheus/promql.Walk(0x13cfee0, 0xc0028f8280, 0x13cde60, 0xc0028f8180, 0xc0028f07a0, 0x1, 0x1, 0x0, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:268 +0x7e4
github.com/prometheus/prometheus/promql.Inspect(...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/ast.go:320
github.com/prometheus/prometheus/promql.(*Engine).populateSeries(0xc0003b98c0, 0x13ebfc0, 0xc000f0c8a0, 0x13cdb20, 0xc0028eae10, 0xc0028eb090, 0x1, 0xc000f0c8d0, 0x13ebfc0, 0xc000f0c8a0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:547 +0x3dc
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc0003b98c0, 0x13ebfc0, 0xc000f0c840, 0xc0028f6360, 0xc0028eb090, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:409 +0x171
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc0003b98c0, 0x13ebfc0, 0xc000f0c840, 0xc0028f6360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:390 +0x567
github.com/prometheus/prometheus/promql.(*query).Exec(0xc0028f6360, 0x13ebfc0, 0xc000f0c2d0, 0xc0028f4180)
	/go/pkg/mod/github.com/prometheus/[email protected]/promql/engine.go:178 +0x94
github.com/improbable-eng/thanos/pkg/query/api.(*API).queryRange(0xc0003402d0, 0xc000728f00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/query/api/v1.go:402 +0xb1e
github.com/improbable-eng/thanos/pkg/query/api.(*API).Register.func1.1(0x13e7b00, 0xc0028dbf80, 0xc000728f00)
	/go/src/github.com/improbable-eng/thanos/pkg/query/api/v1.go:157 +0x55
net/http.HandlerFunc.ServeHTTP(0xc0004d4790, 0x13e7b00, 0xc0028dbf80, 0xc000728f00)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1(0x13e63c0, 0xc00752ad28, 0xc000728f00)
	/go/pkg/mod/github.com/!n!y!times/[email protected]/gzip.go:287 +0x210
net/http.HandlerFunc.ServeHTTP(0xc00013b4a0, 0x13e63c0, 0xc00752ad28, 0xc000728f00)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/improbable-eng/thanos/pkg/tracing.HTTPMiddleware.func1(0x13e63c0, 0xc00752ad28, 0xc000728c00)
	/go/src/github.com/improbable-eng/thanos/pkg/tracing/http.go:37 +0x511
net/http.HandlerFunc.ServeHTTP(0xc000340d70, 0x13e63c0, 0xc00752ad28, 0xc000728c00)
	/usr/local/go/src/net/http/server.go:1995 +0x44
github.com/prometheus/client_golang/prometheus.InstrumentHandlerFuncWithOpts.func1(0x13e7980, 0xc00130cb60, 0xc000728c00)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/http.go:260 +0x232
github.com/prometheus/common/route.(*Router).handle.func1(0x13e7980, 0xc00130cb60, 0xc011ad0200, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:60 +0x26e
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0003b9950, 0x13e7980, 0xc00130cb60, 0xc011ad0200)
	/go/pkg/mod/github.com/julienschmidt/[email protected]/router.go:299 +0x6d1
github.com/prometheus/common/route.(*Router).ServeHTTP(0xc0003d9100, 0x13e7980, 0xc00130cb60, 0xc011ad0200)
	/go/pkg/mod/github.com/prometheus/[email protected]/route/route.go:98 +0x4c
net/http.(*ServeMux).ServeHTTP(0xc000152940, 0x13e7980, 0xc00130cb60, 0xc011ad0200)
	/usr/local/go/src/net/http/server.go:2375 +0x1d6
net/http.serverHandler.ServeHTTP(0xc00013e000, 0x13e7980, 0xc00130cb60, 0xc011ad0200)
	/usr/local/go/src/net/http/server.go:2774 +0xa8
net/http.(*conn).serve(0xc00b854000, 0x13ebf00, 0xc017a66740)
	/usr/local/go/src/net/http/server.go:1878 +0x851
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2884 +0x2f4


@povilasv
Copy link
Member

Ok this data is a bit too confusing, as I am running older version, which didn't have fix for Thanos query

@povilasv
Copy link
Member

povilasv commented May 31, 2019

We decided we need to rewrite that part to make it simpler. So this will be a bit bigger effort.

@wbh1
Copy link
Contributor

wbh1 commented May 31, 2019

Looks like your findings are consistent with what I'm seeing, too. I currently have 3 queries stuck in the queue (based on prometheus_engine_queries metric), with similar goroutine traces:

ps - thanks for your work on this! Let me know if there's anything that I can help with :)

goroutine 7346097 [semacquire, 1282 minutes]:
sync.runtime_Semacquire(0xc000e4e0b8)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc000e4e0b0)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc000e4e0b0, 0xc000cf42a0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:166 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:220 +0x74a
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc00072e1e0, 0xc00042a550)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

goroutine 8193383 [semacquire, 886 minutes]:
sync.runtime_Semacquire(0xc00093e708)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc00093e700)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc00093e700, 0xc0009ad980)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:166 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:220 +0x74a
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000805c50, 0xc000842460)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

goroutine 8193428 [semacquire, 886 minutes]:
sync.runtime_Semacquire(0xc00093ea98)
	/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc00093ea90)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1.1(0xc00093ea90, 0xc000f70310)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:166 +0x2b
github.com/improbable-eng/thanos/pkg/store.(*ProxyStore).Series.func1(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:220 +0x74a
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000e8d500, 0xc0008431d0)
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66

@wbh1
Copy link
Contributor

wbh1 commented May 31, 2019

I may be out of my depth, but is the issue possibly that the buffered channel is full which is blocking on the function and prevents the wg.Done() call from being reached?

I'll test it out on my local installation and report back if any difference is made.

(Referring to this part of the startStreamSeriesSet function, which is called by this part of the Series function in pkg/store/proxy.go)

Never mind. Looks like this was fixed in #1082. I'll run in prod over the weekend and see if things are resolved.

@wbh1
Copy link
Contributor

wbh1 commented Jun 10, 2019

Following up on my previous comment, this is definitely resolved by #1082. No stale goroutines or Prom queries have been observed in the 10 days since I started running a newer release of Thanos.

This issue should be able to be closed.

@bwplotka
Copy link
Member

Thanks everyone involved! Was quite painful to dig it (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants