receive: fanoutForward leads to holes on graphs #2841

sepich · 2020-07-04T09:54:31Z

Thanos, Prometheus and Golang version used:
thanos, version 0.14.0-rc.0 (branch: HEAD, revision: 00e3c32)
build user: circleci@7a93ec389107
build date: 20200702-19:09:41
go version: go1.14.2

Object Storage Provider:
GCP

What happened:
We're using thanos receive in 2 tier mode. (1st tier receiving requests and forwards to 2nd tier, which actually store data)
2nd tier are 3 replicas with --receive.replication-factor=3
When switching query, store and receive from v0.12.2 to v0.14.0-rc.0 we start to get holes on graphs

Scrape interval is 15s, holes start to appear on any rate(metric[30s]) data. This only affects recent (1h) data. As compactor merges block from 3 receive replicas - holes disappear (so -2h-1h data is fine on the same zoom).
I believe #2621 is to blame.
The idea is great! But realisation seems to have an issue.
I'm talking about normal work mode, when all 3 replicas are available. I do see that thanos_receive_forward_requests_total{result="error"} is about 1/3 of all requests, and if i enable debug logs - i see events request failed, but not needed to achieve quorum. And setting --receive-forward-timeout=60s change nothing.
Obviously fanoutForward current implementation leads to random holes in data on each pod:

metric/pod	thanos-receive-0	thanos-receive-1	thanos-receive-2
metric_1	1	1	-
metric_2	-	1	1
metric_3	1	-	1

And then holes on graphs appear. Compactor is able to handle this, but it seems querier - does not.

What you expected to happen:
I very like the idea "try to write data to all replicas till timeout happen" as it could help in such issue:

kubernetes node goes down (power issue for example)
kubernetes waits for 5min to mark node as NotReady. During this time pods on that node are still resolvable in DNS.
thanos-receive tries to replicate data to pod which is down, but still exist in dns. Connection attempt to missing ip takes 1m to become context canceled
at this point all remote-write connections to all thanos-receive are stuck and no data is accepted, in logs level=error ts=2020-07-03T20:53:53.881199029Z caller=handler.go:414 component=receive component=receive-handler msg="forwarding request" err="rpc error: code = Canceled desc = context canceled" endpoint=thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901
prometheuses start to scale up remote-write connections up to 1000
missing receive pod disappears from dns
thanos-receive starts to get no such host error much faster and starts to process data
(usually at this time it is already dead due to OOM ;)

The text was updated successfully, but these errors were encountered:

kakkoyun · 2020-07-07T08:49:42Z

@sepich Thanks for reporting this.

It looks interesting. Normally when receivers reached the quorum with a replication factor of 3, that means at least 2 of the writes succeeded (which seems working when I have checked the table you have shared). Queriers should about the query to retrieve the data from all receivers simultaneously, so even if a single receiver has the result you should see something on the graphs.

Could you please share your configuration for receivers and queries?

bwplotka · 2020-07-07T11:49:11Z

Thanks for this. This is overall expected to see for a single receive instance. The key part is to have Querier connected with deduplication enabled for all available receive pods (: Can you double-check that this is the case? @sepich

sepich · 2020-07-10T19:27:09Z

Could you please share your configuration for receivers and queries?

receivers:

          args:
          - receive
          - |
            --objstore.config=type: GCS
            config:
              bucket: bucket
          - --tsdb.path=/data
          - --tsdb.retention=1d
          - --tsdb.min-block-duration=30m  # have to be >2x(prometheus side 15m), https://github.com/thanos-io/thanos/issues/2114
          - --tsdb.max-block-duration=30m
          - --label=replica="$(NAME)"
          - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
          - --receive.hashrings-file=/cfg/hashrings.json
          - --receive.replication-factor=3

hashrings.json

    [
      {
        "hashring": "soft-tenants",
        "endpoints": [
          "thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901"
        ]
      }
    ]

query:

          args:
            - query
            - --query.auto-downsampling
            - --query.partial-response
            - --query.replica-label=replica
            - --query.replica-label=tenant_id
            - --store=dns+prometheus:10901
            - --store=dns+thanos-store:10901
            - --store=dns+thanos-receive:10901
            - --store=dns+thanos-rule:10901

Name thanos-receive is ClusterIP=None type k8s service, so expands to 3 IPs of receive pods.
I do see that all 3 pods are connected in logs of querier:

level=info ts=2020-07-10T15:06:34.329385781Z caller=grpc.go:119 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2020-07-10T15:06:39.336924142Z caller=storeset.go:417 component=storeset msg="adding new storeAPI to query storeset" address=10.60.9.6:10901 extLset="{replica=\"thanos-receive-1\"}"
level=info ts=2020-07-10T15:06:39.337032065Z caller=storeset.go:417 component=storeset msg="adding new storeAPI to query storeset" address=10.60.5.20:10901 extLset="{replica=\"thanos-receive-2\"}"
level=info ts=2020-07-10T15:14:09.344647209Z caller=storeset.go:417 component=storeset msg="adding new storeAPI to query storeset" address=10.60.10.4:10901 extLset="{replica=\"thanos-receive-0\"}"

BTW, on Stores tab of query web interface i see this:

sepich · 2020-08-08T20:50:35Z

I've created docker-compose small setup so that you hopefully could reproduce the issue:

git clone https://github.com/sepich/compose-playground
git checkout thanos-2841
docker-compose up -d
docker ps

NAMES             STATUS       IMAGE               PORTS
avalanche-1       Up 1h        ®/avalanche:latest  9001/tcp
avalanche-2       Up 1 second  ®/avalanche:latest  9001/tcp
avalanche-3       Up 1 second  ®/avalanche:latest  9001/tcp
grafana           Up 1h        ®/grafana:7.0.3     :3000→3000/tcp
node-exporter     Up 1h        prom/node-exporter  :9100→9100/tcp
prometheus        Up 4s        prom/prometheus     :9090→9090/tcp
thanos-front      Up 11m       ®/thanos:v0.14.0
thanos-query      Up 11m       ®/thanos:v0.14.0    :10902→10902/tcp
thanos-receive-0  Up 11m       ®/thanos:v0.14.0
thanos-receive-1  Up 11m       ®/thanos:v0.14.0
thanos-receive-2  Up 11m       ®/thanos:v0.14.0

3x thanos-receive containers are in hashring, and prometheus sends remote-write to thanos-front container, which is the same receive but not in hashring, so just forwards the data.
avalanche containers are to create some load, so that together with --receive-forward-timeout=1s in thanos-front you should see constant stream of such events:

$ docker logs -f --tail=10 thanos-front
level=debug ts=2020-08-08T19:21:53.5366348Z caller=handler.go:475 component=receive component=receive-handler tenant=default-tenant request-id= msg="request failed, but not needed to achieve quorum" err="forwarding request to endpoint thanos-receive-1:10901: rpc error: code = Canceled desc = context canceled"
level=debug ts=2020-08-08T19:21:53.5410935Z caller=handler.go:475 component=receive component=receive-handler tenant=default-tenant request-id= msg="request failed, but not needed to achieve quorum" err="forwarding request to endpoint thanos-receive-1:10901: rpc error: code = Canceled desc = context canceled"

If you pc is too powerfull, please add more avalanches to docker-compose.yml and to cfg/prometheus.yml ;)
Until you will get such logs.

Then login to grafana (localhost:3000) and check out the only dashboard, it has some rate(metric[30s]) graphs.
You might need to wait 10-20min to start see holes randomly in these metrics:

Lets take first graph query and repeat it on thanos-query (localhost:10902) to see holes:

At the same time, the same query on prometheus (localhost:9090) returns whole data:

Now you can change thanos-front image to thanosio/thanos:v0.12.2 and you will see no holes anymore (for new data).
If i'm adding compactor and store here, then compacted data (>3h old) also have no holes.

Yes the issue looks pretty minor here, but here is graph from Production, which has more load. At 22:45 i've changed thanos-front from v0.12.2 to v0.14.0, --receive-forward-timeout=60s, as you can see it is not usable:

sepich · 2020-09-01T19:23:31Z

Thank you very much @kakkoyun !
I've tried v0.15.0-rc.1 and #2854 has been the fix.

kakkoyun · 2020-09-02T04:29:48Z

@sepich I'm glad that it helped. Also awesome explanation with the detailed graphs, I really appreciated it.

kakkoyun added component: receive needs-more-info labels Jul 7, 2020

sepich pushed a commit to sepich/compose-playground that referenced this issue Aug 8, 2020

try to reproduce thanos-io/thanos#2841

0710a75

sepich closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: fanoutForward leads to holes on graphs #2841

receive: fanoutForward leads to holes on graphs #2841

sepich commented Jul 4, 2020

kakkoyun commented Jul 7, 2020

bwplotka commented Jul 7, 2020

sepich commented Jul 10, 2020

sepich commented Aug 8, 2020

sepich commented Sep 1, 2020

kakkoyun commented Sep 2, 2020

receive: fanoutForward leads to holes on graphs #2841

receive: fanoutForward leads to holes on graphs #2841

Comments

sepich commented Jul 4, 2020

kakkoyun commented Jul 7, 2020

bwplotka commented Jul 7, 2020

sepich commented Jul 10, 2020

sepich commented Aug 8, 2020

sepich commented Sep 1, 2020

kakkoyun commented Sep 2, 2020