Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch #7248

jnyi · 2024-04-01T05:14:08Z

Thanos, Prometheus and Golang version used:
Thanos: 0.35.0-dev
Golang: go1.21.7

Object Storage Provider: s3

What happened:
switch from v0.34.1 -> v0.35.0-dev experience high in flight requests, found this #7045, did a few things:

increased receive.forward.async-workers to a large number, the issue remains
reverted above pr and test, issue still persists

What you expected to happen:
With async writes, the write latency should improve.

How to reproduce it (as minimally and precisely as possible):

Our receive commands (we split receive into router and ingestor modes), below is the args for router:

      receive
      --debug.name=pantheon-writer
      --log.format=logfmt
      --log.level=debug
      --http-address=0.0.0.0:10902
      --http-grace-period=5m
      --grpc-address=0.0.0.0:10901
      --grpc-grace-period=5m
      --hash-func=SHA256
      --label __replica__="$(NAME)"
      --receive.grpc-compression=none
      --receive.default-tenant-id=unknown
      --receive.tenant-label-name=__tenant__
      --remote-write.address=0.0.0.0:19291
      --receive-forward-max-backoff=5s
      --receive-forward-timeout=5s
      --receive.hashrings-algorithm=ketama
      --receive.hashrings-file=/var/thanos/config/hashring.json
      --receive.hashrings-file-refresh-interval=3m
      --receive.replication-factor=3

Full logs to relevant components:

full goroutine:
full_goroutine.txt

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

GiedriusS · 2024-04-04T13:43:08Z

Not sure I get the report. Are you saying that even after reverting the latency didn't go back? 🤔 How does thanos_receive_forward_delay_seconds look like? The number of go routines? Is everything OK with ingesters (grpc_server_handling_seconds_bucket{grpc_method="RemoteWrite"})?

jnyi · 2024-04-08T17:49:35Z

Actually I suspect it is special to our setup because we use multi az hashring, will debug more with tracing:

[
    {
        "endpoints": [
            {
                "address": "thanos-receive-rep0-0.thanos-receive-svc:10901",
                "az": "zone-0"
            },
            {
                "address": "thanos-receive-rep0-1.thanos-receive-svc:10901",
                "az": "zone-0"
            },
            {
                "address": "thanos-receive-rep1-0.thanos-receive-svc:10901",
                "az": "zone-1"
            },
            {
                "address": "thanos-receive-rep1-1.thanos-receive-svc:10901",
                "az": "zone-1"
            },
            {
                "address": "thanos-receive-rep2-0.thanos-receive-svc:10901",
                "az": "zone-2"
            },
            {
                "address": "thanos-receive-rep2-1.thanos-receive-svc:10901",
                "az": "zone-2"
            }
        ],
        "hashring": "thanos-receive"
    }
]

jnyi · 2024-04-08T21:49:14Z

We see large forward delays but grpc latency is low:

yeya24 · 2024-04-08T22:22:21Z

Is it because there is not enough workers due to #7045 and requests keep queueing?

jnyi · 2024-04-08T23:29:18Z

I've used 10 x cpu cores as the # of workers, maybe that's not enough? Also got some tracing results:

Before in v0.34:

After in v0.35

:

jnyi · 2024-04-08T23:58:14Z

increased the worker count to 3000, the request started to be sequential in newer version:

jnyi · 2024-04-09T00:02:53Z

I think I found the bug, this RemoteWriteAsync operation isn't parallel but sequential due to res := <-w.workResult:

	// Do the writes to remote nodes. Run them all in parallel.
	for writeDestination := range remoteWrites {
		h.sendRemoteWrite(ctx, params.tenant, writeDestination, remoteWrites[writeDestination], params.alreadyReplicated, responses, wg)
	}

func (p *peerWorker) RemoteWriteAsync(ctx context.Context, req *storepb.WriteRequest, er endpointReplica, seriesIDs []int, responseWriter chan writeResponse, cb func(error)) {
	p.initWorkers()

	w := peerWorkItem{
		cc:          p.cc,
		req:         req,
		workResult:  make(chan peerWorkResponse, 1),
		workItemCtx: ctx,
		er:          er,

		sendTime: time.Now(),
	}

	p.work <- w
	res := <-w.workResult

	responseWriter <- newWriteResponse(seriesIDs, res.err, er)
	cb(res.err)
}

jnyi · 2024-04-09T06:18:27Z

Hi @yeya24 and @GiedriusS , I've submitted a fix, appreciate your review: #7267

douglascamata · 2024-04-11T14:51:30Z

@jnyi there's one more conflict to solve in the PR, FYI. You were also pinged there.

jnyi mentioned this issue Apr 1, 2024

[PLAT-104961] Upgrade thanos to main and v0.35.0 databricks/thanos#26

Merged

2 tasks

jnyi mentioned this issue Apr 9, 2024

Receive: fix issue-7248 with parallel receive_forward #7267

Merged

2 tasks

douglascamata added bug priority: P0 component: receive labels Apr 11, 2024

yeya24 closed this as completed in #7267 Apr 11, 2024

dosubot bot mentioned this issue Sep 6, 2024

Thanos receive - failed to handle request, internal server error #7703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch #7248

Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch #7248

jnyi commented Apr 1, 2024

GiedriusS commented Apr 4, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

yeya24 commented Apr 8, 2024

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 9, 2024 •

edited

Loading

jnyi commented Apr 9, 2024 •

edited

Loading

douglascamata commented Apr 11, 2024

Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch #7248

Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch #7248

Comments

jnyi commented Apr 1, 2024

GiedriusS commented Apr 4, 2024 • edited Loading

jnyi commented Apr 8, 2024 • edited Loading

jnyi commented Apr 8, 2024 • edited Loading

yeya24 commented Apr 8, 2024

jnyi commented Apr 8, 2024 • edited Loading

jnyi commented Apr 8, 2024 • edited Loading

jnyi commented Apr 9, 2024 • edited Loading

jnyi commented Apr 9, 2024 • edited Loading

douglascamata commented Apr 11, 2024

GiedriusS commented Apr 4, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 8, 2024 •

edited

Loading

jnyi commented Apr 9, 2024 •

edited

Loading

jnyi commented Apr 9, 2024 •

edited

Loading