Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch #7248

Closed
jnyi opened this issue Apr 1, 2024 · 9 comments · Fixed by #7267

Comments

@jnyi
Copy link
Contributor

jnyi commented Apr 1, 2024

Thanos, Prometheus and Golang version used:
Thanos: 0.35.0-dev
Golang: go1.21.7

Screenshot 2024-03-31 at 10 04 21 PM

Object Storage Provider: s3

What happened:
switch from v0.34.1 -> v0.35.0-dev experience high in flight requests, found this #7045, did a few things:

  1. increased receive.forward.async-workers to a large number, the issue remains
  2. reverted above pr and test, issue still persists

What you expected to happen:
With async writes, the write latency should improve.

How to reproduce it (as minimally and precisely as possible):

Our receive commands (we split receive into router and ingestor modes), below is the args for router:

      receive
      --debug.name=pantheon-writer
      --log.format=logfmt
      --log.level=debug
      --http-address=0.0.0.0:10902
      --http-grace-period=5m
      --grpc-address=0.0.0.0:10901
      --grpc-grace-period=5m
      --hash-func=SHA256
      --label __replica__="$(NAME)"
      --receive.grpc-compression=none
      --receive.default-tenant-id=unknown
      --receive.tenant-label-name=__tenant__
      --remote-write.address=0.0.0.0:19291
      --receive-forward-max-backoff=5s
      --receive-forward-timeout=5s
      --receive.hashrings-algorithm=ketama
      --receive.hashrings-file=/var/thanos/config/hashring.json
      --receive.hashrings-file-refresh-interval=3m
      --receive.replication-factor=3

Full logs to relevant components:

full goroutine:
full_goroutine.txt

Anything else we need to know:

@GiedriusS
Copy link
Member

GiedriusS commented Apr 4, 2024

Not sure I get the report. Are you saying that even after reverting the latency didn't go back? 🤔 How does thanos_receive_forward_delay_seconds look like? The number of go routines? Is everything OK with ingesters (grpc_server_handling_seconds_bucket{grpc_method="RemoteWrite"})?

@jnyi
Copy link
Contributor Author

jnyi commented Apr 8, 2024

Actually I suspect it is special to our setup because we use multi az hashring, will debug more with tracing:

[
    {
        "endpoints": [
            {
                "address": "thanos-receive-rep0-0.thanos-receive-svc:10901",
                "az": "zone-0"
            },
            {
                "address": "thanos-receive-rep0-1.thanos-receive-svc:10901",
                "az": "zone-0"
            },
            {
                "address": "thanos-receive-rep1-0.thanos-receive-svc:10901",
                "az": "zone-1"
            },
            {
                "address": "thanos-receive-rep1-1.thanos-receive-svc:10901",
                "az": "zone-1"
            },
            {
                "address": "thanos-receive-rep2-0.thanos-receive-svc:10901",
                "az": "zone-2"
            },
            {
                "address": "thanos-receive-rep2-1.thanos-receive-svc:10901",
                "az": "zone-2"
            }
        ],
        "hashring": "thanos-receive"
    }
]

@jnyi
Copy link
Contributor Author

jnyi commented Apr 8, 2024

We see large forward delays but grpc latency is low:

Screenshot 2024-04-08 at 2 49 08 PM

@yeya24
Copy link
Contributor

yeya24 commented Apr 8, 2024

Is it because there is not enough workers due to #7045 and requests keep queueing?

@jnyi
Copy link
Contributor Author

jnyi commented Apr 8, 2024

I've used 10 x cpu cores as the # of workers, maybe that's not enough? Also got some tracing results:

Before in v0.34:
Screenshot 2024-04-08 at 4 27 37 PM

After in v0.35
Screenshot 2024-04-08 at 4 27 28 PM
:

@jnyi
Copy link
Contributor Author

jnyi commented Apr 8, 2024

increased the worker count to 3000, the request started to be sequential in newer version:

Screenshot 2024-04-08 at 4 57 43 PM Screenshot 2024-04-08 at 4 57 22 PM

@jnyi
Copy link
Contributor Author

jnyi commented Apr 9, 2024

I think I found the bug, this RemoteWriteAsync operation isn't parallel but sequential due to res := <-w.workResult:

	// Do the writes to remote nodes. Run them all in parallel.
	for writeDestination := range remoteWrites {
		h.sendRemoteWrite(ctx, params.tenant, writeDestination, remoteWrites[writeDestination], params.alreadyReplicated, responses, wg)
	}

func (p *peerWorker) RemoteWriteAsync(ctx context.Context, req *storepb.WriteRequest, er endpointReplica, seriesIDs []int, responseWriter chan writeResponse, cb func(error)) {
	p.initWorkers()

	w := peerWorkItem{
		cc:          p.cc,
		req:         req,
		workResult:  make(chan peerWorkResponse, 1),
		workItemCtx: ctx,
		er:          er,

		sendTime: time.Now(),
	}

	p.work <- w
	res := <-w.workResult

	responseWriter <- newWriteResponse(seriesIDs, res.err, er)
	cb(res.err)
}

@jnyi
Copy link
Contributor Author

jnyi commented Apr 9, 2024

Hi @yeya24 and @GiedriusS , I've submitted a fix, appreciate your review: #7267

@douglascamata
Copy link
Contributor

@jnyi there's one more conflict to solve in the PR, FYI. You were also pinged there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants