-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (client times out after error sending) in ManyClientsTest.test_many_clients
#10092
Comments
There is a similar backtrace here: #8979 (comment) but the logs are gone. It's quite possible that this is an error case that has existed for a long time but is quite rare. |
The "Producer {} waiting" message is printed on every message, so we can track the progress of a producer and see where it pauses for a long time. Producer 2089 wasn't the only one hitting big delays. Here's another producer hitting a 5 minute delay:
This could be symptomatic of the combination of server rate limiting being unfair, and it having a large ceiling on the client side delay. If a client gets unlucky a few times, it can end up being in backoff state for minutes. The total messages sent in the 10 minute timeout period was 3968454, the message size is a linear distribution between 0 and 16384, which works out to ~55MiB/s. The test configures a 30MiB/s target_quota_byte_rate (per shard) and a 100MiB/s kafka_throughput_limit_node_in_bps (per node). The effective limit per node is therefore 60MiB/s, as each node has only 2 shards. There are tens of producers that do not complete, but oddly in the final few seconds of the test it is only 2089 that we see progressing. In the last 2 second of the test 254 messages are emitted for producer 2089, which is only ~1MiB/s. But it's possible that in this part of the test we just have a bunch of other producers stuck in their client backoff sleep. |
Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092
- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092
Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092
- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092
Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092
- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092
Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092
- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092
|
Two unexpected things here:
https://buildkite.com/redpanda/vtools/builds/7097#018779a6-467c-448b-a183-43a17d8ff708
The text was updated successfully, but these errors were encountered: