CI Failure (client times out after error sending) in `ManyClientsTest.test_many_clients` #10092

jcsp · 2023-04-14T09:02:08Z

Two unexpected things here:

the client is getting a timeout producing
the client is then failing to complete after the timeout, leading to a test failure with timeout.

https://buildkite.com/redpanda/vtools/builds/7097#018779a6-467c-448b-a183-43a17d8ff708

Module: rptest.scale_tests.many_clients_test
Class:  ManyClientsTest
Method: test_many_clients

[2023-04-13T11:08:33Z WARN  client_swarm] Error on producer 2089 400/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
... until timeout

====================================================================================================
test_id:    rptest.scale_tests.many_clients_test.ManyClientsTest.test_many_clients
status:     FAIL
run time:   10 minutes 17.714 seconds


    TimeoutError(None)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_clients_test.py", line 113, in test_many_clients
    producer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/producer_swarm.py", line 69, in wait_node
    self._redpanda.wait_until(lambda: not self.is_alive(node),
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1323, in wait_until
    wait_until(wrapped,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None

The text was updated successfully, but these errors were encountered:

jcsp · 2023-04-14T09:10:34Z

There is a similar backtrace here: #8979 (comment) but the logs are gone. It's quite possible that this is an error case that has existed for a long time but is quite rare.

jcsp · 2023-04-19T12:21:17Z

The "Producer {} waiting" message is printed on every message, so we can track the progress of a producer and see where it pauses for a long time.

Producer 2089 wasn't the only one hitting big delays. Here's another producer hitting a 5 minute delay:

[2023-04-13T11:00:19Z DEBUG client_swarm] Producer 1977 waiting
[2023-04-13T11:05:14Z WARN  client_swarm] Error on producer 1045 7/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:05:19Z WARN  client_swarm] Error on producer 1977 13/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:05:19Z DEBUG client_swarm] Producer 1977 waiting
[2023-04-13T11:00:19Z DEBUG client_swarm] Producer 1977 waiting
[2023-04-13T11:05:19Z WARN  client_swarm] Error on producer 1977 13/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:05:19Z DEBUG client_swarm] Producer 1977 waiting

This could be symptomatic of the combination of server rate limiting being unfair, and it having a large ceiling on the client side delay. If a client gets unlucky a few times, it can end up being in backoff state for minutes.

The total messages sent in the 10 minute timeout period was 3968454, the message size is a linear distribution between 0 and 16384, which works out to ~55MiB/s.

The test configures a 30MiB/s target_quota_byte_rate (per shard) and a 100MiB/s kafka_throughput_limit_node_in_bps (per node). The effective limit per node is therefore 60MiB/s, as each node has only 2 shards.

There are tens of producers that do not complete, but oddly in the final few seconds of the test it is only 2089 that we see progressing. In the last 2 second of the test 254 messages are emitted for producer 2089, which is only ~1MiB/s. But it's possible that in this part of the test we just have a bunch of other producers stuck in their client backoff sleep.

Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092

- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092

Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092

- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092

Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092

- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092

ztlpn · 2023-05-02T18:54:58Z

https://buildkite.com/redpanda/vtools/builds/7354#0187d8ed-2013-4bf1-b94f-97a50e0d3c11

Running all clients at max speed against a rate limited cluster results in a very unpredictable runtime, because some clients end up backing off for a very long time. Fixes redpanda-data#10092

- Use rate limiting in the client, so that it is not vulnerable to very long delays from rate limiting, causing rare timeouts due to statistical unfairness of which clients get limited. - Add a 'realistic' compaction case to accompany the pathological case. Realistic is incompressible data, so we're just paying the CPU tax, pathological is zeros, where we hit the memory inflation risk. - Make the test adaptively choose messages counts for a target runtime. - Configure a node rate limit that is aligned with the IOPs throughput of i3en.xlarge nodes when we are sending lots of tiny messages. - Set a heuristic "effective message size" for the pathological compaction/compression case, which reflects the equivalent uncompressed message size for throughput calculation purposes. Fixes redpanda-data#10092

dlex · 2023-05-22T21:03:07Z

on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/7644#0188330a-ea4b-477c-93d8-28e0828b53f0
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/7603#01882de4-7d22-4e20-95b5-653d0959ae48

jcsp added kind/bug Something isn't working ci-failure performance sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages labels Apr 14, 2023

jcsp self-assigned this Apr 20, 2023

jcsp mentioned this issue Apr 20, 2023

tests: integration testing for compaction with compression #9594

Merged

7 tasks

jcsp closed this as completed in #9594 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (client times out after error sending) in `ManyClientsTest.test_many_clients` #10092

CI Failure (client times out after error sending) in `ManyClientsTest.test_many_clients` #10092

jcsp commented Apr 14, 2023

jcsp commented Apr 14, 2023

jcsp commented Apr 19, 2023

ztlpn commented May 2, 2023

dlex commented May 22, 2023

CI Failure (client times out after error sending) in ManyClientsTest.test_many_clients #10092

CI Failure (client times out after error sending) in ManyClientsTest.test_many_clients #10092

Comments

jcsp commented Apr 14, 2023

jcsp commented Apr 14, 2023

jcsp commented Apr 19, 2023

ztlpn commented May 2, 2023

dlex commented May 22, 2023

CI Failure (client times out after error sending) in `ManyClientsTest.test_many_clients` #10092

CI Failure (client times out after error sending) in `ManyClientsTest.test_many_clients` #10092