Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (client times out after error sending) in ManyClientsTest.test_many_clients #10092

Closed
jcsp opened this issue Apr 14, 2023 · 4 comments · Fixed by #9594
Closed

CI Failure (client times out after error sending) in ManyClientsTest.test_many_clients #10092

jcsp opened this issue Apr 14, 2023 · 4 comments · Fixed by #9594
Assignees
Labels
ci-failure kind/bug Something isn't working performance sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@jcsp
Copy link
Contributor

jcsp commented Apr 14, 2023

Two unexpected things here:

  • the client is getting a timeout producing
  • the client is then failing to complete after the timeout, leading to a test failure with timeout.

https://buildkite.com/redpanda/vtools/builds/7097#018779a6-467c-448b-a183-43a17d8ff708

Module: rptest.scale_tests.many_clients_test
Class:  ManyClientsTest
Method: test_many_clients
[2023-04-13T11:08:33Z WARN  client_swarm] Error on producer 2089 400/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
[2023-04-13T11:08:33Z DEBUG client_swarm] Producer 2089 waiting
... until timeout
====================================================================================================
test_id:    rptest.scale_tests.many_clients_test.ManyClientsTest.test_many_clients
status:     FAIL
run time:   10 minutes 17.714 seconds


    TimeoutError(None)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_clients_test.py", line 113, in test_many_clients
    producer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/producer_swarm.py", line 69, in wait_node
    self._redpanda.wait_until(lambda: not self.is_alive(node),
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1323, in wait_until
    wait_until(wrapped,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None
@jcsp jcsp added kind/bug Something isn't working ci-failure performance sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages labels Apr 14, 2023
@jcsp
Copy link
Contributor Author

jcsp commented Apr 14, 2023

There is a similar backtrace here: #8979 (comment) but the logs are gone. It's quite possible that this is an error case that has existed for a long time but is quite rare.

@jcsp
Copy link
Contributor Author

jcsp commented Apr 19, 2023

The "Producer {} waiting" message is printed on every message, so we can track the progress of a producer and see where it pauses for a long time.

Producer 2089 wasn't the only one hitting big delays. Here's another producer hitting a 5 minute delay:

[2023-04-13T11:00:19Z DEBUG client_swarm] Producer 1977 waiting
[2023-04-13T11:05:14Z WARN  client_swarm] Error on producer 1045 7/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:05:19Z WARN  client_swarm] Error on producer 1977 13/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:05:19Z DEBUG client_swarm] Producer 1977 waiting
[2023-04-13T11:00:19Z DEBUG client_swarm] Producer 1977 waiting
[2023-04-13T11:05:19Z WARN  client_swarm] Error on producer 1977 13/1000: Message production error: MessageTimedOut (Local: Message timed out)
[2023-04-13T11:05:19Z DEBUG client_swarm] Producer 1977 waiting

This could be symptomatic of the combination of server rate limiting being unfair, and it having a large ceiling on the client side delay. If a client gets unlucky a few times, it can end up being in backoff state for minutes.

The total messages sent in the 10 minute timeout period was 3968454, the message size is a linear distribution between 0 and 16384, which works out to ~55MiB/s.

The test configures a 30MiB/s target_quota_byte_rate (per shard) and a 100MiB/s kafka_throughput_limit_node_in_bps (per node). The effective limit per node is therefore 60MiB/s, as each node has only 2 shards.

There are tens of producers that do not complete, but oddly in the final few seconds of the test it is only 2089 that we see progressing. In the last 2 second of the test 254 messages are emitted for producer 2089, which is only ~1MiB/s. But it's possible that in this part of the test we just have a bunch of other producers stuck in their client backoff sleep.

@jcsp jcsp self-assigned this Apr 20, 2023
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 20, 2023
Running all clients at max speed against a rate limited
cluster results in a very unpredictable runtime, because
some clients end up backing off for a very long time.

Fixes redpanda-data#10092
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 20, 2023
- Use rate limiting in the client, so that it is
  not vulnerable to very long delays from rate limiting,
  causing rare timeouts due to statistical unfairness of
  which clients get limited.
- Add a 'realistic' compaction case to accompany the
  pathological case.  Realistic is incompressible data,
  so we're just paying the CPU tax, pathological is zeros,
  where we hit the memory inflation risk.
- Make the test adaptively choose messages counts for a
  target runtime.
- Configure a node rate limit that is aligned with the
  IOPs throughput of i3en.xlarge nodes when we are sending
  lots of tiny messages.
- Set a heuristic "effective message size" for the pathological
  compaction/compression case, which reflects the equivalent
  uncompressed message size for throughput calculation
  purposes.

Fixes redpanda-data#10092
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 21, 2023
Running all clients at max speed against a rate limited
cluster results in a very unpredictable runtime, because
some clients end up backing off for a very long time.

Fixes redpanda-data#10092
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 21, 2023
- Use rate limiting in the client, so that it is
  not vulnerable to very long delays from rate limiting,
  causing rare timeouts due to statistical unfairness of
  which clients get limited.
- Add a 'realistic' compaction case to accompany the
  pathological case.  Realistic is incompressible data,
  so we're just paying the CPU tax, pathological is zeros,
  where we hit the memory inflation risk.
- Make the test adaptively choose messages counts for a
  target runtime.
- Configure a node rate limit that is aligned with the
  IOPs throughput of i3en.xlarge nodes when we are sending
  lots of tiny messages.
- Set a heuristic "effective message size" for the pathological
  compaction/compression case, which reflects the equivalent
  uncompressed message size for throughput calculation
  purposes.

Fixes redpanda-data#10092
jcsp added a commit to jcsp/redpanda that referenced this issue May 2, 2023
Running all clients at max speed against a rate limited
cluster results in a very unpredictable runtime, because
some clients end up backing off for a very long time.

Fixes redpanda-data#10092
jcsp added a commit to jcsp/redpanda that referenced this issue May 2, 2023
- Use rate limiting in the client, so that it is
  not vulnerable to very long delays from rate limiting,
  causing rare timeouts due to statistical unfairness of
  which clients get limited.
- Add a 'realistic' compaction case to accompany the
  pathological case.  Realistic is incompressible data,
  so we're just paying the CPU tax, pathological is zeros,
  where we hit the memory inflation risk.
- Make the test adaptively choose messages counts for a
  target runtime.
- Configure a node rate limit that is aligned with the
  IOPs throughput of i3en.xlarge nodes when we are sending
  lots of tiny messages.
- Set a heuristic "effective message size" for the pathological
  compaction/compression case, which reflects the equivalent
  uncompressed message size for throughput calculation
  purposes.

Fixes redpanda-data#10092
@ztlpn
Copy link
Contributor

ztlpn commented May 2, 2023

@jcsp jcsp closed this as completed in #9594 May 3, 2023
travisdowns pushed a commit to travisdowns/redpanda that referenced this issue May 5, 2023
Running all clients at max speed against a rate limited
cluster results in a very unpredictable runtime, because
some clients end up backing off for a very long time.

Fixes redpanda-data#10092
travisdowns pushed a commit to travisdowns/redpanda that referenced this issue May 5, 2023
- Use rate limiting in the client, so that it is
  not vulnerable to very long delays from rate limiting,
  causing rare timeouts due to statistical unfairness of
  which clients get limited.
- Add a 'realistic' compaction case to accompany the
  pathological case.  Realistic is incompressible data,
  so we're just paying the CPU tax, pathological is zeros,
  where we hit the memory inflation risk.
- Make the test adaptively choose messages counts for a
  target runtime.
- Configure a node rate limit that is aligned with the
  IOPs throughput of i3en.xlarge nodes when we are sending
  lots of tiny messages.
- Set a heuristic "effective message size" for the pathological
  compaction/compression case, which reflects the equivalent
  uncompressed message size for throughput calculation
  purposes.

Fixes redpanda-data#10092
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working performance sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants