Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (TimeoutError) in KgoRepeaterSelfTest.test_kgo_repeater #10865

Closed
VladLazar opened this issue May 18, 2023 · 10 comments · Fixed by #11092
Closed

CI Failure (TimeoutError) in KgoRepeaterSelfTest.test_kgo_repeater #10865

VladLazar opened this issue May 18, 2023 · 10 comments · Fixed by #11092
Labels
area/tests ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@VladLazar
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/29370#01882e53-41c2-436f-81b6-45a359769d05

Module: rptest.tests.services_self_test
Class:  KgoRepeaterSelfTest
Method: test_kgo_repeater
test_id:    rptest.tests.services_self_test.KgoRepeaterSelfTest.test_kgo_repeater
status:     FAIL
run time:   1 minute 45.411 seconds

    TimeoutError(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/services_self_test.py", line 70, in test_kgo_repeater
    repeater.await_progress(1024, timeout_sec=75)
  File "/root/tests/rptest/services/kgo_repeater_service.py", line 301, in await_progress
    self.redpanda.wait_until(check, timeout_sec=timeout_sec, backoff_sec=1)
  File "/root/tests/rptest/services/redpanda.py", line 1629, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None
@VladLazar VladLazar added kind/bug Something isn't working ci-failure sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages labels May 18, 2023
@jcsp
Copy link
Contributor

jcsp commented May 18, 2023

The test expects to write 4MiB in 75 seconds.

This is similar to #10500 in that a very modest throughput expectation is not being met. I suspect we have merged some test that is eating too many resources on the shared test node, and consequentially causing these existing tests to start failing. #10500 first appeared when we moved to shared lib builds for debug, but became much more frequent recently, this issue has only just popped up.

@dlex
Copy link
Contributor

dlex commented May 22, 2023

@VladLazar
Copy link
Contributor Author

FAIL test: KgoRepeaterSelfTest.test_kgo_repeater (2/69 runs)
  failure at 2023-05-22T14:44:28.482Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29583#0188439d-c75d-4048-94a9-94b65ca6b252
  failure at 2023-05-19T15:51:47.969Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29477#01883461-64c4-42e5-b8bf-79844bb04d52

@michael-redpanda
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

FAIL test: KgoRepeaterSelfTest.test_kgo_repeater (6/31 runs)
  failure at 2023-05-25T19:44:44.258Z: TimeoutError(None)
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29913#01885434-6e20-4945-af74-28cd962343d4

@dlex
Copy link
Contributor

dlex commented May 29, 2023

This test succeeds locally for me but fails pretty consistently in CI, like here: https://buildkite.com/redpanda/redpanda/builds/30044#01885b12-fd93-4abd-9941-9dca0a461938

When the test succceds, it completes in, seconds:

  • run alone: 33, 36,
  • run 3 instances simultaneously: 54, 55, 49, 58, 58, 55

When in fails in CI, the percentage complete in 75s is: 82, 77, 76, 78, 76

Apparently the runtime is really dependent on the environment. To reliably pass in the current CI environment, it needs the timeout to be twice as high.

@dlex dlex self-assigned this May 29, 2023
@dlex
Copy link
Contributor

dlex commented May 29, 2023

The test expects to write 4MiB in 75 seconds.

This is similar to #10500 in that a very modest throughput expectation is not being met.

56 kiB/s is indeed a modest expectation, however the problem with the speed is that the producer sends only 1 message (4kiB) per request. This way

[DEBUG - 2023-05-27 03:00:22,991 - kgo_repeater_service - check - lineno:291]: await_progress: 72.6% p=743 c=743, initial_p=0, initial_c=0, await count 1024)
....
[DEBUG - 2023-05-27 03:00:24,101 - kgo_repeater_service - check - lineno:291]: await_progress: 73.7% p=755 c=755, initial_p=0, initial_c=0, await count 1024)

producer only pushes ~11 messages per second. Average processing time of a produce request in the test is 106 ms (ok-ish), given that there are 3 nodes, even with 1 msg/s the producer is utilizing cluster capacity at 39%.

@jcsp: is 1 msg per produce request intentional in this test?

If yes, the timeout needs to be increased.

@abhijat
Copy link
Contributor

abhijat commented May 30, 2023

https://buildkite.com/redpanda/redpanda/builds/30108#018868e9-8cf5-4149-a046-793962299e58

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/services_self_test.py", line 71, in test_kgo_repeater
    repeater.await_progress(1024, timeout_sec=75)
  File "/root/tests/rptest/services/kgo_repeater_service.py", line 301, in await_progress
    self.redpanda.wait_until(check, timeout_sec=timeout_sec, backoff_sec=1)
  File "/root/tests/rptest/services/redpanda.py", line 1984, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None

@jcsp
Copy link
Contributor

jcsp commented May 30, 2023

@jcsp: is 1 msg per produce request intentional in this test?

Not explicitly.

I'm going to disable this test in debug mode. It was stable for a long time, but there is some systemic issue in the debug tests that is making things slower and slower -- if I bump the timeout, I may have to come back and do the same thing again later.

jcsp added a commit to jcsp/redpanda that referenced this issue May 30, 2023
These mostly involve sending some amount of traffic, which tends
to be flaky in debug mode.  These tests don't exist to test
Redpanda, they're just smoke tests for the services themselves,
so we don't gain much by running against debug binaries.

Fixes redpanda-data#10865
jcsp added a commit to jcsp/redpanda that referenced this issue May 30, 2023
These mostly involve sending some amount of traffic, which tends
to be flaky in debug mode.  These tests don't exist to test
Redpanda, they're just smoke tests for the services themselves,
so we don't gain much by running against debug binaries.

Fixes redpanda-data#10865
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 30, 2023
These mostly involve sending some amount of traffic, which tends
to be flaky in debug mode.  These tests don't exist to test
Redpanda, they're just smoke tests for the services themselves,
so we don't gain much by running against debug binaries.

Fixes redpanda-data#10865

(cherry picked from commit e3af6c2)
@dlex dlex removed their assignment May 30, 2023
VladLazar pushed a commit to VladLazar/redpanda that referenced this issue May 31, 2023
These mostly involve sending some amount of traffic, which tends
to be flaky in debug mode.  These tests don't exist to test
Redpanda, they're just smoke tests for the services themselves,
so we don't gain much by running against debug binaries.

Fixes redpanda-data#10865
michael-redpanda pushed a commit to vbotbuildovich/redpanda that referenced this issue Jul 18, 2023
These mostly involve sending some amount of traffic, which tends
to be flaky in debug mode.  These tests don't exist to test
Redpanda, they're just smoke tests for the services themselves,
so we don't gain much by running against debug binaries.

Fixes redpanda-data#10865

(cherry picked from commit e3af6c2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants