Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (RPK failed FindCoordinator req) in RpkGroupCommandsTest.test_group_describe #12291

Closed
NyaliaLui opened this issue Jul 18, 2023 · 12 comments · Fixed by #12564
Closed
Assignees
Labels
area/kafka ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@NyaliaLui
Copy link
Contributor

https://buildkite.com/redpanda/vtools/builds/8505#018966d7-6953-4067-afab-fa6498b0f9bb

Module: rptest.tests.rpk_group_test
Class:  RpkGroupCommandsTest
Method: test_group_describe
test_id:    rptest.tests.rpk_group_test.RpkGroupCommandsTest.test_group_describe
status:     FAIL
run time:   9.182 seconds


    RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-082c7dfe8830a81b7-1/redpanda/vtools/vbuild/redpanda_installs/ci/bin/rpk group -X brokers=docker-rp-20:9092,docker-rp-22:9092,docker-rp-21:9092 describe test-g1 -v returned 1, output: ', '02:55:28.186  DEBUG  sharded request  {"req": "FindCoordinator", "destinations": ["any"]}\n02:55:28.186  DEBUG  opening connection to broker  {"addr": "docker-rp-20:9092", "broker": "seed_0"}\n02:55:28.186  DEBUG  connection opened to broker  {"addr": "docker-rp-20:9092", "broker": "seed_0"}\n02:55:28.187  DEBUG  issuing api versions request  {"broker": "seed_0", "version": 3}\n02:55:28.187  DEBUG  wrote ApiVersions v3  {"broker": "seed_0", "bytes_written": 31, "write_wait": "16.409µs", "time_to_write": "16.118µs", "err": null}\n02:55:28.187  DEBUG  read ApiVersions v3  {"broker": "seed_0", "bytes_read": 296, "read_wait": "36.776µs", "time_to_read": "170.578µs", "err": null}\n02:55:28.187  DEBUG  connection initialized successfully  {"addr": "docker-rp-20:9092", "broker": "seed_0"}\n02:55:28.187  DEBUG  sharded request failed, resharding and reissuing  {"req": "FindCoordinator", "time_since_start": "1.090625ms", "tries": 0, "err": "broker is too old; the broker has already indicated it will not know how to handle the request"}\n02:55:28.187  DEBUG  sharded request  {"req": "FindCoordinator", "destinations": ["any"]}\n02:55:28.187  DEBUG  opening connection to broker  {"addr": "docker-rp-22:9092", "broker": "seed_1"}\n02:55:28.187  DEBUG  connection opened to broker  {"addr": "docker-rp-22:9092", "broker": "seed_1"}\n02:55:28.187  DEBUG  issuing api versions request  {"broker": "seed_1", "version": 3}\n02:55:28.187  DEBUG  wrote ApiVersions v3  {"broker": "seed_1", "bytes_written": 31, "write_wait": "8.351µs", "time_to_write": "15.301µs", "err": null}\n02:55:28.188  DEBUG  read ApiVersions v3  {"broker": "seed_1", "bytes_read": 296, "read_wait": "28.104µs", "time_to_read": "423.174µs", "err": null}\n02:55:28.188  DEBUG  connection initialized successfully  {"addr": "docker-rp-22:9092", "broker": "seed_1"}\n02:55:28.188  DEBUG  wrote FindCoordinator v3  {"broker": "seed_1", "bytes_written": 28, "write_wait": "857.859µs", "time_to_write": "17.294µs", "err": null}\n02:55:28.193  DEBUG  read FindCoordinator v3  {"broker": "seed_1", "bytes_read": 26, "read_wait": "44.775µs", "time_to_read": "5.287227ms", "err": null}\n02:55:28.193  DEBUG  sharded request  {"req": "DescribeGroups", "destinations": ["err"]}\nunable to describe groups: request DescribeGroups has 1 separate shard errors, first: COORDINATOR_NOT_AVAILABLE: The coordinator is not available.\n', 1, '')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/rpk_group_test.py", line 91, in test_group_describe
    wait_until(lambda: rpk.group_describe(group_1).state == "Stable",
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/tests/rpk_group_test.py", line 91, in <lambda>
    wait_until(lambda: rpk.group_describe(group_1).state == "Stable",
  File "/root/tests/rptest/clients/rpk.py", line 689, in group_describe
    rpk_group = try_describe_group(group)
  File "/root/tests/rptest/clients/rpk.py", line 637, in try_describe_group
    out = self._run_group(cmd)
  File "/root/tests/rptest/clients/rpk.py", line 792, in _run_group
    return self._execute(cmd, stdin=stdin, timeout=timeout)
  File "/root/tests/rptest/clients/rpk.py", line 910, in _execute
    raise RpkException(
rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-082c7dfe8830a81b7-1/redpanda/vtools/vbuild/redpanda_installs/ci/bin/rpk group -X brokers=docker-rp-20:9092,docker-rp-22:9092,docker-rp-21:9092 describe test-g1 -v returned 1, output:  error: 02:55:28.186  DEBUG  sharded request  {"req": "FindCoordinator", "destinations": ["any"]}
02:55:28.186  DEBUG  opening connection to broker  {"addr": "docker-rp-20:9092", "broker": "seed_0"}
02:55:28.186  DEBUG  connection opened to broker  {"addr": "docker-rp-20:9092", "broker": "seed_0"}
02:55:28.187  DEBUG  issuing api versions request  {"broker": "seed_0", "version": 3}
02:55:28.187  DEBUG  wrote ApiVersions v3  {"broker": "seed_0", "bytes_written": 31, "write_wait": "16.409µs", "time_to_write": "16.118µs", "err": null}
02:55:28.187  DEBUG  read ApiVersions v3  {"broker": "seed_0", "bytes_read": 296, "read_wait": "36.776µs", "time_to_read": "170.578µs", "err": null}
02:55:28.187  DEBUG  connection initialized successfully  {"addr": "docker-rp-20:9092", "broker": "seed_0"}
02:55:28.187  DEBUG  sharded request failed, resharding and reissuing  {"req": "FindCoordinator", "time_since_start": "1.090625ms", "tries": 0, "err": "broker is too old; the broker has already indicated it will not know how to handle the request"}
02:55:28.187  DEBUG  sharded request  {"req": "FindCoordinator", "destinations": ["any"]}
02:55:28.187  DEBUG  opening connection to broker  {"addr": "docker-rp-22:9092", "broker": "seed_1"}
02:55:28.187  DEBUG  connection opened to broker  {"addr": "docker-rp-22:9092", "broker": "seed_1"}
02:55:28.187  DEBUG  issuing api versions request  {"broker": "seed_1", "version": 3}
02:55:28.187  DEBUG  wrote ApiVersions v3  {"broker": "seed_1", "bytes_written": 31, "write_wait": "8.351µs", "time_to_write": "15.301µs", "err": null}
02:55:28.188  DEBUG  read ApiVersions v3  {"broker": "seed_1", "bytes_read": 296, "read_wait": "28.104µs", "time_to_read": "423.174µs", "err": null}
02:55:28.188  DEBUG  connection initialized successfully  {"addr": "docker-rp-22:9092", "broker": "seed_1"}
02:55:28.188  DEBUG  wrote FindCoordinator v3  {"broker": "seed_1", "bytes_written": 28, "write_wait": "857.859µs", "time_to_write": "17.294µs", "err": null}
02:55:28.193  DEBUG  read FindCoordinator v3  {"broker": "seed_1", "bytes_read": 26, "read_wait": "44.775µs", "time_to_read": "5.287227ms", "err": null}
02:55:28.193  DEBUG  sharded request  {"req": "DescribeGroups", "destinations": ["err"]}
unable to describe groups: request DescribeGroups has 1 separate shard errors, first: COORDINATOR_NOT_AVAILABLE: The coordinator is not available.
 returncode: 1>
@NyaliaLui NyaliaLui added kind/bug Something isn't working ci-failure labels Jul 18, 2023
@NyaliaLui
Copy link
Contributor Author

02:55:28.187  DEBUG  sharded request failed, resharding and reissuing  {"req": "FindCoordinator", "time_since_start": "1.090625ms", "tries": 0, "err": "broker is too old; the broker has already indicated it will not know how to handle the request"}

This makes me think that either RPK is using an unsupported API version or we need to update FindCoordinator API in our kafka protocol

@rystsov
Copy link
Contributor

rystsov commented Jul 19, 2023

@michael-redpanda
Copy link
Contributor

Able to reproduce in ducktape. Failed 2/10 times.

@michael-redpanda
Copy link
Contributor

My suspicion is that this is an unintended side effect of #12121. Marking as sev/low as this is probably a test issue but shoudl be relatively straight forward to fix.

@michael-redpanda michael-redpanda added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Jul 20, 2023
@graphcareful
Copy link
Contributor

Tracked the case down to this commit 61e2512 which causes Rpk to exit 1 and print the output on stderr in the case of certain failures.

Ducktape was previously assuming this command would print COORDINATOR_NOT_AVAILABLE on stdout and not exit 1, now that this isn't the case we should modify ducktape to react correctly to the new behavior of rpk group describe

@dotnwat
Copy link
Member

dotnwat commented Jul 20, 2023

Is the fix easy for this? It seems to be triggering quite a bit in CI (at least in one of my PRs)

@twmb
Copy link
Contributor

twmb commented Jul 20, 2023

The sharding error is because the client interally tries to send this as a batched FindCoordinator request, and then immediately sees RP doesn't support that, so then the client splits the request and actually issues it. So, expect to see that message until RP supports batched FindCoordinator (v4+)

@twmb
Copy link
Contributor

twmb commented Jul 20, 2023

@graphcareful I should change rpk to not fail if there is a partial error -- this matches the old behavior. However, the old code would just print a failure here then exit 0, so perhaps the new behavior is correct.

@NyaliaLui
Copy link
Contributor Author

The sharding error is because the client interally tries to send this as a batched FindCoordinator request, and then immediately sees RP doesn't support that, so then the client splits the request and actually issues it. So, expect to see that message until RP supports batched FindCoordinator (v4+)

Sounds like we need to update FindCoordinator API in our Kafka protocol as well

@graphcareful
Copy link
Contributor

graphcareful commented Jul 20, 2023

IMO the fix would be as easy as including stdout in RpkException.msg, i can self assign, I think we should keep this particular rpk cmd behaving the way it does, exiting 1 when there is an error is consistent with the other commands right?

@graphcareful graphcareful self-assigned this Jul 20, 2023
@michael-redpanda michael-redpanda removed their assignment Jul 21, 2023
dotnwat added a commit to dotnwat/redpanda that referenced this issue Aug 2, 2023
Changes to rpk caused the COORDINATOR_NOT_AVAILABLE messages to be
printed to stderr (makes sense) instead of stdout. Updated to check for
this condition in stderr.

Fixes: redpanda-data#12291

Signed-off-by: Noah Watkins <[email protected]>
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Aug 11, 2023
Changes to rpk caused the COORDINATOR_NOT_AVAILABLE messages to be
printed to stderr (makes sense) instead of stdout. Updated to check for
this condition in stderr.

Fixes: redpanda-data#12291

Signed-off-by: Noah Watkins <[email protected]>
(cherry picked from commit 231d9e4)
@rystsov
Copy link
Contributor

rystsov commented Aug 15, 2023

@rystsov rystsov reopened this Aug 15, 2023
@piyushredpanda
Copy link
Contributor

Issue hasn't occurred for 2 months per Pandatriage; closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants