Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Timeout) in ControllerUpgradeTest.test_updating_cluster_when_executing_operations #8083

Closed
rystsov opened this issue Jan 6, 2023 · 5 comments · Fixed by #8099
Assignees
Labels

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 6, 2023

https://buildkite.com/redpanda/redpanda/builds/20717#0185849e-72bf-4773-9bc1-9c1395324e2c

Module: rptest.tests.controller_upgrade_test
Class:  ControllerUpgradeTest
Method: test_updating_cluster_when_executing_operations
test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status:     FAIL
run time:   1 minute 39.531 seconds

    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/controller_upgrade_test.py", line 98, in test_updating_cluster_when_executing_operations
    admin_fuzz.wait(num_executed_before_restart + 2, 240)
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 588, in wait
    wait_until(check, timeout_sec=timeout, backoff_sec=2)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 576, in check
    raise self.error
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 469, in thread_loop
    wait_until(validate_result,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jan 6, 2023
@rystsov rystsov self-assigned this Jan 6, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 6, 2023

admin_ops_fuzzer expects rpk-wrapper to throw exception when alter_topic_config fail but here we see that the method returned UNKNOWN_SERVER_ERROR and the fuzzer assumed it's a success and jumped to validation which never finishes because the alteration never happened

[INFO  - 2023-01-06 01:19:43,212 - admin_ops_fuzzer - execute - lineno:174]: Updating topic: fuzzy-operator-5706-kfcknp with: retention.ms=952635
[DEBUG - 2023-01-06 01:19:43,212 - rpk - _execute - lineno:639]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-015c40dd418d79786-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-15:9092,docker-rp-14:9092,docker-rp-13:9092,docker-rp-11:9092,docker-rp-12:9092', 'alter-config', 'fuzzy-operator-5706-kfcknp', '--set', 'retention.ms=952635']
[DEBUG - 2023-01-06 01:19:43,237 - rpk - _execute - lineno:652]: 
TOPIC                       STATUS
fuzzy-operator-5706-kfcknp  UNKNOWN_SERVER_ERROR

[INFO  - 2023-01-06 01:19:43,238 - admin_ops_fuzzer - validate - lineno:183]: Validating topic fuzzy-operator-5706-kfcknp update, expected: retention.ms=952635
[DEBUG - 2023-01-06 01:19:43,238 - rpk - _execute - lineno:639]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-015c40dd418d79786-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-15:9092,docker-rp-14:9092,docker-rp-11:9092,docker-rp-12:9092,docker-rp-13:9092', 'describe', 'fuzzy-operator-5706-kfcknp', '-c']
[DEBUG - 2023-01-06 01:19:43,265 - rpk - _execute - lineno:652]: 
KEY                           VALUE       SOURCE
cleanup.policy                delete      DYNAMIC_TOPIC_CONFIG
compression.type              producer    DEFAULT_CONFIG
max.message.bytes             1048576     DEFAULT_CONFIG
message.timestamp.type        CreateTime  DEFAULT_CONFIG
redpanda.remote.delete        false       DYNAMIC_TOPIC_CONFIG
redpanda.remote.read          false       DEFAULT_CONFIG
redpanda.remote.write         false       DEFAULT_CONFIG
retention.bytes               -1          DEFAULT_CONFIG
retention.local.target.bytes  -1          DEFAULT_CONFIG
retention.local.target.ms     86400000    DEFAULT_CONFIG
retention.ms                  604800000   DEFAULT_CONFIG
segment.bytes                 1073741824  DEFAULT_CONFIG

@jcsp
Copy link
Contributor

jcsp commented Jan 9, 2023

rpk in general does not reliably return non-zero status on failure.

I recently added some logic for throwing on errors, but it wasn't very generic (just raises on "INVALID" in output). This probably needs updating to actually parse rpk's output and check for "OK".

@jcsp jcsp added the area/tests label Jan 9, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 9, 2023

@jcsp I've already done it, there is an open PR

@jcsp
Copy link
Contributor

jcsp commented Jan 9, 2023

Oops, sorry ,wasn't reading carefully enough.

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 10, 2023
As lots of test became unstable after fixing validation in admin
operations fuzzer. We disable validating result of topic configuration
alteration not to disturb normal development process with constantly
failing tests.

Related: redpanda-data#8083, redpanda-data#8102

Signed-off-by: Michal Maslanka <[email protected]>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 10, 2023
As lots of test became unstable after fixing validation in admin
operations fuzzer. We disable validating result of topic configuration
alteration not to disturb normal development process with constantly
failing tests.

Related: redpanda-data#8083, redpanda-data#8102

Signed-off-by: Michal Maslanka <[email protected]>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 10, 2023
As lots of test became unstable after fixing validation in admin
operations fuzzer. We disable validating result of admin
operations not to disturb normal development process with constantly
failing tests.

Related: redpanda-data#8083, redpanda-data#8102

Signed-off-by: Michal Maslanka <[email protected]>
@graphcareful
Copy link
Contributor

@rystsov rystsov changed the title CI Failure (Timeout) in ControllerUpgradeTest.test_updating_cluster_when_executing_operations CI Failure (Timeout) in ControllerUpgradeTest.test_updating_cluster_when_executing_operations Jan 10, 2023
rystsov added a commit that referenced this issue Jan 10, 2023
Fix #8083: Timeout in ControllerUpgradeTest.test_updating_cluster_when_executing_operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants