Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (timeout on wait_until recommissioned) in RandomNodeOperationsTest.test_node_operations #8179

Closed
rystsov opened this issue Jan 12, 2023 · 12 comments · Fixed by #8437 or #8493
Assignees
Labels

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 12, 2023

https://buildkite.com/redpanda/redpanda/builds/21032#0185a377-7e27-497a-8ff3-ce678f93793a

Module: rptest.tests.random_node_operations_test
Class:  RandomNodeOperationsTest
Method: test_node_operations
Arguments:
{
  "enable_failures": false
}
====================================================================================================
test_id:    rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False
status:     FAIL
run time:   9 minutes 41.370 seconds

    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/random_node_operations_test.py", line 103, in test_node_operations
    executor.execute_operation(op)
  File "/root/tests/rptest/utils/node_operations.py", line 275, in execute_operation
    self.recommission(operation.node)
  File "/root/tests/rptest/utils/node_operations.py", line 217, in recommission
    wait_until(recommissioned, timeout_sec=self.timeout, backoff_sec=1)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

see 58th retry

@abhijat
Copy link
Contributor

abhijat commented Jan 18, 2023

https://buildkite.com/redpanda/vtools/builds/5252#0185b99b-8c5e-4e0a-b084-b0ac2f2cebe0

rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False: Summary: TimeoutError('')
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/random_node_operations_test.py", line 103, in test_node_operations
    executor.execute_operation(op)
  File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 359, in execute_operation
    self.recommission(operation.node)
  File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 301, in recommission
    wait_until(recommissioned, timeout_sec=self.timeout, backoff_sec=1)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@andijcr
Copy link
Contributor

andijcr commented Jan 24, 2023

https://buildkite.com/redpanda/redpanda/builds/21730#0185e271-3c89-4a4d-b423-66adeeb98842

FAIL test: RandomNodeOperationsTest.test_node_operations.enable_failures=True (4/20 runs)
  failure at 2023-01-24T07:42:04.626Z: TimeoutError('')
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/21730#0185e271-3c89-4a4d-b423-66adeeb98842

@BenPope
Copy link
Member

BenPope commented Jan 26, 2023

@rystsov
Copy link
Contributor Author

rystsov commented Jan 27, 2023

@rystsov
Copy link
Contributor Author

rystsov commented Jan 27, 2023

The test is trying to update a topic which was deleted. It looks like it's the same issue as #8437 fixes

[INFO  - 2023-01-27 00:04:15,967 - admin_ops_fuzzer - execute - lineno:90]: Creating topic with name fuzzy-operator-9200-smnhdk, replication: 3 partitions: 1
[INFO  - 2023-01-27 00:08:14,350 - admin_ops_fuzzer - execute - lineno:124]: Deleting topic: fuzzy-operator-9200-smnhdk
[INFO  - 2023-01-27 00:08:14,384 - admin_ops_fuzzer - validate - lineno:131]: Validating topic fuzzy-operator-9200-smnhdk deletion
[INFO  - 2023-01-27 00:08:15,419 - admin_ops_fuzzer - validate - lineno:131]: Validating topic fuzzy-operator-9200-smnhdk deletion
[INFO  - 2023-01-27 00:08:16,453 - admin_ops_fuzzer - validate - lineno:131]: Validating topic fuzzy-operator-9200-smnhdk deletion
[INFO  - 2023-01-27 00:08:17,495 - admin_ops_fuzzer - validate - lineno:131]: Validating topic fuzzy-operator-9200-smnhdk deletion
[INFO  - 2023-01-27 00:08:52,550 - admin_ops_fuzzer - execute - lineno:174]: Updating topic: fuzzy-operator-9200-smnhdk with: cleanup.policy=delete
[INFO  - 2023-01-27 00:08:57,594 - admin_ops_fuzzer - validate - lineno:183]: Validating topic fuzzy-operator-9200-smnhdk update, expected: cleanup.policy=delete
[INFO  - 2023-01-27 00:09:02,634 - admin_ops_fuzzer - validate - lineno:183]: Validating topic fuzzy-operator-9200-smnhdk update, expected: cleanup.policy=delete
[INFO  - 2023-01-27 00:09:07,674 - admin_ops_fuzzer - validate - lineno:183]: Validating topic fuzzy-operator-9200-smnhdk update, expected: cleanup.policy=delete
[INFO  - 2023-01-27 00:09:12,713 - admin_ops_fuzzer - validate - lineno:183]: Validating topic fuzzy-operator-9200-smnhdk update, expected: cleanup.policy=delete
[DEBUG - 2023-01-27 00:09:12,714 - rpk - _execute - lineno:752]: Executing command: ['/var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0e7195169efbf4837-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '--brokers', 'docker-rp-22:9092,docker-rp-10:9092,docker-rp-21:9092,docker-rp-19:9092', 'describe', 'fuzzy-operator-9200-smnhdk', '-c']
rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0e7195169efbf4837-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk topic --brokers docker-rp-22:9092,docker-rp-10:9092,docker-rp-21:9092,docker-rp-19:9092 describe fuzzy-operator-9200-smnhdk -c returned 1, output:  error: config response contained error: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition.

@rystsov
Copy link
Contributor Author

rystsov commented Jan 29, 2023

It looks like we have a comeback - https://buildkite.com/redpanda/redpanda/builds/22029

test_id:    rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True
status:     FAIL
run time:   10 minutes 57.416 seconds

    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/random_node_operations_test.py", line 103, in test_node_operations
    executor.execute_operation(op)
  File "/root/tests/rptest/utils/node_operations.py", line 359, in execute_operation
    self.recommission(operation.node)
  File "/root/tests/rptest/utils/node_operations.py", line 301, in recommission
    wait_until(recommissioned, timeout_sec=self.timeout, backoff_sec=1)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@dotnwat
Copy link
Member

dotnwat commented Jan 29, 2023

@rystsov

I scrolled up further in the log before the timeout error and I see

[DEBUG - 2023-01-29 02:16:18,793 - admin_ops_fuzzer - thread_loop - lineno:500]: Operation: RedpandaAdminOperation.DELETE_USER
Traceback (most recent call last):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 489, in thread_loop
    if self.execute_with_retries(op_type, op):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 528, in execute_with_retries
    raise error
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 519, in execute_with_retries
    if op.validate(self.operation_ctx):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 301, in validate
    users = ctx.admin().list_users()
  File "/root/tests/rptest/services/admin.py", line 649, in list_users
    return self._request("get", "security/users", node=node).json()
  File "/root/tests/rptest/services/admin.py", line 288, in _request
    node = random.choice(self.redpanda.started_nodes())
  File "/usr/lib/python3.10/random.py", line 378, in choice
    return seq[self._randbelow(len(seq))]
IndexError: list index out of range

which sort of makes me think that the timeout error is a symptom of this other in the test that is failing? looks a bit like another parsing issue?

@rystsov
Copy link
Contributor Author

rystsov commented Jan 29, 2023

@dotnwat keep scrolling :) this error happens after the original, then the test shuts down redpanda, empties started_nodes and the still active background thread fails with IndexError

@rystsov rystsov assigned mmaslankaprv and unassigned bharathv Jan 29, 2023
@dotnwat
Copy link
Member

dotnwat commented Jan 30, 2023

@dotnwat keep scrolling :) this error happens after the original, then the test shuts down redpanda, empties started_nodes and the still active background thread fails with IndexError

oh my! i didn't scroll far enough!

@mmaslankaprv
Copy link
Member

This is a race condition, i need to fix this, it is already fixed with the new simplified raft configuration handling

@mmaslankaprv
Copy link
Member

this issue is related with the recent change in Raft where we do not allow canceling raft reconfiguration if a replica is a laerner in old raft configuration

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 30, 2023
…d to learner

A recent change in raft configuration cancellation logic prevents raft
re-configuration canceling when nodes to be removed are demoted to
learners. Since 'moving back' is not longer possible we should prevent
recommissioning a node when it is already a learner in previous part of
raft joint configuration.

The fix doesn't fix the race condition entirely but it reduces the
possibility of it occurring. The race condition will be fixed with an
introduction of simplified raft configuration. Also the race condition
isn't really breaking anything. A user may receive success from
recommission API however the node will simply be removed. It is a
behavior that we accepted for partition movement cancel API.

Fixes: redpanda-data#8179

Signed-off-by: Michal Maslanka <[email protected]>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jan 30, 2023
…arner

A recent change in raft configuration cancellation logic prevents raft
re-configuration canceling when nodes to be removed are demoted to
learners. Since 'moving back' is not longer possible we should prevent
recommissioning a node when it is already a learner in previous part of
raft joint configuration.

The fix doesn't fix the race condition entirely but it reduces the
possibility of it occurring. The race condition will be fixed with an
introduction of simplified raft configuration. Also the race condition
isn't really breaking anything. A user may receive success from
recommission API however the node will simply be removed. It is a
behavior that we accepted for partition movement cancel API.

Fixes: redpanda-data#8179

Signed-off-by: Michal Maslanka <[email protected]>
@dlex
Copy link
Contributor

dlex commented Mar 7, 2023

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 8, 2023
…arner

A recent change in raft configuration cancellation logic prevents raft
re-configuration canceling when nodes to be removed are demoted to
learners. Since 'moving back' is not longer possible we should prevent
recommissioning a node when it is already a learner in previous part of
raft joint configuration.

The fix doesn't fix the race condition entirely but it reduces the
possibility of it occurring. The race condition will be fixed with an
introduction of simplified raft configuration. Also the race condition
isn't really breaking anything. A user may receive success from
recommission API however the node will simply be removed. It is a
behavior that we accepted for partition movement cancel API.

Fixes: redpanda-data#8179

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 2a3bf5e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants