Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure - Timeout (Missing columns: EPOCH) in NodeOperationFuzzyTest.test_node_operations #8381

Closed
rystsov opened this issue Jan 24, 2023 · 6 comments · Fixed by #8444
Closed
Assignees
Labels
ci-failure kind/bug Something isn't working

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 24, 2023

https://buildkite.com/redpanda/vtools/builds/5399#0185dda8-ff9a-47c1-8b69-82b48d8ab625

Module: rptest.scale_tests.node_operations_fuzzy_test
Class:  NodeOperationFuzzyTest
Method: test_node_operations
Arguments:
{
  "compacted_topics": true,
  "enable_failures": true,
  "num_to_upgrade": 0
}

For the generic errors like Timeout it isn't enough to take a look just at result.txt we should download the archive and take a peek at test_log.debug

report.txt

test_id:    rptest.scale_tests.node_operations_fuzzy_test.NodeOperationFuzzyTest.test_node_operations.enable_failures=True.num_to_upgrade=0.compacted_topics=True
status:     FAIL
run time:   11 minutes 9.047 seconds

    TimeoutError('')
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/node_operations_fuzzy_test.py", line 152, in test_node_operations
    executor.execute_operation(op)
  File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 359, in execute_operation
    self.recommission(operation.node)
  File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 301, in recommission
    wait_until(recommissioned, timeout_sec=self.timeout, backoff_sec=1)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

test_log.debug

[DEBUG - 2023-01-23 10:40:52,374 - rpk - _execute - lineno:749]: Executing command: ['/opt/redpanda/bin/rpk', 'topic', '--brokers', 'ip-172-31-7-19:9092,ip-172-31-12-253:9092,ip-172-31-11-215:9092,ip-172-31-9-194:9092', 'describe', 'fuzzy-operator-4269-ubxeca', '-p']
[DEBUG - 2023-01-23 10:40:52,402 - rpk - _execute - lineno:762]: 
PARTITION  LEADER  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK

[ERROR - 2023-01-23 10:40:52,402 - rpk - describe_topic - lineno:360]: Missing columns: EPOCH
[INFO  - 2023-01-23 10:40:52,402 - admin_ops_fuzzer - execute_with_retries - lineno:513]: Operation: RedpandaAdminOperation.ADD_PARTITIONS, retries left: 1/5
Traceback (most recent call last):
  File "/home/ubuntu/redpanda/tests/rptest/services/admin_ops_fuzzer.py", line 510, in execute_with_retries
    return op.execute(self.operation_ctx)
  File "/home/ubuntu/redpanda/tests/rptest/services/admin_ops_fuzzer.py", line 217, in execute
    list(rpk.describe_topic(self.topic, tolerant=True)))
  File "/home/ubuntu/redpanda/tests/rptest/clients/rpk.py", line 361, in describe_topic
    raise RpkException(f"Missing columns: {missing_columns}")
rptest.clients.rpk.RpkException: RpkException<Missing columns: EPOCH>
@rystsov rystsov added kind/bug Something isn't working ci-failure labels Jan 24, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 24, 2023

@twmb we try to parse the outcome of rpk topic describe -p but it seems in some cases it includes EPOCH column and in some it doesn't, do you know why it happens and which output we should expect?

@rystsov
Copy link
Contributor Author

rystsov commented Jan 24, 2023

https://buildkite.com/redpanda/vtools/builds/5399#0185dda8-ff9a-47c1-8b69-82b48d8ab625

Module: rptest.scale_tests.node_operations_fuzzy_test
Class:  NodeOperationFuzzyTest
Method: test_node_operations
Arguments:
{
  "compacted_topics": false,
  "enable_failures": false,
  "num_to_upgrade": 0
}

@twmb
Copy link
Contributor

twmb commented Jan 25, 2023

We print an EPOCH column if any of the partitions have a non--1 epoch. I was expecting ever since Redpanda's KIP-320 implementation that LeaderEpoch would never be -1 -- the first epoch for any partition should be 0. Is that not the case?

@rystsov rystsov self-assigned this Jan 26, 2023
@rystsov
Copy link
Contributor Author

rystsov commented Jan 26, 2023

Well, the failing test indicates that we still may return -1. Are you sure it is an invalid value even in cases then there is non coordinator error or some some other kind of error? If you're certain - I'll add an assert and then we'll investigate when it's triggered

@rystsov
Copy link
Contributor Author

rystsov commented Jan 26, 2023

But in this case the output is empty and I think it explains the lack of EPOCH column:

[DEBUG - 2023-01-23 10:40:52,374 - rpk - _execute - lineno:749]: Executing command: ['/opt/redpanda/bin/rpk', 'topic', '--brokers', 'ip-172-31-7-19:9092,ip-172-31-12-253:9092,ip-172-31-11-215:9092,ip-172-31-9-194:9092', 'describe', 'fuzzy-operator-4269-ubxeca', '-p']
[DEBUG - 2023-01-23 10:40:52,402 - rpk - _execute - lineno:762]: 
PARTITION  LEADER  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK

[ERROR - 2023-01-23 10:40:52,402 - rpk - describe_topic - lineno:360]: Missing columns: EPOCH

@twmb
Copy link
Contributor

twmb commented Jan 27, 2023

@rystsov the describe output was written a while before Redpanda supported leader epochs -- it was basically written preemptively so that nothing needed to be done once KIP-320 was implemented. Now that Redpanda does support KIP-320, we can change the code to always print the epoch column. I expect we'd receive a -1 if there are response errors, but in no other case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants