Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (NodeCrash "Attempted to upgrade from incompatible logical version 10 to logical version 12!") in FeaturesUpgradeAssertionTest.test_upgrade_assertion #11275

Closed
andijcr opened this issue Jun 7, 2023 · 3 comments · Fixed by #11387
Assignees
Labels
ci-failure kind/bug Something isn't working

Comments

@andijcr
Copy link
Contributor

andijcr commented Jun 7, 2023

arm
https://buildkite.com/redpanda/vtools/builds/7957#01889253-b38f-4abd-9b03-68e462adb10f

Module: rptest.tests.cluster_features_test
Class:  FeaturesUpgradeAssertionTest
Method: test_upgrade_assertion
====================================================================================================
test_id:    rptest.tests.cluster_features_test.FeaturesUpgradeAssertionTest.test_upgrade_assertion
status:     FAIL
run time:   59.383 seconds


    <NodeCrash ip-172-31-14-62: ERROR 2023-06-07 02:52:07,838 [shard 0] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-0e3b044e0c7c4107c-1/redpanda/redpanda/src/v/features/feature_table.cc:655) 'false' Attempted to upgrade from incompatible logical version 10 to logical version 12!
>
Traceback (most recent call last):
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/cluster_features_test.py", line 536, in test_upgrade_assertion
    self.redpanda.stop_node(upgrade_node, forced=True)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2791, in stop_node
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda node ip-172-31-14-62 failed to stop in 30 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 100, in wrapped
    redpanda.raise_on_crash()
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2402, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash ip-172-31-14-62: ERROR 2023-06-07 02:52:07,838 [shard 0] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-0e3b044e0c7c4107c-1/redpanda/redpanda/src/v/features/feature_table.cc:655) 'false' Attempted to upgrade from incompatible logical version 10 to logical version 12!
>

@andijcr andijcr added kind/bug Something isn't working ci-failure sev/high loss of availability, pathological performance degradation, recoverable corruption labels Jun 7, 2023
@rockwotj rockwotj self-assigned this Jun 8, 2023
@andijcr
Copy link
Contributor Author

andijcr commented Jun 9, 2023

@rockwotj
Copy link
Contributor

This is a class of "failed to stop in 30 seconds" bugs, the assertion that is hit is explicitly tested for and the raise_on_crash function isn't explicitly checking for allowed log lines. I'll put in a fix for the misleading error message, but this isn't sev/high as it's an expected crash in the test.

@rockwotj rockwotj removed the sev/high loss of availability, pathological performance degradation, recoverable corruption label Jun 12, 2023
rockwotj added a commit to rockwotj/redpanda that referenced this issue Jun 12, 2023
In redpanda-data#11275 we get a false positive detecting a crash because of an
expected log line. Let's ignore those when looking for crashes.

Note the test is still failing, but due to a different reason than what
this function is calling.

Signed-off-by: Tyler Rockwood <[email protected]>
rockwotj added a commit to rockwotj/redpanda that referenced this issue Jun 12, 2023
In redpanda-data#11275 we get a false positive detecting a crash because of an
expected log line. Let's ignore those when looking for crashes.

Note the test is still failing, but due to a different reason than what
this function is calling.

Signed-off-by: Tyler Rockwood <[email protected]>
rockwotj added a commit to rockwotj/redpanda that referenced this issue Jun 12, 2023
In redpanda-data#11275 we get a false positive detecting a crash because of an
expected log line. Let's ignore those when looking for crashes.

Note the test is still failing, but due to a different reason than what
this function is calling.

Signed-off-by: Tyler Rockwood <[email protected]>
rockwotj added a commit to rockwotj/redpanda that referenced this issue Jun 13, 2023
It seems like redpanda_pid can pick up false pids.

When debugging redpanda-data#11275, it seems like there is an apport process who's
pid is being picked up. Here's a process dump that includes the line.
There are no running redpanda processes, but checking for redpanda
processes times out, leading me to believe that this process is being
picked up.

In that case `stop_node` can fail if the node is already failed in CDT,
as this process is picked up.

```
[DEBUG - 2023-06-12 02:54:07,216 - redpanda - _log_node_process_state - lineno:2215]: root      136962 99.6  6.0 1587948 1452804 ?     R    02:53   0:49 /usr/bin/python3 /usr/shar
e/apport/apport -p136954 -s5 -c18446744073709551615 -d1 -P136954 -u0 -g0 -- !opt!redpanda_installs!head!libexec!redpanda
```

Fixes: redpanda-data#11275

Signed-off-by: Tyler Rockwood <[email protected]>
rockwotj added a commit to rockwotj/redpanda that referenced this issue Jun 13, 2023
It seems like redpanda_pid can pick up false pids in the current grep
filtering approach. Use `pgrep` to not incorrectly pickup a command with
a command line arg of `redpanda`, which seems likely in our environments.

When debugging redpanda-data#11275, it seems like there is an apport process who's
pid is being picked up. Here's a process dump that includes the line.
There are no running redpanda processes, but checking for redpanda
processes times out, leading me to believe that this process is being
picked up.

In that case `stop_node` can fail if the node is already failed in CDT,
as this process is picked up.

```
[DEBUG - 2023-06-12 02:54:07,216 - redpanda - _log_node_process_state - lineno:2215]: root      136962 99.6  6.0 1587948 1452804 ?     R    02:53   0:49 /usr/bin/python3 /usr/shar
e/apport/apport -p136954 -s5 -c18446744073709551615 -d1 -P136954 -u0 -g0 -- !opt!redpanda_installs!head!libexec!redpanda
```

Fixes: redpanda-data#11275

Signed-off-by: Tyler Rockwood <[email protected]>
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jun 14, 2023
It seems like redpanda_pid can pick up false pids in the current grep
filtering approach. Use `pgrep` to not incorrectly pickup a command with
a command line arg of `redpanda`, which seems likely in our environments.

When debugging redpanda-data#11275, it seems like there is an apport process who's
pid is being picked up. Here's a process dump that includes the line.
There are no running redpanda processes, but checking for redpanda
processes times out, leading me to believe that this process is being
picked up.

In that case `stop_node` can fail if the node is already failed in CDT,
as this process is picked up.

```
[DEBUG - 2023-06-12 02:54:07,216 - redpanda - _log_node_process_state - lineno:2215]: root      136962 99.6  6.0 1587948 1452804 ?     R    02:53   0:49 /usr/bin/python3 /usr/shar
e/apport/apport -p136954 -s5 -c18446744073709551615 -d1 -P136954 -u0 -g0 -- !opt!redpanda_installs!head!libexec!redpanda
```

Fixes: redpanda-data#11275

Signed-off-by: Tyler Rockwood <[email protected]>
(cherry picked from commit 4a2985f)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jun 14, 2023
It seems like redpanda_pid can pick up false pids in the current grep
filtering approach. Use `pgrep` to not incorrectly pickup a command with
a command line arg of `redpanda`, which seems likely in our environments.

When debugging redpanda-data#11275, it seems like there is an apport process who's
pid is being picked up. Here's a process dump that includes the line.
There are no running redpanda processes, but checking for redpanda
processes times out, leading me to believe that this process is being
picked up.

In that case `stop_node` can fail if the node is already failed in CDT,
as this process is picked up.

```
[DEBUG - 2023-06-12 02:54:07,216 - redpanda - _log_node_process_state - lineno:2215]: root      136962 99.6  6.0 1587948 1452804 ?     R    02:53   0:49 /usr/bin/python3 /usr/shar
e/apport/apport -p136954 -s5 -c18446744073709551615 -d1 -P136954 -u0 -g0 -- !opt!redpanda_installs!head!libexec!redpanda
```

Fixes: redpanda-data#11275

Signed-off-by: Tyler Rockwood <[email protected]>
(cherry picked from commit 4a2985f)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Jun 14, 2023
It seems like redpanda_pid can pick up false pids in the current grep
filtering approach. Use `pgrep` to not incorrectly pickup a command with
a command line arg of `redpanda`, which seems likely in our environments.

When debugging redpanda-data#11275, it seems like there is an apport process who's
pid is being picked up. Here's a process dump that includes the line.
There are no running redpanda processes, but checking for redpanda
processes times out, leading me to believe that this process is being
picked up.

In that case `stop_node` can fail if the node is already failed in CDT,
as this process is picked up.

```
[DEBUG - 2023-06-12 02:54:07,216 - redpanda - _log_node_process_state - lineno:2215]: root      136962 99.6  6.0 1587948 1452804 ?     R    02:53   0:49 /usr/bin/python3 /usr/shar
e/apport/apport -p136954 -s5 -c18446744073709551615 -d1 -P136954 -u0 -g0 -- !opt!redpanda_installs!head!libexec!redpanda
```

Fixes: redpanda-data#11275

Signed-off-by: Tyler Rockwood <[email protected]>
(cherry picked from commit 4a2985f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants