Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (search victim assert) in ControllerEraseTest.test_erase_controller_log #8217

Closed
rystsov opened this issue Jan 13, 2023 · 11 comments · Fixed by #8400, #11350, #11970 or #16495
Closed
Assignees
Labels
area/controller ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@rystsov
Copy link
Contributor

rystsov commented Jan 13, 2023

https://buildkite.com/redpanda/redpanda/builds/21131#0185aa14-d78f-4589-83fe-33c79c1b9029

Module: rptest.tests.controller_erase_test
Class:  ControllerEraseTest
Method: test_erase_controller_log
Arguments:
{
  "partial": true
}
test_id:    rptest.tests.controller_erase_test.ControllerEraseTest.test_erase_controller_log.partial=True
status:     FAIL
run time:   32.670 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/controller_erase_test.py", line 99, in test_erase_controller_log
    assert self.redpanda.search_log_node(victim_node,
AssertionError
@piyushredpanda
Copy link
Contributor

@VadimPlh : can you please help with this?

@VadimPlh
Copy link
Contributor

Problem

Test delete last segment from disk and after startup node should signal that last_applied_offset does not present on the disk.
But in this test run node only replicated last controller log segment, but not applied it. So last_applied_offset was from another segment on the disk, which test did not delete.

The problem is that the test isn't waiting for a node's last_applied for the controller to advance far enough before restarting it.

Fix

Chatted with @jcsp and @mmaslankaprv
Idea is add new admin_api adding something like /v1/debug/controller_status that reports the controller log HWM and its last_applied. In test we should ask status for all nodes and wait when all of then will apply new offset
In future we can move sync point inside admin_api call. And node will be responsible for barrier

@andijcr
Copy link
Contributor

andijcr commented Jun 5, 2023

@andijcr
Copy link
Contributor

andijcr commented Jun 7, 2023

@andijcr
Copy link
Contributor

andijcr commented Jun 9, 2023

@michael-redpanda
Copy link
Contributor

@twmb
Copy link
Contributor

twmb commented Jun 11, 2023

@twmb twmb mentioned this issue Jun 11, 2023
7 tasks
@mmaslankaprv mmaslankaprv added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Jul 10, 2023
@mmaslankaprv
Copy link
Member

The test selected segment which contained not applied data (dirty) after the segment was deleted there was no inconsistency in the data.

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 10, 2023
Wait for all victim node records to be applied. If a victim node
contains some of the records that were not applied or about to be
truncated the test should wait before selecting segments to trim as in
the case if segment contains only dirty records removing it will not
cause inconsistency.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
Controller erasure test is supposed to validate if there is a mismatch
between the last appended entry in kvstore and controller max offset. In
order for the test to work correctly we must wait for all the messages
to be committed as we only delete the last segment that contains a
single message (new replicated configuration). In order to make the test
reliable change the condition to wait for the applied offset on the node
where controller log is going to be removed to be equal to the leader
dirty offset.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 57fb4c0)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
Controller erasure test is supposed to validate if there is a mismatch
between the last appended entry in kvstore and controller max offset. In
order for the test to work correctly we must wait for all the messages
to be committed as we only delete the last segment that contains a
single message (new replicated configuration). In order to make the test
reliable change the condition to wait for the applied offset on the node
where controller log is going to be removed to be equal to the leader
dirty offset.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 57fb4c0)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 13, 2023
Controller erasure test is supposed to validate if there is a mismatch
between the last appended entry in kvstore and controller max offset. In
order for the test to work correctly we must wait for all the messages
to be committed as we only delete the last segment that contains a
single message (new replicated configuration). In order to make the test
reliable change the condition to wait for the applied offset on the node
where controller log is going to be removed to be equal to the leader
dirty offset.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 57fb4c0)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
Controller erasure test is supposed to validate if there is a mismatch
between the last appended entry in kvstore and controller max offset. In
order for the test to work correctly we must wait for all the messages
to be committed as we only delete the last segment that contains a
single message (new replicated configuration). In order to make the test
reliable change the condition to wait for the applied offset on the node
where controller log is going to be removed to be equal to the leader
dirty offset.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 57fb4c0)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 20, 2023
Controller erasure test is supposed to validate if there is a mismatch
between the last appended entry in kvstore and controller max offset. In
order for the test to work correctly we must wait for all the messages
to be committed as we only delete the last segment that contains a
single message (new replicated configuration). In order to make the test
reliable change the condition to wait for the applied offset on the node
where controller log is going to be removed to be equal to the leader
dirty offset.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 57fb4c0)
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jul 24, 2023
Wait for all victim node records to be applied. If a victim node
contains some of the records that were not applied or about to be
truncated the test should wait before selecting segments to trim as in
the case if segment contains only dirty records removing it will not
cause inconsistency.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit 47a2c05)
@mmaslankaprv mmaslankaprv reopened this Feb 6, 2024
@mmaslankaprv mmaslankaprv reopened this Feb 6, 2024
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Feb 6, 2024
When testing partial deletion the test selects a segment to remove from
controller log. Before deleting the segment but after it was selected it
may be accounted in the controller snapshot.

Disabled controller snapshot to prevent the test racing with
snapshot creation.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Feb 9, 2024
When testing partial deletion the test selects a segment to remove from
controller log. Before deleting the segment but after it was selected it
may be accounted in the controller snapshot.

Disabled controller snapshot to prevent the test racing with
snapshot creation.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit f978777)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue May 16, 2024
When testing partial deletion the test selects a segment to remove from
controller log. Before deleting the segment but after it was selected it
may be accounted in the controller snapshot.

Disabled controller snapshot to prevent the test racing with
snapshot creation.

Fixes: redpanda-data#8217

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit f978777)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment