Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (_topic_remote_deleted timeout) in TopicDeleteCloudStorageTest.topic_delete_installed_snapshots_test #8496

Closed
jcsp opened this issue Jan 30, 2023 · 35 comments · Fixed by #8560, #9187, #10925, #11630 or #11684
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@jcsp
Copy link
Contributor

jcsp commented Jan 30, 2023

This is a relatively recently added test, first time we've seen it fail.

https://buildkite.com/redpanda/redpanda/builds/22044#01860016-6553-4a49-b948-ebdd36604013

Module: rptest.tests.topic_delete_test
Class:  TopicDeleteCloudStorageTest
Method: topic_delete_installed_snapshots_test
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_delete_test.py", line 269, in topic_delete_installed_snapshots_test
    wait_until(lambda: self._topic_remote_deleted(self.topic),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@jcsp jcsp added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Jan 30, 2023
@jcsp jcsp self-assigned this Jan 30, 2023
@jcsp jcsp added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Jan 30, 2023
@jcsp
Copy link
Contributor Author

jcsp commented Jan 30, 2023

A leadership transfer is leading to double upload of a segment, and only the final one that ends up in the manifest is deleted. This is not a bug per-se as the tiered storage design allows for some garbage to be left behind after leadership movement, it is an unnecessary rough edge in the context of an orchestrated leadership transfer.

While we could add some logic to detect and accept this case in the test, maybe it's better to go ahead and amend leadership transfer code to do a gentle shutdown of ntp_archiver during the transfer.

@jcsp
Copy link
Contributor Author

jcsp commented Jan 31, 2023

jcsp added a commit to jcsp/redpanda that referenced this issue Feb 1, 2023
This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.

This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.

Fixes redpanda-data#8496
@rystsov
Copy link
Contributor

rystsov commented Feb 3, 2023

@graphcareful
Copy link
Contributor

jcsp added a commit to jcsp/redpanda that referenced this issue Feb 15, 2023
This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.

This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.

Fixes redpanda-data#8496
jcsp added a commit to jcsp/redpanda that referenced this issue Feb 15, 2023
This substantially reduces the probability of leaving orphaned
objects in the object store when partitions change leadership
under load, e.g. during upgrades or leader balancing.

This fixes a test failure that indirectly detects orphan
objects by checking that topic deletion clears all objects.

Fixes redpanda-data#8496
@mmaslankaprv
Copy link
Member

Similar error in 3 tests:

@mmaslankaprv mmaslankaprv reopened this Feb 20, 2023
@mmaslankaprv
Copy link
Member

FAIL test: TopicDeleteCloudStorageTest.topic_delete_unavailable_test (2/34 runs)
failure at 2023-02-17T18:33:51.439Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/23485#01866065-c283-4087-9858-85c95a82d98a
FAIL test: TopicDeleteCloudStorageTest.topic_delete_unavailable_test.cloud_storage_type=CloudStorageType.ABS (1/35 runs)
failure at 2023-02-18T07:33:23.813Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/23528#01866339-eb65-494f-bdec-1bfba8586722
FAIL test: TopicDeleteCloudStorageTest.topic_delete_unavailable_test.cloud_storage_type=CloudStorageType.S3 (2/36 runs)
failure at 2023-02-19T07:17:24.906Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/23538#01866841-6238-478f-8d2d-65f449dfb8ef

@Scandiravian
Copy link
Contributor

Also just happened to me https://buildkite.com/redpanda/redpanda/builds/23591#01866fa0-8cbf-47c7-b8e2-735317cc8777

Module: rptest.tests.topic_delete_test
Class:  TopicDeleteCloudStorageTest
Method: topic_delete_unavailable_test
Arguments:
{
  "cloud_storage_type": 1
}
====================================================================================================
test_id:    rptest.tests.topic_delete_test.TopicDeleteCloudStorageTest.topic_delete_unavailable_test.cloud_storage_type=CloudStorageType.S3
status:     FAIL
run time:   3 minutes 52.267 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_delete_test.py", line 332, in topic_delete_unavailable_test
    wait_until(lambda: self._topic_remote_deleted(next_topic),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

jcsp added a commit to jcsp/redpanda that referenced this issue Mar 1, 2023
All these tests rely on leadership transfers being
done gracefully to avoid orphan objects, which is not
reliable in debug mode.

This is a more complete followup to
4f9bc2d

Fixes redpanda-data#8496
@jcsp jcsp closed this as completed in #9187 Mar 1, 2023
@twmb
Copy link
Contributor

twmb commented Mar 10, 2023

Just happened in tip,
https://buildkite.com/redpanda/redpanda/builds/24747#0186c8a3-7da3-4d43-a900-adfecb5e35ef

====================================================================================================
test_id:    rptest.tests.topic_delete_test.TopicDeleteCloudStorageTest.topic_delete_installed_snapshots_test
status:     FAIL
run time:   1 minute 57.472 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_delete_test.py", line 272, in topic_delete_installed_snapshots_test
    wait_until(lambda: self._topic_remote_deleted(self.topic),
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@twmb twmb reopened this Mar 10, 2023
@dlex
Copy link
Contributor

dlex commented Mar 10, 2023

@BenPope
Copy link
Member

BenPope commented Mar 15, 2023

@BenPope
Copy link
Member

BenPope commented Mar 15, 2023

@BenPope
Copy link
Member

BenPope commented May 8, 2023

@BenPope
Copy link
Member

BenPope commented May 8, 2023

ballard26 pushed a commit to ballard26/redpanda that referenced this issue May 9, 2023
This code is giong to get revised ahead  of 23.2 with
tombstones, which will generally make it more reliable
but also involvev test changes, so let's defer stabilizing
these until then.

Related: redpanda-data#9629
Related: redpanda-data#8496
jcsp added a commit to jcsp/redpanda that referenced this issue May 22, 2023
Underlying issues have been addressed:
- Adjacent segment merging is disabled.
- We now have a scrubber that finishes deletion eventually
  if anything interrupts the initial best-effort deletion
- A bug was fixed where the `replaced` list of the manifest
  was ignored during deletion.
- A test bug is fixed where it was incorrectly asserting
  that objects should still exist at the end of the 'unavailable'
  variant of the test.

Fixes: redpanda-data#8496
Fixes: redpanda-data#9629
@dotnwat
Copy link
Member

dotnwat commented Jun 3, 2023

This has been closed and re-opened several times, and I think it was triggered again here. The latest fix was merged on May 24 and yesterday a ci-repeat 5 triggered it.

The PR in question was the client fetch throttling. However, the code that is timing out in this test appears to be object storage APi (self.cloud_storage_client.list_objects). Of course that doesn't prove that the throttling code isn't to blame, but presumably we'd have seen a different failure earlier in the test if this were the culprit.

@andijcr
Copy link
Contributor

andijcr commented Jun 5, 2023

@andijcr
Copy link
Contributor

andijcr commented Jun 7, 2023

@andijcr
Copy link
Contributor

andijcr commented Jun 9, 2023

@michael-redpanda
Copy link
Contributor

@michael-redpanda
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented Jun 16, 2023

@travisdowns
Copy link
Member

FAIL test: TopicDeleteCloudStorageTest.topic_delete_unavailable_test.cloud_storage_type=CloudStorageType.ABS (1/29 runs)
failure at 2023-06-19T07:31:50.725Z: TimeoutError('')
on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/31582#0188d25c-cfaf-4822-a594-91fb3371ebc3

@travisdowns
Copy link
Member

FAIL test: TopicDeleteCloudStorageTest.topic_delete_installed_snapshots_test (2/13 runs)
failure at 2023-06-20T17:16:32.585Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31659#0188d99c-fed9-40a6-b43c-00b508233319
failure at 2023-06-20T10:26:17.109Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31637#0188d81e-727e-453f-a9a9-0fba8d86cfa8

jcsp added a commit to jcsp/redpanda that referenced this issue Jun 23, 2023
As soon as a topic is deleted, raft is shut down.  This prevents
ntp_archiver from updating archival_metadata_stm after uploading
a segment, resulting in an orphan segment after deletion.

Fixes redpanda-data#8496
Fixes redpanda-data#10655
jcsp added a commit to jcsp/redpanda that referenced this issue Jun 23, 2023
As soon as a topic is deleted, raft is shut down.  This prevents
ntp_archiver from updating archival_metadata_stm after uploading
a segment, resulting in an orphan segment after deletion.

Fixes redpanda-data#8496
Fixes redpanda-data#10655
@StephanDollberg
Copy link
Member

jcsp added a commit to jcsp/redpanda that referenced this issue Jun 26, 2023
This fixes commit 8355e4e

quiesce_uploads would fail silently if you called it for
a topic that didn't exist: that's fixed in the previous
commit.

This commit fixes the underlying issue, that we were
passing a string instead of a list of strings, so
quiesce_upload was trying to wait for each character
in the string as if it was a topic name.

Fixes redpanda-data#8496
Fixes redpanda-data#10655
Fixes redpanda-data#9629
@piyushredpanda
Copy link
Contributor

@jcsp should we remove the ci-disabled-test label now?

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Oct 11, 2023

Pandatriage reports to re-open this because of

      {
          "title": "TimeoutError('')",
          "id": 8939,
          "ts": 1689901360.004222,
          "type": "pr-merged",
          "build": "release",
          "arch": "amd64",
          "link": "https://buildkite.com/redpanda/redpanda/builds/33679"
      }

which occurred in late july and we have not seen new failures since. So it seems OK to remain closed.

@piyushredpanda
Copy link
Contributor

Hmm, why is pandatriage reporting a july build for reopening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment