-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (_topic_remote_deleted timeout) in TopicDeleteCloudStorageTest
.topic_delete_installed_snapshots_test
#8496
Comments
A leadership transfer is leading to double upload of a segment, and only the final one that ends up in the manifest is deleted. This is not a bug per-se as the tiered storage design allows for some garbage to be left behind after leadership movement, it is an unnecessary rough edge in the context of an orchestrated leadership transfer. While we could add some logic to detect and accept this case in the test, maybe it's better to go ahead and amend leadership transfer code to do a gentle shutdown of ntp_archiver during the transfer. |
Reproduced with same failure mode: |
This substantially reduces the probability of leaving orphaned objects in the object store when partitions change leadership under load, e.g. during upgrades or leader balancing. This fixes a test failure that indirectly detects orphan objects by checking that topic deletion clears all objects. Fixes redpanda-data#8496
This substantially reduces the probability of leaving orphaned objects in the object store when partitions change leadership under load, e.g. during upgrades or leader balancing. This fixes a test failure that indirectly detects orphan objects by checking that topic deletion clears all objects. Fixes redpanda-data#8496
This substantially reduces the probability of leaving orphaned objects in the object store when partitions change leadership under load, e.g. during upgrades or leader balancing. This fixes a test failure that indirectly detects orphan objects by checking that topic deletion clears all objects. Fixes redpanda-data#8496
Similar error in 3 tests: |
FAIL test: TopicDeleteCloudStorageTest.topic_delete_unavailable_test (2/34 runs) |
Also just happened to me https://buildkite.com/redpanda/redpanda/builds/23591#01866fa0-8cbf-47c7-b8e2-735317cc8777
|
All these tests rely on leadership transfers being done gracefully to avoid orphan objects, which is not reliable in debug mode. This is a more complete followup to 4f9bc2d Fixes redpanda-data#8496
Just happened in tip,
|
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24726#0186c7b0-34fe-4357-9603-fb2b470f67ec |
This code is giong to get revised ahead of 23.2 with tombstones, which will generally make it more reliable but also involvev test changes, so let's defer stabilizing these until then. Related: redpanda-data#9629 Related: redpanda-data#8496
Underlying issues have been addressed: - Adjacent segment merging is disabled. - We now have a scrubber that finishes deletion eventually if anything interrupts the initial best-effort deletion - A bug was fixed where the `replaced` list of the manifest was ignored during deletion. - A test bug is fixed where it was incorrectly asserting that objects should still exist at the end of the 'unavailable' variant of the test. Fixes: redpanda-data#8496 Fixes: redpanda-data#9629
This has been closed and re-opened several times, and I think it was triggered again here. The latest fix was merged on May 24 and yesterday a ci-repeat 5 triggered it.
The PR in question was the client fetch throttling. However, the code that is timing out in this test appears to be object storage APi ( |
https://buildkite.com/redpanda/redpanda/builds/30568#01888a22-5476-43b6-8d0f-1f46708efc40 |
FAIL test: TopicDeleteCloudStorageTest.topic_delete_unavailable_test.cloud_storage_type=CloudStorageType.ABS (1/29 runs) |
FAIL test: TopicDeleteCloudStorageTest.topic_delete_installed_snapshots_test (2/13 runs) |
As soon as a topic is deleted, raft is shut down. This prevents ntp_archiver from updating archival_metadata_stm after uploading a segment, resulting in an orphan segment after deletion. Fixes redpanda-data#8496 Fixes redpanda-data#10655
As soon as a topic is deleted, raft is shut down. This prevents ntp_archiver from updating archival_metadata_stm after uploading a segment, resulting in an orphan segment after deletion. Fixes redpanda-data#8496 Fixes redpanda-data#10655
Think this happened again: https://buildkite.com/redpanda/redpanda/builds/31920#0188f66a-2a94-46a5-a587-37c2c09f0e81 ? |
This fixes commit 8355e4e quiesce_uploads would fail silently if you called it for a topic that didn't exist: that's fixed in the previous commit. This commit fixes the underlying issue, that we were passing a string instead of a list of strings, so quiesce_upload was trying to wait for each character in the string as if it was a topic name. Fixes redpanda-data#8496 Fixes redpanda-data#10655 Fixes redpanda-data#9629
@jcsp should we remove the |
Pandatriage reports to re-open this because of
which occurred in late july and we have not seen new failures since. So it seems OK to remain closed. |
Hmm, why is pandatriage reporting a july build for reopening? |
This is a relatively recently added test, first time we've seen it fail.
https://buildkite.com/redpanda/redpanda/builds/22044#01860016-6553-4a49-b948-ebdd36604013
The text was updated successfully, but these errors were encountered: