cloud_storage: add cloud storage scrubbing capabilities #13253

VladLazar · 2023-09-04T16:31:11Z

This is the first PR for cloud storage scrubbing. The idea is that we want Redpanda to detect
inconsistencies in the data and metadata uploaded to the cloud storage bucket on its own.

There's three parts to the approach in this PR.

Detection: a new housekeeping job is introduced. If it decides that scrubbing should be done,
then it downloads the metadata from the bucket and looks for anomalies. Currently only missing partition manifests,
missing spillover manifests and missing segments are supported. More will follow.
Persistence & Processing: Anomalies are persisted on the partition's log or in the snapshot. After the scrubber runs, it replicates a new archival STM command. When that command is applied, the in memory manifest checks
the anomalies to ensure they are not false positives. The valid anomalies are kept in memory.
Reporting: Anomalies can be retrieved from the admin api. Currently, this can only be done on a per partition basis, but it will be extended to be cluster wide.

In the interest of keeping this reviewable, let's defer the items below for the next PRs:

Bulletproofing for scale:
- support low tput requests in the client pool
- ensure the scrubber doesn't impact the cloud storage read/write paths
Detect more types of anomalies
Make the scrubber smarter:
- if the last scrub was partial, pick up from where that finished instead of restarting
- Add a new error code for malformed manifests at the remote level
Cluster wide anomaly reporting
Add metric to alert on
Endpoint for on-demand scrub

Backports Required

Release Notes

Features

Add cloud storage scrubbing capabilities to Redpanda. In brief, the scrubber runs in the background and verifies the integrity of the cloud storage metadata and the existence of data referenced by it. When an issue is discovered, the redpanda_cloud_storage_anomalies will increment its counters based on the anomaly type. Per partition anomalies can be queried via the v1/cloud_storage/anomalies/ admin API endpoint.

VladLazar · 2023-09-12T13:51:52Z

/ci-repeat

andrwng

Haven't made my way through all the tests yet, but so far looks pretty good!

src/v/cloud_storage/partition_manifest.cc

src/v/archival/tests/scrubber_scheduler_test.cc

src/v/archival/scrubber.cc

src/v/cloud_storage/anomalies_detector.cc

andrwng · 2023-09-14T05:30:43Z

src/v/cloud_storage/partition_manifest.h

@@ -533,6 +537,8 @@ class partition_manifest : public base_manifest {

    iobuf to_iobuf() const;

+    void process_anomalies(scrub_status status, anomalies detected);


nit: maybe call it maybe_add_anomalies or something? I was surprised at first that a serde class was doing some processing/work, but looks like this is just accepting and trimming some anomalies

Hmm. maybe_add_anomalies doesn't reflect the fact that we perform trimming here. I'll have a think for a better name.

src/v/cloud_storage/partition_manifest.cc

src/v/config/configuration.cc

src/v/cloud_storage/anomalies_detector.cc

src/v/cloud_storage/partition_manifest.cc

src/v/archival/ntp_archiver_service.cc

VladLazar · 2023-09-14T11:39:53Z

Changes in force-push:

moved last partition scrub timestamp into STM command
tweaked log allow list for ducktape test
last_partition_scrub JSON ser/de
added code comments where requested

This commit renames the cloud storage scrubber to purger, since it didn't really do any scrubbing and only played a role during remote topic deletion. I went for purger since the code already used that term in various places.

This commit adds three tunable cloud storage cluster configs. They are all fairly self exaplanatory. In theory, end users should never have to touch these unless there is an issue with scrubbing.

VladLazar · 2023-09-15T09:05:25Z

Changes in force-push:

rebased on dev to solve conflicts

We need to gate scrubbing behind a feature flag since it will end up replicating a new archival STM command.

This commit adds a new field to the partition manifest to track the last scrub that occurred.

andrwng

Just nits remaining, otherwise LGTM

tests/rptest/tests/cloud_storage_scrubber_test.py

dotnwat

amazing stuff.

are anomalies reported until they are resolved, and how are they expired from storage (or not expired)?

dotnwat · 2023-09-16T18:20:27Z

src/v/archival/purger.h

- * The scrubber is a global sharded service: it runs on all shards, and
- * decides internally which shard will scrub which ranges of objects
+ * The purger is a global sharded service: it runs on all shards, and
+ * decides internally which shard will purge which ranges of objects
 * in object storage.


it sounds like there were bigger plans for the now-named purger that included scrubbing? but maybe it's easier to integrate whatever this purger is doing into the new framework later (if that's a plan at all)?

Right. The now named purger will deal with the deletion of orphaned objects too. John pushed a branch in the main redpanda repo with some POC orphan deletion stuff. I had a look through it and it looks good. The plan is to revamp that.

dotnwat · 2023-09-16T18:25:36Z

src/v/config/configuration.cc

+      "cloud_storage_scrubbing_interval_ms",
+      "Time interval between scrubs of the same partition",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},
+      1h)


curious about this 1 hour interval. does this mean that if we had 3600 partitions we'd be scrubbing a partition every second?

Good question. This interval is applied between two scrubs of the same partition. There's also a generous jitter of 10min which tries to avoid scrubbing too many things at the same time.

To answer the question, yes (roughly). I say we go with 1h for now, but it will likely have to change to something like 6 or 12. There's some more upcoming work on scrubbing scalability and picking a good interval will be part of that.

dotnwat · 2023-09-16T18:55:51Z

src/v/cluster/archival_metadata_stm.cc

@@ -184,6 +184,8 @@ struct archival_metadata_stm::snapshot
    kafka::offset start_kafka_offset;
    // List of spillover manifests
    fragmented_vector<segment> spillover_manifests;
+    // Timestamp of last completed scrub
+    model::timestamp last_partition_scrub;


I take it that model::timestamp has a reasonable default value for the case where upgraded code is reading old messages and wants to behave in an adaptive way (probably ignoring it?).

Yes. The default is -1, which is equal to model::timestamp::missing(). I could make it explicit

dotnwat · 2023-09-16T19:10:32Z

src/v/archival/scrubber.cc

+    } catch (...) {
+        vlog(
+          _logger.error,
+          "Unexpected exception while awaiting feature activation: {}",
+          std::current_exception());
+        co_return;


since feature await is kicked off in the constructor (as opposed to example in ::start()) is there a risk that it's run too soon in redpanda start up sequence such that some exceptions here are more likely and the scrubber never gets activated? im not very familiar with all the exceptions that might pop out of the await_feature interface, but maybe it's very unlikely and we shouldn't worry?

Good point. I did look through the features code and I think the only exception we can reasonably expect is from the abort source. This catch is me being defensive. I could add a start function, although the exercise seems a bit finicky.

yeh seems ok as-is

src/v/archival/scrubber.cc

dotnwat · 2023-09-16T22:05:30Z

src/v/cloud_storage/anomalies_detector.cc

+    // Binary manifest encoding and spillover manifests were both added
+    // in the same release. Hence, it's an anomaly to have a JSON
+    // encoded manifest and spillover manifests.
+    if (format == manifest_format::json && spill_manifest_paths.size() > 0) {


Binary manifest encoding and spillover manifests were both added in the same release.

makes sense. it seems like the existence of spill manifest here is telling us something about the release from which these manifest are coming. but it seems like there is another, stronger property to complete the implication here. something like a spill manifest would imply that a json manifest were always replaced by a binary manifest?

something like a spill manifest would imply that a json manifest were always replaced by a binary manifest?

Exactly!

src/v/cloud_storage/anomalies_detector.cc

src/v/archival/scrubber.cc

dotnwat · 2023-09-16T22:58:51Z

src/v/cloud_storage/partition_manifest.cc

+    auto first_kafka_offset = full_log_start_kafka_offset();
+    auto& missing_segs = detected.missing_segments;
+    erase_if(missing_segs, [&first_kafka_offset](const auto& meta) {
+        return meta.next_kafka_offset() <= first_kafka_offset;


do i understand correctly that this is saying that an apparent missing segment is not actually missing if it is sequence before the starting offset (e.g. because of delete prefix / retention)?

Precisely. If all the offsets in a missing segment are below the starting Kafka offset, we should ignore it.

VladLazar · 2023-09-18T08:54:24Z

are anomalies reported until they are resolved, and how are they expired from storage (or not expired)?

In partition_manifest::process_anomalies we overwrite the current anomalies with the new ones (which can be empty) if the scrubbing was full (i.e. went through all manifests and segments). Thinking about it now, perhaps we should do the filtering after we update the list of anomalies. This would avoid reporting stale anomalies when all scrubs come out as partial.

This commit introduces a stub housekeeping scrubber job. At this point it includes the scaffolding required by the housekeeping service and the scheduling logic. A future commit will plug in the actual scrubbing. A scrubbing scheduling utiliy class is also added. Separating this logic from the scrubber itself allows for writing unit tests.

Move spillover_manifest_path_components to cloud_storage/types.h in order to avoid circular dependecies in a future commit.

This commit introduces a new utility class that detects anomalies withing cloud storage data and metadata. It performs the following steps: 1. Download partition manifest 2. Check for existence of spillover manifests 3. Check for existence of segments referenced by partition manifest 4. For each spillover manifest, check for existence of the referenced segments This class will be extended with detection of other anomaly types in a future patch set.

This commit extends the partition manifest to include the anomalies detected by the scrubber. The next commit will add the "write path" for this. Note that the anomalies are not included in the serde format, so they will not be uploaded to the cloud. They are, however, included in the snapshot. The anomaly validation logic is also included in this commit (see partition_manifest::process_anomalies). This is where false positives detected by the scrubber are removed.

This commit implements the "anomaly write path". A new archival STM command is introduced: process_anomalies_cmd. After replication, it calls into the partition manifest which grabs the anomalies and processes them.

This commit plugs the scrubber housekeeping job into ntp_archiver. Like with the adjacent segment merging job, it will only be enabled while the ntp archiver is aware of the local Raft being the leader.

This commit introduces a new endpoint to the admin API /v1/cloud_storage/anomalies/{namespace}/{topic}/{partition} which allows for the retrieval of anomalies detected by the cloud storage scrubber.

VladLazar · 2023-09-18T13:10:02Z

Failure is:

CI Failure (ASAN Stack overflow) in id_allocator_stm_test_rpunit.stm_monotonicity_test #13491

github-actions bot added the area/redpanda label Sep 4, 2023

VladLazar force-pushed the scrubbing branch from 15b4f50 to 549c8ab Compare September 8, 2023 14:35

VladLazar changed the title ~~[WIP] Cloud storage scrubbing~~ [WIP] cloud_storage: add cloud storage scrubbing capabilities Sep 8, 2023

VladLazar added the area/cloud-storage Shadow indexing subsystem label Sep 8, 2023

VladLazar force-pushed the scrubbing branch 7 times, most recently from d43a82c to 724bb40 Compare September 12, 2023 11:20

VladLazar force-pushed the scrubbing branch 3 times, most recently from c615025 to dd5b079 Compare September 13, 2023 12:47

VladLazar changed the title ~~[WIP] cloud_storage: add cloud storage scrubbing capabilities~~ cloud_storage: add cloud storage scrubbing capabilities Sep 13, 2023

VladLazar marked this pull request as ready for review September 13, 2023 12:54

VladLazar requested review from Lazin, andrwng, dotnwat and abhijat September 13, 2023 12:54

VladLazar force-pushed the scrubbing branch 2 times, most recently from d3f6f1a to c443869 Compare September 13, 2023 17:17

andrwng reviewed Sep 14, 2023

View reviewed changes

VladLazar force-pushed the scrubbing branch from c443869 to 713fbb0 Compare September 14, 2023 11:37

VladLazar requested a review from andrwng September 14, 2023 11:40

Vlad Lazar added 2 commits September 15, 2023 10:03

cloud_storage: rename scrubber to purger

ece3bb0

This commit renames the cloud storage scrubber to purger, since it didn't really do any scrubbing and only played a role during remote topic deletion. I went for purger since the code already used that term in various places.

config: add cloud scrubbing configs

2df5838

This commit adds three tunable cloud storage cluster configs. They are all fairly self exaplanatory. In theory, end users should never have to touch these unless there is an issue with scrubbing.

VladLazar force-pushed the scrubbing branch from 713fbb0 to 0acfc81 Compare September 15, 2023 09:04

Vlad Lazar added 2 commits September 15, 2023 13:34

features: add cloud storage scrubbing feature

c6306cc

We need to gate scrubbing behind a feature flag since it will end up replicating a new archival STM command.

cloud_storage: track last scrub in manifest

958bda3

This commit adds a new field to the partition manifest to track the last scrub that occurred.

VladLazar force-pushed the scrubbing branch from 0acfc81 to 274ebe2 Compare September 15, 2023 12:35

andrwng previously approved these changes Sep 16, 2023

View reviewed changes

tests/rptest/tests/cloud_storage_scrubber_test.py Outdated Show resolved Hide resolved

tests/rptest/tests/cloud_storage_scrubber_test.py Outdated Show resolved Hide resolved

dotnwat reviewed Sep 16, 2023

View reviewed changes

Vlad Lazar added 10 commits September 18, 2023 10:50

cloud_storage: move spill path type to types.h

9fabd54

Move spillover_manifest_path_components to cloud_storage/types.h in order to avoid circular dependecies in a future commit.

tests: support header only replies in http imposter

ed09d3c

archival: plug anomaly detection into the scrubber

b6e74af

archival: add STM command to process anomalies

de13845

This commit implements the "anomaly write path". A new archival STM command is introduced: process_anomalies_cmd. After replication, it calls into the partition manifest which grabs the anomalies and processes them.

archival: plug scrubber housekeeping job in

6b19435

This commit plugs the scrubber housekeeping job into ntp_archiver. Like with the adjacent segment merging job, it will only be enabled while the ntp archiver is aware of the local Raft being the leader.

admin: add per partition anomalies endpoint

1df7328

This commit introduces a new endpoint to the admin API /v1/cloud_storage/anomalies/{namespace}/{topic}/{partition} which allows for the retrieval of anomalies detected by the cloud storage scrubber.

rptest: add end to end cloud scrubbing duck test

b931e21

VladLazar dismissed andrwng’s stale review via b931e21 September 18, 2023 10:17

VladLazar force-pushed the scrubbing branch from 274ebe2 to b931e21 Compare September 18, 2023 10:17

VladLazar requested review from dotnwat and andrwng September 18, 2023 12:32

dotnwat approved these changes Sep 18, 2023

View reviewed changes

dotnwat merged commit 5acbf6c into redpanda-data:dev Sep 18, 2023
9 checks passed

github-actions bot mentioned this pull request Dec 22, 2023

update redpanda appVersion from v23.2.21 to v23.3.1 redpanda-data/helm-charts#950

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: add cloud storage scrubbing capabilities #13253

cloud_storage: add cloud storage scrubbing capabilities #13253

VladLazar commented Sep 4, 2023 •

edited

Loading

VladLazar commented Sep 12, 2023

andrwng left a comment

andrwng Sep 14, 2023

VladLazar Sep 14, 2023

VladLazar commented Sep 14, 2023

VladLazar commented Sep 15, 2023

andrwng left a comment

dotnwat left a comment

dotnwat Sep 16, 2023

VladLazar Sep 18, 2023

dotnwat Sep 18, 2023

dotnwat Sep 16, 2023

VladLazar Sep 18, 2023

dotnwat Sep 16, 2023

VladLazar Sep 18, 2023

dotnwat Sep 16, 2023

VladLazar Sep 18, 2023

dotnwat Sep 18, 2023

dotnwat Sep 16, 2023

VladLazar Sep 18, 2023

dotnwat Sep 16, 2023

VladLazar Sep 18, 2023

VladLazar commented Sep 18, 2023

VladLazar commented Sep 18, 2023

		@@ -533,6 +537,8 @@ class partition_manifest : public base_manifest {

		iobuf to_iobuf() const;

		void process_anomalies(scrub_status status, anomalies detected);

cloud_storage: add cloud storage scrubbing capabilities #13253

cloud_storage: add cloud storage scrubbing capabilities #13253

Conversation

VladLazar commented Sep 4, 2023 • edited Loading

Backports Required

Release Notes

Features

VladLazar commented Sep 12, 2023

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar commented Sep 14, 2023

VladLazar commented Sep 15, 2023

andrwng left a comment

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar commented Sep 18, 2023

VladLazar commented Sep 18, 2023

VladLazar commented Sep 4, 2023 •

edited

Loading