Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: add cloud storage scrubbing capabilities #13253

Merged
merged 14 commits into from
Sep 18, 2023

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Sep 4, 2023

This is the first PR for cloud storage scrubbing. The idea is that we want Redpanda to detect
inconsistencies in the data and metadata uploaded to the cloud storage bucket on its own.

There's three parts to the approach in this PR.

  1. Detection: a new housekeeping job is introduced. If it decides that scrubbing should be done,
    then it downloads the metadata from the bucket and looks for anomalies. Currently only missing partition manifests,
    missing spillover manifests and missing segments are supported. More will follow.
  2. Persistence & Processing: Anomalies are persisted on the partition's log or in the snapshot. After the scrubber runs, it replicates a new archival STM command. When that command is applied, the in memory manifest checks
    the anomalies to ensure they are not false positives. The valid anomalies are kept in memory.
  3. Reporting: Anomalies can be retrieved from the admin api. Currently, this can only be done on a per partition basis, but it will be extended to be cluster wide.

In the interest of keeping this reviewable, let's defer the items below for the next PRs:

  • Bulletproofing for scale:
    • support low tput requests in the client pool
    • ensure the scrubber doesn't impact the cloud storage read/write paths
  • Detect more types of anomalies
  • Make the scrubber smarter:
    • if the last scrub was partial, pick up from where that finished instead of restarting
    • Add a new error code for malformed manifests at the remote level
  • Cluster wide anomaly reporting
  • Add metric to alert on
  • Endpoint for on-demand scrub

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • Add cloud storage scrubbing capabilities to Redpanda. In brief, the scrubber runs in the background and verifies the integrity of the cloud storage metadata and the existence of data referenced by it. When an issue is discovered, the redpanda_cloud_storage_anomalies will increment its counters based on the anomaly type. Per partition anomalies can be queried via the v1/cloud_storage/anomalies/ admin API endpoint.

@VladLazar VladLazar changed the title [WIP] Cloud storage scrubbing [WIP] cloud_storage: add cloud storage scrubbing capabilities Sep 8, 2023
@VladLazar VladLazar added the area/cloud-storage Shadow indexing subsystem label Sep 8, 2023
@VladLazar VladLazar force-pushed the scrubbing branch 7 times, most recently from d43a82c to 724bb40 Compare September 12, 2023 11:20
@VladLazar
Copy link
Contributor Author

/ci-repeat

@VladLazar VladLazar force-pushed the scrubbing branch 3 times, most recently from c615025 to dd5b079 Compare September 13, 2023 12:47
@VladLazar VladLazar changed the title [WIP] cloud_storage: add cloud storage scrubbing capabilities cloud_storage: add cloud storage scrubbing capabilities Sep 13, 2023
@VladLazar VladLazar marked this pull request as ready for review September 13, 2023 12:54
@VladLazar VladLazar force-pushed the scrubbing branch 2 times, most recently from d3f6f1a to c443869 Compare September 13, 2023 17:17
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't made my way through all the tests yet, but so far looks pretty good!

src/v/cloud_storage/partition_manifest.cc Show resolved Hide resolved
src/v/archival/scrubber.cc Outdated Show resolved Hide resolved
src/v/cloud_storage/anomalies_detector.cc Show resolved Hide resolved
@@ -533,6 +537,8 @@ class partition_manifest : public base_manifest {

iobuf to_iobuf() const;

void process_anomalies(scrub_status status, anomalies detected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe call it maybe_add_anomalies or something? I was surprised at first that a serde class was doing some processing/work, but looks like this is just accepting and trimming some anomalies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. maybe_add_anomalies doesn't reflect the fact that we perform trimming here. I'll have a think for a better name.

src/v/cloud_storage/partition_manifest.cc Outdated Show resolved Hide resolved
src/v/config/configuration.cc Show resolved Hide resolved
src/v/cloud_storage/anomalies_detector.cc Show resolved Hide resolved
src/v/cloud_storage/partition_manifest.cc Outdated Show resolved Hide resolved
src/v/archival/ntp_archiver_service.cc Show resolved Hide resolved
@VladLazar
Copy link
Contributor Author

Changes in force-push:

  • moved last partition scrub timestamp into STM command
  • tweaked log allow list for ducktape test
  • last_partition_scrub JSON ser/de
  • added code comments where requested

Vlad Lazar added 2 commits September 15, 2023 10:03
This commit renames the cloud storage scrubber to purger, since it
didn't really do any scrubbing and only played a role during remote
topic deletion. I went for purger since the code already used that term
in various places.
This commit adds three tunable cloud storage cluster configs. They are
all fairly self exaplanatory. In theory, end users should never have to
touch these unless there is an issue with scrubbing.
@VladLazar
Copy link
Contributor Author

Changes in force-push:

  • rebased on dev to solve conflicts

Vlad Lazar added 2 commits September 15, 2023 13:34
We need to gate scrubbing behind a feature flag since it will end up
replicating a new archival STM command.
This commit adds a new field to the partition manifest to track the last
scrub that occurred.
andrwng
andrwng previously approved these changes Sep 16, 2023
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits remaining, otherwise LGTM

tests/rptest/tests/cloud_storage_scrubber_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/cloud_storage_scrubber_test.py Outdated Show resolved Hide resolved
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing stuff.

are anomalies reported until they are resolved, and how are they expired from storage (or not expired)?

Comment on lines 23 to 25
* The scrubber is a global sharded service: it runs on all shards, and
* decides internally which shard will scrub which ranges of objects
* The purger is a global sharded service: it runs on all shards, and
* decides internally which shard will purge which ranges of objects
* in object storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds like there were bigger plans for the now-named purger that included scrubbing? but maybe it's easier to integrate whatever this purger is doing into the new framework later (if that's a plan at all)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The now named purger will deal with the deletion of orphaned objects too. John pushed a branch in the main redpanda repo with some POC orphan deletion stuff. I had a look through it and it looks good. The plan is to revamp that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

"cloud_storage_scrubbing_interval_ms",
"Time interval between scrubs of the same partition",
{.needs_restart = needs_restart::no, .visibility = visibility::tunable},
1h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about this 1 hour interval. does this mean that if we had 3600 partitions we'd be scrubbing a partition every second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This interval is applied between two scrubs of the same partition. There's also a generous jitter of 10min which tries to avoid scrubbing too many things at the same time.

To answer the question, yes (roughly). I say we go with 1h for now, but it will likely have to change to something like 6 or 12. There's some more upcoming work on scrubbing scalability and picking a good interval will be part of that.

@@ -184,6 +184,8 @@ struct archival_metadata_stm::snapshot
kafka::offset start_kafka_offset;
// List of spillover manifests
fragmented_vector<segment> spillover_manifests;
// Timestamp of last completed scrub
model::timestamp last_partition_scrub;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it that model::timestamp has a reasonable default value for the case where upgraded code is reading old messages and wants to behave in an adaptive way (probably ignoring it?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The default is -1, which is equal to model::timestamp::missing(). I could make it explicit

Comment on lines 47 to 53
} catch (...) {
vlog(
_logger.error,
"Unexpected exception while awaiting feature activation: {}",
std::current_exception());
co_return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since feature await is kicked off in the constructor (as opposed to example in ::start()) is there a risk that it's run too soon in redpanda start up sequence such that some exceptions here are more likely and the scrubber never gets activated? im not very familiar with all the exceptions that might pop out of the await_feature interface, but maybe it's very unlikely and we shouldn't worry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I did look through the features code and I think the only exception we can reasonably expect is from the abort source. This catch is me being defensive. I could add a start function, although the exercise seems a bit finicky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh seems ok as-is

src/v/archival/scrubber.cc Outdated Show resolved Hide resolved
// Binary manifest encoding and spillover manifests were both added
// in the same release. Hence, it's an anomaly to have a JSON
// encoded manifest and spillover manifests.
if (format == manifest_format::json && spill_manifest_paths.size() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary manifest encoding and spillover manifests were both added in the same release.

makes sense. it seems like the existence of spill manifest here is telling us something about the release from which these manifest are coming. but it seems like there is another, stronger property to complete the implication here. something like a spill manifest would imply that a json manifest were always replaced by a binary manifest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like a spill manifest would imply that a json manifest were always replaced by a binary manifest?

Exactly!

src/v/cloud_storage/anomalies_detector.cc Show resolved Hide resolved
src/v/cloud_storage/anomalies_detector.cc Show resolved Hide resolved
src/v/archival/scrubber.cc Outdated Show resolved Hide resolved
auto first_kafka_offset = full_log_start_kafka_offset();
auto& missing_segs = detected.missing_segments;
erase_if(missing_segs, [&first_kafka_offset](const auto& meta) {
return meta.next_kafka_offset() <= first_kafka_offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do i understand correctly that this is saying that an apparent missing segment is not actually missing if it is sequence before the starting offset (e.g. because of delete prefix / retention)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely. If all the offsets in a missing segment are below the starting Kafka offset, we should ignore it.

@VladLazar
Copy link
Contributor Author

are anomalies reported until they are resolved, and how are they expired from storage (or not expired)?

In partition_manifest::process_anomalies we overwrite the current anomalies with the new ones (which can be empty) if the scrubbing was full (i.e. went through all manifests and segments). Thinking about it now, perhaps we should do the filtering after we update the list of anomalies. This would avoid reporting stale anomalies when all scrubs come out as partial.

Vlad Lazar added 10 commits September 18, 2023 10:50
This commit introduces a stub housekeeping scrubber job. At this point
it includes the scaffolding required by the housekeeping service and the
scheduling logic. A future commit will plug in the actual scrubbing.

A scrubbing scheduling utiliy class is also added. Separating this logic
from the scrubber itself allows for writing unit tests.
Move spillover_manifest_path_components to cloud_storage/types.h in
order to avoid circular dependecies in a future commit.
This commit introduces a new utility class that detects anomalies
withing cloud storage data and metadata.

It performs the following steps:
1. Download partition manifest
2. Check for existence of spillover manifests
3. Check for existence of segments referenced by partition manifest
4. For each spillover manifest, check for existence of the referenced
   segments

This class will be extended with detection of other anomaly types in a
future patch set.
This commit extends the partition manifest to include the anomalies
detected by the scrubber. The next commit will add the "write path" for
this. Note that the anomalies are not included in the serde format, so
they will not be uploaded to the cloud. They are, however, included in
the snapshot.

The anomaly validation logic is also included in this commit (see
partition_manifest::process_anomalies). This is where false positives
detected by the scrubber are removed.
This commit implements the "anomaly write path". A new archival STM
command is introduced: process_anomalies_cmd. After replication, it
calls into the partition manifest which grabs the anomalies and
processes them.
This commit plugs the scrubber housekeeping job into ntp_archiver. Like
with the adjacent segment merging job, it will only be enabled while the
ntp archiver is aware of the local Raft being the leader.
This commit introduces a new endpoint to the admin API
/v1/cloud_storage/anomalies/{namespace}/{topic}/{partition} which allows
for the retrieval of anomalies detected by the cloud storage scrubber.
@VladLazar
Copy link
Contributor Author

@dotnwat dotnwat merged commit 5acbf6c into redpanda-data:dev Sep 18, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants