-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: add cloud storage scrubbing capabilities #13253
Conversation
d43a82c
to
724bb40
Compare
/ci-repeat |
c615025
to
dd5b079
Compare
d3f6f1a
to
c443869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't made my way through all the tests yet, but so far looks pretty good!
@@ -533,6 +537,8 @@ class partition_manifest : public base_manifest { | |||
|
|||
iobuf to_iobuf() const; | |||
|
|||
void process_anomalies(scrub_status status, anomalies detected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe call it maybe_add_anomalies
or something? I was surprised at first that a serde class was doing some processing/work, but looks like this is just accepting and trimming some anomalies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. maybe_add_anomalies
doesn't reflect the fact that we perform trimming here. I'll have a think for a better name.
c443869
to
713fbb0
Compare
Changes in force-push:
|
This commit renames the cloud storage scrubber to purger, since it didn't really do any scrubbing and only played a role during remote topic deletion. I went for purger since the code already used that term in various places.
This commit adds three tunable cloud storage cluster configs. They are all fairly self exaplanatory. In theory, end users should never have to touch these unless there is an issue with scrubbing.
713fbb0
to
0acfc81
Compare
Changes in force-push:
|
We need to gate scrubbing behind a feature flag since it will end up replicating a new archival STM command.
This commit adds a new field to the partition manifest to track the last scrub that occurred.
0acfc81
to
274ebe2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits remaining, otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing stuff.
are anomalies reported until they are resolved, and how are they expired from storage (or not expired)?
src/v/archival/purger.h
Outdated
* The scrubber is a global sharded service: it runs on all shards, and | ||
* decides internally which shard will scrub which ranges of objects | ||
* The purger is a global sharded service: it runs on all shards, and | ||
* decides internally which shard will purge which ranges of objects | ||
* in object storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds like there were bigger plans for the now-named purger that included scrubbing? but maybe it's easier to integrate whatever this purger is doing into the new framework later (if that's a plan at all)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The now named purger will deal with the deletion of orphaned objects too. John pushed a branch in the main redpanda repo with some POC orphan deletion stuff. I had a look through it and it looks good. The plan is to revamp that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
"cloud_storage_scrubbing_interval_ms", | ||
"Time interval between scrubs of the same partition", | ||
{.needs_restart = needs_restart::no, .visibility = visibility::tunable}, | ||
1h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about this 1 hour interval. does this mean that if we had 3600 partitions we'd be scrubbing a partition every second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This interval is applied between two scrubs of the same partition. There's also a generous jitter of 10min which tries to avoid scrubbing too many things at the same time.
To answer the question, yes (roughly). I say we go with 1h for now, but it will likely have to change to something like 6 or 12. There's some more upcoming work on scrubbing scalability and picking a good interval will be part of that.
@@ -184,6 +184,8 @@ struct archival_metadata_stm::snapshot | |||
kafka::offset start_kafka_offset; | |||
// List of spillover manifests | |||
fragmented_vector<segment> spillover_manifests; | |||
// Timestamp of last completed scrub | |||
model::timestamp last_partition_scrub; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it that model::timestamp
has a reasonable default value for the case where upgraded code is reading old messages and wants to behave in an adaptive way (probably ignoring it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The default is -1
, which is equal to model::timestamp::missing()
. I could make it explicit
src/v/archival/scrubber.cc
Outdated
} catch (...) { | ||
vlog( | ||
_logger.error, | ||
"Unexpected exception while awaiting feature activation: {}", | ||
std::current_exception()); | ||
co_return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since feature await is kicked off in the constructor (as opposed to example in ::start()) is there a risk that it's run too soon in redpanda start up sequence such that some exceptions here are more likely and the scrubber never gets activated? im not very familiar with all the exceptions that might pop out of the await_feature interface, but maybe it's very unlikely and we shouldn't worry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I did look through the features code and I think the only exception we can reasonably expect is from the abort source. This catch
is me being defensive. I could add a start
function, although the exercise seems a bit finicky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh seems ok as-is
// Binary manifest encoding and spillover manifests were both added | ||
// in the same release. Hence, it's an anomaly to have a JSON | ||
// encoded manifest and spillover manifests. | ||
if (format == manifest_format::json && spill_manifest_paths.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary manifest encoding and spillover manifests were both added in the same release.
makes sense. it seems like the existence of spill manifest here is telling us something about the release from which these manifest are coming. but it seems like there is another, stronger property to complete the implication here. something like a spill manifest would imply that a json manifest were always replaced by a binary manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like a spill manifest would imply that a json manifest were always replaced by a binary manifest?
Exactly!
auto first_kafka_offset = full_log_start_kafka_offset(); | ||
auto& missing_segs = detected.missing_segments; | ||
erase_if(missing_segs, [&first_kafka_offset](const auto& meta) { | ||
return meta.next_kafka_offset() <= first_kafka_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do i understand correctly that this is saying that an apparent missing segment is not actually missing if it is sequence before the starting offset (e.g. because of delete prefix / retention)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely. If all the offsets in a missing segment are below the starting Kafka offset, we should ignore it.
In |
This commit introduces a stub housekeeping scrubber job. At this point it includes the scaffolding required by the housekeeping service and the scheduling logic. A future commit will plug in the actual scrubbing. A scrubbing scheduling utiliy class is also added. Separating this logic from the scrubber itself allows for writing unit tests.
Move spillover_manifest_path_components to cloud_storage/types.h in order to avoid circular dependecies in a future commit.
This commit introduces a new utility class that detects anomalies withing cloud storage data and metadata. It performs the following steps: 1. Download partition manifest 2. Check for existence of spillover manifests 3. Check for existence of segments referenced by partition manifest 4. For each spillover manifest, check for existence of the referenced segments This class will be extended with detection of other anomaly types in a future patch set.
This commit extends the partition manifest to include the anomalies detected by the scrubber. The next commit will add the "write path" for this. Note that the anomalies are not included in the serde format, so they will not be uploaded to the cloud. They are, however, included in the snapshot. The anomaly validation logic is also included in this commit (see partition_manifest::process_anomalies). This is where false positives detected by the scrubber are removed.
This commit implements the "anomaly write path". A new archival STM command is introduced: process_anomalies_cmd. After replication, it calls into the partition manifest which grabs the anomalies and processes them.
This commit plugs the scrubber housekeeping job into ntp_archiver. Like with the adjacent segment merging job, it will only be enabled while the ntp archiver is aware of the local Raft being the leader.
This commit introduces a new endpoint to the admin API /v1/cloud_storage/anomalies/{namespace}/{topic}/{partition} which allows for the retrieval of anomalies detected by the cloud storage scrubber.
274ebe2
to
b931e21
Compare
This is the first PR for cloud storage scrubbing. The idea is that we want Redpanda to detect
inconsistencies in the data and metadata uploaded to the cloud storage bucket on its own.
There's three parts to the approach in this PR.
then it downloads the metadata from the bucket and looks for anomalies. Currently only missing partition manifests,
missing spillover manifests and missing segments are supported. More will follow.
the anomalies to ensure they are not false positives. The valid anomalies are kept in memory.
In the interest of keeping this reviewable, let's defer the items below for the next PRs:
Backports Required
Release Notes
Features
redpanda_cloud_storage_anomalies
will increment its counters based on the anomaly type. Per partition anomalies can be queried via thev1/cloud_storage/anomalies/
admin API endpoint.