docs: sstable / replica corruption RFC #68378

itsbilal · 2021-08-03T16:11:58Z

This PR proposes a design for responding to sstable
corruption by marking matching replicas as corrupt
and letting Cockroach replication delete / replace
them with data from other nodes.

Informs #67568.

Release note: None.

cockroach-teamcity · 2021-08-03T16:12:10Z

This change is

itsbilal · 2021-08-03T16:12:19Z

Would really like to get some feedback on this from KV folks too - any ideas on who I should tag?

jbowens

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @petermattis, and @sumeerbhola)

docs/RFCS/20210730_replica_corruption_recovery.md, line 31 at r1 (raw file):

While this is okay from a correctness perspective, it is still something that
requires manual operator intervention and can unnecessarily excessive in

s/can/can cause/?

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

`CorruptSpans`. This is necessary as Cockroach will be expected to garbage
collect replicas corresponding to reported instances of corruption, and those
GC requests will come into pebble in the form of range deletion tombstones.

What happens if there's an open snapshot at a sequence number lower than the range tombstone? Do we need to wait until the range tombstone is in the last snapshot stripe? What if the snapshot exists because we're trying to replicate the corrupt range to another store?

docs/RFCS/20210730_replica_corruption_recovery.md, line 61 at r1 (raw file):

collect replicas corresponding to reported instances of corruption, and those
GC requests will come into pebble in the form of range deletion tombstones.
Compactions of any keys other than point tombstones into a `CorruptSpan` will

should this say "other than range tombstones"?

docs/RFCS/20210730_replica_corruption_recovery.md, line 62 at r1 (raw file):

GC requests will come into pebble in the form of range deletion tombstones.
Compactions of any keys other than point tombstones into a `CorruptSpan` will
continue erroring out in the background.

Since the compaction picker is deterministic, we may repeatedly try to compact a corrupt sstable before it's been GC'd. This has the potential to starve compactions either into or out of the corrupt sstable's level. Should the compaction picker try to avoid compactions that involve the corrupt sstable but don't remove the corrupt span? I think it would be straightforward to skip IsCorrupt files in pickFile, but that only handles the case where the corrupt sstable is the seed file in the input level. Eventually if we refactored the compaction picker to iterate over pre-expanded atomic compaction units (cockroachdb/pebble#1152), we could skip any compaction units that contain input-level files with IsCorrupt marked.

Compactions into the level containing the corrupt sstable seem trickier, because we don't know ahead of time which compactions might drop the corrupt span. One idea is to try the compaction once. When it fails with a corruption error, record a sequence number on the file's metadata indicating that no compaction should be attempted into that file unless the compaction contains an input file with a higher sequence number than the one recorded. That would allow us to skip compactions into the file until a new sstable that might potentially drop the corrupted range exists in the above level.

Maybe it's moot if the entire turnaround on GC-ing a corrupt replica is quick.

docs/RFCS/20210730_replica_corruption_recovery.md, line 68 at r1 (raw file):

it to a `corrupt` subdirectory in the store directory. This would reduce the
chances of a corrupt block/sector of disk getting re-used by Pebble further
down the line, and would effectively quarantine that file for Pebble's purposes.

Interesting! If we wanted to limit the wasted disk space, we could use fallocate on supporting systems to punch holes in the sstables to deallocate the non-corrupt regions.

docs/RFCS/20210730_replica_corruption_recovery.md, line 114 at r1 (raw file):

5) The replicateQueue will remove these replicas, and the GC queue will issue
   range deletion tombstones for those replicas, which will then get compacted
   into the corrupt sstables, deleting them.

Do the replica IDs get cleared from corruptReplicas?

docs/RFCS/20210730_replica_corruption_recovery.md, line 139 at r1 (raw file):

corrupt sstable(s). This should be unlikely as sstables with range tombstones
are already prioritized for compactions, but it's something that would need
to be closely observed.

If this turns out to be a problem, we might be able to work something into the code path that calculates the RangeDeletionBytesEstimate used to prioritize range deletions. When we see a range deletion in a new sstable, we already find all overlapping files in lower levels in order to estimate the amount of data dropped by the range tombstone. If any of the overlapping files have the IsCorrupt flag flipped, we could record something in the tombstone file's table stats to prioritize its compaction.

docs/RFCS/20210730_replica_corruption_recovery.md, line 155 at r1 (raw file):

benefit of this extra handlifting is less clear.

Another alternative implementation is to support arbitraty compactions into

s/arbitraty/arbitrary/

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @petermattis, and @sumeerbhola)

docs/RFCS/20210730_replica_corruption_recovery.md, line 34 at r1 (raw file):

disruption for the cluster (eg. entire node being taken down for a bit flip
in an sstable somewhere). We can do better; we could just delete the replicas
corresponding to those sstables (or better yet, specific blocks of data within

How many replicas are typically covered by one sstable? When I looked at this a few years ago our L0 sstables were quite wide because of the way nearly every SQL write touches both regular and range-local keys. Corruption in one of these could easily knock out half the store. (This is also relatively easy to fix if we hack in something that breaks L0 sstables in two at the local/global split point. Maybe we've already done this).

docs/RFCS/20210730_replica_corruption_recovery.md, line 67 at r1 (raw file):

`deleteObsoleteFiles` after the compaction is complete. Instead, it could move
it to a `corrupt` subdirectory in the store directory. This would reduce the
chances of a corrupt block/sector of disk getting re-used by Pebble further

Would that work these days? SSDs tend to have enough abstractions and translation layers (and wear-leveling movement) that you don't have much control over when you're reusing a specific block.

docs/RFCS/20210730_replica_corruption_recovery.md, line 112 at r1 (raw file):

   handing in `Allocator.computeAction` such as what we have for
   `AllocateRemove{Dead,Decommissioning}Voter`?)
5) The replicateQueue will remove these replicas, and the GC queue will issue

The replicateQueue only processes replicas for which the current node is the leaseholder. And if the store with the corruption is not already the leaseholder we don't want it to try and grab the lease (we probably want to make the opposite move if it happens to already be the leaseholder). That means we need some way for the node with the corrupted sst to let the leaseholder know to remove it. I believe decommissioning uses gossip for this and we should be able to do something similar here.

docs/RFCS/20210730_replica_corruption_recovery.md, line 160 at r1 (raw file):

properties so future restarts of Pebble do not lose those.

# Unresolved questions

Once we've seen one corruption event, how likely are we to see more corruption on the same store? Is there a point at which we'd be better off having the node decommission itself rather than recover and keep going on the same device?

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, and @petermattis)

docs/RFCS/20210730_replica_corruption_recovery.md, line 34 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How many replicas are typically covered by one sstable? When I looked at this a few years ago our L0 sstables were quite wide because of the way nearly every SQL write touches both regular and range-local keys. Corruption in one of these could easily knock out half the store. (This is also relatively easy to fix if we hack in something that breaks L0 sstables in two at the local/global split point. Maybe we've already done this).

We now split into multiple sstables when flushing, based on Lbase boundaries, so sstables should be narrower than before. We don't have any explicit code for local/global keys.
We narrow the corruption down to the ssblock that is corrupted, which is much more likely to be one of the many data blocks than the one index block, and we know the key bounds of that data block, so the corrupt span could be much narrower than the sstable bounds.

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

What happens if there's an open snapshot at a sequence number lower than the range tombstone? Do we need to wait until the range tombstone is in the last snapshot stripe? What if the snapshot exists because we're trying to replicate the corrupt range to another store?

Good question. My thinking was that we would not try to fully reuse range tombstones for this purpose. One issue there is that we don't know which sstable has what range tombstones without reading it, and these corrupt files will not participate in compactions until we know the corrupt data in them is no longer relevant.

Pebble would maintain a set of corrupt spans each with a max seqnum of what is corrupt
CockroachDB would say via a call into Pebble that it was safe to discard data for a span [k1, k2) (corresponding to a CockroachDB range). Pebble would then keep track of such ignorable spans (along with the max seqnum), and if the corrupt spans for a file were such that all were now ignorable, it would again become eligible to be picked for compaction. We would then skip over the corrupt parts when compacting. We'd not actually deleting data, just ignoring the corrupt parts, so there is no interaction with snapshots.
The main difference is that these ignorable spans are being maintained at the Pebble level and not inside sstables, and that is fine since there should be few of them. I am hoping we don't need to persist them (hmm, this needs some more thought).

There is an issue that if there is someone with an open snapshot that is planning to read this range's data, it will silently miss the corruption. Which now makes me glad that we are not using snapshot trickery to do things like lock table migrations, and instead following the normal rules of reading data for the range only when the node has a replica. Since the "safe to discard" call will not happen until the replica is removed, we shouldn't have anyone holding onto a snapshot that reads this range's state.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, and @petermattis)

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

Previously, sumeerbhola wrote…

Good question. My thinking was that we would not try to fully reuse range tombstones for this purpose. One issue there is that we don't know which sstable has what range tombstones without reading it, and these corrupt files will not participate in compactions until we know the corrupt data in them is no longer relevant.

Pebble would maintain a set of corrupt spans each with a max seqnum of what is corrupt

CockroachDB would say via a call into Pebble that it was safe to discard data for a span [k1, k2) (corresponding to a CockroachDB range). Pebble would then keep track of such ignorable spans (along with the max seqnum), and if the corrupt spans for a file were such that all were now ignorable, it would again become eligible to be picked for compaction. We would then skip over the corrupt parts when compacting. We'd not actually deleting data, just ignoring the corrupt parts, so there is no interaction with snapshots.
The main difference is that these ignorable spans are being maintained at the Pebble level and not inside sstables, and that is fine since there should be few of them. I am hoping we don't need to persist them (hmm, this needs some more thought).

There is an issue that if there is someone with an open snapshot that is planning to read this range's data, it will silently miss the corruption. Which now makes me glad that we are not using snapshot trickery to do things like lock table migrations, and instead following the normal rules of reading data for the range only when the node has a replica. Since the "safe to discard" call will not happen until the replica is removed, we shouldn't have anyone holding onto a snapshot that reads this range's state.

Thought a bit more about this:

Range tombstones can be split by compactions. So if we have a corrupt file at L6 with a corrupt span [a, z) and a range tombstone gets added for [a, z) at L0 there is nothing preventing it from being split such that only part of it is in an L5 file. Now this L5 file cannot be successfully compacted with this L6 file -- we have to wait for the rest of the range tombstone to make it all the way down to L5. This becomes complicated and delays the compaction. Externalizing this state to be at the DB level, via "ignorable-spans" (there is probably a better name for this), sidesteps this problem.
I think we can avoid persisting these ignorable spans. We will forget them if the node crashes, and when we come back and there are still some corrupt files whose spans were previously ignorable, we have two possibilities:
- the span corresponds to ranges that are not on this node. CockroachDB can immediately mark these as ignorable when it gets informed about the corrupt spans.
- the span corresponds to a range that is on this node. It needs to again repeat the process of discarding this replica. This should be rare, so I think it is fine to be wasteful here.
Once the ignorable span and range tombstone has been added, the node should be free to again add a replica of the range (without waiting for the corruption to be compacted away). This means reads of the newly re-added range should not encounter corruption. The range tombstone partially prevents the corruption from being visible since mergingIter uses tombstones to seek pass the whole tombstone range. But again, if the range tombstone has been split, we could land in the middle of the range and see the corruption. We could use the DB level knowledge of ignorable-spans, which is a (span, seqnum) pair, to skip the corrupt data. I am worried this is getting too complicated. A simpler alternative is to prioritize rewriting of these corrupt files to remove the corruption (single file compactions) and not allow adding back the a replica until that is done -- since corruption is not common we can afford the prioritization, and the small delay until the replica can be added back again.

jbowens

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @petermattis, @sumeerbhola, and @tbg)

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

A simpler alternative is to prioritize rewriting of these corrupt files to remove the corruption (single file compactions) and not allow adding back the a replica until that is done -- since corruption is not common we can afford the prioritization, and the small delay until the replica can be added back again.

That makes sense to me.

Sequence number zeroing would be another complication for externalized range tombstones/ignorable spans. Keys at sequence numbers higher than the ignorable span may have their sequence numbers zeroed, removing the knowledge that it's more recent than the ignorable span. Deletion-only compactions deal with this by removing any pending deletion hints that overlap any seqnum-zeroing compaction's key range. Ignorable spans wouldn't be optional optimizations like deletion-only compaction hints though.

docs/RFCS/20210730_replica_corruption_recovery.md, line 98 at r1 (raw file):

   (will this have the right consistency guarantees?) corresponding to the
   corrupt span(s), and use its own store's `GetReplicaIfExists` to get the
   `*Replica`.

What would the behavior be if the corrupt span corresponds to node-local data? Just crash with an informative corruption message?

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @sumeerbhola, and @tbg)

docs/RFCS/20210730_replica_corruption_recovery.md, line 46 at r1 (raw file):

The FileMetadata in Pebble will be updated to include an `IsCorrupt bool` and a
`CorruptSpans` set of spans. The former will be atomically flipped to true

Sounds like this bit and the CorruptSpans won't be persisted to disk. That's fine, just noting that could be made more explicit in the text here.

joshimhoff

A few comments with an eye towards operations.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @sumeerbhola, and @tbg)

docs/RFCS/20210730_replica_corruption_recovery.md, line 71 at r1 (raw file):

Non-compaction read paths will be audited to ensure that all outbound corruption
errors are tagged with `base.ErrCorruption`, and none of them throw a panic

Post implementation of this design doc, will we ever panic CRDB due to corruption?

docs/RFCS/20210730_replica_corruption_recovery.md, line 80 at r1 (raw file):

type CorruptionManager struct {

Can we have both a structured log message & a time-series counter tracking how often KV "reacts to instances of corruption"? Then we can alert on this.

docs/RFCS/20210730_replica_corruption_recovery.md, line 116 at r1 (raw file):

   into the corrupt sstables, deleting them.

## Drawbacks

Can we provide a way to turn off this functionality, going back to panicing instead, to be used by devs & operators in case of unexpected outage where the functionality is making things worse rather than better?

docs/RFCS/20210730_replica_corruption_recovery.md, line 121 at r1 (raw file):

across the cluster at the same time could put the cluster in a state
with many underreplicated or unavailable ranges. However the likelihood of this
happening would not be any higher than under the pre-21.2 status quo, where

Isn't the key difference tho that now we are deleting data? In the pre-21.2 case, we don't delete data; we just crash the node. In the worst case, could this delete data process, if it happens across multiple nodes at roughly the same time, due to a bug or similar, lead to data loss? One can imagine a rate limit on how much deletion we do due to corruption. This seems wise to have to me; perhaps we already have considered this?

docs/RFCS/20210730_replica_corruption_recovery.md, line 160 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Once we've seen one corruption event, how likely are we to see more corruption on the same store? Is there a point at which we'd be better off having the node decommission itself rather than recover and keep going on the same device?

As per my request for structured log lines & time-series metrics, in CC land I would think we would alert in case of a lot of this happening. We may want to take manual action, e.g. spinning up a new node with a different disk attached, if a single node is experiencing all the corruption.

itsbilal

Quick update to address some minor comments and to reply to a larger thread, while I continue to code-read the KV parts.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, @joshimhoff, @petermattis, @sumeerbhola, and @tbg)

docs/RFCS/20210730_replica_corruption_recovery.md, line 31 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

s/can/can cause/?

Done.

docs/RFCS/20210730_replica_corruption_recovery.md, line 46 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Sounds like this bit and the CorruptSpans won't be persisted to disk. That's fine, just noting that could be made more explicit in the text here.

Done.

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

A simpler alternative is to prioritize rewriting of these corrupt files to remove the corruption (single file compactions) and not allow adding back the a replica until that is done -- since corruption is not common we can afford the prioritization, and the small delay until the replica can be added back again.

That makes sense to me.

Sequence number zeroing would be another complication for externalized range tombstones/ignorable spans. Keys at sequence numbers higher than the ignorable span may have their sequence numbers zeroed, removing the knowledge that it's more recent than the ignorable span. Deletion-only compactions deal with this by removing any pending deletion hints that overlap the compaction's key range. Ignorable spans wouldn't be optional optimizations like deletion-only compaction hints though.

Not sure if I fully see why we need the ignorable spans in the first place. Since the range tombstone will only be laid at the very end when the replica is being discarded (and the current RFC calls for nothing special to indicate that to pebble beyond just a normal replica deletion), shouldn't we be okay with just waiting for the range tombstone to fall into the last snapshot stripe? There will inevitably be more compaction-time churn of corrupt files being read and the compaction failing, but that can be seen as a separate problem that can be solved with the max seqnum thing that Jackson is mentioning below to prevent starvation.

docs/RFCS/20210730_replica_corruption_recovery.md, line 61 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

should this say "other than range tombstones"?

Oops. Done.

docs/RFCS/20210730_replica_corruption_recovery.md, line 62 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Since the compaction picker is deterministic, we may repeatedly try to compact a corrupt sstable before it's been GC'd. This has the potential to starve compactions either into or out of the corrupt sstable's level. Should the compaction picker try to avoid compactions that involve the corrupt sstable but don't remove the corrupt span? I think it would be straightforward to skip IsCorrupt files in pickFile, but that only handles the case where the corrupt sstable is the seed file in the input level. Eventually if we refactored the compaction picker to iterate over pre-expanded atomic compaction units (cockroachdb/pebble#1152), we could skip any compaction units that contain input-level files with IsCorrupt marked.

Compactions into the level containing the corrupt sstable seem trickier, because we don't know ahead of time which compactions might drop the corrupt span. One idea is to try the compaction once. When it fails with a corruption error, record a sequence number on the file's metadata indicating that no compaction should be attempted into that file unless the compaction contains an input file with a higher sequence number than the one recorded. That would allow us to skip compactions into the file until a new sstable that might potentially drop the corrupted range exists in the above level.

Maybe it's moot if the entire turnaround on GC-ing a corrupt replica is quick.

I like this idea. I too was worried about the determinism and possible starvation, so this definitely helps us get out of it. Implemented correctly, I hope this would also reduce the need to have ignorable spans, though happy to be convinced otherwise.

docs/RFCS/20210730_replica_corruption_recovery.md, line 67 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Would that work these days? SSDs tend to have enough abstractions and translation layers (and wear-leveling movement) that you don't have much control over when you're reusing a specific block.

Good point. I got this idea from @petermattis, would probably need to look deeper into this.

docs/RFCS/20210730_replica_corruption_recovery.md, line 71 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Post implementation of this design doc, will we ever panic CRDB due to corruption?

In some instances, yes. I've now clarified this in the RFC, but not all instances of corruption will be recoverable (eg. corruption in node-local keys). It's just that the point of identifying between them will get moved to Cockroach.

docs/RFCS/20210730_replica_corruption_recovery.md, line 98 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

What would the behavior be if the corrupt span corresponds to node-local data? Just crash with an informative corruption message?

Good point - and something @joshimhoff also indirectly brought up. In that case, Cockroach will crash / panic. Clarifying.

docs/RFCS/20210730_replica_corruption_recovery.md, line 112 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The replicateQueue only processes replicas for which the current node is the leaseholder. And if the store with the corruption is not already the leaseholder we don't want it to try and grab the lease (we probably want to make the opposite move if it happens to already be the leaseholder). That means we need some way for the node with the corrupted sst to let the leaseholder know to remove it. I believe decommissioning uses gossip for this and we should be able to do something similar here.

Thanks for bringing this up! Working on updating this part after some code reading..

itsbilal

TFTRs! Addressed many of the comments, with the main ones being:

Misc ops-related changes such as log lines and metric gauges (thanks @joshimhoff !)
Spelled out how the compaction picker would try to get out of repeated scheduling of compactions that are guaranteed to fail (thanks @jbowens !)
Figured out a plan to signal to leaseholder over gossip that a follower is corrupt (thanks @bdarnell !), though it might not be that refined.

Biggest outstanding thread is the one about range tombstones vs. ignored spans.

Dismissed @joshimhoff from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, @joshimhoff, @petermattis, @sumeerbhola, and @tbg)

docs/RFCS/20210730_replica_corruption_recovery.md, line 80 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Can we have both a structured log message & a time-series counter tracking how often KV "reacts to instances of corruption"? Then we can alert on this.

Done.

docs/RFCS/20210730_replica_corruption_recovery.md, line 114 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Do the replica IDs get cleared from corruptReplicas?

Done. Clarified where this would happen.

docs/RFCS/20210730_replica_corruption_recovery.md, line 116 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Can we provide a way to turn off this functionality, going back to panicing instead, to be used by devs & operators in case of unexpected outage where the functionality is making things worse rather than better?

Done. I added a CrashOnCorruption option in Pebble. Unfortunately that would mean it'd be a per-node setting that'd have to be specified in the command line. I could alternatively add a cluster setting, but cluster settings for low-level knobs like these can be precarious as ultimately the setting will be stored in Pebble as a key.

docs/RFCS/20210730_replica_corruption_recovery.md, line 121 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Isn't the key difference tho that now we are deleting data? In the pre-21.2 case, we don't delete data; we just crash the node. In the worst case, could this delete data process, if it happens across multiple nodes at roughly the same time, due to a bug or similar, lead to data loss? One can imagine a rate limit on how much deletion we do due to corruption. This seems wise to have to me; perhaps we already have considered this?

We haven't really considered this, as the incidence of corruption events is rare enough that a rate limit would be unlikely to ever be hit. That said, we could add this as a safeguard regardless. It could be at the allocator step; not more than 1 replica deleted per minute per allocator/leaseholder or something.

docs/RFCS/20210730_replica_corruption_recovery.md, line 155 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

s/arbitraty/arbitrary/

Done.

docs/RFCS/20210730_replica_corruption_recovery.md, line 160 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

As per my request for structured log lines & time-series metrics, in CC land I would think we would alert in case of a lot of this happening. We may want to take manual action, e.g. spinning up a new node with a different disk attached, if a single node is experiencing all the corruption.

Added a metric for this. Yet to add a rate-limiter on replica deletions due to corruption though.

joshimhoff

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, @joshimhoff, @petermattis, @sumeerbhola, and @tbg)

docs/RFCS/20210730_replica_corruption_recovery.md, line 80 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

Thanks for helping us write alerts!

docs/RFCS/20210730_replica_corruption_recovery.md, line 116 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. I added a CrashOnCorruption option in Pebble. Unfortunately that would mean it'd be a per-node setting that'd have to be specified in the command line. I could alternatively add a cluster setting, but cluster settings for low-level knobs like these can be precarious as ultimately the setting will be stored in Pebble as a key.

For SRE, CLI sounds good. Thanks for the kill switch!

docs/RFCS/20210730_replica_corruption_recovery.md, line 121 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

We haven't really considered this, as the incidence of corruption events is rare enough that a rate limit would be unlikely to ever be hit. That said, we could add this as a safeguard regardless. It could be at the allocator step; not more than 1 replica deleted per minute per allocator/leaseholder or something.

I think the argument for this is almost exactly what you write. If lots of corruption is hit at once, it may indicate a bug in the corruption detection logic or similar, as it it NOT expected for corruption to be hit often. We can avoid that bug absolutely recking a cluster by rate limiting the rate of deletes, possibly without adding much extra complexity to this feature.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, @joshimhoff, and @petermattis)

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

Not sure if I fully see why we need the ignorable spans in the first place.

The first bullet in my previous comment is an annoyance when working with range tombstones. But I suspect it is not a blocker -- I tried briefly to see if we could construct a pathological scenario where no compaction progress could be made because of tombstones being split, and could not, because I don't think one can create a graph cycle.
The third bullet is I think still a problem. If tombstones are split and the range has been added again, I think mergingIter could land in the middle of the corruption when a client is iterating using an Iterator. Worth confirming that I am not mistaken.

Regarding sequence number zeroing being a complication wrt externalized ignorable spans, I think we can adjust the compaction.allowedZeroSeqNum logic to account for this.

It is of course better if we don't need to introduce another abstraction. But I am somewhat worried that we will complicate other things by trying to do this solely via range tombstones, like the (a) iterator code, (b) code to try to avoid retrying failed compactions, even though the file ought to be able to participate in compactions due to range tombstones having been added, (c) code to successfully do multi-file compactions where one or more files have some corrupt data (this may overlap with (a)). With ignorable spans the main benefit would be that as soon as a corrupt file's corrupted spans were marked as ignored we could put it first in the list of compactions as a single file rewrite (which I realize we don't currently have). The tradeoffs here need some more thought, and I am not actually sure which is better.

btw, how are we going to skip corrupt blocks when using compactionIter? compactionIter does not utilize the tombstone based seeking logic that is in mergingIter because it needs to write out the range tombstones too. So it is iterating over all points and calling i.rangeDelFrag.Deleted which means it will encounter the corrupt blocks.

I wonder whether we are over-engineering this. If sstables are equally likely to be corrupt, then the probability of it being in L5 or L6 is 99%, which means the key span is not going to be super wide (for any large-ish LSM). We could declare the whole file's width as corrupt, and then later simply discard it by installing a new version.

tbg

Apologies for popping in here so late!

If I understand correctly, the high-level strategy is for replication changes to remove the corrupt replicas, and the RFC acknowledges that this is more difficult if the corruption is on the leaseholder. I'm not sure the complexity in that case is fully understood yet though. The leaseholder is the only replica that can propose Raft commands. The leaseholder is corrupt in that example, and cannot be expected to propose anything. In particular, it can't transfer the lease away, meaning the remaining option is a non-cooperative lease transfer, i.e. just stopping liveness heartbeats. This is too disruptive, it's already not too far away from terminating the node. We need some way for the leaseholder to give up control. This isn't impossible - we need some out-of-band way to signal to a follower that as of a given HLC timestamp the leaseholder will not use their lease any more, and followers can factor that into their calculus on when to request a lease (and the evaluation logic needs updates to allow what technically looks like violations of the lease invariants, etc...) - but it's not something we have or had on our radar (though I think it came up in some other context at some point).

The other comment I have is that I think at the end of the work proposed in this RFC we'll still be in mostly the same place unless we can also teach all (or most relevant) code paths in CockroachDB to react gracefully to storage errors. Realistically speaking I think this means a Replica-level concept of corruption where the Replica stops all write processing and triggers the out-of-band lease relinquishment.

The callback-based mechanism is useful but in light of the above it's not paying attention to half the battle, which is when the corruption pops up during Replica work (as I expect it frequently does). I would give the Replica a more proactive role in handling its own corruption.

Filling in some of these considerations into the RFC's design (which in principle I find appealing):

Replica gets a notion of being corrupted (i.e. can be marked or mark itself as such) - btw, we had this a while ago but never productionized it enough, it's a bunch of work. There are metrics on how many replicas mark themselves as corrupt (both a counter and a gauge) and this is reflected in the RangeStatus report (which does not error out, like other diagnostic endpoints).
Replicas mark themselves as corrupt when they encounter an errors.Is(err, pebble.ErrCorruption).
corrupted Replicas immediately stop serving reads and writes on the request path. (They need to stop serving reads to make it safe for someone else to request the lease, as happens later). But we'll still allow the replicaGCQueue to work.
when pebble finds a corruption, the callback fires (with a metric too). The callback iterates over all local Replicas and marks those overlapping the corrupt span as corrupt. If the corrupt span reaches into the range-local keyspace, the node fatals (marking all replicas as corrupt is not a useful alternative, it is more effective to crash in this case which incurs only a ~7s outage and the node can be wiped & brought back up).
The Store surrounding the corrupted Replica gossips a signal marking that Replica as corrupt to the rest of the cluster. This suppresses incoming lease transfers (i.e. lease preferences, etc) and provides a signal to the distribution subsystem to rebalance off this Replica. The signal is gossiped until the local Replica disappears. The signal contains a timestamp at which the lease was relinquished (i.e. it is safe for someone else to force-request a lease with that start time); this timestamp also doubles as time the corruption was detected.
non-corrupted Replicas observing the signal will request a lease starting at the safe timestamp. (We may need a "force" flag here to allow such leases to go through safely as they may ordinarily be refused).
The replicate queue on the new leaseholder will react to the gossiped signal by rebalancing off the corrupt replica (overriding any concerns around constraints and such). This can probably be achieved by treating this replica like one sitting on a decommissioning and dead node (but need to double check - the allocator is very averse to pure downreplication but this is what may be required here at least in some scenarios; might be easier to make this a separate code path to keep things simple).
replicaGC on the corrupt replica will remove the replica. The gossip signal thus stops.
Since the gossip signal has stopped, upreplication will occur, perhaps to the same store, or perhaps to a different one.
Range is healthy again.

So we need

changes in the lease code (allow overrides)
concept of replica corruption & making all code paths safe
replicate queue changes
pebble changes (corruption callback + allowing corrupted SSTs to be compacted away)

Also, as an aside, replicaGC currently does not always place a tombstone. It only does that when there seems to be a lot to delete, note the false in mustUseClearRange:

cockroach/pkg/kv/kvserver/replica_destroy.go

Lines 134 to 153 in 3357d2e

    
           // destroyRaftMuLocked deletes data associated with a replica, leaving a 
        
           // tombstone. The Replica may not be initialized in which case only the 
        
           // range ID local data is removed. 
        
           func (r *Replica) destroyRaftMuLocked(ctx context.Context, nextReplicaID roachpb.ReplicaID) error { 
        
           	startTime := timeutil.Now() 
        
           	ms := r.GetMVCCStats() 
        
           	batch := r.Engine().NewUnindexedBatch(true /* writeOnly */) 
        
           	defer batch.Close() 
        
           	clearRangeIDLocalOnly := !r.IsInitialized() 
        
           	if err := r.preDestroyRaftMuLocked( 
        
           		ctx, 
        
           		r.Engine(), 
        
           		batch, 
        
           		nextReplicaID, 
        
           		clearRangeIDLocalOnly, 
        
           		false, /* mustUseClearRange */ 
        
           	); err != nil { 
        
           		return err 
        
           	}

So it seems that we must change that too (a tiny change at the KV layer), at least for corrupted replicas, but better to change it in generality then. I think we want that anyway.

I think this is all workable. Eyeballing it I think most of the work is really at the KV layer, so KV should be more involved with the design. On the execution side as well, I don't think the KV-side changes are particularly easy (& they touch all of the dangerous bits), so it seems unlikely that they can be carried out without significant KV involvement.

Finally, with respect to motivation, is there a more compelling argument than #57934? Sure, this does seem to come up a lot, but ... does it come up with our customers as well? I always wonder if most of the sentry reports of this kind that we get are cooky set-ups that wouldn't be much better off even with this new functionality. I am somewhat excited about what's proposed here, but the opportunity & ongoing maintenance/complexity cost for the KV team is very real and so the motivation should fully reflect what we know about how much of an issue SST corruption is in practice.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, @joshimhoff, and @petermattis)

itsbilal

@tbg Thanks so much for adding colour about the missed bits about lease transfers and the importance of getting it right. It does look like the RFC in its current state undersells the KV-side complexities of this project. I've incorporated your list of "what needs to happen" into the RFC to at least make it a source of truth for how far along the conversation has gotten.

About the question about the motivation, it's not too late to bring that in question - the motivation is mostly #57934, as well as theoretical concerns from some customers who would like to see us work better to address availability under situations like these. In storage planning for 21.2 we decided to deprioritize work on this project beyond just getting this RFC in good shape. I've made edits to bring in the points I had missed earlier that you brought up. Thanks again!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @itsbilal, @jbowens, @joshimhoff, and @petermattis)

docs/RFCS/20210730_replica_corruption_recovery.md, line 60 at r1 (raw file):

Previously, sumeerbhola wrote…

Not sure if I fully see why we need the ignorable spans in the first place.

The first bullet in my previous comment is an annoyance when working with range tombstones. But I suspect it is not a blocker -- I tried briefly to see if we could construct a pathological scenario where no compaction progress could be made because of tombstones being split, and could not, because I don't think one can create a graph cycle.
The third bullet is I think still a problem. If tombstones are split and the range has been added again, I think mergingIter could land in the middle of the corruption when a client is iterating using an Iterator. Worth confirming that I am not mistaken.

Regarding sequence number zeroing being a complication wrt externalized ignorable spans, I think we can adjust the compaction.allowedZeroSeqNum logic to account for this.

It is of course better if we don't need to introduce another abstraction. But I am somewhat worried that we will complicate other things by trying to do this solely via range tombstones, like the (a) iterator code, (b) code to try to avoid retrying failed compactions, even though the file ought to be able to participate in compactions due to range tombstones having been added, (c) code to successfully do multi-file compactions where one or more files have some corrupt data (this may overlap with (a)). With ignorable spans the main benefit would be that as soon as a corrupt file's corrupted spans were marked as ignored we could put it first in the list of compactions as a single file rewrite (which I realize we don't currently have). The tradeoffs here need some more thought, and I am not actually sure which is better.

btw, how are we going to skip corrupt blocks when using compactionIter? compactionIter does not utilize the tombstone based seeking logic that is in mergingIter because it needs to write out the range tombstones too. So it is iterating over all points and calling i.rangeDelFrag.Deleted which means it will encounter the corrupt blocks.

I wonder whether we are over-engineering this. If sstables are equally likely to be corrupt, then the probability of it being in L5 or L6 is 99%, which means the key span is not going to be super wide (for any large-ish LSM). We could declare the whole file's width as corrupt, and then later simply discard it by installing a new version.

I am somewhat convinced that the seqnum-based ratcheting approach that Jackson highlighted would keep us away from cyclical compactions and all, and fragmentation of range tombstones alone wouldn't be enough to cause it because the corrupt spans would at most be spanning one file.

But yes, I agree that there'll be a significant lift required in iterators (especially compactionIter / mergingIter) to gracefully skip over corrupt blocks covered by range tombstones. It could add a lot of complexity; my code reading made me convinced it was still worthwhile to reuse the abstraction as it wasn't significant enough, but if we're not narrowing corrupt bounds beyond a file's bounds anyway, and we're just deleting an entire file in a Version, then a new operation should be straightforward enough to be the preferred approach. I've spelled this out in the alternatives below, but I'm open to making it the main solution and moving it up over here.

This PR proposes a design for responding to sstable corruption by marking matching replicas as corrupt and letting Cockroach replication delete / replace them with data from other nodes. Informs cockroachdb#67568. Release note: None. Release justification:

itsbilal requested review from jbowens, petermattis and sumeerbhola August 3, 2021 16:11

itsbilal self-assigned this Aug 3, 2021

itsbilal requested a review from a team as a code owner August 3, 2021 16:11

jbowens reviewed Aug 4, 2021

View reviewed changes

bdarnell reviewed Aug 4, 2021

View reviewed changes

sumeerbhola reviewed Aug 10, 2021

View reviewed changes

tbg self-requested a review August 16, 2021 08:48

jbowens reviewed Aug 16, 2021

View reviewed changes

petermattis reviewed Aug 17, 2021

View reviewed changes

joshimhoff requested review from bdarnell and sumeerbhola August 23, 2021 22:17

joshimhoff reviewed Aug 23, 2021

View reviewed changes

itsbilal force-pushed the rfc-replica-corruption branch from c6e6e4b to 92d437b Compare August 24, 2021 22:02

itsbilal commented Aug 24, 2021

View reviewed changes

itsbilal force-pushed the rfc-replica-corruption branch from 92d437b to 5ae50ba Compare September 2, 2021 18:59

itsbilal commented Sep 2, 2021

View reviewed changes

joshimhoff requested review from jbowens and petermattis September 2, 2021 19:32

joshimhoff reviewed Sep 2, 2021

View reviewed changes

joshimhoff self-requested a review September 2, 2021 19:32

tbg removed their request for review September 15, 2021 08:19

sumeerbhola reviewed Sep 28, 2021

View reviewed changes

tbg mentioned this pull request Sep 29, 2021

kvserver: recover from panic-triggering command in raft log #70860

Closed

tbg reviewed Sep 29, 2021

View reviewed changes

itsbilal force-pushed the rfc-replica-corruption branch from 5ae50ba to 2c56d6c Compare September 30, 2021 21:04

itsbilal commented Sep 30, 2021

View reviewed changes

itsbilal force-pushed the rfc-replica-corruption branch from 2c56d6c to b20cdb0 Compare October 1, 2021 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: sstable / replica corruption RFC #68378

docs: sstable / replica corruption RFC #68378

itsbilal commented Aug 3, 2021

cockroach-teamcity commented Aug 3, 2021

itsbilal commented Aug 3, 2021

jbowens left a comment

bdarnell left a comment

sumeerbhola left a comment

sumeerbhola left a comment

jbowens left a comment •

edited

Loading

petermattis left a comment

joshimhoff left a comment

itsbilal left a comment

itsbilal left a comment

joshimhoff left a comment

sumeerbhola left a comment

tbg left a comment

itsbilal left a comment

	// destroyRaftMuLocked deletes data associated with a replica, leaving a
	// tombstone. The Replica may not be initialized in which case only the
	// range ID local data is removed.
	func (r *Replica) destroyRaftMuLocked(ctx context.Context, nextReplicaID roachpb.ReplicaID) error {
	startTime := timeutil.Now()

	ms := r.GetMVCCStats()
	batch := r.Engine().NewUnindexedBatch(true /* writeOnly */)
	defer batch.Close()
	clearRangeIDLocalOnly := !r.IsInitialized()
	if err := r.preDestroyRaftMuLocked(
	ctx,
	r.Engine(),
	batch,
	nextReplicaID,
	clearRangeIDLocalOnly,
	false, /* mustUseClearRange */
	); err != nil {
	return err
	}

docs: sstable / replica corruption RFC #68378

Are you sure you want to change the base?

docs: sstable / replica corruption RFC #68378

Conversation

itsbilal commented Aug 3, 2021

cockroach-teamcity commented Aug 3, 2021

itsbilal commented Aug 3, 2021

jbowens left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment • edited Loading

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

joshimhoff left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

joshimhoff left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

jbowens left a comment •

edited

Loading