-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: use narrow checkpoints in consistency checker #95963
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
f29bbc7
to
6daea4d
Compare
6daea4d
to
16374ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be worth having an integration test for this.
PPS: We should test that we can open the narrow checkpoint just like any other Pebble database as well, e.g. by running various |
Previously, erikgrinaker (Erik Grinaker) wrote…
Another vote for a single span over the ranges. Note that the narrow checkpoints only exclude SST files that are fully outside of the spans; they do not "cut up" SST files that intersect the spans (and they don't trim WALs). So if we have ranges The exception of course would be if we want to also include something from the very beginning of the keyspace (the meta ranges?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pavelkalinnikov, @RaduBerinde, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
The exception of course would be if we want to also include something from the very beginning of the keyspace (the meta ranges?).
The meta ranges likely won't be on this node in larger clusters, and it isn't clear that they'd be helpful anyway. But I do think we want to get the range-local keyspans, both replicated and unreplicated, and those will have to be separate spans since they're prefixed by range ID.
This might be confusing, as your say, since we'll see various parts of the surrounding ranges at various LSM levels, but I think in almost all cases we'll only be interested in the user keyspan anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @jbowens, @RaduBerinde, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
The exception of course would be if we want to also include something from the very beginning of the keyspace (the meta ranges?).
The meta ranges likely won't be on this node in larger clusters, and it isn't clear that they'd be helpful anyway. But I do think we want to get the range-local keyspans, both replicated and unreplicated, and those will have to be separate spans since they're prefixed by range ID.
This might be confusing, as your say, since we'll see various parts of the surrounding ranges at various LSM levels, but I think in almost all cases we'll only be interested in the user keyspan anyway.
I'm also inclined to make it a single span. One complication here is that one range already consists of multiple spans. See rditer.makeAllKeySpans
, I counted 6 spans. Now, if we have 3 ranges 6 spans each, we can either use the 18 spans as they are, or merge them into 6 wide spans. I'm in favour of the latter approach, but it looks like we would need cross-rangeID span helpers for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
I'm also inclined to make it a single span. One complication here is that one range already consists of multiple spans. See
rditer.makeAllKeySpans
, I counted 6 spans. Now, if we have 3 ranges 6 spans each, we can either use the 18 spans as they are, or merge them into 6 wide spans. I'm in favour of the latter approach, but it looks like we would need cross-rangeID span helpers for that.
@tbg Do you think we could adapt rditer/select.go
to work across range IDs?
Right now we have:
func Select(rangeID roachpb.RangeID, opts SelectOpts) []roachpb.Span
But it seems easy to extend it to:
func SelectMulti(begin, end roachpb.RangeID, opts SelectOpts) []roachpb.Span
It would use the starting points of the begin
range, and end points of the end
range. A little complication is that the caller must correctly compute the user keys span in opts
. For a single range it's easy: take the range descriptor's span; for a multi-range case the caller should take the first key of the left range, and the end key of the last range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
@tbg Do you think we could adapt
rditer/select.go
to work across range IDs?Right now we have:
func Select(rangeID roachpb.RangeID, opts SelectOpts) []roachpb.SpanBut it seems easy to extend it to:
func SelectMulti(begin, end roachpb.RangeID, opts SelectOpts) []roachpb.SpanIt would use the starting points of the
begin
range, and end points of theend
range. A little complication is that the caller must correctly compute the user keys span inopts
. For a single range it's easy: take the range descriptor's span; for a multi-range case the caller should take the first key of the left range, and the end key of the last range.
My mistake, I think we can't assume that consecutive ranges according to the key are consecutive w.r.t. range IDs. So, for all rangeID-based keys, we need to take all the spans as is. It's only the user key spans that we can (and should) merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
My mistake, I think we can't assume that consecutive ranges according to the key are consecutive w.r.t. range IDs. So, for all rangeID-based keys, we need to take all the spans as is. It's only the user key spans that we can (and should) merge.
Hm, but then we're limited to only finding bugs that affect the user keyspace. What if the bug is in the rangeID-based space? Wouldn't we want the LSMs touching neighbouring range IDs included as well?
@erikgrinaker Do you remember which part of the keyspace contained that Pebble bug? Could it manifest itself in other parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pavelkalinnikov, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
My mistake, I think we can't assume that consecutive ranges according to the key are consecutive w.r.t. range IDs. So, for all rangeID-based keys, we need to take all the spans as is. It's only the user key spans that we can (and should) merge.
Correct, they'll often not have consecutive range IDs.
What if the bug is in the rangeID-based space? Wouldn't we want the LSMs touching neighbouring range IDs included as well?
Yeah, may as well grab it just in case. For the range-local keyspace let's extend the spans to the neighbouring ranges by range ID, while for the user keyspace we'll extend it to the neighbouring ranges by key.
One could argue that we should also grab the range-local spans for the neighbouring ranges by user key, but I don't think we need to overcomplicate this. I think what we mostly care about is the inconsistent range itself and the surrounding LSM contents, and so let's just grab that.
Do you remember which part of the keyspace contained that Pebble bug? Could it manifest itself in other parts?
The user keyspace.
It could manifest in the range-local keyspace too, since we do use Pebble range tombstones there, but I think they're much less likely to span ranges given the structure of the keyspace there (it typically has many different prefixes for different types of data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a roachtest/inconsistency
roachtest that we might want to spruce up a bit.
Reviewed 5 of 5 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @jbowens, @pavelkalinnikov, and @RaduBerinde)
pkg/kv/kvserver/replica_consistency.go
line 778 at r2 (raw file):
compacted, which leads to checkpoints becoming a full copy of a past state. This is not time critical because the checkpoints are localized to a single / few range(s), and over time may grow to consume only O(GB) disk space. There is no
The range size is configurable, and also we're starting to talk about larger range sizes. You could make this a bit more general and say that it's O(RangeSize)
.
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
My mistake, I think we can't assume that consecutive ranges according to the key are consecutive w.r.t. range IDs. So, for all rangeID-based keys, we need to take all the spans as is. It's only the user key spans that we can (and should) merge.
Correct, they'll often not have consecutive range IDs.
What if the bug is in the rangeID-based space? Wouldn't we want the LSMs touching neighbouring range IDs included as well?
Yeah, may as well grab it just in case. For the range-local keyspace let's extend the spans to the neighbouring ranges by range ID, while for the user keyspace we'll extend it to the neighbouring ranges by key.
One could argue that we should also grab the range-local spans for the neighbouring ranges by user key, but I don't think we need to overcomplicate this. I think what we mostly care about is the inconsistent range itself and the surrounding LSM contents, and so let's just grab that.
Do you remember which part of the keyspace contained that Pebble bug? Could it manifest itself in other parts?
The user keyspace.
It could manifest in the range-local keyspace too, since we do use Pebble range tombstones there, but I think they're much less likely to span ranges given the structure of the keyspace there (it typically has many different prefixes for different types of data).
Determining a single RSpan
over which to checkpoint the key-prefixed data makes sense to me. Space that has a placeholder on it ought to be empty anyway. So I think we should:
- find left and right neighboring range, note the left start key, the right end key, and both range ids
- rspan=
[left start key, right end key)
spans = append(rditer.Select(rangeID, ReplicatedBySpan:rspan, ReplicatedByRangeID:true), rditer.Select(leftRangeID, ReplicatedByRangeID:true), rditer.Select(rightRangeID, ReplicatedByRangeID:true)
In words, we pretend that our range has the wider key span and checkpoint that (this is a couple of spans under the hood due to local-key and local-rangeid addressed keys), but in addition include the two replicated rangeid-based key spans for the left and right neighbor.
For debugging purposes we should also throw in the rangeid-unreplicated key spans, i.e. set the corresponding bool to true in spans =
above for all Select
invocations.
pkg/kv/kvserver/store.go
line 2911 at r2 (raw file):
spans = append(spans, span) } if err := s.engine.CreateCheckpoint(checkpointDir, spans); err != nil {
It would be good to mark the checkpoint as partial and perhaps encode (in a text file?) what the start and end key and rangeIDs were that are guaranteed to be covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pavelkalinnikov, @RaduBerinde, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Determining a single
RSpan
over which to checkpoint the key-prefixed data makes sense to me. Space that has a placeholder on it ought to be empty anyway. So I think we should:
- find left and right neighboring range, note the left start key, the right end key, and both range ids
- rspan=
[left start key, right end key)
spans = append(rditer.Select(rangeID, ReplicatedBySpan:rspan, ReplicatedByRangeID:true), rditer.Select(leftRangeID, ReplicatedByRangeID:true), rditer.Select(rightRangeID, ReplicatedByRangeID:true)
In words, we pretend that our range has the wider key span and checkpoint that (this is a couple of spans under the hood due to local-key and local-rangeid addressed keys), but in addition include the two replicated rangeid-based key spans for the left and right neighbor.
For debugging purposes we should also throw in the rangeid-unreplicated key spans, i.e. set the corresponding bool to true in
spans =
above for allSelect
invocations.
Yep. Note that the rangeID neighbours will often be different ranges than the user key neighbours. We don't want to conflate them, since we could otherwise checkpoint the entire range-local span, which can be pretty large. In other words, if we have an inconsistent range with the following surrounding ranges on that node:
- [a-b): range 8
- [d-f): range 5 (inconsistent)
- [k-m): range 2
The node has replicas for ranges 2, 3, 5, 6, 8.
Then we want to checkpoint the user keyspan [a-m), and the range-local keyspans for ranges [3-6].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pavelkalinnikov, @RaduBerinde, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yep. Note that the rangeID neighbours will often be different ranges than the user key neighbours. We don't want to conflate them, since we could otherwise checkpoint the entire range-local span, which can be pretty large. In other words, if we have an inconsistent range with the following surrounding ranges on that node:
- [a-b): range 8
- [d-f): range 5 (inconsistent)
- [k-m): range 2
The node has replicas for ranges 2, 3, 5, 6, 8.
Then we want to checkpoint the user keyspan [a-m), and the range-local keyspans for ranges [3-6].
Taking into account whether the local keyspans are prefixed by key or range ID, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @jbowens, @RaduBerinde, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Taking into account whether the local keyspans are prefixed by key or range ID, of course.
SGTM, thanks for the suggestions. WIP.
pkg/kv/kvserver/store.go
line 2911 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It would be good to mark the checkpoint as partial and perhaps encode (in a text file?) what the start and end key and rangeIDs were that are guaranteed to be covered.
I agree. I was thinking at least including this information in the log/prevent-startup-file message. Do you think it should a separate file though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @RaduBerinde, and @tbg)
pkg/kv/kvserver/store.go
line 2911 at r2 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
I agree. I was thinking at least including this information in the log/prevent-startup-file message. Do you think it should a separate file though?
+1 to encoding the spans somewhere: the Pebble checkpoint won't encode any information about which spans were checkpointed, and it won't be possible to tell which regions of the keyspace should be consistent. I could imagine looking at one of these partial checkpoints and getting confused because there exist keys outside the checkpointed spans that are inconsistent.
Previously, jbowens (Jackson Owens) wrote…
I think it would be a good idea to include this information in the checkpoint's manifest.. And have any tools you use warn you about it if these are set.. I'd need to update the manifest format though. |
Previously, RaduBerinde wrote…
Filed cockroachdb/pebble#2285 |
16374ce
to
b3c64a0
Compare
5c857c6
to
cf35c11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @srosenberg, and @tbg)
pkg/kv/kvserver/store.go
line 2892 at r1 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
Right, this isn't the final version, I just did the easy by-key part for now, and testing things. I will add by-rangeID neighbours shortly.
Added the rangeID neighbours too.
bbfe4aa
to
198090b
Compare
@erikgrinaker @tbg This is ready for another round. I recommend looking commit by commit. |
pkg/kv/kvserver/store.go
Outdated
if left != nil { | ||
userKeys.Key = left.Desc().StartKey | ||
if id := left.RangeID; id < prevID || id > nextID { | ||
spans = append(spans, rditer.Select(left.RangeID, byRangeID)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/left.RangeID/id/
pkg/kv/kvserver/store.go
Outdated
if right != nil { | ||
userKeys.EndKey = right.Desc().EndKey | ||
if id := right.RangeID; id < prevID || id > nextID { | ||
spans = append(spans, rditer.Select(right.RangeID, byRangeID)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/right.RangeID/id/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with exception of the case in which we have neighboring rangeIDs but no neighboring inited ranges. It sounds like we're not collecting the neighboring rangeID spans in the checkpoint then. Either explain it in a comment or fix it, depending on it's broken. After that good to merge.
Do we have coverage that the checkpoint contains ... the right thing? If we passed the wrong spans to pebble or pebble would just create an empty checkpoint, would we find out anywhere? If not, let's add... some testing at least. Unit testing of checkpointSpans
would seem appropriate for example, or we do some additional verification in TestCheckConsistencyInconsistent
?
Reviewed 1 of 5 files at r3, 1 of 3 files at r4, 5 of 5 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, all commit messages.
Dismissed @RaduBerinde and @jbowens from a discussion.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @pavelkalinnikov, and @srosenberg)
pkg/cmd/roachtest/tests/inconsistency.go
line 145 at r8 (raw file):
// multiple if multiple consistency checks fail in close succession), and // puts spans information into the checkpoint.txt file in it. c.Run(ctx, c.Node(n), "find {store-dir}/auxiliary/checkpoints/r1_at_*/checkpoint.txt")
Heh, I thought this was a bogus invocation because the *
looks like it would get shell expanded, but I think I added proper escaping here at some point. Still, this would look less weird as
find {store-dir}/auxiliary/checkpoints -name 'checkpoint.txt'
I cede the loss of coverage of the r1_
match.
Also, this would be the other approach:
det, err := c.RunWithDetailsSingleNode()
det.Stdout <-- has the path and can be validated/used for the subsequend cmds
I'm fine with the current code as-is just as long as it works.
pkg/kv/kvserver/store.go
line 2923 at r6 (raw file):
if left != nil { userKeys.Key = left.Desc().StartKey if id := left.RangeID; id < prevID || id > nextID {
So if we don't hit this condition, we don't collect the neighboring rangeID spans? Ditto for the right branch.
pkg/kv/kvserver/store_replica_btree_test.go
line 126 at r9 (raw file):
{"a", nil, repl2}, {"aa", nil, repl3}, {"b", repl2, repl3},
Nice, thanks for adding the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, and @srosenberg)
pkg/kv/kvserver/store.go
line 2923 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
So if we don't hit this condition, we don't collect the neighboring rangeID spans? Ditto for the right branch.
We do collect them above where we take by-rangeID spans covering all of [prevID, nextID]. This check here skips adding spans for an ID that is in the already covered [prevID, nextID].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, a unit-test for checkpointSpans
would be nicest, I think. Will do.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TestStoreCheckpointSpans
. PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @herkolategan, @srosenberg, and @tbg)
f2b87e7
to
f1257f1
Compare
@erikgrinaker wanna take a look at which spans we checkpoint before I merge? See |
Yeah, I'll have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r10, 2 of 2 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, 1 of 1 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker, @herkolategan, and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
This commit modifies the consistency checker to save partial checkpoints instead of full ones. The partial checkpoints contain the data for the inconsistent range and its immediate left and right neighbour in the Store. Only the inconsistent range's Replica is locked at the time of snapshotting the store before creating the checkpoint, other ranges can be out of sync. They are checkpointed too, to give more context when e.g. debugging non-trivial split/merge scenarios. Release note (ops change): In the rare event of Range inconsistency, the consistency checker saves a storage checkpoint on each storage the Range belongs to. Previously, this was a full checkpoint, so its cost could quickly escalate on the nodes that went on running. This change makes the checkpoints partial, i.e. they now only contain the relevant range and its neighbours. This eliminates the time pressure on the cluster operator to remove the checkpoints. Epic: none
Release note: none Epic: none
This commit modifies the consistency check failure roachtest to verify that all the nodes of an inconsistent range create a checkpoint that can be opened with tooling. Release note: none Epic: none
Release note: none Epic: none
Release note: none Epic: none
f1257f1
to
c87ca2d
Compare
bors r=tbg,erikgrinaker |
Build succeeded: |
This commit modifies the consistency checker to save partial checkpoints
instead of full ones. The partial checkpoints contain the data for the
inconsistent range and its immediate left and right neighbour in the Store.
Only the inconsistent range's Replica is locked at the time of snapshotting the
store before creating the checkpoint, other ranges can be out of sync. They are
checkpointed too, to give more context when e.g. debugging non-trivial
split/merge scenarios.
Release note (ops change): In the rare event of Range inconsistency, the
consistency checker saves a storage checkpoint on each storage the Range
belongs to. Previously, this was a full checkpoint, so its cost could quickly
escalate on the nodes that went on running. This change makes the checkpoints
partial, i.e. they now only contain the relevant range and its neighbours. This
eliminates the time pressure on the cluster operator to remove the checkpoints.
Resolves #90543
Epic: none