-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: loosely couple raft log truncation #76215
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't run or added any tests -- looking for feedback on the approach here.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)
66b53a0
to
c5d146b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few separate thoughts for discussion. I've loosely familiarized myself with the diff without going into details. I am still at the stage where I'm trying to understand what our "minimal achievement" is that we need to have in order to continue on the separate raft log next cycle, and am looking for things we can simplify as the branch cut is approaching fast and I am fairly complexity averse at this point.
re: this comment
We do want to contine to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.
I'm not sure this is true. While for #42491 we will indeed have followers send snapshots, this doesn't mean that the raft leader has to guide the truncation. Instead, it means that the "pending snapshot inflight" protection is important. While a snapshot is in flight (from any replica), the leader should not truncate its local log in a way that will prevent the recipient of the snapshot from catching up. In other words, if a snapshot at index N is in flight, the leader should protect log index N. This doesn't mean that the leader has to coordinate truncation! Instead, it means that the leader has to coordinate snapshots, which (to the best of my knowledge) is the plan anyway. The way I think this could work (but I've missed the last meeting about this with @amygao9 and @aayushshah15 and I'll possibly miss the next one too, unfortunately, so maybe either of you could chime in here) is that the leaseholder reaches out to the follower, the follower instantiates a snapshot, sends the index back to the leaseholder, starts streaming the snapshot and returns the error (if any) to the leaseholder (all over the same, streaming, RPC). The leaseholder can thus keep this map up to date much like it does today.
This isn't to say that I disagree with keeping the leader-mediated approach in this PR. I am all for making the smallest possible change at this point.
I would hope that we can stop tracking the exact size of the log at some point. This was introduced when we still had to include the raft log in snapshots, so there was a significant risk to letting the log grow large. Yet, tracking only on a best-effort basis (add up the sizes of log entries going into the log but never recompute) and resetting that on each truncation somehow wasn't good enough, I'm not quite sure why but I think we ended up in overload regimes where the raft leader/lease would flap around all of the time, which somehow reset this counting (these are faint, possibly incorrect, memories). So, for this PR, it's probably ok if we regress a little bit. A very large raft log poses only two risks that I can think of:
- can lead to out-of-disk scenarios (or bad allocator decisions should the allocator decide to care about disk usage)
- can lead to followers catching up across a very long raft log rather than more efficiently via a snapshot.
For the second bullet, it's not really the size of the raft log that matters, but the length, and this is something we do know precisely at all times.
You try to handle the case in which a follower restarts and losing its pending truncations. It's good to clear out the raft log for a quiesced follower, but having followers take action in the raft log queue seems to be moving us in the wrong direction (we do need to communicate some constraints to the raft leader due to pending snapshots and follower progress, but other than that replicas should feel free to truncate their logs as they deem useful in a purely local way). Instead, could we, eventually, "simply" have followers clear out all entries <= appliedState on server start-up as well when they quiesce? This doesn't even need to serialize with the raft handling loop as long as the truncations are atomic on disk and respect durability of the state machine. Of course this isn't work that should happen in this PR, but I wonder if we can avoid pulling the raft log queue into things here, and instead do the truncation on startup. This would also inspire a first step towards encapsulating the truncation logic a bit better (pulling it out of the bowels of below-raft Replica
but instead calling into it there).
It would help if we clearly spelled out what the "must-have" for this work is. I think you're trying to unblock the separate raft log by making sure we can introduce the replicas storage "soon", but if there is a cluster setting, are we really saving any time? I assume we are if we are blocking the cluster upgrade on having the setting disabled and flush all raft logs once; ignoring possibly annoying theoretical problems with a "slow" raft log queue enacting the old behavior after the fact). But I still wonder if this wouldn't all work the same if we went to down on the raft truncations in early 22.2 instead of now.
Might be worth talking through some of this stuff in person next week.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
Thanks for looking at this! I've added some technical responses below, which we can further discuss in that meeting. Regarding the minimal work needed for v22.1 for ReplicasStorage to be real for v22.2:
I don't think we are disagreeing. Or you are making a subtle point that I am unable to yet understand.
I am not quite convinced that we shouldn't track the log size -- it is useful for observability metrics etc., and it seems wrong to carefully maintain MVCCStats (I realize those also help for replica divergence which is not relevant here), and not raft log size.
I was simply trying to be extra careful here in not over truncating. Which is why it is conditioned on quiescence and the periodic scan that adds to the Regarding "clear out all entries <= appliedState", the code here is explicitly trying to clear all the raft log entries and not only up to RangeAppliedState.RaftAppliedIndex. This relies for correctness on the fact that such clearing always goes through the pending truncation queue, which ensures the RaftAppliedIndex is durable up to the last raft log entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone following this, Sumeer and I met yesterday to discuss. We broadly agreed (but see my conclusion at the end of this comment...) to continue with the PR, perhaps cutting out the attempts to truncate the raft log on quiesced followers because it adds a fair bit of complexity and we don't think we really needed it.
@sumeerbhola gave this another pass. I wasn't 100% sure what you were looking for at this point (should've asked yesterday...) since the code is still pretty unkempt (and so also hard to review). But I think I've developed some opinions, which are in comments below but also summarized here:
- it's time to move all code to its proper home, give extensive comments, and optimize for readability/reviewability. (Moving it later means a wasted detailed review pass, let's get it out of the way now while I haven't really taken too close a look line-by-line).
- we should not do "apply-style" work outside of the replica but delegate that to the main replica raft handling loop (by invoking it with appropriate parameters). This is a new concept; we haven't done that before. But it should be relatively straightforward; snapshot application comes very close to it conceptually.
But, taking a step back - overall, I am starting to get worried that we won't succeed in landing a clean PR that KV/Repl can be completely happy with with how much time we have left until March 15th, when stability starts (and we don't want feature work to leak into the stability period). In addition to this, I will be on vacation Mar 8-15, and out on 2/25 as well as 2/28; plus there is an R&R day on 3/4. Charitably including today, we're left with 11 working days, which includes Fridays. That is too close for comfort, especially with both of us having other duties, some racing the same Mar 15 cutoff. Apologies for not having done this math yesterday, it would've preserved more of my initial concern around the size of the change at hand.
In light of this, I think we should revisit alternatives again. Yesterday we talked about performing the truncation traditionally but in a pre-apply hook, where it can target the raft engine in a separate WriteBatch (for compatibility with ReplicaStorage
). Perhaps we can additionally get away with lowering the truncation index to the currently durable AppliedIndex
, if necessary, which should make this "wholly correct". This seems feasible if we figure out what to do with stats and whether we're okay not truncating by quite as much as the leader wanted (risking there not being another truncation). I would much rather take my chances with that than have no other option than land the present PR, as we can run experiments and experience supports the claim that a bit of raft log is not something that customers are typically concerned with (they are concerned with giant raft logs that run them out of space, but that is not the issue we'd be creating here). As for the stats, it might be a little tricky. If we're under-truncating every time, we don't want to mark the stats as wrong every time. We don't also want to recompute in the apply loop every time. We could sidestep the problem by not keeping the stats on followers and by making sure the leader only ever choses locally-durable indexes.
I think there's something there we should explore and, given the urgency, probably in-"person".
Reviewed 2 of 16 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)
Touches
Also update the issue with a short synopsis of changes made here (in particular the bit that we're conservative here and really not trying to fundamentally change how it works).
pkg/kv/kvserver/raft_log_queue.go, line 39 at r1 (raw file):
// The safety requirement for truncation is that the entries being truncated // are already durably applied to the state machine. Initialized replicas may
Might want to spell out in detail why, it is because after a truncation, the only remaining source of the data in the entries is state machine (which a prefix of the log), so if we ever truncate a log entry that is not durably on the state machine, we risk losing the information in said log entries which could lead to a scenario in which we violate Raft's leader completeness property: the replica may become leader and be asked for log entries it can no longer provide (neither directly nor through a snapshot).
pkg/kv/kvserver/raft_log_queue.go, line 47 at r1 (raw file):
// becomes possible. // // An attempt is made to add a replica to the queue under two situations:
Not going to read this one too closely yet because - I think - we've agreed to rip out the second path into the raft log queue (taken on followers) for now.
pkg/kv/kvserver/raft_log_queue.go, line 89 at r1 (raw file):
// (2) are enabled with cluster version LooselyCoupledRaftLogTruncation. // This is a temporary cluster setting that we will remove after one release
Where did we leave things here in terms of timing?
- 22.1: has the setting
- 22.2: setting is baked in as "true", i.e. if you ran with
false
in the old binary you'll still get the new behavior now (and it is always on, even in mixed cluster). Upon rolling back to 21.1, setting takes hold again, so you might miss some truncations until the setting is updated.
Just to confirm, someone who "permanently" runs a 22.1/22.2 mixed cluster will be fine, right? Old nodes will truncate directly, new nodes will stash the suggestions and execute them based on durability, right?
Write out some of these things on a comment here so that it's clear to everyone what the plan is.
pkg/kv/kvserver/raft_log_queue.go, line 99 at r1 (raw file):
func init() { enableLooselyCoupledTruncation.SetVisibility(settings.Reserved)
nit: can also do this
var enableLooselyCoupledTruncation = func() *settings.BoolSetting {
s := settings.RegisterBoolSetting(
settings.SystemOnly,
"kv.raft_log.enable_loosely_coupled_truncation",
"set to true to loosely couple the raft log truncation.",
true)
s.SetVisibility(settings.Reserved)
return s
}
I would probably prefer this because it keeps everything in one place but your call.
pkg/kv/kvserver/raft_log_queue.go, line 879 at r1 (raw file):
func (t *truncatorForReplicasInStore) addPendingTruncation( ctx context.Context, r *Replica,
As you're working on this stuff, can you make sure that we don't depend on anything directly but only through a narrow abstraction? Also move everything into a new file, say raft_log_truncator.go
.
pkg/kv/kvserver/raft_log_queue.go, line 884 at r1 (raw file):
raftLogDelta int64, deltaIncludesSideloaded bool, ) {
There sure is a lot of brainmelt spaghetti code in this method. I know this PR is basically in prototype status but since I am reviewing it now try to clean it up and significantly expand commentary.
pkg/kv/kvserver/raft_log_queue.go, line 972 at r1 (raw file):
t.mu.Lock() defer t.mu.Unlock() n := len(t.mu.ranges)
It's an anti-pattern to hold on to *Replica
objects (which can be replicagc'ed and instantiated under a new ReplicaID, etc). It is likely better to remember RangeIDs and then retrieve them from the store with the relevant mutexes held so that we serialize cleanly with replicaGC. I see that you're doing basically the right things below so I don't think it is far off, but let's try not to hold on to *Replica
objects while queued either way.
pkg/kv/kvserver/raft_log_queue.go, line 976 at r1 (raw file):
return } ranges = make([]rangeAndReplicaID, n)
Why not "just" make([]rangeAndReplicaID, 0, n)
and then use append
in the loop? The simpler the better. I'd rather review less index arithmetic than more, and this isn't a very hot code path.
pkg/kv/kvserver/raft_log_queue.go, line 1005 at r1 (raw file):
r, err := t.store.GetReplica(repl.rangeID) if err != nil || r == nil || r.mu.replicaID != repl.replicaID { // Not found or not the same replica.
It should be safe if it is not the same replica. If that is not safe, we should understand why. Not holding on to a replica while queued is a good way to avoid questions like this. When the replica switches out (i.e. replicaGC & replaced by new replica or the like) the truncation may no longer apply, but the code should do a "right thing" regardless.
It is sort of unfortunate that we're really hacking into the internals here - holding raftMu, etc, but I see how this is necessary. Try to keep it to a minimum though, for example, instead of reaching into r.mu.state
to update the truncated index, see if you can call the method that applies side effects during raft application with an updated state.
pkg/kv/kvserver/raft_log_queue.go, line 1020 at r1 (raw file):
truncState := *r.mu.state.TruncatedState for !r.mu.pendingLogTruncations.empty() {
Isn't this going to be an infinite loop sometimes? I see what you're trying to do (discard all of the truncations that don't affect any log entries at this point) but I can't convince myself that this code works as is.
pkg/kv/kvserver/raft_log_queue.go, line 1021 at r1 (raw file):
truncState := *r.mu.state.TruncatedState for !r.mu.pendingLogTruncations.empty() { pendingTrunc := r.mu.pendingLogTruncations.front()
maybe front()
and pop()
could be named better. It's not so clear which ones come first and how they're sorted. I know there are only two, but still.
pkg/kv/kvserver/raft_log_queue.go, line 1026 at r1 (raw file):
} } if r.mu.pendingLogTruncations.empty() {
Isn't this only reached once empty()
returns true?
pkg/kv/kvserver/raft_log_queue.go, line 1040 at r1 (raw file):
} r.mu.Unlock() r.raftMu.Unlock()
Randomly unlocking raftMu here seems wrong.
pkg/kv/kvserver/raft_log_queue.go, line 1051 at r1 (raw file):
} enactIndex := -1 r.mu.pendingLogTruncations.iterate(func(trunc pendingTruncation) {
Same comment as above, this brainmelt code should be cleanly contained somewhere, with nice comments and unit tests, and not littering raft_log_queue.go
. (I know you didn't intend to leave as is, just mentioning it anyway).
In particular, there is something about merging truncations and reasoning about their interplay (also in terms of stats) that could likely be encapsulated better.
pkg/kv/kvserver/raft_log_queue.go, line 1055 at r1 (raw file):
return } enactIndex++
That you're tracking the index here seems to point at a need for a better version of iterate
.
pkg/kv/kvserver/raft_log_queue.go, line 1057 at r1 (raw file):
enactIndex++ }) if enactIndex > 0 {
I'd prefer an early-bail instead of a giant conditional, if possible.
pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):
} if !apply { panic("unexpected !apply returned from handleTruncatedStateBelowRaftPreApply")
We need at least an attempt at proper error handling here, I don't want us to regress in terms of #75944. It's also a good question of how this will interact with the code here. Will "something" have to inspect all of the errors coming out of a replica while raftMu is held? It would be a shame.
I think the entire code here should be delegated to the raft handling goroutine, where in addition to an inSnap
we pass "something" related to truncation.
So something like
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 539 to 541 in c5d146b
func (r *Replica) handleRaftReadyRaftMuLocked( | |
ctx context.Context, inSnap IncomingSnapshot, | |
) (handleRaftReadyStats, string, error) { |
which gives us a chance to centralize error handling and piggy-back on the existing mechanisms to apply side effects to the in-memory state (i.e. updating the in-mem truncatedstate).
Drive-by comment since I unfortunately don't have bandwidth to follow this work closely.
I think it's an antipattern to do state machine transitions in the main Raft loop -- there's conceptually no need for it, and it can block other Raft processing causing spurious elections and other issues. See also #17500. We've discussed this briefly before, and maybe I'm misremembering, but I thought we wanted to ultimately separate Raft processing from state machine application? This comment seems to imply the opposite? |
There's the short game and the long game. Right now, if you do raft things you have to hold raftMu, and when you change the state you need to synchronize with the in-mem state. The way we do this today is in the raft apply loop. Yes, we ultimately want to decouple those, but it's not going to happen in this PR. My opinion is that we should not be tacking on a rather ugly second moving part that completely serializes with the raft apply loop (holds raftMu throughout) but has to reinvent some of the bits that already exist (and will remain a weird one-off anyway) because it's error-prone and also doesn't get us closer to the goal. We should keep all raft-related processing on the raft goroutine until the day that we are ready to piece them apart, in which case we will separate all raft processing from the state machine application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comments!
In light of this, I think we should revisit alternatives again.
I'm going to make an iteration over this code first. It wasn't my intention to produce anything unkempt here (hence the extensive code comment too), even though the PR is marked draft. But hopefully another pass by me will reduce the irritants. Also, the intention was to localize most changes to the new truncator abstraction so it could be tested separately.
we should not do "apply-style" work outside of the replica but delegate that to the main replica raft handling loop (by invoking it with appropriate parameters).
I am confused by this. There is also a more detailed comment on this below. We are doing raft log truncation, and not doing application, so why is this "apply-style" work? If this is about reusing handleRaftReadyRaftMuLocked
or the raft goroutine see my more detailed response below.
We also have very little to synchronize with wrt in-memory state -- only the Replica stats and queued truncations need to be updated.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/raft_log_queue.go, line 89 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Where did we leave things here in terms of timing?
- 22.1: has the setting
- 22.2: setting is baked in as "true", i.e. if you ran with
false
in the old binary you'll still get the new behavior now (and it is always on, even in mixed cluster). Upon rolling back to 21.1, setting takes hold again, so you might miss some truncations until the setting is updated.Just to confirm, someone who "permanently" runs a 22.1/22.2 mixed cluster will be fine, right? Old nodes will truncate directly, new nodes will stash the suggestions and execute them based on durability, right?
Write out some of these things on a comment here so that it's clear to everyone what the plan is.
Yes to all.
Will do regarding comment.
pkg/kv/kvserver/raft_log_queue.go, line 244 at r1 (raw file):
firstIndex, err := r.raftFirstIndexLocked() firstIndex = r.mu.pendingLogTruncations.computePostTruncFirstIndex(firstIndex)
Note that almost all the changes to raftLogQueue
other than these calls to computePostTrunc*, and the later setting of ExpectedFirstIndex will go away.
pkg/kv/kvserver/raft_log_queue.go, line 884 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
There sure is a lot of brainmelt spaghetti code in this method. I know this PR is basically in prototype status but since I am reviewing it now try to clean it up and significantly expand commentary.
I'll add comments.
btw, even though the PR is marked "draft" it was my intention to produce a prototype here, which is why there are comments in various places (though not in this method).
pkg/kv/kvserver/raft_log_queue.go, line 972 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's an anti-pattern to hold on to
*Replica
objects (which can be replicagc'ed and instantiated under a new ReplicaID, etc). It is likely better to remember RangeIDs and then retrieve them from the store with the relevant mutexes held so that we serialize cleanly with replicaGC. I see that you're doing basically the right things below so I don't think it is far off, but let's try not to hold on to*Replica
objects while queued either way.
Ack. The truncatorForReplicasInStore
holds onto the (RangeID, ReplicaID) pair, so I think it is "correct" in that way.
pkg/kv/kvserver/raft_log_queue.go, line 1005 at r1 (raw file):
Not holding on to a replica while queued is a good way to avoid questions like this
The code is not holding onto a *Replica
while queued. It is holding onto a (RangeID, ReplicaID) pair. If you are saying we shouldn't bother holding onto a ReplicaID
either, sure there is nothing necessary for correctness.
pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We need at least an attempt at proper error handling here, I don't want us to regress in terms of #75944. It's also a good question of how this will interact with the code here. Will "something" have to inspect all of the errors coming out of a replica while raftMu is held? It would be a shame.
I think the entire code here should be delegated to the raft handling goroutine, where in addition to an
inSnap
we pass "something" related to truncation.So something like
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 539 to 541 in c5d146b
func (r *Replica) handleRaftReadyRaftMuLocked( ctx context.Context, inSnap IncomingSnapshot, ) (handleRaftReadyStats, string, error) { which gives us a chance to centralize error handling and piggy-back on the existing mechanisms to apply side effects to the in-memory state (i.e. updating the in-mem truncatedstate).
- It was an oversight on my part to panic here. I didn't carefully read the implementation and thought that the only reason to return false was a simple invariant violation.
- I don't understand how we should delegate this to the raft handling goroutine. We already call
handleRaftReadyRaftMuLocked
outside of the raft goroutine when applying a snapshot, so we haven't centralized that either. To centralize it we would need to somehow hook up into theraftScheduler
andprocessReady
. That seems a much wider change. - Are you suggesting that we do this via another parameter to
handleRaftReadyRaftMuLocked
to preserve an illusion of centralization? At least there is some shared code paths in the implementation ofhandleRaftReadyRaftMuLocked
for both of its current callers. However the code below is nothing like what is inhandleRaftReadyRaftMuLocked
. It borrows more heavily from code that sits in state machine application (which will be removed in for v22.2), since we have been doing raft log truncation as part of state machine application.
fa0fa5b
to
3b533bc
Compare
3b533bc
to
aed49fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)
Previously, tbg (Tobias Grieger) wrote…
Touches
Also update the issue with a short synopsis of changes made here (in particular the bit that we're conservative here and really not trying to fundamentally change how it works).
I'd already changed to "Informs", so I didn't change to "Touches". I'll update the issue with a synopsis once this PR merges.
pkg/kv/kvserver/raft_log_queue.go, line 39 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Might want to spell out in detail why, it is because after a truncation, the only remaining source of the data in the entries is state machine (which a prefix of the log), so if we ever truncate a log entry that is not durably on the state machine, we risk losing the information in said log entries which could lead to a scenario in which we violate Raft's leader completeness property: the replica may become leader and be asked for log entries it can no longer provide (neither directly nor through a snapshot).
Done. Instead of phrasing it as the replica not having entries to provide to others, I phrased it as a gap between the last entry applied to the state machine and the first entry in the untruncated log -- which means even this replica can't do any more application.
pkg/kv/kvserver/raft_log_queue.go, line 47 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not going to read this one too closely yet because - I think - we've agreed to rip out the second path into the raft log queue (taken on followers) for now.
These two situations are unchanged. They were true even before this PR.
I've only removed the phrase "latest applied state was not durable" in the parenthetical comment.
pkg/kv/kvserver/raft_log_queue.go, line 76 at r1 (raw file):
// truncation can be done independently at each replica when the corresponding // RaftAppliedIndex is durable. Note that since raft state (including truncated // state) is not part of the state machine, this loose coordination is fine.
I've updated this text to describe what happens when we are not loosely coupled and why that is considered correct (and added a reference to #38566 in that we are not fully certain about its correctness).
pkg/kv/kvserver/raft_log_queue.go, line 78 at r1 (raw file):
// state) is not part of the state machine, this loose coordination is fine. // // 2. Proposes "local" truncation
I've removed this item in this form, and added some commentary about the the behavior when a node restarts in that the pending truncations are lost.
pkg/kv/kvserver/raft_log_queue.go, line 89 at r1 (raw file):
Previously, sumeerbhola wrote…
Yes to all.
Will do regarding comment.
Added a comment.
pkg/kv/kvserver/raft_log_queue.go, line 99 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: can also do this
var enableLooselyCoupledTruncation = func() *settings.BoolSetting { s := settings.RegisterBoolSetting( settings.SystemOnly, "kv.raft_log.enable_loosely_coupled_truncation", "set to true to loosely couple the raft log truncation.", true) s.SetVisibility(settings.Reserved) return s }I would probably prefer this because it keeps everything in one place but your call.
Good idea. Done.
pkg/kv/kvserver/raft_log_queue.go, line 848 at r1 (raw file):
} type rangeAndReplicaID struct {
this is gone, since we now only key by RangeID.
pkg/kv/kvserver/raft_log_queue.go, line 879 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
As you're working on this stuff, can you make sure that we don't depend on anything directly but only through a narrow abstraction? Also move everything into a new file, say
raft_log_truncator.go
.
Done.
The narrow abstractions are storeForTruncator
and replicaForTruncator
. They should allow for testing all the code in raft_log_truncator.go without instantiating a Replica or Store.
pkg/kv/kvserver/raft_log_queue.go, line 884 at r1 (raw file):
Previously, sumeerbhola wrote…
I'll add comments.
btw, even though the PR is marked "draft" it was my intention to produce a prototype here, which is why there are comments in various places (though not in this method).
(meant to say "not my intention")
- Added more comments.
- Fixed the iteration to utilize
pendingLogTruncations.iterate
.
pkg/kv/kvserver/raft_log_queue.go, line 976 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why not "just"
make([]rangeAndReplicaID, 0, n)
and then useappend
in the loop? The simpler the better. I'd rather review less index arithmetic than more, and this isn't a very hot code path.
Done
pkg/kv/kvserver/raft_log_queue.go, line 1005 at r1 (raw file):
The key in the map no longer includes the ReplicaID.
instead of reaching into r.mu.state to update the truncated index, see if you can call the method that applies side effects during raft application with an updated state.
This reaching-into is now done in a narrower and cleaner way with the replicaForTruncator
interface.
pkg/kv/kvserver/raft_log_queue.go, line 1020 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this going to be an infinite loop sometimes? I see what you're trying to do (discard all of the truncations that don't affect any log entries at this point) but I can't convince myself that this code works as is.
Good catch. This was missing a break.
A unit test should uncover such bugs.
pkg/kv/kvserver/raft_log_queue.go, line 1021 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
maybe
front()
andpop()
could be named better. It's not so clear which ones come first and how they're sorted. I know there are only two, but still.
I've added code comments where they are declared. I am also open to naming suggestions.
pkg/kv/kvserver/raft_log_queue.go, line 1026 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this only reached once
empty()
returns true?
Added a comment. The aforementioned bug was causing confusion. It is reached once we've removed all the useless/noop truncations.
pkg/kv/kvserver/raft_log_queue.go, line 1040 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Randomly unlocking raftMu here seems wrong.
I've added a comment for this func and renamed it to popAllOnError
.
pkg/kv/kvserver/raft_log_queue.go, line 1051 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Same comment as above, this brainmelt code should be cleanly contained somewhere, with nice comments and unit tests, and not littering
raft_log_queue.go
. (I know you didn't intend to leave as is, just mentioning it anyway).In particular, there is something about merging truncations and reasoning about their interplay (also in terms of stats) that could likely be encapsulated better.
I am hopeful that this revised version will not be a brainmelt. I am happy for further simplification suggestions.
Also, now that this is abstracted from Replica
, it will be easy to unit test thoroughly.
Regarding merging truncations, that happens at queueing time. Here there is the (a) persistent truncation, for which we just need to apply the last enactable truncation (there is now a code comment that explains this), (b) updating the replica state which involved applying each.
I mistakenly did not use handleTruncatedStateResult
to update the RaftTruncatedState
and its side-effects. This has been rectified.
Also, the legacy code that calls handleRaftLogDeltaResult
now forwards to setTruncationDeltaAndTrusted
.
pkg/kv/kvserver/raft_log_queue.go, line 1055 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That you're tracking the index here seems to point at a need for a better version of
iterate
.
iterate
now passes the index.
pkg/kv/kvserver/raft_log_queue.go, line 1057 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd prefer an early-bail instead of a giant conditional, if possible.
Done
pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):
It was an oversight on my part to panic here.
I spoke too soon. The code was panicking if there was no error and !apply was returned. This is an invariant violation since we've already eliminated the noop truncations. But I've changed it to a log statement to be more graceful in case someone changes the implementation of handleTruncatedStateBelowRaftPreApply
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cleaned up and ready for a look.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_application_state_machine.go, line 763 at r2 (raw file):
if res.State != nil && res.State.TruncatedState != nil { var err error
The changes in this file are not significant, but I could use some guidance on how to unit test them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was fast, thanks for the quick turnaround! Thanks for pulling out the interface. I am not happy with it yet but now it's clearer what we're up against and I am able to make specific suggestions. I have some meetings coming up so flushing what I have.
I am confused by this. There is also a more detailed comment on this below. We are doing raft log truncation, and not doing application, so why is this "apply-style" work? If this is about reusing handleRaftReadyRaftMuLocked or the raft goroutine see my more detailed response below.
Sorry, it's not about this being "apply" work, but "raft" work. "Raft" work happens in handleRaftReadyRaftMuLocked
. This is where entries get written to the log. It should also be where they get removed from the log. I shouldn't have uttered the word "apply" which has nothing to do with this. We also happen to apply entries in that method, but we all agree that this eventually should be split off, where handleRaftReadyRaftMuLocked
does little more than notify an apply subsystem that there is work to do.
I'm still making up my mind on how hard to push for this item, and will get a second opinion (Erik) as well.
In either case, looking at this PR now (thanks for the cleanups!) my worry from yesterday has abated. I still have to review the nitty-gritty but if we keep up the rate of turnaround and nothing unexpected happens, we should comfortably get that in.
Reviewed 4 of 16 files at r1, 2 of 8 files at r2.
Dismissed @sumeerbhola from 3 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/raft_log_queue.go, line 244 at r1 (raw file):
Previously, sumeerbhola wrote…
Note that almost all the changes to
raftLogQueue
other than these calls to computePostTrunc*, and the later setting of ExpectedFirstIndex will go away.
There aren't any other changes, right? The cluster version gate will go away but that's it.
pkg/kv/kvserver/raft_log_queue.go, line 252 at r2 (raw file):
// NB: we need an exclusive lock due to grabbing the first index. r.mu.Lock() raftLogSize := r.mu.pendingLogTruncations.computePostTruncLogSize(r.mu.raftLogSize)
Could you update the comment on newTruncateDecision
to give an idea of how pending truncations interact with new truncate decisions? I think what we're doing is to pretend that all pending truncations have happened, and then computing what the next truncation would be.
computePostTrunc*
also need comments.
pkg/kv/kvserver/raft_log_queue.go, line 280 at r2 (raw file):
// those pending truncations would cause a transition from trusted => // !trusted. This is done since we don't want to trigger a recomputation of // the raft log size while we still have pending truncations.
Not triggering recomputations prematurely makes sense, can you elaborate on the consequences of this? There may always be pending truncations, at least in a contrived world, right?
pkg/kv/kvserver/raft_log_truncator.go, line 67 at r2 (raw file):
Maybe there's a way to clarify what "front" means in this, it's the least aggressive truncation, right? So maybe
// Returns the front (i.e. least aggressive truncation) of the pending truncations queue, without removing the element
, and similar for pop.
pkg/kv/kvserver/raft_log_truncator.go, line 167 at r2 (raw file):
// // We acknowledge that this interface may seem peculiar -- this is due to the // constraint that it is abstracting Replica.
There isn't really such a constraint, and I think it would be instructive to make the interface "what we really want it to be" (i.e. fewer random methods, ideally no need to have methods that expose locks), and for the implementation do this:
type replicaInTruncator Replica
func (ri *replicaInTruncator) SomeGoodMethod() bool {
r := (*Replica)(ri)
// Do things
return true
}
The balance to strike is that there shouldn't be any nontrivial code in methods on replicaInTruncator
, as then it's back to not being able to properly unit test them.
But I hope there is such a balance. I imagine the Store interface would get a method that loads the replica (by RangeID), acquires the raftMu and checks the destroy status before returning. So now the remainder of the truncator can forget about the destroy status. Both callers in this PR that load the replica immediately proceed and check destroy status under the raftMu, so this makes sense. And I don't think they can unlock raftMu and then keep using the replica; after all the replica might be destroyed in the instant after. So we can use an acquire-release pattern here:
repl, ok := storeInterface.AcquireReplica(rangeID) // repl.raftMu
if !ok { continue }
defer repl.Release() // basically unlocks raftMu
The replicaForTruncator methods that are subject to a lock can sit behind this pattern:
type lockedReplicaInTruncator replicaInTruncator // with corresponding interface lockedReplicaInTruncatorI
func (lri *lockedReplicaInTruncator) TouchSomethingThatNeedsLock() {
r.mu.foo = "bar"
}
func (ri *replicaInTruncator) WithLock(f func(lockedReplicaInTruncatorI)) {
ri.mu.Lock()
defer ri.mu.Unlock()
f(lri(ri))
}
I'm sure there are a few complications here that I'm not seeing, but I think we can and should strive to make these interfaces pared down and pleasant to use.
pkg/kv/kvserver/replica.go, line 457 at r2 (raw file):
// Writes also require that raftMu be held, which means one of mu or // raftMu is sufficient for reads. pendingLogTruncations pendingLogTruncations
Just to play devil's advocate here, is there any reason for this to sit under replica.mu
? We always hold raftMu
when reading or writing this, so why bring replica.mu
into the equation? This could sit under raftMu
it seems and would have a better home there.
pkg/kv/kvserver/replica.go, line 2050 at r2 (raw file):
} func (r *Replica) setTruncationDeltaAndTrusted(deltaBytes int64, isDeltaTrusted bool) {
We definitely want to avoid breaking our naming conventions for replica methods that hold locks. My suggestion above to clean up the interfaces will achieve that too.
pkg/kv/kvserver/replica.go, line 2097 at r2 (raw file):
_, expectedFirtIndexAccurate := r.handleTruncatedStateResult( ctx, trunc, expectedFirstIndexPreTruncation) return expectedFirtIndexAccurate
First
pkg/kv/kvserver/replica_application_state_machine.go, line 763 at r2 (raw file):
Previously, sumeerbhola wrote…
The changes in this file are not significant, but I could use some guidance on how to unit test them.
We could pull out a free-standing function that takes res
, the cluster settings, b.state.TruncatedState
, a state loader, a WriteBatch, the truncator, the replica through an interface. It will be work, work that's useful but I am hesitant to make you do it in the context of this PR. It would be worth-while, but not if it comes at the cost of polishing other, more important, bits of the PR. So perhaps let's hold off and add it in in a follow-up, or not at all, depending on how things pan out.
pkg/kv/kvserver/replica_application_state_machine.go, line 764 at r2 (raw file):
if res.State != nil && res.State.TruncatedState != nil { var err error looselyCoupledTruncation := isLooselyCoupledRaftLogTruncationEnabled(ctx, b.r.ClusterSettings())
Checking the cluster version below raft is somewhat atypical. It's ok here because the effects of this command are local, but I wonder if we have to. Isn't it up to the creator of the truncation proposal to decide whether to use the loosely coupled truncations?
pkg/kv/kvserver/replica_application_state_machine.go, line 794 at r2 (raw file):
// we step up as leader at some point in the future, we recompute // our numbers. // TODO(sumeer): this code will be deleted when there is no
❤️
pkg/kv/kvserver/replica_sideload_disk.go, line 219 at r2 (raw file):
} // Helper for truncation or byte calculation for [from, to)
nit: .
pkg/kv/kvserver/replica_sideload_disk.go, line 254 at r2 (raw file):
} if deletedAll && doTruncate {
Make sure there's a simple unit test for this. Modifying one of the existing ones should be enough.
pkg/kv/kvserver/store.go, line 714 at r2 (raw file):
replicaGCQueue *replicaGCQueue // Replica GC queue raftLogQueue *raftLogQueue // Raft log truncation queue raftTruncator raftLogTruncator // Enacts truncation
Maybe
// Carries out truncations from the raft log queue when they are safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):
Previously, sumeerbhola wrote…
It was an oversight on my part to panic here.
I spoke too soon. The code was panicking if there was no error and !apply was returned. This is an invariant violation since we've already eliminated the noop truncations. But I've changed it to a log statement to be more graceful in case someone changes the implementation of
handleTruncatedStateBelowRaftPreApply
.
The concrete proposal is that the callback will become very simple: enqueue all replicas that have pending truncations (not validating for whether they can actually occur now) for a Ready check:
r.store.scheduler.EnqueueRaftReady(rangeID)
Then, in handlehandleRaftReadyRaftMuLocked
:
diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index aabcf3af74..09af6313d6 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -559,6 +559,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
raftLogSize := r.mu.raftLogSize
leaderID := r.mu.leaderID
lastLeaderID := leaderID
+ var pendingTruncation interface{}
err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup)
if err != nil {
@@ -567,6 +568,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
if hasReady = raftGroup.HasReady(); hasReady {
rd = raftGroup.Ready()
}
+ if someCheapCondition {
+ pendingTruncation = &somethingSomething{}
+ }
// We unquiesce if we have a Ready (= there's work to do). We also have
// to unquiesce if we just flushed some proposals but there isn't a
// Ready, which can happen if the proposals got dropped (raft does this
@@ -591,6 +595,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
const expl = "while checking raft group for Ready"
return stats, expl, errors.Wrap(err, expl)
}
+ if pendingTruncation != nil {
+ maybeExecutePendingTruncation(...)
+ }
if !hasReady {
// We must update the proposal quota even if we don't have a ready.
// Consider the case when our quota is of size 1 and two out of three
The main reason I'm advocating for this is because I think that way the new code will not cause any issues down the road and will most naturally lend itself to agreeing with any future refactoring we might perform. We're avoiding a one-off code path for log truncations and keeping all mutations to the raft state in one place; we are tracking the overhead of truncations and handling errors coming from them in exactly the same way as we would for any raft command (see here). We would also be able to, if we wanted to, append and truncate the log in a single batch (the append batch is created a bit further down), which might be advantageous (or not, no idea), but even if we don't do that it just seems right to have these pieces of code close to each other. It is true that in theory we could try to do some of the log truncation work completely out of band (bump the truncated state first, then delete the old entries later) but this is similar to trying to separate entry application from entry commit, i.e. something to consider later (besides, we're holding raftMu all of the time, so there is no concurrency between raft handling and truncations as is anyway). In short, I think some variant of what I'm proposing (callback triggers a raft handle loop which does the rest) makes sense with our current architecture and won't get in the way of future changes.
There is a little bit of risk to going through the scheduler (maybe in adding extra scheduling steps, we are somehow starving other raft groups of processing) so we could entertain bypassing the scheduler by instead calling store.processReady(rangeID)
directly. But we should avoid blocking in the pebble durability callback, so it would really be "spawn a task that calls processReady but make sure there's only ever one task per rangeID so they don't pile up", really it starts looking a lot like rebuilding the scheduler already, and so I think we should go through the scheduler and just make sure that that's ok (as I fully expect it to be).
pkg/kv/kvserver/raft_log_queue.go, line 37 at r2 (raw file):
// Overview of Raft log truncation: // Might want to spell out in detail why, it is because after a truncation,
I think you meant to remove this entire paragraph.
c0f06ed
to
2ba1d64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/raft_log_queue.go, line 244 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
There aren't any other changes, right? The cluster version gate will go away but that's it.
Correct -- what is now in this file is what is remaining.
pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The concrete proposal is that the callback will become very simple: enqueue all replicas that have pending truncations (not validating for whether they can actually occur now) for a Ready check:
r.store.scheduler.EnqueueRaftReady(rangeID)Then, in
handlehandleRaftReadyRaftMuLocked
:diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go index aabcf3af74..09af6313d6 100644 --- a/pkg/kv/kvserver/replica_raft.go +++ b/pkg/kv/kvserver/replica_raft.go @@ -559,6 +559,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked( raftLogSize := r.mu.raftLogSize leaderID := r.mu.leaderID lastLeaderID := leaderID + var pendingTruncation interface{} err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) { numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup) if err != nil { @@ -567,6 +568,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked( if hasReady = raftGroup.HasReady(); hasReady { rd = raftGroup.Ready() } + if someCheapCondition { + pendingTruncation = &somethingSomething{} + } // We unquiesce if we have a Ready (= there's work to do). We also have // to unquiesce if we just flushed some proposals but there isn't a // Ready, which can happen if the proposals got dropped (raft does this @@ -591,6 +595,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked( const expl = "while checking raft group for Ready" return stats, expl, errors.Wrap(err, expl) } + if pendingTruncation != nil { + maybeExecutePendingTruncation(...) + } if !hasReady { // We must update the proposal quota even if we don't have a ready. // Consider the case when our quota is of size 1 and two out of threeThe main reason I'm advocating for this is because I think that way the new code will not cause any issues down the road and will most naturally lend itself to agreeing with any future refactoring we might perform. We're avoiding a one-off code path for log truncations and keeping all mutations to the raft state in one place; we are tracking the overhead of truncations and handling errors coming from them in exactly the same way as we would for any raft command (see here). We would also be able to, if we wanted to, append and truncate the log in a single batch (the append batch is created a bit further down), which might be advantageous (or not, no idea), but even if we don't do that it just seems right to have these pieces of code close to each other. It is true that in theory we could try to do some of the log truncation work completely out of band (bump the truncated state first, then delete the old entries later) but this is similar to trying to separate entry application from entry commit, i.e. something to consider later (besides, we're holding raftMu all of the time, so there is no concurrency between raft handling and truncations as is anyway). In short, I think some variant of what I'm proposing (callback triggers a raft handle loop which does the rest) makes sense with our current architecture and won't get in the way of future changes.
There is a little bit of risk to going through the scheduler (maybe in adding extra scheduling steps, we are somehow starving other raft groups of processing) so we could entertain bypassing the scheduler by instead calling
store.processReady(rangeID)
directly. But we should avoid blocking in the pebble durability callback, so it would really be "spawn a task that calls processReady but make sure there's only ever one task per rangeID so they don't pile up", really it starts looking a lot like rebuilding the scheduler already, and so I think we should go through the scheduler and just make sure that that's ok (as I fully expect it to be).
Regarding not blocking in the Pebble durability callback, I was planning to add something in the truncator which start a goroutine if none was running.
Unlike the raft scheduler, there would be one goroutine doing the truncation across all replicas, which has pros and cons.
Pros: we share the same durability reader across the replicas; it is clear that now is the time to check the durable RaftAppliedIndex; there is at most one goroutine doing this, so the resource consumption is constrained.
Con: there is at most one goroutine doing this, and what if it falls behind.
If we move this to the scheduler, we need some way for the ready loop to know that a durability callback has not happened, so it shouldn't bother doing a read to try to dequeue pending truncations. This is definitely doable.
I don't have a good sense of the surrounding code, so I'm happy to go with whatever you decide.
pkg/kv/kvserver/raft_log_queue.go, line 37 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think you meant to remove this entire paragraph.
Done
pkg/kv/kvserver/raft_log_queue.go, line 252 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you update the comment on
newTruncateDecision
to give an idea of how pending truncations interact with new truncate decisions? I think what we're doing is to pretend that all pending truncations have happened, and then computing what the next truncation would be.
computePostTrunc*
also need comments.
Done
pkg/kv/kvserver/raft_log_queue.go, line 280 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not triggering recomputations prematurely makes sense, can you elaborate on the consequences of this? There may always be pending truncations, at least in a contrived world, right?
I've added the following to the comment.
// Note that as
// soon as those pending truncations are enacted r.mu.raftLogSizeTrusted
// will become false and we will recompute the size -- so this cannot cause
// an indefinite delay in recomputation.
I think we can live with a transient under or over estimate.
pkg/kv/kvserver/raft_log_truncator.go, line 67 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe there's a way to clarify what "front" means in this, it's the least aggressive truncation, right? So maybe
// Returns the front (i.e. least aggressive truncation) of the pending truncations queue, without removing the element
, and similar for pop.
Done
pkg/kv/kvserver/raft_log_truncator.go, line 167 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
There isn't really such a constraint, and I think it would be instructive to make the interface "what we really want it to be" (i.e. fewer random methods, ideally no need to have methods that expose locks), and for the implementation do this:
type replicaInTruncator Replica func (ri *replicaInTruncator) SomeGoodMethod() bool { r := (*Replica)(ri) // Do things return true }The balance to strike is that there shouldn't be any nontrivial code in methods on
replicaInTruncator
, as then it's back to not being able to properly unit test them.But I hope there is such a balance. I imagine the Store interface would get a method that loads the replica (by RangeID), acquires the raftMu and checks the destroy status before returning. So now the remainder of the truncator can forget about the destroy status. Both callers in this PR that load the replica immediately proceed and check destroy status under the raftMu, so this makes sense. And I don't think they can unlock raftMu and then keep using the replica; after all the replica might be destroyed in the instant after. So we can use an acquire-release pattern here:
repl, ok := storeInterface.AcquireReplica(rangeID) // repl.raftMu if !ok { continue } defer repl.Release() // basically unlocks raftMuThe replicaForTruncator methods that are subject to a lock can sit behind this pattern:
type lockedReplicaInTruncator replicaInTruncator // with corresponding interface lockedReplicaInTruncatorI func (lri *lockedReplicaInTruncator) TouchSomethingThatNeedsLock() { r.mu.foo = "bar" } func (ri *replicaInTruncator) WithLock(f func(lockedReplicaInTruncatorI)) { ri.mu.Lock() defer ri.mu.Unlock() f(lri(ri)) }I'm sure there are a few complications here that I'm not seeing, but I think we can and should strive to make these interfaces pared down and pleasant to use.
The storeForTruncator interface change idea is nice. I've made that change. It removed the explicit locking of raftMu which simplified the code.
There is still explicit locking of Replica.mu -- I don't see how to avoid this given that we are reading many things while holding that lock.
pkg/kv/kvserver/replica.go, line 457 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Just to play devil's advocate here, is there any reason for this to sit under
replica.mu
? We always holdraftMu
when reading or writing this, so why bringreplica.mu
into the equation? This could sit underraftMu
it seems and would have a better home there.
We also read it in newTruncateDecision
where currently we only hold Replica.mu
. It seemed awkward to disassociate this from the raftLogSize
and the truncated state, which it adjusts, which sit under Replica.mu
. I could change that code to also acquire raftMu
if you think that is better.
pkg/kv/kvserver/replica.go, line 2050 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We definitely want to avoid breaking our naming conventions for replica methods that hold locks. My suggestion above to clean up the interfaces will achieve that too.
I added the Locked
suffix to two of the methods.
The getPendingTruncs
situation is a bit odd since it doesn't need the lock for reads. And if we are going to write the *pendingTruncs
later, we will be calling methods on pendingLogTruncations
which has no notion of concurrency control, so those methods can't say "Locked".
pkg/kv/kvserver/replica.go, line 2097 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
First
Done
pkg/kv/kvserver/replica_application_state_machine.go, line 764 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Checking the cluster version below raft is somewhat atypical. It's ok here because the effects of this command are local, but I wonder if we have to. Isn't it up to the creator of the truncation proposal to decide whether to use the loosely coupled truncations?
Currently there isn't a bool in ReplicaEvalResult
that specifies whether it should be loosely coupled or not. The upstream code looks only at the cluster version and if it is LooselyCoupledRaftLogTruncation it populates ExpectedFirstIndex
. The ExpectedFirstIndex
is useful even for strongly coupled truncation. I could add a bool to ReplicaEvalResult
, which we would remove after the strongly coupled truncation goes away, or continue to do this. Since we can't fully dictate what the below raft code does (it could be an old binary), it does seem a little odd to add additional plumbing.
But I don't have a strong opinion.
I've also added a code comment here which includes what you mentioned.
pkg/kv/kvserver/replica_sideload_disk.go, line 219 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: .
Done
pkg/kv/kvserver/store.go, line 714 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe
// Carries out truncations from the raft log queue when they are safe
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 16 files at r1, 1 of 8 files at r2, 10 of 14 files at r3, 6 of 6 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @sumeerbhola, and @tbg)
pkg/kv/kvserver/raft_log_queue.go, line 111 at r4 (raw file):
// will do strongly coupled truncation and the v22.2 will do loosely // coupled. This co-existence is correct. // TODO: flip back to true before merging.
Reminder about this TODO.
pkg/kv/kvserver/raft_log_truncator.go, line 61 at r4 (raw file):
func (p *pendingLogTruncations) computePostTruncFirstIndex(firstIndex uint64) uint64 { p.iterate(func(_ int, trunc pendingTruncation) { if firstIndex < trunc.Index+1 {
Consider pulling trunc.Index + 1
into a method on pendingTruncation
called firstIndexAfterApplied
or something like that with a short comment that reminds readers that RaftTruncatedState.Index
is inclusive.
pkg/kv/kvserver/raft_log_truncator.go, line 85 at r4 (raw file):
p.truncs[0] = pendingTruncation{} if !(p.truncs[1] == (pendingTruncation{})) { p.truncs[0] = p.truncs[1]
nit: can this method just be these two lines, without the conditional?
pkg/kv/kvserver/raft_log_truncator.go, line 125 at r4 (raw file):
expectedFirstIndex uint64 // logDeltaBytes includes the bytes from sideloaded files. logDeltaBytes int64
Could you mention that this is always negative (or zero?). It caught me by surprise to see us add logDeltaBytes
above.
pkg/kv/kvserver/raft_log_truncator.go, line 183 at r4 (raw file):
The comment above says "Any mutex protecting raft-state is acquired before returning". Was that in reference to Replica.raftMu
and is this in reference to the Replica.mu
? If so, consider mentioning this in that comment.
Any mutex protecting raft-state (Replica.raftMu) is acquired before returning, but finer-grained mutexes protecting replica state (Replica.mu) must be acquired using lockReplicaState.
pkg/kv/kvserver/raft_log_truncator.go, line 318 at r4 (raw file):
return } ranges = make([]roachpb.RangeID, 0, n)
nit: if we want to minimize the amount of time that we spend holding the raftLogTruncator
's lock, we could maintain two maps and swap them here.
Something like:
type raftLogTruncator struct {
...
add, drain map[roachpb.RangeID]struct{}
}
func (t *raftLogTruncator) addPendingTruncation()
...
t.mu.Lock()
t.mu.add[r.GetRangeID()] = struct{}{}
t.mu.Unlock()
}
func (t *raftLogTruncator) durabilityAdvanced()
...
t.mu.Lock()
t.mu.add, t.mu.drain = t.mu.drain, t.mu.add
drain := t.mu.drain
t.mu.Unlock()
for k := drain {
...
delete(drain, k)
}
}
pkg/kv/kvserver/raft_log_truncator.go, line 423 at r4 (raw file):
return } if err := batch.Commit(false); err != nil {
Consider adding a note about why we don't need durability with this operation.
pkg/kv/kvserver/replica.go, line 2038 at r4 (raw file):
// Implementation of replicaForTruncator interface. var _ replicaForTruncator = &Replica{}
Should these be methods on Replica or on a separate concrete type? I believe Tobi was suggesting that we do something like we do with replicaProposer
to avoid bloating the Replica
method set.
pkg/kv/kvserver/replica_application_state_machine.go, line 763 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We could pull out a free-standing function that takes
res
, the cluster settings,b.state.TruncatedState
, a state loader, a WriteBatch, the truncator, the replica through an interface. It will be work, work that's useful but I am hesitant to make you do it in the context of this PR. It would be worth-while, but not if it comes at the cost of polishing other, more important, bits of the PR. So perhaps let's hold off and add it in in a follow-up, or not at all, depending on how things pan out.
TestReplicaStateMachineChangeReplicas
is an example of a test that directly exercises a Replica state machine. You may be able to write the kind of test that you want at a similar level.
pkg/kv/kvserver/replica_application_state_machine.go, line 1264 at r4 (raw file):
The RaftExpectedFirstIndex != 0 will not be necessary in the release following LooselyCoupledRaftLogTruncation.
Is this true? Do we plan to run a long-running migration to flush out all raft log truncation entries without RaftExpectedFirstIndex
from all raft logs?
2ba1d64
to
eab0236
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/raft_log_truncator.go, line 61 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider pulling
trunc.Index + 1
into a method onpendingTruncation
calledfirstIndexAfterApplied
or something like that with a short comment that reminds readers thatRaftTruncatedState.Index
is inclusive.
Done
pkg/kv/kvserver/raft_log_truncator.go, line 85 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: can this method just be these two lines, without the conditional?
Yes, it can! Done.
pkg/kv/kvserver/raft_log_truncator.go, line 125 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you mention that this is always negative (or zero?). It caught me by surprise to see us add
logDeltaBytes
above.
Done.
It shouldn't be 0 since the truncation is truncating at least one additional entry. But I don't want to make any claims on what we consider the size of a log entry, so the comment I've added says <= 0.
pkg/kv/kvserver/raft_log_truncator.go, line 183 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The comment above says "Any mutex protecting raft-state is acquired before returning". Was that in reference to
Replica.raftMu
and is this in reference to theReplica.mu
? If so, consider mentioning this in that comment.Any mutex protecting raft-state (Replica.raftMu) is acquired before returning, but finer-grained mutexes protecting replica state (Replica.mu) must be acquired using lockReplicaState.
Correct. Done.
pkg/kv/kvserver/raft_log_truncator.go, line 318 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: if we want to minimize the amount of time that we spend holding the
raftLogTruncator
's lock, we could maintain two maps and swap them here.Something like:
type raftLogTruncator struct { ... add, drain map[roachpb.RangeID]struct{} } func (t *raftLogTruncator) addPendingTruncation() ... t.mu.Lock() t.mu.add[r.GetRangeID()] = struct{}{} t.mu.Unlock() } func (t *raftLogTruncator) durabilityAdvanced() ... t.mu.Lock() t.mu.add, t.mu.drain = t.mu.drain, t.mu.add drain := t.mu.drain t.mu.Unlock() for k := drain { ... delete(drain, k) } }
This is the only case we do work proportional to the number of ranges while holding raftLogTruncator's lock, so probably makes sense to change this -- if we hold this too long we'll cause anyone trying to queue new truncations to block. I'll wait to make this change until it is clear that raftLogTruncator
will remain in its current form.
pkg/kv/kvserver/raft_log_truncator.go, line 423 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider adding a note about why we don't need durability with this operation.
Done
pkg/kv/kvserver/replica.go, line 2038 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should these be methods on Replica or on a separate concrete type? I believe Tobi was suggesting that we do something like we do with
replicaProposer
to avoid bloating theReplica
method set.
@tbg is that what you were suggesting too?
If yes, I'll make the change.
pkg/kv/kvserver/replica_application_state_machine.go, line 1264 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The RaftExpectedFirstIndex != 0 will not be necessary in the release following LooselyCoupledRaftLogTruncation.
Is this true? Do we plan to run a long-running migration to flush out all raft log truncation entries without
RaftExpectedFirstIndex
from all raft logs?
Good point.
This comment could have been more decisive. I've replaced it with
// This strongly coupled truncation code will be removed in the release
// following LooselyCoupledRaftLogTruncation.
And in the other place in this file where we gate on RaftExpectedFirstIndex == 0
I've added
// In the release following LooselyCoupledRaftLogTruncation, we will
// retire the strongly coupled path. It is possible that some replica
// still has a truncation sitting in a raft log that never populated
// RaftExpectedFirstIndex, which will be interpreted as 0. When applying
// it, the loosely coupled code will mark the log size as untrusted and
// will recompute the size. This has no correctness impact, so we are not
// going to bother with a long-running migration.
eab0236
to
18566d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_sideload_disk.go, line 254 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Make sure there's a simple unit test for this. Modifying one of the existing ones should be enough.
Done
pkg/kv/kvserver/testdata/raft_log_truncator, line 83 at r6 (raw file):
# TODO: add more test cases and add comments.
This test is WIP
This is the CockroachDB plumbing for Pebble's IterOptions.OnlyReadGuaranteedDurable. It is for use in the raftLogTruncator cockroachdb#76215. Since most of the exported interfaces in the storage package use a Reader, we support this via a DurabilityRequirement parameter on Engine.NewReadOnly, and not via an iterator option. There is also a RegisterFlushCompletedCallback method on Engine which will be used to poll certain durable state in the raftLogTruncator. Other than the trivial plumbing, this required some refactoring of the Reader.MVCCGet* code for Pebble and pebbleReadOnly. Even though it is deprecated and primarily/only used in tests, we don't want to have the durability semantics diverge. Release note: None
This is the CockroachDB plumbing for Pebble's IterOptions.OnlyReadGuaranteedDurable. It is for use in the raftLogTruncator cockroachdb#76215. Since most of the exported interfaces in the storage package use a Reader, we support this via a DurabilityRequirement parameter on Engine.NewReadOnly, and not via an iterator option. There is also a RegisterFlushCompletedCallback method on Engine which will be used to poll certain durable state in the raftLogTruncator. Other than the trivial plumbing, this required some refactoring of the Reader.MVCCGet* code for Pebble and pebbleReadOnly. Even though it is deprecated and primarily/only used in tests, we don't want to have the durability semantics diverge. Release note: None
6e1d2a4
to
fac39ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fine tuning at this point.
Reviewed 3 of 8 files at r8, 8 of 15 files at r9, 2 of 2 files at r10, 6 of 6 files at r11, all commit messages.
Dismissed @nvanbenschoten and @sumeerbhola from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/raft_log_queue.go, line 114 at r11 (raw file):
s := settings.RegisterBoolSetting( settings.SystemOnly, "kv.raft_log.enable_loosely_coupled_truncation.enabled",
nit: enable_x.enabled
is not quite right. How about kv.raft_log.loosely_coupled_truncations.enabled
.
pkg/kv/kvserver/raft_log_queue.go, line 265 at r11 (raw file):
firstIndex, err := r.raftFirstIndexLocked() firstIndex = r.pendingLogTruncations.computePostTruncFirstIndex(firstIndex)
Could you rearrange this such that firstIndex, err
is the last thing before r.mu.Unlock
and the error check? I don't think we need to hold the lock for computePostTruncFirstIndex
, so this can be done after the err check.
pkg/kv/kvserver/raft_log_truncator.go, line 410 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: you could change this to just return
(*Replica, bool)
as you're not ever really interested in handling an error here. You just want to have the replica.
Having reviewed more, yes this seems right and a good thing to do.
pkg/kv/kvserver/raft_log_truncator.go, line 46 at r11 (raw file):
type pendingLogTruncations struct { mu struct { // From a lock ordering perspective, this mutex is the lowest.
Clarify "lowest".
pkg/kv/kvserver/raft_log_truncator.go, line 48 at r11 (raw file):
// From a lock ordering perspective, this mutex is the lowest. syncutil.Mutex // We only track the oldest and latest pending truncation. We cannot track
We track the old and merge everything else into the "new" truncation, right? If we discard all intermediate truncations then we're also losing the bytes deltas.
pkg/kv/kvserver/raft_log_truncator.go, line 94 at r11 (raw file):
func (p *pendingLogTruncations) isEmptyLocked() bool { return p.mu.truncs[0] == (pendingTruncation{})
Looks like there's an invariant here. Consider spelling out that invariant explicitly on the field comment.
pkg/kv/kvserver/raft_log_truncator.go, line 168 at r11 (raw file):
// raftLogTruncator is responsible for actually enacting truncations. // Mutex ordering: Replica mutexes > raftLogTruncator.mu
Could you spell out what this means? <
means "acquired before", right?
pkg/kv/kvserver/raft_log_truncator.go, line 207 at r11 (raw file):
// // A replica has raft state, including the queue of pending truncations, that // the truncator is modifying. There could be a potential "raft-state" mutex
Heh, let's just drop the conditionals here. There is one canonical impl and the interface is for testing, that is ok.
You might also add that access to the interface is single-threaded.
pkg/kv/kvserver/raft_log_truncator.go, line 237 at r11 (raw file):
_ context.Context, from, to uint64) (freed int64, _ error) getStateLoader() stateloader.StateLoader // Setting the persistent raft state is via the Engine exposed by
nit: // NB:
pkg/kv/kvserver/raft_log_truncator.go, line 260 at r11 (raw file):
// Need to figure out whether to add this new pendingTrunc to the // truncations that are already queued, and if yes, where to add. // i is the index of the last already queued truncation.
Make sure that if it doesn't already, iterateLocked
documents that it visits truncations starting with the less aggressive (old) ones first.
pkg/kv/kvserver/raft_log_truncator.go, line 264 at r11 (raw file):
// alreadyTruncIndex represents what has been already truncated. alreadyTruncIndex := r.getTruncatedState().Index // No need to acquire pendingTruncs.mu for read in this case.
// (see replicaForTruncator)
pkg/kv/kvserver/raft_log_truncator.go, line 289 at r11 (raw file):
// which can be large, since we do the computation for the sideloaded // entries size here. That will reduce the undercounting of the bytes in the // raft log by reducing the value of sideloadedFreed.
// In the common case of alreadyTruncIndex+1 <= raftExpectedFirstIndex, the computation returns the same result regardless of which is plugged in as the lower bound.
pkg/kv/kvserver/raft_log_truncator.go, line 292 at r11 (raw file):
sideloadedFreed, err := r.sideloadedBytesIfTruncatedFromTo( ctx, alreadyTruncIndex+1, pendingTrunc.firstIndexAfterTrunc()) // Log a loud error since we need to continue enqueuing the truncation.
nit: this should be in the if err != nil
block.
pkg/kv/kvserver/raft_log_truncator.go, line 301 at r11 (raw file):
// Merge the existing entry into the new one. // No need to acquire pendingTruncs.mu for read in this case. pendingTrunc.isDeltaTrusted = pendingTrunc.isDeltaTrusted ||
Is that right? If we're merging wrong stats into correct stats, we'll get wrong stats (not right stats). So this should be &&
?
pkg/kv/kvserver/raft_log_truncator.go, line 430 at r11 (raw file):
} } // NB: Unlocking but can keep reading pendingTruncs in this case.
"in this case" doesn't add much, either remove it or say "due to replicaForTruncator contract".
pkg/kv/kvserver/raft_log_truncator.go, line 458 at r11 (raw file):
}) if enactIndex < 0 { // Add it back as range we should examine and release all locks.
"and release all locks" had me confused. You're just saying that we release the replica back to the store, right? Doesn't seem to add much.
pkg/kv/kvserver/raft_log_truncator.go, line 464 at r11 (raw file):
// Do the truncation of persistent raft entries, specified by enactIndex // (this subsumes all the preceding queued truncations). batch := t.store.getEngine().NewUnindexedBatch(false)
(false /* writeOnly */)
pkg/kv/kvserver/raft_log_truncator.go, line 472 at r11 (raw file):
log.Errorf(ctx, "while attempting to truncate raft log: %+v", err) } else { log.Errorf(ctx, "unexpected !apply returned from handleTruncatedStateBelowRaftPreApply")
err := errors.AssertionFailedf("unexpected !apply from handleTruncatedStateBelowRaftPreApply")
if buildutil.CrdbTestBuild || util.RaceEnabled {
log.Fatalf(ctx, "%s", err)
} else {
log.Errorf(ctx, "%s", err)
}
pkg/kv/kvserver/replica.go, line 2040 at r11 (raw file):
} // Implementation of the replicaForTruncator interface.
For replicaProposer
, we pulled this out of replica.go
. I would suggest doing the same here, perhaps moving this to a file raft_truncator_replica.go
.
pkg/kv/kvserver/replica.go, line 2052 at r11 (raw file):
r.mu.Lock() defer r.mu.Unlock() // TruncatedState is guaranteed to be non-nil
.
pkg/kv/kvserver/replica.go, line 2096 at r11 (raw file):
func (r *replicaForTruncatorImpl) getStateLoader() stateloader.StateLoader { // NB: the replicaForTruncator contract says that Replica.raftMu is held.
Need same comment on sideloadedBytesIfTruncatedFromTo.
pkg/kv/kvserver/replica_application_state_machine.go, line 1261 at r11 (raw file):
} isDeltaTrusted := true
nit: perhaps name this more verbosely, isRaftLogTruncationDeltaTrusted
, since there is also the MVCCStats
"delta".
pkg/kv/kvserver/store.go, line 3493 at r11 (raw file):
} // Implementation of the storeForTruncator interface.
raft_truncator_store.go?
pkg/kv/kvserver/store.go, line 3508 at r11 (raw file):
r.mu.Lock() defer r.mu.Unlock() return r.mu.destroyStatus.Removed()
What about destroyReasonMergePending
? As is we'll continue processing the replica. Is that necessary? The raft log is about to be deleted anyway. You could change this to !r.mu.destroyStatus.IsAlive()
to avoid it.
pkg/kv/kvserver/raft_log_truncator_test.go, line 45 at r11 (raw file):
require.Equal(t, 2, truncs.capacity()) require.True(t, truncs.isEmptyLocked()) require.Equal(t, int64(55), truncs.computePostTruncLogSize(55))
You could use EqualValues, don't have to, just wanted to point out this exists in case you weren't aware. I assume you got errors here and then added the cast, if you use constant ints using EqualValues from the start can be a bit faster.
pkg/kv/kvserver/raft_log_truncator_test.go, line 62 at r11 (raw file):
require.Equal(t, int64(0), truncs.computePostTruncLogSize(45)) // Advances to Index+1. require.Equal(t, uint64(21), truncs.computePostTruncFirstIndex(5))
Test 20 and 21 too since they're the boundary values.
pkg/kv/kvserver/raft_log_truncator_test.go, line 67 at r11 (raw file):
// Two pending truncations. truncs.mu.truncs[1].logDeltaBytes = -70
Now that I see this, I don't think I saw a comment that said something like
// INVARIANT: pendingTruncs[0].Index < pendingTruncs[1].Index (if the latter is not zero), consider adding that in the right place.
pkg/kv/kvserver/raft_log_truncator_test.go, line 82 at r11 (raw file):
// Added -120 and bumped up to 0. require.Equal(t, int64(0), truncs.computePostTruncLogSize(115)) // Advances to Index+1 of second entry
.
6da9157
to
91e90d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed all comments
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/raft_log_queue.go, line 114 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit:
enable_x.enabled
is not quite right. How aboutkv.raft_log.loosely_coupled_truncations.enabled
.
Done
pkg/kv/kvserver/raft_log_queue.go, line 265 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you rearrange this such that
firstIndex, err
is the last thing beforer.mu.Unlock
and the error check? I don't think we need to hold the lock forcomputePostTruncFirstIndex
, so this can be done after the err check.
Done
pkg/kv/kvserver/raft_log_truncator.go, line 416 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Oh, you are right, I was somehow thinking it wrapped the engine but it does not (I realized half-way through, but without realizing it threw the premise out of the window). We can leave this as is, thanks for pushing back.
Ack
pkg/kv/kvserver/raft_log_truncator.go, line 410 at r10 (raw file):
Having reviewed more, yes this seems right and a good thing to do.
By "this seems right", I assume you meant to make this change.
Done
pkg/kv/kvserver/raft_log_truncator.go, line 46 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Clarify "lowest".
Done
pkg/kv/kvserver/raft_log_truncator.go, line 48 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We track the old and merge everything else into the "new" truncation, right? If we discard all intermediate truncations then we're also losing the bytes deltas.
The comment is updated to say
// We track up to two truncations: the oldest pending truncation, and a
// merge of all the subsequent pending truncations. We cannot track only
// one merged truncation since its index may always be ahead of the
// durable RaftAppliedIndex, and so we may never be able to truncate.
pkg/kv/kvserver/raft_log_truncator.go, line 94 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Looks like there's an invariant here. Consider spelling out that invariant explicitly on the field comment.
Done
// Invariants:
// - Queue slot i is empty iff truncs[i] == pendingTruncation{}
// - Slot 0 represents the first position in the queue. Therefore, it is
// not possible for slot 0 to be empty and slot 1 to be non-empty.
// An implication of the above is that the queue is empty iff slot 0 is
// empty.
pkg/kv/kvserver/raft_log_truncator.go, line 168 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you spell out what this means?
<
means "acquired before", right?
I am now unsure about my symbol usage, and I wasn't able to quickly find a reference online. So changed to
// Mutex ordering: The Replica mutexes can be held when acquiring
// raftLogTruncator.mu, but the reverse is not permitted.
pkg/kv/kvserver/raft_log_truncator.go, line 207 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Heh, let's just drop the conditionals here. There is one canonical impl and the interface is for testing, that is ok.
You might also add that access to the interface is single-threaded.
Dropped the conditionals.
I didn't add that access to replicaForTruncator is single threaded (even though it is), because it may confuse the reader into thinking that we are saying more than that (that other accesses via *Replica
are not occurring concurrently).
pkg/kv/kvserver/raft_log_truncator.go, line 237 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit:
// NB:
Done
pkg/kv/kvserver/raft_log_truncator.go, line 260 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Make sure that if it doesn't already,
iterateLocked
documents that it visits truncations starting with the less aggressive (old) ones first.
Done
pkg/kv/kvserver/raft_log_truncator.go, line 264 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// (see replicaForTruncator)
Done
pkg/kv/kvserver/raft_log_truncator.go, line 289 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// In the common case of alreadyTruncIndex+1 <= raftExpectedFirstIndex, the computation returns the same result regardless of which is plugged in as the lower bound.
Done. Rewrote to
// It is possible that alreadyTruncIndex+1 != raftExpectedFirstIndex. When
// we merge or enact we will see this problem and set the trusted bit to
// false. But we can at least correctly count sideloaded entries, which can
// be large, since we do the computation for the sideloaded entries size
// here. When alreadyTruncIndex+1 > raftExpectedFirstIndex, this will avoid
// double counting sideloaded entries that will be freed, and when
// alreadyTruncIndex+1 < raftExpectedFirstIndex, this will ensure that we
// don't miss sideloaded entries that will be freed.
//
// In the common case of alreadyTruncIndex+1 == raftExpectedFirstIndex, the
// computation returns the same result regardless of which is plugged in as
// the lower bound.
pkg/kv/kvserver/raft_log_truncator.go, line 292 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: this should be in the
if err != nil
block.
Done
pkg/kv/kvserver/raft_log_truncator.go, line 301 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is that right? If we're merging wrong stats into correct stats, we'll get wrong stats (not right stats). So this should be
&&
?
Good catch. Done. And added test.
pkg/kv/kvserver/raft_log_truncator.go, line 430 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
"in this case" doesn't add much, either remove it or say "due to replicaForTruncator contract".
Done
pkg/kv/kvserver/raft_log_truncator.go, line 458 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
"and release all locks" had me confused. You're just saying that we release the replica back to the store, right? Doesn't seem to add much.
Done (was a stale comment). Changed to // Enqueue the rangeID for the future.
pkg/kv/kvserver/raft_log_truncator.go, line 464 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(false /* writeOnly */)
Done
pkg/kv/kvserver/raft_log_truncator.go, line 472 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
err := errors.AssertionFailedf("unexpected !apply from handleTruncatedStateBelowRaftPreApply") if buildutil.CrdbTestBuild || util.RaceEnabled { log.Fatalf(ctx, "%s", err) } else { log.Errorf(ctx, "%s", err) }
Done
pkg/kv/kvserver/replica.go, line 2040 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
For
replicaProposer
, we pulled this out ofreplica.go
. I would suggest doing the same here, perhaps moving this to a fileraft_truncator_replica.go
.
Done. And renamed type to raftTruncatorReplica
.
pkg/kv/kvserver/replica.go, line 2052 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
.
Done
pkg/kv/kvserver/replica.go, line 2096 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Need same comment on sideloadedBytesIfTruncatedFromTo.
I changed this to
// NB: the replicaForTruncator contract says that Replica.raftMu is held for
// the duration of the existence of replicaForTruncator, so we return the
// r.raftMu.stateloader (and not r.mu.stateLoader).
since that was my limited intention here.
It wasn't about whether someone can be concurrently modifying the engine state or not.
pkg/kv/kvserver/replica_application_state_machine.go, line 1261 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: perhaps name this more verbosely,
isRaftLogTruncationDeltaTrusted
, since there is also theMVCCStats
"delta".
Done
pkg/kv/kvserver/store.go, line 3493 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
raft_truncator_store.go?
Seems very tiny, so I didn't bother. I can do so if you have a strong opinion.
pkg/kv/kvserver/store.go, line 3508 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
What about
destroyReasonMergePending
? As is we'll continue processing the replica. Is that necessary? The raft log is about to be deleted anyway. You could change this to!r.mu.destroyStatus.IsAlive()
to avoid it.
Done
pkg/kv/kvserver/raft_log_truncator_test.go, line 45 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You could use EqualValues, don't have to, just wanted to point out this exists in case you weren't aware. I assume you got errors here and then added the cast, if you use constant ints using EqualValues from the start can be a bit faster.
I wasn't aware. Thanks. Done.
pkg/kv/kvserver/raft_log_truncator_test.go, line 62 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Test 20 and 21 too since they're the boundary values.
Done
pkg/kv/kvserver/raft_log_truncator_test.go, line 67 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Now that I see this, I don't think I saw a comment that said something like
// INVARIANT: pendingTruncs[0].Index < pendingTruncs[1].Index (if the latter is not zero), consider adding that in the right place.
Added to the invariants listed for the truncs
field.
pkg/kv/kvserver/raft_log_truncator_test.go, line 82 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
.
Done
TestReplicaCircuitBreaker_Follower_QuorumLoss failure is a flake #76781 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done, thanks for the swift turnaround!
Reviewed 1 of 14 files at r3, 17 of 17 files at r12, all commit messages.
Dismissed @nvanbenschoten and @sumeerbhola from 10 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
thanks for the speedy reviews! |
bors r=tbg |
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
91e90d7
to
f9dee66
Compare
TestClusterVersionUpgrade failure looks like a flake #74599 |
bors r=tbg |
Build succeeded: |
This is the CockroachDB plumbing for Pebble's IterOptions.OnlyReadGuaranteedDurable. It is for use in the raftLogTruncator cockroachdb#76215. Since most of the exported interfaces in the storage package use a Reader, we support this via a DurabilityRequirement parameter on Engine.NewReadOnly, and not via an iterator option. There is also a RegisterFlushCompletedCallback method on Engine which will be used to poll certain durable state in the raftLogTruncator. Other than the trivial plumbing, this required some refactoring of the Reader.MVCCGet* code for Pebble and pebbleReadOnly. Even though it is deprecated and primarily/only used in tests, we don't want to have the durability semantics diverge. Release note: None
76776: scpb,scdecomp,scbuild: remodel elements, use decomposition r=postamar a=postamar This change adresses some shortcomings of the declarative schema changer that would have compromised its ability to evolve gracefully in its current form. Basically these are: 1. the builder generating targets based off a hybrid descriptor-and- element scheme, and 2. an awkward handling of back-references which precludes doing much more than DROP statements correctly. This change therefore consists of, respectively: 1. rewriting the builder to hide descriptors from the `scbuildstmt` package, instead exposing sets of elements, and 2. reworking the element model definition by getting rid of back-reference elements entirely. In support of (1) this commit introduces a new `scdecomp` package which, given a descriptor, will walk a visitor function over all its constituent elements. This itself is made quite simple by (2) which largely removes the need to look up referenced descriptors outside of some mild edge-cases. This `scdecomp` package is used by the backing structs of the `scbuildstmt` dependency interfaces to collect a "state-of-the-world" snapshot of elements upon which the schema change target elements are layered. This approach lends itself well to caching as the descriptors remain immutable in the builder. The rewrite of most of `elements.proto` in support of (2) implies a slew of cascading changes: - the attributes in `screl` need to be adjusted, - the lifecyles of the elements in `opgen` have to be adjusted, - the dependency rules and no-op rules need to be adjusted, - in turn, new operations need to be defined in `scop`, and, - in turn, these operations need to be implemented in `scmutationexec`. Elements are now responsible for updating any back-references that correspond to their forward references, which effectively pushed that complexity into these reference update ops in `scmutationexec`. These have to operate on a best-effort basis, because there are no longer back-reference elements with dependency rules to enforce convenient assumptions about the context of their adding or removal. This is arguably not a bad thing per-se: this new paradigm is "fail hard, fail fast" which surfaces bugs a lot more quickly than a badly-written dependency rule would. The rules themselves fall into cleaner patterns. This commit provides some tools to express the most common of these. This commit also unifies the `deprules` and `scopt` packages into a commit `rules` package with full data-driven test coverage. Other changes in this commit are peripheral in nature: - Working on this change surfaced some deficiencies in the cross-referenced descriptor validation logic: we checked that the referenced descriptor exists but not that it's not dropped. This commit fixes this. - The expression validation logic in `schemaexpr` quite reasonably used to assume that columns can be found in descriptors; unfortunately the builder needs to be able to use this for columns which only exist as elements. This commit refactors the entry points into this validation logic as a result. - Quality-of-life improvements like adding a testing knob used to panic TestRollback with an error with a full stack trace when the rollback fails. - Because back-references don't exist as elements anymore, they also don't exist as targets, so we now have schema changer jobs linked to descriptors for which there are zero targets. This commit addresses this by simply allowing it. This is necessary to lock descriptors to prevent any concurrent schema changes which would affect them. Release note: None 76793: storage: introduce guaranteed durability functionality r=jbowens a=sumeerbhola This is the CockroachDB plumbing for Pebble's IterOptions.OnlyReadGuaranteedDurable. It is for use in the raftLogTruncator #76215. Since most of the exported interfaces in the storage package use a Reader, we support this via a DurabilityRequirement parameter on Engine.NewReadOnly, and not via an iterator option. There is also a RegisterFlushCompletedCallback method on Engine which will be used to poll certain durable state in the raftLogTruncator. Other than the trivial plumbing, this required some refactoring of the Reader.MVCCGet* code for Pebble and pebbleReadOnly. Even though it is deprecated and primarily/only used in tests, we don't want to have the durability semantics diverge. Release note: None 76835: ttljob: add controls to pause TTL jobs r=rafiss a=otan See individual commits for details. 76901: colexecspan: de-templatize span assembler and use span.Splitter r=RaduBerinde a=RaduBerinde #### colexecspan: de-templatize span assembler The span assembler code is generated only to inline a piece of code that has two variants. This change converts it to non-generated code and simply forks the code paths above the batch loop. Release note: None #### colexecspan: use span.Splitter The span assembler duplicates some of the logic in `span.Splitter`. Now that the latter is a separate type, we can use it instead. Release note: None Co-authored-by: Marius Posta <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
This is the CockroachDB plumbing for Pebble's IterOptions.OnlyReadGuaranteedDurable. It is for use in the raftLogTruncator cockroachdb#76215. Since most of the exported interfaces in the storage package use a Reader, we support this via a DurabilityRequirement parameter on Engine.NewReadOnly, and not via an iterator option. There is also a RegisterFlushCompletedCallback method on Engine which will be used to poll certain durable state in the raftLogTruncator. Other than the trivial plumbing, this required some refactoring of the Reader.MVCCGet* code for Pebble and pebbleReadOnly. Even though it is deprecated and primarily/only used in tests, we don't want to have the durability semantics diverge. Release note: None
This is the CockroachDB plumbing for Pebble's IterOptions.OnlyReadGuaranteedDurable. It is for use in the raftLogTruncator cockroachdb#76215. Since most of the exported interfaces in the storage package use a Reader, we support this via a DurabilityRequirement parameter on Engine.NewReadOnly, and not via an iterator option. There is also a RegisterFlushCompletedCallback method on Engine which will be used to poll certain durable state in the raftLogTruncator. Other than the trivial plumbing, this required some refactoring of the Reader.MVCCGet* code for Pebble and pebbleReadOnly. Even though it is deprecated and primarily/only used in tests, we don't want to have the durability semantics diverge. Release note: None
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.
Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).
The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.
We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.
The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).
Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.
The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.
The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.
Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.
Informs #36262
Informs #16624
Release note (ops change): The cluster setting
kv.raft_log.loosely_coupled_truncation.enabled can be used
to disable loosely coupled truncation.