Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: loosely couple raft log truncation #76215

Merged
merged 1 commit into from
Feb 22, 2022

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Feb 8, 2022

In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs #36262
Informs #16624

Release note (ops change): The cluster setting
kv.raft_log.loosely_coupled_truncation.enabled can be used
to disable loosely coupled truncation.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run or added any tests -- looking for feedback on the approach here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few separate thoughts for discussion. I've loosely familiarized myself with the diff without going into details. I am still at the stage where I'm trying to understand what our "minimal achievement" is that we need to have in order to continue on the separate raft log next cycle, and am looking for things we can simplify as the branch cut is approaching fast and I am fairly complexity averse at this point.


re: this comment

We do want to contine to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

I'm not sure this is true. While for #42491 we will indeed have followers send snapshots, this doesn't mean that the raft leader has to guide the truncation. Instead, it means that the "pending snapshot inflight" protection is important. While a snapshot is in flight (from any replica), the leader should not truncate its local log in a way that will prevent the recipient of the snapshot from catching up. In other words, if a snapshot at index N is in flight, the leader should protect log index N. This doesn't mean that the leader has to coordinate truncation! Instead, it means that the leader has to coordinate snapshots, which (to the best of my knowledge) is the plan anyway. The way I think this could work (but I've missed the last meeting about this with @amygao9 and @aayushshah15 and I'll possibly miss the next one too, unfortunately, so maybe either of you could chime in here) is that the leaseholder reaches out to the follower, the follower instantiates a snapshot, sends the index back to the leaseholder, starts streaming the snapshot and returns the error (if any) to the leaseholder (all over the same, streaming, RPC). The leaseholder can thus keep this map up to date much like it does today.

This isn't to say that I disagree with keeping the leader-mediated approach in this PR. I am all for making the smallest possible change at this point.


I would hope that we can stop tracking the exact size of the log at some point. This was introduced when we still had to include the raft log in snapshots, so there was a significant risk to letting the log grow large. Yet, tracking only on a best-effort basis (add up the sizes of log entries going into the log but never recompute) and resetting that on each truncation somehow wasn't good enough, I'm not quite sure why but I think we ended up in overload regimes where the raft leader/lease would flap around all of the time, which somehow reset this counting (these are faint, possibly incorrect, memories). So, for this PR, it's probably ok if we regress a little bit. A very large raft log poses only two risks that I can think of:

  • can lead to out-of-disk scenarios (or bad allocator decisions should the allocator decide to care about disk usage)
  • can lead to followers catching up across a very long raft log rather than more efficiently via a snapshot.

For the second bullet, it's not really the size of the raft log that matters, but the length, and this is something we do know precisely at all times.


You try to handle the case in which a follower restarts and losing its pending truncations. It's good to clear out the raft log for a quiesced follower, but having followers take action in the raft log queue seems to be moving us in the wrong direction (we do need to communicate some constraints to the raft leader due to pending snapshots and follower progress, but other than that replicas should feel free to truncate their logs as they deem useful in a purely local way). Instead, could we, eventually, "simply" have followers clear out all entries <= appliedState on server start-up as well when they quiesce? This doesn't even need to serialize with the raft handling loop as long as the truncations are atomic on disk and respect durability of the state machine. Of course this isn't work that should happen in this PR, but I wonder if we can avoid pulling the raft log queue into things here, and instead do the truncation on startup. This would also inspire a first step towards encapsulating the truncation logic a bit better (pulling it out of the bowels of below-raft Replica but instead calling into it there).


It would help if we clearly spelled out what the "must-have" for this work is. I think you're trying to unblock the separate raft log by making sure we can introduce the replicas storage "soon", but if there is a cluster setting, are we really saving any time? I assume we are if we are blocking the cluster upgrade on having the setting disabled and flush all raft logs once; ignoring possibly annoying theoretical problems with a "slow" raft log queue enacting the old behavior after the fact). But I still wonder if this wouldn't all work the same if we went to down on the raft truncations in early 22.2 instead of now.


Might be worth talking through some of this stuff in person next week.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

@sumeerbhola
Copy link
Collaborator Author

Thanks for looking at this!
Let's discuss risk in the in-person meeting (I have a very different opinion of course :) -- in addition to the cluster setting to preserve the coupled behavior, I also tried to encapsulate most of the new code so that it can be well tested.

I've added some technical responses below, which we can further discuss in that meeting.

Regarding the minimal work needed for v22.1 for ReplicasStorage to be real for v22.2:

  • ReplicasStorage cannot allow for truncation to happen in the current coupled manner. So even the enableLooselyCoupledTruncation cluster setting will need to be ripped out.
  • A v22.2 is not harming others by making its own choice to loosely couple:
    • As a follower: However, if someone rolls that node back to v22.1 then there could have been queued pending truncations that didn't get applied in the v22.2 node. What will make the v22.1 node do the truncation: Of course if the leader proposes another truncation we have no problem. But the quiescent case is not handled in v22.1 without the raftLogQueue change in this PR.
    • As a leader: It is proposing to followers as before. It doesn't care if the follower is v22.1 with coupled truncation.
  • The lack of ExpectedFirstIndex was already a problem. And now that it is possible to have pending truncations below raft in a leader, and have the leader propose another truncation while something is pending, it smells even more. There are two cases to consider with an unchanged v22.1
    • v22.2 follower is receiving these truncations and queueing them: The ExpectedFirstIndex is not set. Actually, sometimes it is set and sometimes it is not (as the leader moves around between different version nodes). When it has an ExpectedFirstIndex the v22.2 follower can appropriately validate and set trusted to false. When it doesn't, it could assume these are back-to-back and non-overlapping. This is what we are doing in this PR too. I was hoping to rip out the ability to handle both cases in v22.2, but it isn't a big deal if that code is in v22.2.
    • v22.2 leader: In addition to the previous bullet, it will be setting ExpectedFirstIndex. A v22.1 follower will be oblivious to it, which is fine.

... In other words, if a snapshot at index N is in flight, the leader should protect log index N. This doesn't mean that the leader has to coordinate truncation! ...

I don't think we are disagreeing. Or you are making a subtle point that I am unable to yet understand.
Since the leader is the central source of knowledge about what needs to be protected (as it is coordinating both leader and follower snapshots) it is the natural place to coordinate the truncation upper bound.

I would hope that we can stop tracking the exact size of the log at some point. This was introduced when we still had to include the raft log in snapshots, so there was a significant risk to letting the log grow large. Yet, tracking only on a best-effort basis (add up the sizes of log entries going into the log but never recompute) and resetting that on each truncation somehow wasn't good enough

I am not quite convinced that we shouldn't track the log size -- it is useful for observability metrics etc., and it seems wrong to carefully maintain MVCCStats (I realize those also help for replica divergence which is not relevant here), and not raft log size.
And regarding how to maintain the log size: if there is a write heavy raft group, and where we are not truncating everything (due to the need to protect a number of entries), recomputing at each replica (on truncation) by reading all the remaining entries seems expensive. I retained the code for the delta computation at the leaseholder because it seems a simple enough optimization (and given that we are adding ExpectedFirstIndex, which mostly fixes its inaccuracy).

Instead, could we, eventually, "simply" have followers clear out all entries <= appliedState on server start-up as well when they quiesce?

I was simply trying to be extra careful here in not over truncating. Which is why it is conditioned on quiescence and the periodic scan that adds to the raftLogQueue -- the assumption here was that by then the quiescence state of the replica would be accurate. My worry here was that maybe quiescence could be a false positive when the node has just started. I understand little about how the !quiescent => quiescent transition happens so this could be wrong headed thinking on my part.

Regarding "clear out all entries <= appliedState", the code here is explicitly trying to clear all the raft log entries and not only up to RangeAppliedState.RaftAppliedIndex. This relies for correctness on the fact that such clearing always goes through the pending truncation queue, which ensures the RaftAppliedIndex is durable up to the last raft log entry.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone following this, Sumeer and I met yesterday to discuss. We broadly agreed (but see my conclusion at the end of this comment...) to continue with the PR, perhaps cutting out the attempts to truncate the raft log on quiesced followers because it adds a fair bit of complexity and we don't think we really needed it.

@sumeerbhola gave this another pass. I wasn't 100% sure what you were looking for at this point (should've asked yesterday...) since the code is still pretty unkempt (and so also hard to review). But I think I've developed some opinions, which are in comments below but also summarized here:

  1. it's time to move all code to its proper home, give extensive comments, and optimize for readability/reviewability. (Moving it later means a wasted detailed review pass, let's get it out of the way now while I haven't really taken too close a look line-by-line).
  2. we should not do "apply-style" work outside of the replica but delegate that to the main replica raft handling loop (by invoking it with appropriate parameters). This is a new concept; we haven't done that before. But it should be relatively straightforward; snapshot application comes very close to it conceptually.

But, taking a step back - overall, I am starting to get worried that we won't succeed in landing a clean PR that KV/Repl can be completely happy with with how much time we have left until March 15th, when stability starts (and we don't want feature work to leak into the stability period). In addition to this, I will be on vacation Mar 8-15, and out on 2/25 as well as 2/28; plus there is an R&R day on 3/4. Charitably including today, we're left with 11 working days, which includes Fridays. That is too close for comfort, especially with both of us having other duties, some racing the same Mar 15 cutoff. Apologies for not having done this math yesterday, it would've preserved more of my initial concern around the size of the change at hand.

In light of this, I think we should revisit alternatives again. Yesterday we talked about performing the truncation traditionally but in a pre-apply hook, where it can target the raft engine in a separate WriteBatch (for compatibility with ReplicaStorage). Perhaps we can additionally get away with lowering the truncation index to the currently durable AppliedIndex, if necessary, which should make this "wholly correct". This seems feasible if we figure out what to do with stats and whether we're okay not truncating by quite as much as the leader wanted (risking there not being another truncation). I would much rather take my chances with that than have no other option than land the present PR, as we can run experiments and experience supports the claim that a bit of raft log is not something that customers are typically concerned with (they are concerned with giant raft logs that run them out of space, but that is not the issue we'd be creating here). As for the stats, it might be a little tricky. If we're under-truncating every time, we don't want to mark the stats as wrong every time. We don't also want to recompute in the apply loop every time. We could sidestep the problem by not keeping the stats on followers and by making sure the leader only ever choses locally-durable indexes.

I think there's something there we should explore and, given the urgency, probably in-"person".

Reviewed 2 of 16 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)


-- commits, line 65 at r1:

Touches

Also update the issue with a short synopsis of changes made here (in particular the bit that we're conservative here and really not trying to fundamentally change how it works).


pkg/kv/kvserver/raft_log_queue.go, line 39 at r1 (raw file):

// The safety requirement for truncation is that the entries being truncated
// are already durably applied to the state machine. Initialized replicas may

Might want to spell out in detail why, it is because after a truncation, the only remaining source of the data in the entries is state machine (which a prefix of the log), so if we ever truncate a log entry that is not durably on the state machine, we risk losing the information in said log entries which could lead to a scenario in which we violate Raft's leader completeness property: the replica may become leader and be asked for log entries it can no longer provide (neither directly nor through a snapshot).


pkg/kv/kvserver/raft_log_queue.go, line 47 at r1 (raw file):

// becomes possible.
//
// An attempt is made to add a replica to the queue under two situations:

Not going to read this one too closely yet because - I think - we've agreed to rip out the second path into the raft log queue (taken on followers) for now.


pkg/kv/kvserver/raft_log_queue.go, line 89 at r1 (raw file):

// (2) are enabled with cluster version LooselyCoupledRaftLogTruncation.

// This is a temporary cluster setting that we will remove after one release

Where did we leave things here in terms of timing?

  • 22.1: has the setting
  • 22.2: setting is baked in as "true", i.e. if you ran with false in the old binary you'll still get the new behavior now (and it is always on, even in mixed cluster). Upon rolling back to 21.1, setting takes hold again, so you might miss some truncations until the setting is updated.

Just to confirm, someone who "permanently" runs a 22.1/22.2 mixed cluster will be fine, right? Old nodes will truncate directly, new nodes will stash the suggestions and execute them based on durability, right?

Write out some of these things on a comment here so that it's clear to everyone what the plan is.


pkg/kv/kvserver/raft_log_queue.go, line 99 at r1 (raw file):

func init() {
	enableLooselyCoupledTruncation.SetVisibility(settings.Reserved)

nit: can also do this

var enableLooselyCoupledTruncation = func() *settings.BoolSetting {
  s := settings.RegisterBoolSetting(
	settings.SystemOnly,
	"kv.raft_log.enable_loosely_coupled_truncation",
	"set to true to loosely couple the raft log truncation.",
	true)
  s.SetVisibility(settings.Reserved)
  return s
}

I would probably prefer this because it keeps everything in one place but your call.


pkg/kv/kvserver/raft_log_queue.go, line 879 at r1 (raw file):

func (t *truncatorForReplicasInStore) addPendingTruncation(
	ctx context.Context,
	r *Replica,

As you're working on this stuff, can you make sure that we don't depend on anything directly but only through a narrow abstraction? Also move everything into a new file, say raft_log_truncator.go.


pkg/kv/kvserver/raft_log_queue.go, line 884 at r1 (raw file):

	raftLogDelta int64,
	deltaIncludesSideloaded bool,
) {

There sure is a lot of brainmelt spaghetti code in this method. I know this PR is basically in prototype status but since I am reviewing it now try to clean it up and significantly expand commentary.


pkg/kv/kvserver/raft_log_queue.go, line 972 at r1 (raw file):

		t.mu.Lock()
		defer t.mu.Unlock()
		n := len(t.mu.ranges)

It's an anti-pattern to hold on to *Replica objects (which can be replicagc'ed and instantiated under a new ReplicaID, etc). It is likely better to remember RangeIDs and then retrieve them from the store with the relevant mutexes held so that we serialize cleanly with replicaGC. I see that you're doing basically the right things below so I don't think it is far off, but let's try not to hold on to *Replica objects while queued either way.


pkg/kv/kvserver/raft_log_queue.go, line 976 at r1 (raw file):

			return
		}
		ranges = make([]rangeAndReplicaID, n)

Why not "just" make([]rangeAndReplicaID, 0, n) and then use append in the loop? The simpler the better. I'd rather review less index arithmetic than more, and this isn't a very hot code path.


pkg/kv/kvserver/raft_log_queue.go, line 1005 at r1 (raw file):

		r, err := t.store.GetReplica(repl.rangeID)
		if err != nil || r == nil || r.mu.replicaID != repl.replicaID {
			// Not found or not the same replica.

It should be safe if it is not the same replica. If that is not safe, we should understand why. Not holding on to a replica while queued is a good way to avoid questions like this. When the replica switches out (i.e. replicaGC & replaced by new replica or the like) the truncation may no longer apply, but the code should do a "right thing" regardless.

It is sort of unfortunate that we're really hacking into the internals here - holding raftMu, etc, but I see how this is necessary. Try to keep it to a minimum though, for example, instead of reaching into r.mu.state to update the truncated index, see if you can call the method that applies side effects during raft application with an updated state.


pkg/kv/kvserver/raft_log_queue.go, line 1020 at r1 (raw file):

		truncState := *r.mu.state.TruncatedState
		for !r.mu.pendingLogTruncations.empty() {

Isn't this going to be an infinite loop sometimes? I see what you're trying to do (discard all of the truncations that don't affect any log entries at this point) but I can't convince myself that this code works as is.


pkg/kv/kvserver/raft_log_queue.go, line 1021 at r1 (raw file):

		truncState := *r.mu.state.TruncatedState
		for !r.mu.pendingLogTruncations.empty() {
			pendingTrunc := r.mu.pendingLogTruncations.front()

maybe front() and pop() could be named better. It's not so clear which ones come first and how they're sorted. I know there are only two, but still.


pkg/kv/kvserver/raft_log_queue.go, line 1026 at r1 (raw file):

			}
		}
		if r.mu.pendingLogTruncations.empty() {

Isn't this only reached once empty() returns true?


pkg/kv/kvserver/raft_log_queue.go, line 1040 at r1 (raw file):

			}
			r.mu.Unlock()
			r.raftMu.Unlock()

Randomly unlocking raftMu here seems wrong.


pkg/kv/kvserver/raft_log_queue.go, line 1051 at r1 (raw file):

		}
		enactIndex := -1
		r.mu.pendingLogTruncations.iterate(func(trunc pendingTruncation) {

Same comment as above, this brainmelt code should be cleanly contained somewhere, with nice comments and unit tests, and not littering raft_log_queue.go. (I know you didn't intend to leave as is, just mentioning it anyway).

In particular, there is something about merging truncations and reasoning about their interplay (also in terms of stats) that could likely be encapsulated better.


pkg/kv/kvserver/raft_log_queue.go, line 1055 at r1 (raw file):

				return
			}
			enactIndex++

That you're tracking the index here seems to point at a need for a better version of iterate.


pkg/kv/kvserver/raft_log_queue.go, line 1057 at r1 (raw file):

			enactIndex++
		})
		if enactIndex > 0 {

I'd prefer an early-bail instead of a giant conditional, if possible.


pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):

			}
			if !apply {
				panic("unexpected !apply returned from handleTruncatedStateBelowRaftPreApply")

We need at least an attempt at proper error handling here, I don't want us to regress in terms of #75944. It's also a good question of how this will interact with the code here. Will "something" have to inspect all of the errors coming out of a replica while raftMu is held? It would be a shame.

I think the entire code here should be delegated to the raft handling goroutine, where in addition to an inSnap we pass "something" related to truncation.

So something like

func (r *Replica) handleRaftReadyRaftMuLocked(
ctx context.Context, inSnap IncomingSnapshot,
) (handleRaftReadyStats, string, error) {

which gives us a chance to centralize error handling and piggy-back on the existing mechanisms to apply side effects to the in-memory state (i.e. updating the in-mem truncatedstate).

@erikgrinaker
Copy link
Contributor

Drive-by comment since I unfortunately don't have bandwidth to follow this work closely.

we should not do "apply-style" work outside of the replica but delegate that to the main replica raft handling loop

I think it's an antipattern to do state machine transitions in the main Raft loop -- there's conceptually no need for it, and it can block other Raft processing causing spurious elections and other issues. See also #17500.

We've discussed this briefly before, and maybe I'm misremembering, but I thought we wanted to ultimately separate Raft processing from state machine application? This comment seems to imply the opposite?

@tbg
Copy link
Member

tbg commented Feb 15, 2022

There's the short game and the long game. Right now, if you do raft things you have to hold raftMu, and when you change the state you need to synchronize with the in-mem state. The way we do this today is in the raft apply loop. Yes, we ultimately want to decouple those, but it's not going to happen in this PR. My opinion is that we should not be tacking on a rather ugly second moving part that completely serializes with the raft apply loop (holds raftMu throughout) but has to reinvent some of the bits that already exist (and will remain a weird one-off anyway) because it's error-prone and also doesn't get us closer to the goal. We should keep all raft-related processing on the raft goroutine until the day that we are ready to piece them apart, in which case we will separate all raft processing from the state machine application.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments!

In light of this, I think we should revisit alternatives again.

I'm going to make an iteration over this code first. It wasn't my intention to produce anything unkempt here (hence the extensive code comment too), even though the PR is marked draft. But hopefully another pass by me will reduce the irritants. Also, the intention was to localize most changes to the new truncator abstraction so it could be tested separately.

we should not do "apply-style" work outside of the replica but delegate that to the main replica raft handling loop (by invoking it with appropriate parameters).

I am confused by this. There is also a more detailed comment on this below. We are doing raft log truncation, and not doing application, so why is this "apply-style" work? If this is about reusing handleRaftReadyRaftMuLocked or the raft goroutine see my more detailed response below.
We also have very little to synchronize with wrt in-memory state -- only the Replica stats and queued truncations need to be updated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/raft_log_queue.go, line 89 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Where did we leave things here in terms of timing?

  • 22.1: has the setting
  • 22.2: setting is baked in as "true", i.e. if you ran with false in the old binary you'll still get the new behavior now (and it is always on, even in mixed cluster). Upon rolling back to 21.1, setting takes hold again, so you might miss some truncations until the setting is updated.

Just to confirm, someone who "permanently" runs a 22.1/22.2 mixed cluster will be fine, right? Old nodes will truncate directly, new nodes will stash the suggestions and execute them based on durability, right?

Write out some of these things on a comment here so that it's clear to everyone what the plan is.

Yes to all.
Will do regarding comment.


pkg/kv/kvserver/raft_log_queue.go, line 244 at r1 (raw file):

	firstIndex, err := r.raftFirstIndexLocked()
	firstIndex = r.mu.pendingLogTruncations.computePostTruncFirstIndex(firstIndex)

Note that almost all the changes to raftLogQueue other than these calls to computePostTrunc*, and the later setting of ExpectedFirstIndex will go away.


pkg/kv/kvserver/raft_log_queue.go, line 884 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

There sure is a lot of brainmelt spaghetti code in this method. I know this PR is basically in prototype status but since I am reviewing it now try to clean it up and significantly expand commentary.

I'll add comments.
btw, even though the PR is marked "draft" it was my intention to produce a prototype here, which is why there are comments in various places (though not in this method).


pkg/kv/kvserver/raft_log_queue.go, line 972 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's an anti-pattern to hold on to *Replica objects (which can be replicagc'ed and instantiated under a new ReplicaID, etc). It is likely better to remember RangeIDs and then retrieve them from the store with the relevant mutexes held so that we serialize cleanly with replicaGC. I see that you're doing basically the right things below so I don't think it is far off, but let's try not to hold on to *Replica objects while queued either way.

Ack. The truncatorForReplicasInStore holds onto the (RangeID, ReplicaID) pair, so I think it is "correct" in that way.


pkg/kv/kvserver/raft_log_queue.go, line 1005 at r1 (raw file):

Not holding on to a replica while queued is a good way to avoid questions like this

The code is not holding onto a *Replica while queued. It is holding onto a (RangeID, ReplicaID) pair. If you are saying we shouldn't bother holding onto a ReplicaID either, sure there is nothing necessary for correctness.


pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We need at least an attempt at proper error handling here, I don't want us to regress in terms of #75944. It's also a good question of how this will interact with the code here. Will "something" have to inspect all of the errors coming out of a replica while raftMu is held? It would be a shame.

I think the entire code here should be delegated to the raft handling goroutine, where in addition to an inSnap we pass "something" related to truncation.

So something like

func (r *Replica) handleRaftReadyRaftMuLocked(
ctx context.Context, inSnap IncomingSnapshot,
) (handleRaftReadyStats, string, error) {

which gives us a chance to centralize error handling and piggy-back on the existing mechanisms to apply side effects to the in-memory state (i.e. updating the in-mem truncatedstate).

  • It was an oversight on my part to panic here. I didn't carefully read the implementation and thought that the only reason to return false was a simple invariant violation.
  • I don't understand how we should delegate this to the raft handling goroutine. We already call handleRaftReadyRaftMuLocked outside of the raft goroutine when applying a snapshot, so we haven't centralized that either. To centralize it we would need to somehow hook up into the raftScheduler and processReady. That seems a much wider change.
  • Are you suggesting that we do this via another parameter to handleRaftReadyRaftMuLocked to preserve an illusion of centralization? At least there is some shared code paths in the implementation of handleRaftReadyRaftMuLocked for both of its current callers. However the code below is nothing like what is in handleRaftReadyRaftMuLocked. It borrows more heavily from code that sits in state machine application (which will be removed in for v22.2), since we have been doing raft log truncation as part of state machine application.

@sumeerbhola sumeerbhola force-pushed the raft_trunc2 branch 2 times, most recently from fa0fa5b to 3b533bc Compare February 15, 2022 18:53
@sumeerbhola sumeerbhola marked this pull request as ready for review February 15, 2022 18:56
@sumeerbhola sumeerbhola requested a review from a team as a code owner February 15, 2022 18:56
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


-- commits, line 65 at r1:

Previously, tbg (Tobias Grieger) wrote…

Touches

Also update the issue with a short synopsis of changes made here (in particular the bit that we're conservative here and really not trying to fundamentally change how it works).

I'd already changed to "Informs", so I didn't change to "Touches". I'll update the issue with a synopsis once this PR merges.


pkg/kv/kvserver/raft_log_queue.go, line 39 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might want to spell out in detail why, it is because after a truncation, the only remaining source of the data in the entries is state machine (which a prefix of the log), so if we ever truncate a log entry that is not durably on the state machine, we risk losing the information in said log entries which could lead to a scenario in which we violate Raft's leader completeness property: the replica may become leader and be asked for log entries it can no longer provide (neither directly nor through a snapshot).

Done. Instead of phrasing it as the replica not having entries to provide to others, I phrased it as a gap between the last entry applied to the state machine and the first entry in the untruncated log -- which means even this replica can't do any more application.


pkg/kv/kvserver/raft_log_queue.go, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not going to read this one too closely yet because - I think - we've agreed to rip out the second path into the raft log queue (taken on followers) for now.

These two situations are unchanged. They were true even before this PR.
I've only removed the phrase "latest applied state was not durable" in the parenthetical comment.


pkg/kv/kvserver/raft_log_queue.go, line 76 at r1 (raw file):

// truncation can be done independently at each replica when the corresponding
// RaftAppliedIndex is durable. Note that since raft state (including truncated
// state) is not part of the state machine, this loose coordination is fine.

I've updated this text to describe what happens when we are not loosely coupled and why that is considered correct (and added a reference to #38566 in that we are not fully certain about its correctness).


pkg/kv/kvserver/raft_log_queue.go, line 78 at r1 (raw file):

// state) is not part of the state machine, this loose coordination is fine.
//
// 2. Proposes "local" truncation

I've removed this item in this form, and added some commentary about the the behavior when a node restarts in that the pending truncations are lost.


pkg/kv/kvserver/raft_log_queue.go, line 89 at r1 (raw file):

Previously, sumeerbhola wrote…

Yes to all.
Will do regarding comment.

Added a comment.


pkg/kv/kvserver/raft_log_queue.go, line 99 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: can also do this

var enableLooselyCoupledTruncation = func() *settings.BoolSetting {
  s := settings.RegisterBoolSetting(
	settings.SystemOnly,
	"kv.raft_log.enable_loosely_coupled_truncation",
	"set to true to loosely couple the raft log truncation.",
	true)
  s.SetVisibility(settings.Reserved)
  return s
}

I would probably prefer this because it keeps everything in one place but your call.

Good idea. Done.


pkg/kv/kvserver/raft_log_queue.go, line 848 at r1 (raw file):

}

type rangeAndReplicaID struct {

this is gone, since we now only key by RangeID.


pkg/kv/kvserver/raft_log_queue.go, line 879 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

As you're working on this stuff, can you make sure that we don't depend on anything directly but only through a narrow abstraction? Also move everything into a new file, say raft_log_truncator.go.

Done.
The narrow abstractions are storeForTruncator and replicaForTruncator. They should allow for testing all the code in raft_log_truncator.go without instantiating a Replica or Store.


pkg/kv/kvserver/raft_log_queue.go, line 884 at r1 (raw file):

Previously, sumeerbhola wrote…

I'll add comments.
btw, even though the PR is marked "draft" it was my intention to produce a prototype here, which is why there are comments in various places (though not in this method).

(meant to say "not my intention")

  • Added more comments.
  • Fixed the iteration to utilize pendingLogTruncations.iterate.

pkg/kv/kvserver/raft_log_queue.go, line 976 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why not "just" make([]rangeAndReplicaID, 0, n) and then use append in the loop? The simpler the better. I'd rather review less index arithmetic than more, and this isn't a very hot code path.

Done


pkg/kv/kvserver/raft_log_queue.go, line 1005 at r1 (raw file):
The key in the map no longer includes the ReplicaID.

instead of reaching into r.mu.state to update the truncated index, see if you can call the method that applies side effects during raft application with an updated state.

This reaching-into is now done in a narrower and cleaner way with the replicaForTruncator interface.


pkg/kv/kvserver/raft_log_queue.go, line 1020 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this going to be an infinite loop sometimes? I see what you're trying to do (discard all of the truncations that don't affect any log entries at this point) but I can't convince myself that this code works as is.

Good catch. This was missing a break.
A unit test should uncover such bugs.


pkg/kv/kvserver/raft_log_queue.go, line 1021 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

maybe front() and pop() could be named better. It's not so clear which ones come first and how they're sorted. I know there are only two, but still.

I've added code comments where they are declared. I am also open to naming suggestions.


pkg/kv/kvserver/raft_log_queue.go, line 1026 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this only reached once empty() returns true?

Added a comment. The aforementioned bug was causing confusion. It is reached once we've removed all the useless/noop truncations.


pkg/kv/kvserver/raft_log_queue.go, line 1040 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Randomly unlocking raftMu here seems wrong.

I've added a comment for this func and renamed it to popAllOnError.


pkg/kv/kvserver/raft_log_queue.go, line 1051 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Same comment as above, this brainmelt code should be cleanly contained somewhere, with nice comments and unit tests, and not littering raft_log_queue.go. (I know you didn't intend to leave as is, just mentioning it anyway).

In particular, there is something about merging truncations and reasoning about their interplay (also in terms of stats) that could likely be encapsulated better.

I am hopeful that this revised version will not be a brainmelt. I am happy for further simplification suggestions.
Also, now that this is abstracted from Replica, it will be easy to unit test thoroughly.

Regarding merging truncations, that happens at queueing time. Here there is the (a) persistent truncation, for which we just need to apply the last enactable truncation (there is now a code comment that explains this), (b) updating the replica state which involved applying each.

I mistakenly did not use handleTruncatedStateResult to update the RaftTruncatedState and its side-effects. This has been rectified.
Also, the legacy code that calls handleRaftLogDeltaResult now forwards to setTruncationDeltaAndTrusted.


pkg/kv/kvserver/raft_log_queue.go, line 1055 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That you're tracking the index here seems to point at a need for a better version of iterate.

iterate now passes the index.


pkg/kv/kvserver/raft_log_queue.go, line 1057 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd prefer an early-bail instead of a giant conditional, if possible.

Done


pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):

It was an oversight on my part to panic here.

I spoke too soon. The code was panicking if there was no error and !apply was returned. This is an invariant violation since we've already eliminated the noop truncations. But I've changed it to a log statement to be more graceful in case someone changes the implementation of handleTruncatedStateBelowRaftPreApply.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cleaned up and ready for a look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_application_state_machine.go, line 763 at r2 (raw file):

	if res.State != nil && res.State.TruncatedState != nil {
		var err error

The changes in this file are not significant, but I could use some guidance on how to unit test them.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was fast, thanks for the quick turnaround! Thanks for pulling out the interface. I am not happy with it yet but now it's clearer what we're up against and I am able to make specific suggestions. I have some meetings coming up so flushing what I have.

I am confused by this. There is also a more detailed comment on this below. We are doing raft log truncation, and not doing application, so why is this "apply-style" work? If this is about reusing handleRaftReadyRaftMuLocked or the raft goroutine see my more detailed response below.

Sorry, it's not about this being "apply" work, but "raft" work. "Raft" work happens in handleRaftReadyRaftMuLocked. This is where entries get written to the log. It should also be where they get removed from the log. I shouldn't have uttered the word "apply" which has nothing to do with this. We also happen to apply entries in that method, but we all agree that this eventually should be split off, where handleRaftReadyRaftMuLocked does little more than notify an apply subsystem that there is work to do.
I'm still making up my mind on how hard to push for this item, and will get a second opinion (Erik) as well.

In either case, looking at this PR now (thanks for the cleanups!) my worry from yesterday has abated. I still have to review the nitty-gritty but if we keep up the rate of turnaround and nothing unexpected happens, we should comfortably get that in.

Reviewed 4 of 16 files at r1, 2 of 8 files at r2.
Dismissed @sumeerbhola from 3 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/raft_log_queue.go, line 244 at r1 (raw file):

Previously, sumeerbhola wrote…

Note that almost all the changes to raftLogQueue other than these calls to computePostTrunc*, and the later setting of ExpectedFirstIndex will go away.

There aren't any other changes, right? The cluster version gate will go away but that's it.


pkg/kv/kvserver/raft_log_queue.go, line 252 at r2 (raw file):

	// NB: we need an exclusive lock due to grabbing the first index.
	r.mu.Lock()
	raftLogSize := r.mu.pendingLogTruncations.computePostTruncLogSize(r.mu.raftLogSize)

Could you update the comment on newTruncateDecision to give an idea of how pending truncations interact with new truncate decisions? I think what we're doing is to pretend that all pending truncations have happened, and then computing what the next truncation would be.

computePostTrunc* also need comments.


pkg/kv/kvserver/raft_log_queue.go, line 280 at r2 (raw file):

	// those pending truncations would cause a transition from trusted =>
	// !trusted. This is done since we don't want to trigger a recomputation of
	// the raft log size while we still have pending truncations.

Not triggering recomputations prematurely makes sense, can you elaborate on the consequences of this? There may always be pending truncations, at least in a contrived world, right?


pkg/kv/kvserver/raft_log_truncator.go, line 67 at r2 (raw file):
Maybe there's a way to clarify what "front" means in this, it's the least aggressive truncation, right? So maybe

// Returns the front (i.e. least aggressive truncation) of the pending truncations queue, without removing the element

, and similar for pop.


pkg/kv/kvserver/raft_log_truncator.go, line 167 at r2 (raw file):

//
// We acknowledge that this interface may seem peculiar -- this is due to the
// constraint that it is abstracting Replica.

There isn't really such a constraint, and I think it would be instructive to make the interface "what we really want it to be" (i.e. fewer random methods, ideally no need to have methods that expose locks), and for the implementation do this:

type replicaInTruncator Replica

func (ri *replicaInTruncator) SomeGoodMethod() bool {
  r := (*Replica)(ri)
  // Do things
  return true
}

The balance to strike is that there shouldn't be any nontrivial code in methods on replicaInTruncator, as then it's back to not being able to properly unit test them.

But I hope there is such a balance. I imagine the Store interface would get a method that loads the replica (by RangeID), acquires the raftMu and checks the destroy status before returning. So now the remainder of the truncator can forget about the destroy status. Both callers in this PR that load the replica immediately proceed and check destroy status under the raftMu, so this makes sense. And I don't think they can unlock raftMu and then keep using the replica; after all the replica might be destroyed in the instant after. So we can use an acquire-release pattern here:

repl, ok := storeInterface.AcquireReplica(rangeID) // repl.raftMu
if !ok { continue }
defer repl.Release() // basically unlocks raftMu

The replicaForTruncator methods that are subject to a lock can sit behind this pattern:

type lockedReplicaInTruncator replicaInTruncator // with corresponding interface lockedReplicaInTruncatorI

func (lri *lockedReplicaInTruncator) TouchSomethingThatNeedsLock() {
  r.mu.foo = "bar"
}

func (ri *replicaInTruncator) WithLock(f func(lockedReplicaInTruncatorI)) {
  ri.mu.Lock()
  defer ri.mu.Unlock()
  f(lri(ri))
}

I'm sure there are a few complications here that I'm not seeing, but I think we can and should strive to make these interfaces pared down and pleasant to use.


pkg/kv/kvserver/replica.go, line 457 at r2 (raw file):

		// Writes also require that raftMu be held, which means one of mu or
		// raftMu is sufficient for reads.
		pendingLogTruncations pendingLogTruncations

Just to play devil's advocate here, is there any reason for this to sit under replica.mu? We always hold raftMu when reading or writing this, so why bring replica.mu into the equation? This could sit under raftMu it seems and would have a better home there.


pkg/kv/kvserver/replica.go, line 2050 at r2 (raw file):

}

func (r *Replica) setTruncationDeltaAndTrusted(deltaBytes int64, isDeltaTrusted bool) {

We definitely want to avoid breaking our naming conventions for replica methods that hold locks. My suggestion above to clean up the interfaces will achieve that too.


pkg/kv/kvserver/replica.go, line 2097 at r2 (raw file):

	_, expectedFirtIndexAccurate := r.handleTruncatedStateResult(
		ctx, trunc, expectedFirstIndexPreTruncation)
	return expectedFirtIndexAccurate

First


pkg/kv/kvserver/replica_application_state_machine.go, line 763 at r2 (raw file):

Previously, sumeerbhola wrote…

The changes in this file are not significant, but I could use some guidance on how to unit test them.

We could pull out a free-standing function that takes res, the cluster settings, b.state.TruncatedState, a state loader, a WriteBatch, the truncator, the replica through an interface. It will be work, work that's useful but I am hesitant to make you do it in the context of this PR. It would be worth-while, but not if it comes at the cost of polishing other, more important, bits of the PR. So perhaps let's hold off and add it in in a follow-up, or not at all, depending on how things pan out.


pkg/kv/kvserver/replica_application_state_machine.go, line 764 at r2 (raw file):

	if res.State != nil && res.State.TruncatedState != nil {
		var err error
		looselyCoupledTruncation := isLooselyCoupledRaftLogTruncationEnabled(ctx, b.r.ClusterSettings())

Checking the cluster version below raft is somewhat atypical. It's ok here because the effects of this command are local, but I wonder if we have to. Isn't it up to the creator of the truncation proposal to decide whether to use the loosely coupled truncations?


pkg/kv/kvserver/replica_application_state_machine.go, line 794 at r2 (raw file):

				// we step up as leader at some point in the future, we recompute
				// our numbers.
				// TODO(sumeer): this code will be deleted when there is no

❤️


pkg/kv/kvserver/replica_sideload_disk.go, line 219 at r2 (raw file):

}

// Helper for truncation or byte calculation for [from, to)

nit: .


pkg/kv/kvserver/replica_sideload_disk.go, line 254 at r2 (raw file):

	}

	if deletedAll && doTruncate {

Make sure there's a simple unit test for this. Modifying one of the existing ones should be enough.


pkg/kv/kvserver/store.go, line 714 at r2 (raw file):

	replicaGCQueue     *replicaGCQueue             // Replica GC queue
	raftLogQueue       *raftLogQueue               // Raft log truncation queue
	raftTruncator      raftLogTruncator            // Enacts truncation

Maybe

// Carries out truncations from the raft log queue when they are safe

@tbg tbg self-requested a review February 16, 2022 08:55
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):

Previously, sumeerbhola wrote…

It was an oversight on my part to panic here.

I spoke too soon. The code was panicking if there was no error and !apply was returned. This is an invariant violation since we've already eliminated the noop truncations. But I've changed it to a log statement to be more graceful in case someone changes the implementation of handleTruncatedStateBelowRaftPreApply.

The concrete proposal is that the callback will become very simple: enqueue all replicas that have pending truncations (not validating for whether they can actually occur now) for a Ready check:

r.store.scheduler.EnqueueRaftReady(rangeID)

Then, in handlehandleRaftReadyRaftMuLocked:

diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index aabcf3af74..09af6313d6 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -559,6 +559,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 	raftLogSize := r.mu.raftLogSize
 	leaderID := r.mu.leaderID
 	lastLeaderID := leaderID
+	var pendingTruncation interface{}
 	err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
 		numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup)
 		if err != nil {
@@ -567,6 +568,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 		if hasReady = raftGroup.HasReady(); hasReady {
 			rd = raftGroup.Ready()
 		}
+		if someCheapCondition {
+			pendingTruncation = &somethingSomething{}
+		}
 		// We unquiesce if we have a Ready (= there's work to do). We also have
 		// to unquiesce if we just flushed some proposals but there isn't a
 		// Ready, which can happen if the proposals got dropped (raft does this
@@ -591,6 +595,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 		const expl = "while checking raft group for Ready"
 		return stats, expl, errors.Wrap(err, expl)
 	}
+	if pendingTruncation != nil {
+		maybeExecutePendingTruncation(...)
+	}
 	if !hasReady {
 		// We must update the proposal quota even if we don't have a ready.
 		// Consider the case when our quota is of size 1 and two out of three

The main reason I'm advocating for this is because I think that way the new code will not cause any issues down the road and will most naturally lend itself to agreeing with any future refactoring we might perform. We're avoiding a one-off code path for log truncations and keeping all mutations to the raft state in one place; we are tracking the overhead of truncations and handling errors coming from them in exactly the same way as we would for any raft command (see here). We would also be able to, if we wanted to, append and truncate the log in a single batch (the append batch is created a bit further down), which might be advantageous (or not, no idea), but even if we don't do that it just seems right to have these pieces of code close to each other. It is true that in theory we could try to do some of the log truncation work completely out of band (bump the truncated state first, then delete the old entries later) but this is similar to trying to separate entry application from entry commit, i.e. something to consider later (besides, we're holding raftMu all of the time, so there is no concurrency between raft handling and truncations as is anyway). In short, I think some variant of what I'm proposing (callback triggers a raft handle loop which does the rest) makes sense with our current architecture and won't get in the way of future changes.

There is a little bit of risk to going through the scheduler (maybe in adding extra scheduling steps, we are somehow starving other raft groups of processing) so we could entertain bypassing the scheduler by instead calling store.processReady(rangeID) directly. But we should avoid blocking in the pebble durability callback, so it would really be "spawn a task that calls processReady but make sure there's only ever one task per rangeID so they don't pile up", really it starts looking a lot like rebuilding the scheduler already, and so I think we should go through the scheduler and just make sure that that's ok (as I fully expect it to be).


pkg/kv/kvserver/raft_log_queue.go, line 37 at r2 (raw file):

// Overview of Raft log truncation:

// Might want to spell out in detail why, it is because after a truncation,

I think you meant to remove this entire paragraph.

@sumeerbhola sumeerbhola force-pushed the raft_trunc2 branch 2 times, most recently from c0f06ed to 2ba1d64 Compare February 16, 2022 17:00
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/raft_log_queue.go, line 244 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

There aren't any other changes, right? The cluster version gate will go away but that's it.

Correct -- what is now in this file is what is remaining.


pkg/kv/kvserver/raft_log_queue.go, line 1069 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The concrete proposal is that the callback will become very simple: enqueue all replicas that have pending truncations (not validating for whether they can actually occur now) for a Ready check:

r.store.scheduler.EnqueueRaftReady(rangeID)

Then, in handlehandleRaftReadyRaftMuLocked:

diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index aabcf3af74..09af6313d6 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -559,6 +559,7 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 	raftLogSize := r.mu.raftLogSize
 	leaderID := r.mu.leaderID
 	lastLeaderID := leaderID
+	var pendingTruncation interface{}
 	err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
 		numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup)
 		if err != nil {
@@ -567,6 +568,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 		if hasReady = raftGroup.HasReady(); hasReady {
 			rd = raftGroup.Ready()
 		}
+		if someCheapCondition {
+			pendingTruncation = &somethingSomething{}
+		}
 		// We unquiesce if we have a Ready (= there's work to do). We also have
 		// to unquiesce if we just flushed some proposals but there isn't a
 		// Ready, which can happen if the proposals got dropped (raft does this
@@ -591,6 +595,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 		const expl = "while checking raft group for Ready"
 		return stats, expl, errors.Wrap(err, expl)
 	}
+	if pendingTruncation != nil {
+		maybeExecutePendingTruncation(...)
+	}
 	if !hasReady {
 		// We must update the proposal quota even if we don't have a ready.
 		// Consider the case when our quota is of size 1 and two out of three

The main reason I'm advocating for this is because I think that way the new code will not cause any issues down the road and will most naturally lend itself to agreeing with any future refactoring we might perform. We're avoiding a one-off code path for log truncations and keeping all mutations to the raft state in one place; we are tracking the overhead of truncations and handling errors coming from them in exactly the same way as we would for any raft command (see here). We would also be able to, if we wanted to, append and truncate the log in a single batch (the append batch is created a bit further down), which might be advantageous (or not, no idea), but even if we don't do that it just seems right to have these pieces of code close to each other. It is true that in theory we could try to do some of the log truncation work completely out of band (bump the truncated state first, then delete the old entries later) but this is similar to trying to separate entry application from entry commit, i.e. something to consider later (besides, we're holding raftMu all of the time, so there is no concurrency between raft handling and truncations as is anyway). In short, I think some variant of what I'm proposing (callback triggers a raft handle loop which does the rest) makes sense with our current architecture and won't get in the way of future changes.

There is a little bit of risk to going through the scheduler (maybe in adding extra scheduling steps, we are somehow starving other raft groups of processing) so we could entertain bypassing the scheduler by instead calling store.processReady(rangeID) directly. But we should avoid blocking in the pebble durability callback, so it would really be "spawn a task that calls processReady but make sure there's only ever one task per rangeID so they don't pile up", really it starts looking a lot like rebuilding the scheduler already, and so I think we should go through the scheduler and just make sure that that's ok (as I fully expect it to be).

Regarding not blocking in the Pebble durability callback, I was planning to add something in the truncator which start a goroutine if none was running.
Unlike the raft scheduler, there would be one goroutine doing the truncation across all replicas, which has pros and cons.
Pros: we share the same durability reader across the replicas; it is clear that now is the time to check the durable RaftAppliedIndex; there is at most one goroutine doing this, so the resource consumption is constrained.
Con: there is at most one goroutine doing this, and what if it falls behind.

If we move this to the scheduler, we need some way for the ready loop to know that a durability callback has not happened, so it shouldn't bother doing a read to try to dequeue pending truncations. This is definitely doable.

I don't have a good sense of the surrounding code, so I'm happy to go with whatever you decide.


pkg/kv/kvserver/raft_log_queue.go, line 37 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think you meant to remove this entire paragraph.

Done


pkg/kv/kvserver/raft_log_queue.go, line 252 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you update the comment on newTruncateDecision to give an idea of how pending truncations interact with new truncate decisions? I think what we're doing is to pretend that all pending truncations have happened, and then computing what the next truncation would be.

computePostTrunc* also need comments.

Done


pkg/kv/kvserver/raft_log_queue.go, line 280 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not triggering recomputations prematurely makes sense, can you elaborate on the consequences of this? There may always be pending truncations, at least in a contrived world, right?

I've added the following to the comment.

// Note that as
// soon as those pending truncations are enacted r.mu.raftLogSizeTrusted
// will become false and we will recompute the size -- so this cannot cause
// an indefinite delay in recomputation.

I think we can live with a transient under or over estimate.


pkg/kv/kvserver/raft_log_truncator.go, line 67 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe there's a way to clarify what "front" means in this, it's the least aggressive truncation, right? So maybe

// Returns the front (i.e. least aggressive truncation) of the pending truncations queue, without removing the element

, and similar for pop.

Done


pkg/kv/kvserver/raft_log_truncator.go, line 167 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

There isn't really such a constraint, and I think it would be instructive to make the interface "what we really want it to be" (i.e. fewer random methods, ideally no need to have methods that expose locks), and for the implementation do this:

type replicaInTruncator Replica

func (ri *replicaInTruncator) SomeGoodMethod() bool {
  r := (*Replica)(ri)
  // Do things
  return true
}

The balance to strike is that there shouldn't be any nontrivial code in methods on replicaInTruncator, as then it's back to not being able to properly unit test them.

But I hope there is such a balance. I imagine the Store interface would get a method that loads the replica (by RangeID), acquires the raftMu and checks the destroy status before returning. So now the remainder of the truncator can forget about the destroy status. Both callers in this PR that load the replica immediately proceed and check destroy status under the raftMu, so this makes sense. And I don't think they can unlock raftMu and then keep using the replica; after all the replica might be destroyed in the instant after. So we can use an acquire-release pattern here:

repl, ok := storeInterface.AcquireReplica(rangeID) // repl.raftMu
if !ok { continue }
defer repl.Release() // basically unlocks raftMu

The replicaForTruncator methods that are subject to a lock can sit behind this pattern:

type lockedReplicaInTruncator replicaInTruncator // with corresponding interface lockedReplicaInTruncatorI

func (lri *lockedReplicaInTruncator) TouchSomethingThatNeedsLock() {
  r.mu.foo = "bar"
}

func (ri *replicaInTruncator) WithLock(f func(lockedReplicaInTruncatorI)) {
  ri.mu.Lock()
  defer ri.mu.Unlock()
  f(lri(ri))
}

I'm sure there are a few complications here that I'm not seeing, but I think we can and should strive to make these interfaces pared down and pleasant to use.

The storeForTruncator interface change idea is nice. I've made that change. It removed the explicit locking of raftMu which simplified the code.
There is still explicit locking of Replica.mu -- I don't see how to avoid this given that we are reading many things while holding that lock.


pkg/kv/kvserver/replica.go, line 457 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just to play devil's advocate here, is there any reason for this to sit under replica.mu? We always hold raftMu when reading or writing this, so why bring replica.mu into the equation? This could sit under raftMu it seems and would have a better home there.

We also read it in newTruncateDecision where currently we only hold Replica.mu. It seemed awkward to disassociate this from the raftLogSize and the truncated state, which it adjusts, which sit under Replica.mu. I could change that code to also acquire raftMu if you think that is better.


pkg/kv/kvserver/replica.go, line 2050 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We definitely want to avoid breaking our naming conventions for replica methods that hold locks. My suggestion above to clean up the interfaces will achieve that too.

I added the Locked suffix to two of the methods.
The getPendingTruncs situation is a bit odd since it doesn't need the lock for reads. And if we are going to write the *pendingTruncs later, we will be calling methods on pendingLogTruncations which has no notion of concurrency control, so those methods can't say "Locked".


pkg/kv/kvserver/replica.go, line 2097 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

First

Done


pkg/kv/kvserver/replica_application_state_machine.go, line 764 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Checking the cluster version below raft is somewhat atypical. It's ok here because the effects of this command are local, but I wonder if we have to. Isn't it up to the creator of the truncation proposal to decide whether to use the loosely coupled truncations?

Currently there isn't a bool in ReplicaEvalResult that specifies whether it should be loosely coupled or not. The upstream code looks only at the cluster version and if it is LooselyCoupledRaftLogTruncation it populates ExpectedFirstIndex. The ExpectedFirstIndex is useful even for strongly coupled truncation. I could add a bool to ReplicaEvalResult, which we would remove after the strongly coupled truncation goes away, or continue to do this. Since we can't fully dictate what the below raft code does (it could be an old binary), it does seem a little odd to add additional plumbing.
But I don't have a strong opinion.

I've also added a code comment here which includes what you mentioned.


pkg/kv/kvserver/replica_sideload_disk.go, line 219 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: .

Done


pkg/kv/kvserver/store.go, line 714 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe

// Carries out truncations from the raft log queue when they are safe

Done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 16 files at r1, 1 of 8 files at r2, 10 of 14 files at r3, 6 of 6 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @sumeerbhola, and @tbg)


pkg/kv/kvserver/raft_log_queue.go, line 111 at r4 (raw file):

//   will do strongly coupled truncation and the v22.2 will do loosely
//   coupled. This co-existence is correct.
// TODO: flip back to true before merging.

Reminder about this TODO.


pkg/kv/kvserver/raft_log_truncator.go, line 61 at r4 (raw file):

func (p *pendingLogTruncations) computePostTruncFirstIndex(firstIndex uint64) uint64 {
	p.iterate(func(_ int, trunc pendingTruncation) {
		if firstIndex < trunc.Index+1 {

Consider pulling trunc.Index + 1 into a method on pendingTruncation called firstIndexAfterApplied or something like that with a short comment that reminds readers that RaftTruncatedState.Index is inclusive.


pkg/kv/kvserver/raft_log_truncator.go, line 85 at r4 (raw file):

	p.truncs[0] = pendingTruncation{}
	if !(p.truncs[1] == (pendingTruncation{})) {
		p.truncs[0] = p.truncs[1]

nit: can this method just be these two lines, without the conditional?


pkg/kv/kvserver/raft_log_truncator.go, line 125 at r4 (raw file):

	expectedFirstIndex uint64
	// logDeltaBytes includes the bytes from sideloaded files.
	logDeltaBytes  int64

Could you mention that this is always negative (or zero?). It caught me by surprise to see us add logDeltaBytes above.


pkg/kv/kvserver/raft_log_truncator.go, line 183 at r4 (raw file):
The comment above says "Any mutex protecting raft-state is acquired before returning". Was that in reference to Replica.raftMu and is this in reference to the Replica.mu? If so, consider mentioning this in that comment.

Any mutex protecting raft-state (Replica.raftMu) is acquired before returning, but finer-grained mutexes protecting replica state (Replica.mu) must be acquired using lockReplicaState.


pkg/kv/kvserver/raft_log_truncator.go, line 318 at r4 (raw file):

			return
		}
		ranges = make([]roachpb.RangeID, 0, n)

nit: if we want to minimize the amount of time that we spend holding the raftLogTruncator's lock, we could maintain two maps and swap them here.

Something like:

type raftLogTruncator struct {
    ...
    add, drain map[roachpb.RangeID]struct{}
}

func (t *raftLogTruncator) addPendingTruncation()
    ...
    t.mu.Lock()
    t.mu.add[r.GetRangeID()] = struct{}{}
    t.mu.Unlock()
}

func (t *raftLogTruncator) durabilityAdvanced()
    ...
    t.mu.Lock()
    t.mu.add, t.mu.drain = t.mu.drain, t.mu.add
    drain := t.mu.drain
    t.mu.Unlock()

    for k := drain {
        ...
        delete(drain, k)
    }
}

pkg/kv/kvserver/raft_log_truncator.go, line 423 at r4 (raw file):

			return
		}
		if err := batch.Commit(false); err != nil {

Consider adding a note about why we don't need durability with this operation.


pkg/kv/kvserver/replica.go, line 2038 at r4 (raw file):

// Implementation of replicaForTruncator interface.

var _ replicaForTruncator = &Replica{}

Should these be methods on Replica or on a separate concrete type? I believe Tobi was suggesting that we do something like we do with replicaProposer to avoid bloating the Replica method set.


pkg/kv/kvserver/replica_application_state_machine.go, line 763 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We could pull out a free-standing function that takes res, the cluster settings, b.state.TruncatedState, a state loader, a WriteBatch, the truncator, the replica through an interface. It will be work, work that's useful but I am hesitant to make you do it in the context of this PR. It would be worth-while, but not if it comes at the cost of polishing other, more important, bits of the PR. So perhaps let's hold off and add it in in a follow-up, or not at all, depending on how things pan out.

TestReplicaStateMachineChangeReplicas is an example of a test that directly exercises a Replica state machine. You may be able to write the kind of test that you want at a similar level.


pkg/kv/kvserver/replica_application_state_machine.go, line 1264 at r4 (raw file):

The RaftExpectedFirstIndex != 0 will not be necessary in the release following LooselyCoupledRaftLogTruncation.

Is this true? Do we plan to run a long-running migration to flush out all raft log truncation entries without RaftExpectedFirstIndex from all raft logs?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/raft_log_truncator.go, line 61 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider pulling trunc.Index + 1 into a method on pendingTruncation called firstIndexAfterApplied or something like that with a short comment that reminds readers that RaftTruncatedState.Index is inclusive.

Done


pkg/kv/kvserver/raft_log_truncator.go, line 85 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: can this method just be these two lines, without the conditional?

Yes, it can! Done.


pkg/kv/kvserver/raft_log_truncator.go, line 125 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you mention that this is always negative (or zero?). It caught me by surprise to see us add logDeltaBytes above.

Done.
It shouldn't be 0 since the truncation is truncating at least one additional entry. But I don't want to make any claims on what we consider the size of a log entry, so the comment I've added says <= 0.


pkg/kv/kvserver/raft_log_truncator.go, line 183 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The comment above says "Any mutex protecting raft-state is acquired before returning". Was that in reference to Replica.raftMu and is this in reference to the Replica.mu? If so, consider mentioning this in that comment.

Any mutex protecting raft-state (Replica.raftMu) is acquired before returning, but finer-grained mutexes protecting replica state (Replica.mu) must be acquired using lockReplicaState.

Correct. Done.


pkg/kv/kvserver/raft_log_truncator.go, line 318 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: if we want to minimize the amount of time that we spend holding the raftLogTruncator's lock, we could maintain two maps and swap them here.

Something like:

type raftLogTruncator struct {
    ...
    add, drain map[roachpb.RangeID]struct{}
}

func (t *raftLogTruncator) addPendingTruncation()
    ...
    t.mu.Lock()
    t.mu.add[r.GetRangeID()] = struct{}{}
    t.mu.Unlock()
}

func (t *raftLogTruncator) durabilityAdvanced()
    ...
    t.mu.Lock()
    t.mu.add, t.mu.drain = t.mu.drain, t.mu.add
    drain := t.mu.drain
    t.mu.Unlock()

    for k := drain {
        ...
        delete(drain, k)
    }
}

This is the only case we do work proportional to the number of ranges while holding raftLogTruncator's lock, so probably makes sense to change this -- if we hold this too long we'll cause anyone trying to queue new truncations to block. I'll wait to make this change until it is clear that raftLogTruncator will remain in its current form.


pkg/kv/kvserver/raft_log_truncator.go, line 423 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider adding a note about why we don't need durability with this operation.

Done


pkg/kv/kvserver/replica.go, line 2038 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should these be methods on Replica or on a separate concrete type? I believe Tobi was suggesting that we do something like we do with replicaProposer to avoid bloating the Replica method set.

@tbg is that what you were suggesting too?
If yes, I'll make the change.


pkg/kv/kvserver/replica_application_state_machine.go, line 1264 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The RaftExpectedFirstIndex != 0 will not be necessary in the release following LooselyCoupledRaftLogTruncation.

Is this true? Do we plan to run a long-running migration to flush out all raft log truncation entries without RaftExpectedFirstIndex from all raft logs?

Good point.
This comment could have been more decisive. I've replaced it with

		// This strongly coupled truncation code will be removed in the release
		// following LooselyCoupledRaftLogTruncation.

And in the other place in this file where we gate on RaftExpectedFirstIndex == 0 I've added

		// In the release following LooselyCoupledRaftLogTruncation, we will
		// retire the strongly coupled path. It is possible that some replica
		// still has a truncation sitting in a raft log that never populated
		// RaftExpectedFirstIndex, which will be interpreted as 0. When applying
		// it, the loosely coupled code will mark the log size as untrusted and
		// will recompute the size. This has no correctness impact, so we are not
		// going to bother with a long-running migration.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica_sideload_disk.go, line 254 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Make sure there's a simple unit test for this. Modifying one of the existing ones should be enough.

Done


pkg/kv/kvserver/testdata/raft_log_truncator, line 83 at r6 (raw file):



# TODO: add more test cases and add comments.

This test is WIP

@tbg tbg self-requested a review February 18, 2022 14:29
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Feb 18, 2022
This is the CockroachDB plumbing for Pebble's
IterOptions.OnlyReadGuaranteedDurable. It is for use in
the raftLogTruncator
cockroachdb#76215.

Since most of the exported interfaces in the storage
package use a Reader, we support this via a
DurabilityRequirement parameter on Engine.NewReadOnly,
and not via an iterator option.

There is also a RegisterFlushCompletedCallback method
on Engine which will be used to poll certain durable
state in the raftLogTruncator.

Other than the trivial plumbing, this required some
refactoring of the Reader.MVCCGet* code for Pebble
and pebbleReadOnly. Even though it is deprecated and
primarily/only used in tests, we don't want to have
the durability semantics diverge.

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Feb 18, 2022
This is the CockroachDB plumbing for Pebble's
IterOptions.OnlyReadGuaranteedDurable. It is for use in
the raftLogTruncator
cockroachdb#76215.

Since most of the exported interfaces in the storage
package use a Reader, we support this via a
DurabilityRequirement parameter on Engine.NewReadOnly,
and not via an iterator option.

There is also a RegisterFlushCompletedCallback method
on Engine which will be used to poll certain durable
state in the raftLogTruncator.

Other than the trivial plumbing, this required some
refactoring of the Reader.MVCCGet* code for Pebble
and pebbleReadOnly. Even though it is deprecated and
primarily/only used in tests, we don't want to have
the durability semantics diverge.

Release note: None
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fine tuning at this point.

Reviewed 3 of 8 files at r8, 8 of 15 files at r9, 2 of 2 files at r10, 6 of 6 files at r11, all commit messages.
Dismissed @nvanbenschoten and @sumeerbhola from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/raft_log_queue.go, line 114 at r11 (raw file):

	s := settings.RegisterBoolSetting(
		settings.SystemOnly,
		"kv.raft_log.enable_loosely_coupled_truncation.enabled",

nit: enable_x.enabled is not quite right. How about kv.raft_log.loosely_coupled_truncations.enabled.


pkg/kv/kvserver/raft_log_queue.go, line 265 at r11 (raw file):

	firstIndex, err := r.raftFirstIndexLocked()
	firstIndex = r.pendingLogTruncations.computePostTruncFirstIndex(firstIndex)

Could you rearrange this such that firstIndex, err is the last thing before r.mu.Unlock and the error check? I don't think we need to hold the lock for computePostTruncFirstIndex, so this can be done after the err check.


pkg/kv/kvserver/raft_log_truncator.go, line 410 at r10 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: you could change this to just return (*Replica, bool) as you're not ever really interested in handling an error here. You just want to have the replica.

Having reviewed more, yes this seems right and a good thing to do.


pkg/kv/kvserver/raft_log_truncator.go, line 46 at r11 (raw file):

type pendingLogTruncations struct {
	mu struct {
		// From a lock ordering perspective, this mutex is the lowest.

Clarify "lowest".


pkg/kv/kvserver/raft_log_truncator.go, line 48 at r11 (raw file):

		// From a lock ordering perspective, this mutex is the lowest.
		syncutil.Mutex
		// We only track the oldest and latest pending truncation. We cannot track

We track the old and merge everything else into the "new" truncation, right? If we discard all intermediate truncations then we're also losing the bytes deltas.


pkg/kv/kvserver/raft_log_truncator.go, line 94 at r11 (raw file):

func (p *pendingLogTruncations) isEmptyLocked() bool {
	return p.mu.truncs[0] == (pendingTruncation{})

Looks like there's an invariant here. Consider spelling out that invariant explicitly on the field comment.


pkg/kv/kvserver/raft_log_truncator.go, line 168 at r11 (raw file):

// raftLogTruncator is responsible for actually enacting truncations.
// Mutex ordering: Replica mutexes > raftLogTruncator.mu

Could you spell out what this means? < means "acquired before", right?


pkg/kv/kvserver/raft_log_truncator.go, line 207 at r11 (raw file):

//
// A replica has raft state, including the queue of pending truncations, that
// the truncator is modifying. There could be a potential "raft-state" mutex

Heh, let's just drop the conditionals here. There is one canonical impl and the interface is for testing, that is ok.
You might also add that access to the interface is single-threaded.


pkg/kv/kvserver/raft_log_truncator.go, line 237 at r11 (raw file):

		_ context.Context, from, to uint64) (freed int64, _ error)
	getStateLoader() stateloader.StateLoader
	// Setting the persistent raft state is via the Engine exposed by

nit: // NB:


pkg/kv/kvserver/raft_log_truncator.go, line 260 at r11 (raw file):

	// Need to figure out whether to add this new pendingTrunc to the
	// truncations that are already queued, and if yes, where to add.
	// i is the index of the last already queued truncation.

Make sure that if it doesn't already, iterateLocked documents that it visits truncations starting with the less aggressive (old) ones first.


pkg/kv/kvserver/raft_log_truncator.go, line 264 at r11 (raw file):

	// alreadyTruncIndex represents what has been already truncated.
	alreadyTruncIndex := r.getTruncatedState().Index
	// No need to acquire pendingTruncs.mu for read in this case.

// (see replicaForTruncator)


pkg/kv/kvserver/raft_log_truncator.go, line 289 at r11 (raw file):

	// which can be large, since we do the computation for the sideloaded
	// entries size here. That will reduce the undercounting of the bytes in the
	// raft log by reducing the value of sideloadedFreed.

// In the common case of alreadyTruncIndex+1 <= raftExpectedFirstIndex, the computation returns the same result regardless of which is plugged in as the lower bound.


pkg/kv/kvserver/raft_log_truncator.go, line 292 at r11 (raw file):

	sideloadedFreed, err := r.sideloadedBytesIfTruncatedFromTo(
		ctx, alreadyTruncIndex+1, pendingTrunc.firstIndexAfterTrunc())
	// Log a loud error since we need to continue enqueuing the truncation.

nit: this should be in the if err != nil block.


pkg/kv/kvserver/raft_log_truncator.go, line 301 at r11 (raw file):

		// Merge the existing entry into the new one.
		// No need to acquire pendingTruncs.mu for read in this case.
		pendingTrunc.isDeltaTrusted = pendingTrunc.isDeltaTrusted ||

Is that right? If we're merging wrong stats into correct stats, we'll get wrong stats (not right stats). So this should be &&?


pkg/kv/kvserver/raft_log_truncator.go, line 430 at r11 (raw file):

		}
	}
	// NB: Unlocking but can keep reading pendingTruncs in this case.

"in this case" doesn't add much, either remove it or say "due to replicaForTruncator contract".


pkg/kv/kvserver/raft_log_truncator.go, line 458 at r11 (raw file):

	})
	if enactIndex < 0 {
		// Add it back as range we should examine and release all locks.

"and release all locks" had me confused. You're just saying that we release the replica back to the store, right? Doesn't seem to add much.


pkg/kv/kvserver/raft_log_truncator.go, line 464 at r11 (raw file):

	// Do the truncation of persistent raft entries, specified by enactIndex
	// (this subsumes all the preceding queued truncations).
	batch := t.store.getEngine().NewUnindexedBatch(false)

(false /* writeOnly */)


pkg/kv/kvserver/raft_log_truncator.go, line 472 at r11 (raw file):

			log.Errorf(ctx, "while attempting to truncate raft log: %+v", err)
		} else {
			log.Errorf(ctx, "unexpected !apply returned from handleTruncatedStateBelowRaftPreApply")
err := errors.AssertionFailedf("unexpected !apply from handleTruncatedStateBelowRaftPreApply")
if buildutil.CrdbTestBuild || util.RaceEnabled {
  log.Fatalf(ctx, "%s", err)
} else {
  log.Errorf(ctx, "%s", err)
}

pkg/kv/kvserver/replica.go, line 2040 at r11 (raw file):

}

// Implementation of the replicaForTruncator interface.

For replicaProposer, we pulled this out of replica.go. I would suggest doing the same here, perhaps moving this to a file raft_truncator_replica.go.


pkg/kv/kvserver/replica.go, line 2052 at r11 (raw file):

	r.mu.Lock()
	defer r.mu.Unlock()
	// TruncatedState is guaranteed to be non-nil

.


pkg/kv/kvserver/replica.go, line 2096 at r11 (raw file):

func (r *replicaForTruncatorImpl) getStateLoader() stateloader.StateLoader {
	// NB: the replicaForTruncator contract says that Replica.raftMu is held.

Need same comment on sideloadedBytesIfTruncatedFromTo.


pkg/kv/kvserver/replica_application_state_machine.go, line 1261 at r11 (raw file):

	}

	isDeltaTrusted := true

nit: perhaps name this more verbosely, isRaftLogTruncationDeltaTrusted, since there is also the MVCCStats "delta".


pkg/kv/kvserver/store.go, line 3493 at r11 (raw file):

}

// Implementation of the storeForTruncator interface.

raft_truncator_store.go?


pkg/kv/kvserver/store.go, line 3508 at r11 (raw file):

		r.mu.Lock()
		defer r.mu.Unlock()
		return r.mu.destroyStatus.Removed()

What about destroyReasonMergePending? As is we'll continue processing the replica. Is that necessary? The raft log is about to be deleted anyway. You could change this to !r.mu.destroyStatus.IsAlive() to avoid it.


pkg/kv/kvserver/raft_log_truncator_test.go, line 45 at r11 (raw file):

	require.Equal(t, 2, truncs.capacity())
	require.True(t, truncs.isEmptyLocked())
	require.Equal(t, int64(55), truncs.computePostTruncLogSize(55))

You could use EqualValues, don't have to, just wanted to point out this exists in case you weren't aware. I assume you got errors here and then added the cast, if you use constant ints using EqualValues from the start can be a bit faster.


pkg/kv/kvserver/raft_log_truncator_test.go, line 62 at r11 (raw file):

	require.Equal(t, int64(0), truncs.computePostTruncLogSize(45))
	// Advances to Index+1.
	require.Equal(t, uint64(21), truncs.computePostTruncFirstIndex(5))

Test 20 and 21 too since they're the boundary values.


pkg/kv/kvserver/raft_log_truncator_test.go, line 67 at r11 (raw file):

	// Two pending truncations.
	truncs.mu.truncs[1].logDeltaBytes = -70

Now that I see this, I don't think I saw a comment that said something like

// INVARIANT: pendingTruncs[0].Index < pendingTruncs[1].Index (if the latter is not zero), consider adding that in the right place.


pkg/kv/kvserver/raft_log_truncator_test.go, line 82 at r11 (raw file):

	// Added -120 and bumped up to 0.
	require.Equal(t, int64(0), truncs.computePostTruncLogSize(115))
	// Advances to Index+1 of second entry

.

@sumeerbhola sumeerbhola force-pushed the raft_trunc2 branch 2 times, most recently from 6da9157 to 91e90d7 Compare February 22, 2022 16:12
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed all comments

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/raft_log_queue.go, line 114 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: enable_x.enabled is not quite right. How about kv.raft_log.loosely_coupled_truncations.enabled.

Done


pkg/kv/kvserver/raft_log_queue.go, line 265 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you rearrange this such that firstIndex, err is the last thing before r.mu.Unlock and the error check? I don't think we need to hold the lock for computePostTruncFirstIndex, so this can be done after the err check.

Done


pkg/kv/kvserver/raft_log_truncator.go, line 416 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Oh, you are right, I was somehow thinking it wrapped the engine but it does not (I realized half-way through, but without realizing it threw the premise out of the window). We can leave this as is, thanks for pushing back.

Ack


pkg/kv/kvserver/raft_log_truncator.go, line 410 at r10 (raw file):

Having reviewed more, yes this seems right and a good thing to do.

By "this seems right", I assume you meant to make this change.
Done


pkg/kv/kvserver/raft_log_truncator.go, line 46 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Clarify "lowest".

Done


pkg/kv/kvserver/raft_log_truncator.go, line 48 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We track the old and merge everything else into the "new" truncation, right? If we discard all intermediate truncations then we're also losing the bytes deltas.

The comment is updated to say

		// We track up to two truncations: the oldest pending truncation, and a
		// merge of all the subsequent pending truncations. We cannot track only
		// one merged truncation since its index may always be ahead of the
		// durable RaftAppliedIndex, and so we may never be able to truncate.

pkg/kv/kvserver/raft_log_truncator.go, line 94 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Looks like there's an invariant here. Consider spelling out that invariant explicitly on the field comment.

Done

		// Invariants:
		// - Queue slot i is empty iff truncs[i] == pendingTruncation{}
		// - Slot 0 represents the first position in the queue. Therefore, it is
		//   not possible for slot 0 to be empty and slot 1 to be non-empty.
		// An implication of the above is that the queue is empty iff slot 0 is
		// empty.

pkg/kv/kvserver/raft_log_truncator.go, line 168 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you spell out what this means? < means "acquired before", right?

I am now unsure about my symbol usage, and I wasn't able to quickly find a reference online. So changed to

// Mutex ordering: The Replica mutexes can be held when acquiring
// raftLogTruncator.mu, but the reverse is not permitted.

pkg/kv/kvserver/raft_log_truncator.go, line 207 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Heh, let's just drop the conditionals here. There is one canonical impl and the interface is for testing, that is ok.
You might also add that access to the interface is single-threaded.

Dropped the conditionals.
I didn't add that access to replicaForTruncator is single threaded (even though it is), because it may confuse the reader into thinking that we are saying more than that (that other accesses via *Replica are not occurring concurrently).


pkg/kv/kvserver/raft_log_truncator.go, line 237 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: // NB:

Done


pkg/kv/kvserver/raft_log_truncator.go, line 260 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Make sure that if it doesn't already, iterateLocked documents that it visits truncations starting with the less aggressive (old) ones first.

Done


pkg/kv/kvserver/raft_log_truncator.go, line 264 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// (see replicaForTruncator)

Done


pkg/kv/kvserver/raft_log_truncator.go, line 289 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// In the common case of alreadyTruncIndex+1 <= raftExpectedFirstIndex, the computation returns the same result regardless of which is plugged in as the lower bound.

Done. Rewrote to

	// It is possible that alreadyTruncIndex+1 != raftExpectedFirstIndex. When
	// we merge or enact we will see this problem and set the trusted bit to
	// false. But we can at least correctly count sideloaded entries, which can
	// be large, since we do the computation for the sideloaded entries size
	// here. When alreadyTruncIndex+1 > raftExpectedFirstIndex, this will avoid
	// double counting sideloaded entries that will be freed, and when
	// alreadyTruncIndex+1 < raftExpectedFirstIndex, this will ensure that we
	// don't miss sideloaded entries that will be freed.
	//
	// In the common case of alreadyTruncIndex+1 == raftExpectedFirstIndex, the
	// computation returns the same result regardless of which is plugged in as
	// the lower bound.

pkg/kv/kvserver/raft_log_truncator.go, line 292 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: this should be in the if err != nil block.

Done


pkg/kv/kvserver/raft_log_truncator.go, line 301 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is that right? If we're merging wrong stats into correct stats, we'll get wrong stats (not right stats). So this should be &&?

Good catch. Done. And added test.


pkg/kv/kvserver/raft_log_truncator.go, line 430 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"in this case" doesn't add much, either remove it or say "due to replicaForTruncator contract".

Done


pkg/kv/kvserver/raft_log_truncator.go, line 458 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"and release all locks" had me confused. You're just saying that we release the replica back to the store, right? Doesn't seem to add much.

Done (was a stale comment). Changed to // Enqueue the rangeID for the future.


pkg/kv/kvserver/raft_log_truncator.go, line 464 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(false /* writeOnly */)

Done


pkg/kv/kvserver/raft_log_truncator.go, line 472 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…
err := errors.AssertionFailedf("unexpected !apply from handleTruncatedStateBelowRaftPreApply")
if buildutil.CrdbTestBuild || util.RaceEnabled {
  log.Fatalf(ctx, "%s", err)
} else {
  log.Errorf(ctx, "%s", err)
}

Done


pkg/kv/kvserver/replica.go, line 2040 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

For replicaProposer, we pulled this out of replica.go. I would suggest doing the same here, perhaps moving this to a file raft_truncator_replica.go.

Done. And renamed type to raftTruncatorReplica.


pkg/kv/kvserver/replica.go, line 2052 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

.

Done


pkg/kv/kvserver/replica.go, line 2096 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Need same comment on sideloadedBytesIfTruncatedFromTo.

I changed this to

	// NB: the replicaForTruncator contract says that Replica.raftMu is held for
	// the duration of the existence of replicaForTruncator, so we return the
	// r.raftMu.stateloader (and not r.mu.stateLoader).

since that was my limited intention here.
It wasn't about whether someone can be concurrently modifying the engine state or not.


pkg/kv/kvserver/replica_application_state_machine.go, line 1261 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: perhaps name this more verbosely, isRaftLogTruncationDeltaTrusted, since there is also the MVCCStats "delta".

Done


pkg/kv/kvserver/store.go, line 3493 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

raft_truncator_store.go?

Seems very tiny, so I didn't bother. I can do so if you have a strong opinion.


pkg/kv/kvserver/store.go, line 3508 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What about destroyReasonMergePending? As is we'll continue processing the replica. Is that necessary? The raft log is about to be deleted anyway. You could change this to !r.mu.destroyStatus.IsAlive() to avoid it.

Done


pkg/kv/kvserver/raft_log_truncator_test.go, line 45 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You could use EqualValues, don't have to, just wanted to point out this exists in case you weren't aware. I assume you got errors here and then added the cast, if you use constant ints using EqualValues from the start can be a bit faster.

I wasn't aware. Thanks. Done.


pkg/kv/kvserver/raft_log_truncator_test.go, line 62 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Test 20 and 21 too since they're the boundary values.

Done


pkg/kv/kvserver/raft_log_truncator_test.go, line 67 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Now that I see this, I don't think I saw a comment that said something like

// INVARIANT: pendingTruncs[0].Index < pendingTruncs[1].Index (if the latter is not zero), consider adding that in the right place.

Added to the invariants listed for the truncs field.


pkg/kv/kvserver/raft_log_truncator_test.go, line 82 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

.

Done

@sumeerbhola
Copy link
Collaborator Author

TestReplicaCircuitBreaker_Follower_QuorumLoss failure is a flake #76781

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done, thanks for the swift turnaround!

Reviewed 1 of 14 files at r3, 17 of 17 files at r12, all commit messages.
Dismissed @nvanbenschoten and @sumeerbhola from 10 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

@sumeerbhola
Copy link
Collaborator Author

thanks for the speedy reviews!

@sumeerbhola
Copy link
Collaborator Author

bors r=tbg

In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.loosely_coupled_truncation.enabled can be used to disable
loosely coupled truncation.
@sumeerbhola
Copy link
Collaborator Author

TestClusterVersionUpgrade failure looks like a flake #74599

@sumeerbhola
Copy link
Collaborator Author

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Feb 22, 2022

Build succeeded:

@craig craig bot merged commit 67c8277 into cockroachdb:master Feb 22, 2022
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this pull request Feb 23, 2022
This is the CockroachDB plumbing for Pebble's
IterOptions.OnlyReadGuaranteedDurable. It is for use in
the raftLogTruncator
cockroachdb#76215.

Since most of the exported interfaces in the storage
package use a Reader, we support this via a
DurabilityRequirement parameter on Engine.NewReadOnly,
and not via an iterator option.

There is also a RegisterFlushCompletedCallback method
on Engine which will be used to poll certain durable
state in the raftLogTruncator.

Other than the trivial plumbing, this required some
refactoring of the Reader.MVCCGet* code for Pebble
and pebbleReadOnly. Even though it is deprecated and
primarily/only used in tests, we don't want to have
the durability semantics diverge.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 23, 2022
76776: scpb,scdecomp,scbuild: remodel elements, use decomposition r=postamar a=postamar

This change adresses some shortcomings of the declarative schema
changer that would have compromised its ability to evolve gracefully in
its current form. Basically these are:
  1. the builder generating targets based off a hybrid descriptor-and-
     element scheme, and
  2. an awkward handling of back-references which precludes doing much
     more than DROP statements correctly.

This change therefore consists of, respectively:
  1. rewriting the builder to hide descriptors from the `scbuildstmt`
     package, instead exposing sets of elements, and
  2. reworking the element model definition by getting rid of
     back-reference elements entirely.

In support of (1) this commit introduces a new `scdecomp` package which,
given a descriptor, will walk a visitor function over all its
constituent elements. This itself is made quite simple by (2) which
largely removes the need to look up referenced descriptors outside of
some mild edge-cases.

This `scdecomp` package is used by the backing structs of the
`scbuildstmt` dependency interfaces to collect a "state-of-the-world"
snapshot of elements upon which the schema change target elements are
layered. This approach lends itself well to caching as the descriptors
remain immutable in the builder.

The rewrite of most of `elements.proto` in support of (2) implies a slew
of cascading changes:
 - the attributes in `screl` need to be adjusted,
 - the lifecyles of the elements in `opgen` have to be adjusted,
 - the dependency rules and no-op rules need to be adjusted,
 - in turn, new operations need to be defined in `scop`, and,
 - in turn, these operations need to be implemented in `scmutationexec`.

Elements are now responsible for updating any back-references that
correspond to their forward references, which effectively pushed that
complexity into these reference update ops in `scmutationexec`. These
have to operate on a best-effort basis, because there are no longer
back-reference elements with dependency rules to enforce convenient
assumptions about the context of their adding or removal. This is
arguably not a bad thing per-se: this new paradigm is "fail hard, fail
fast" which surfaces bugs a lot more quickly than a badly-written
dependency rule would.

The rules themselves fall into cleaner patterns. This commit provides some
tools to express the most common of these. This commit also unifies the
`deprules` and `scopt` packages into a commit `rules` package with full
data-driven test coverage.

Other changes in this commit are peripheral in nature:
  - Working on this change surfaced some deficiencies in the
    cross-referenced descriptor validation logic: we checked that the
    referenced descriptor exists but not that it's not dropped. This
    commit fixes this.
  - The expression validation logic in `schemaexpr` quite reasonably
    used to assume that columns can be found in descriptors;
    unfortunately the builder needs to be able to use this for columns
    which only exist as elements. This commit refactors the entry points
    into this validation logic as a result.
  - Quality-of-life improvements like adding a testing knob used to
    panic TestRollback with an error with a full stack trace when the
    rollback fails.
  - Because back-references don't exist as elements anymore, they also
    don't exist as targets, so we now have schema changer jobs linked to
    descriptors for which there are zero targets. This commit addresses this
    by simply allowing it. This is necessary to lock descriptors to
    prevent any concurrent schema changes which would affect them.

Release note: None


76793: storage: introduce guaranteed durability functionality r=jbowens a=sumeerbhola

This is the CockroachDB plumbing for Pebble's
IterOptions.OnlyReadGuaranteedDurable. It is for use in
the raftLogTruncator #76215.

Since most of the exported interfaces in the storage
package use a Reader, we support this via a
DurabilityRequirement parameter on Engine.NewReadOnly,
and not via an iterator option.

There is also a RegisterFlushCompletedCallback method
on Engine which will be used to poll certain durable
state in the raftLogTruncator.

Other than the trivial plumbing, this required some
refactoring of the Reader.MVCCGet* code for Pebble
and pebbleReadOnly. Even though it is deprecated and
primarily/only used in tests, we don't want to have
the durability semantics diverge.

Release note: None

76835: ttljob: add controls to pause TTL jobs r=rafiss a=otan

See individual commits for details.

76901: colexecspan: de-templatize span assembler and use span.Splitter r=RaduBerinde a=RaduBerinde

#### colexecspan: de-templatize span assembler

The span assembler code is generated only to inline a piece of code
that has two variants. This change converts it to non-generated code
and simply forks the code paths above the batch loop.

Release note: None

#### colexecspan: use span.Splitter

The span assembler duplicates some of the logic in `span.Splitter`.
Now that the latter is a separate type, we can use it instead.

Release note: None


Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
maryliag pushed a commit to maryliag/cockroach that referenced this pull request Feb 28, 2022
This is the CockroachDB plumbing for Pebble's
IterOptions.OnlyReadGuaranteedDurable. It is for use in
the raftLogTruncator
cockroachdb#76215.

Since most of the exported interfaces in the storage
package use a Reader, we support this via a
DurabilityRequirement parameter on Engine.NewReadOnly,
and not via an iterator option.

There is also a RegisterFlushCompletedCallback method
on Engine which will be used to poll certain durable
state in the raftLogTruncator.

Other than the trivial plumbing, this required some
refactoring of the Reader.MVCCGet* code for Pebble
and pebbleReadOnly. Even though it is deprecated and
primarily/only used in tests, we don't want to have
the durability semantics diverge.

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
This is the CockroachDB plumbing for Pebble's
IterOptions.OnlyReadGuaranteedDurable. It is for use in
the raftLogTruncator
cockroachdb#76215.

Since most of the exported interfaces in the storage
package use a Reader, we support this via a
DurabilityRequirement parameter on Engine.NewReadOnly,
and not via an iterator option.

There is also a RegisterFlushCompletedCallback method
on Engine which will be used to poll certain durable
state in the raftLogTruncator.

Other than the trivial plumbing, this required some
refactoring of the Reader.MVCCGet* code for Pebble
and pebbleReadOnly. Even though it is deprecated and
primarily/only used in tests, we don't want to have
the durability semantics diverge.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants