kvserver: cancel consistency checks more reliably #86883

pav-kv · 2022-08-25T16:33:28Z

This PR increases chance of propagating cancelation signal to replicas to
prevent them from running abandoned consistency check tasks. Specifically:

The computation is aborted if the collection request is canceled.
The computation is not started if the collection request gave up recently.
The initiator runs all requests in parallel to reduce asynchrony, and to be
able to cancel all the requests explicitly, instead of skipping some of them.

Background

Consistency checks are initiated by ComputeChecksum command in the Raft log,
and run until completion under a background context. The result is collected by
the initiator via the CollectChecksum long poll. The task is synchronized with
the collection handler via the map of replicaChecksum structs.

Currently, the replica initiating the consistency check sends a collection
request to itself first, and only then to other replicas in parallel. This
results in substantial asynchrony on the receiving replica, between the request
handler and the computation task. The current solution to that is keeping the
checksum computation results in memory for replicaChecksumGCInterval to return
them to late arriving requests. However, there is no symmetry here: if the
computation starts late instead, it doesn't learn about a previously failed request.

The reason why the initiator blocks on its local checksum first is that it
computes the "master checksum", which is then added to all other requests.
However, this field is only used by the receiving end to log an inconsistency
error. The actual killing of this replica happens on the second phase of the
protocol, after the initiating replica commits another Raft message with the
Terminate field populated. So, there is no strong reason to keep this blocking
behaviour.

When the CollectChecksum handler exits due to a canceled context (for example,
the request timed out, or the remote caller crashed), the background task
continues to run. If it was not running, it may start in the future. In both
cases, the consistency checks pool (which has a limited size and processing
rate) spends resources on running dangling checks, and rejects useful ones.

If the initiating replica fails to compute its local checksum, it does not send
requests (or any indication to cancel) to other replicas. This is problematic
because the checksum tasks will be run on all replicas, which opens the
possibility for accumulating many such dangling checks.

Part of #77432

Release justification: performance and stability improvement

Release note(bug fix): A consistency check is now skipped/stopped when its
remote initiator gives up on it. Previously such checks would still be
attempted to run, and, due to the limited size of the worker pool, prevent the
useful checks from running. In addition, consistency check requests are now
sent in parallel, and cancelation signal propagates more reliably.

cockroach-teamcity · 2022-08-25T16:33:45Z

This change is

pav-kv · 2022-08-25T16:34:08Z

This forked off of #86591. Ready for review, PTAL.

pav-kv · 2022-08-25T16:37:09Z

Review commit by commit might be best.

pav-kv · 2022-08-26T12:24:15Z

TODO in this PR: increase the "wait for computation start" timeout from 5s to 10s or so, to account for the fact that followers had more time to catch up on the Raft log, before the long polls were made instant and parallel.

tbg

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)

-- commits line 34 at r4:
doesn't really reject useful ones, but makes them really slow since they have to share a lot of resources.

-- commits line 49 at r4:
This isn't exactly true, right? When the remote call never arrives, the checksum computation will run to completion. So this commit without the next one isn't going to make things better. That's fine as we discussed (once the computation is started in parallel, the case in which a call never arrives before the caller times out is hopefully rare), but let's be precise. (We didn't want the receiver to also cancel itself after 5s because then we get the mixed-version concern when the caller still processes itself synchronously).

-- commits line 76 at r5:
Good explanation.

pkg/kv/kvserver/replica_consistency.go line 453 at r4 (raw file):

// getReplicaChecksum creates a replicaChecksum tracker for the given ID, or
// gets the existing one. The start flag indicates whether the computation is
// starting, so the record is updated accordingly.

Mention the contract to call at most once with start=true for a given ID.

pkg/kv/kvserver/replica_consistency.go line 472 at r4 (raw file):

	c, ok := r.mu.checksums[id]
	if !ok {
		taskCtx, taskCancel := context.WithCancel(context.Background())

r.store.Stopper().WithCancelOnQuiesce(r.AnnotateCtx(context.Background))) 🥪

pkg/kv/kvserver/replica_consistency.go line 481 at r4 (raw file):

	} else if start {
		if c.started {
			log.Fatalf(ctx, "attempted to start checksum computation twice for ID %s", id)

Not your code, I know, but could we return an error from this method instead? We ought to avoid unnecessary fatals as a matter of principle; if there were ever an ill-timed replay or something the like, we shouldn't have it have a blast radius larger than necessary. There is no need to shut down the node here. (We can't just ignore this problem since otherwise we probably double-close c.notify)

pkg/kv/kvserver/replica_consistency.go line 489 at r4 (raw file):

}

func (r *Replica) releaseReplicaChecksum(ctx context.Context, id uuid.UUID, gcAt time.Time) {

I stumbled over this method name a bit. release somehow suggests that we "hold a reference" which we don't. This simply updates the gc timestamp. updateReplicaChecksumExpiration?

pkg/kv/kvserver/replica_consistency.go line 493 at r4 (raw file):

	defer r.mu.Unlock()
	if c, ok := r.mu.checksums[id]; ok {
		c.gcTimestamp = gcAt

Should we pick the smaller of the existing timestamps? At least add a comment about what constellations are expected here.

pkg/kv/kvserver/replica_consistency.go line 494 at r4 (raw file):

	if c, ok := r.mu.checksums[id]; ok {
		c.gcTimestamp = gcAt
		r.mu.checksums[id] = c

This write-back would be unnecessary if the map were map[uuid.UUID]*replicaChecksum. I'd be in favor of that change; having to write back is an invitation for bugs.

Also, having the accessors sit on *Replica is a bit annoying; we just didn't know better when we first wrote this stuff.

I'll defer to your judgment on whether this is worth your while right now, but we could add a commit that replaces the (Replica).mu.checksums field with

type consistencyCheckMap map[uuid.UUID]*replicaChecksum

func (ccm consistencyCheckMap) setExpiration(id uuid.UUID, gcAt time.Time) { /* ... */ }

func (ccm consistencyCheckMap) getOrCreate(/* ... */)

then you can also have true unit tests for the logic, etc, and it doesn't clutter *Replica by being better encapsulated with in.

This is a reasonable clean-up in my eyes (I've done quite a few like this over the last release) but if you're not so sure I won't press it.

pkg/kv/kvserver/replica_consistency.go line 505 at r4 (raw file):

func (r *Replica) getChecksum(ctx context.Context, id uuid.UUID) (CollectChecksumResponse, error) {
	now := timeutil.Now()
	c := r.getReplicaChecksum(ctx, id, now, false)

false /* start */ is somewhere in our guidelines though of course Goland already shows the var name inline. (So it's mostly for reviews I guess).

pkg/kv/kvserver/replica_consistency.go line 514 at r4 (raw file):

	// All the code paths below must c.cancel(), except when we received the
	// computation result (successful or not). In the latter case we allow the
	// task to stop gracefully, so we can't simply defer c.cancel() here.

A robust way to do this would be

maybeCancel := c.cancel
defer func() {
  maybeCancel() // wrapped func allows overriding maybeCancel below
}()

if err := foo(); err != nil {
  return err // will cancel
}

if err := bar(); err != nil {
  return err // will cancel
}

maybeCancel = func() // got a result!
return nil

pkg/kv/kvserver/replica_consistency.go line 523 at r4 (raw file):

	}
	// If the checksum has not completed, commit to waiting the full deadline.
	if !computed {

It's a bit odd that there is a computed path here. If computed comes back as true, shouldn't we get the same result if we enter the checksumWait branch anyway, just without blocking? I think this would simplify matters by avoiding some conditionals here. There is no medal for avoiding an extra non-blocking select. (I know, not your code originally)

pkg/kv/kvserver/replica_consistency.go line 801 at r4 (raw file):

	stopper := r.store.Stopper()

	c := r.getReplicaChecksum(ctx, cc.ChecksumID, timeutil.Now(), true)

/* start */

pkg/kv/kvserver/replica_consistency.go line 802 at r4 (raw file):

	c := r.getReplicaChecksum(ctx, cc.ChecksumID, timeutil.Now(), true)
	// All the code paths below must call computeChecksumDone and c.cancel(). We

Ditto for the pattern

pkg/kv/kvserver/replica_consistency.go line 924 at r4 (raw file):

	}); err != nil {
		defer snap.Close()

This defer shouldn't be there.

pkg/kv/kvserver/replica_consistency.go line 927 at r4 (raw file):

		c.cancel()
		r.computeChecksumDone(ctx, cc.ChecksumID, nil, nil)
		log.Errorf(ctx, "could not run async checksum computation (ID = %s): %v", cc.ChecksumID, err)

I know not your code, but I doubt this is worth logging. This is just the error branch of RunAsyncTask, and an error is returned only if the stopper is quiescing, i.e. the surrounding node is shutting down. So this is just going to produce spurious logging during some unit tests and not much else.

pkg/kv/kvserver/replica_consistency.go line 56 at r5 (raw file):

// check could take a long time due to rate limiting and range size. Now, all
// requests are sent immediately in parallel, so we can remove the GC altogether
// and synchronize the request with the task.

You could mention the release timing here. 22.1 waits for the first check to complete, so when interacting with 22.1 we want to be lenient with the timeouts.
If the parallelism commit makes it into 22.2.0 (note the 0) then the code that will be 23.1 (master today) can be aggressive. If parallelism is only in 22.2.N (N>0) then technically we can't make that assumption, though I would be ready to throw a few consistency checks against 22.2.0 under the bus since upgrades from that version into 23.1 are rare anyway and the worst case is bounded.

pkg/kv/kvserver/replica_consistency.go line 352 at r5 (raw file):

		RangeID:            r.RangeID,
		ChecksumID:         id,
		Checksum:           nil, // TODO(pavelkalinnikov): Deprecate this field.

You could throw out this field in an extra commit, vs writing the TODO. I see below that the receiver of this proto already checks whether the field is set. So you can just remove the field from the proto and be done with it.

pkg/kv/kvserver/replica_consistency.go line 381 at r5 (raw file):

	}

	replicas := r.Desc().Replicas().Descriptors()

nit: try to get localReplica from replicas, simply because there isn't any synchronization between this line and the earlier call r.GetReplicaDescriptor; the replica descriptor might've changed. The only consequential change could be that "self" was removed in which case the replica will now be destroyed, so this change is mostly cosmetic, but it is better to tighten it anyway.

pkg/kv/kvserver/replica_consistency.go line 406 at r5 (raw file):

			},
		); err != nil {
			// If we can't start tasks, the node is likely draining. Return the error

I think we'd want to cancel everyone in this case, since you don't want the node shutdown to block on some remote RPC. Consider adding WithCancelOnQuiesce to the context used in this method if it isn't there already.

pkg/kv/kvserver/replica_consistency_test.go line 108 at r3 (raw file):

	tc.repl.mu.Unlock()
	rc, err = tc.repl.getChecksum(ctx, id)
	require.ErrorContains(t, err, "context deadline exceeded")

nit: here we should ideally check for require.True(t, errors.Is(ctx, context.DeadlineExceeded))

pkg/kv/kvserver/stores_server.go line 75 at r5 (raw file):

				//
				// TODO(pavelkalinnikov): The Checksum field is no longer populated.
				// Drop this check, or move it to where the ComputeChecksum.Terminate

Let's drop it, not worth it.

Also can't you do this now? The code below shows that 22.1 nodes can already handle the case where it's not set.

tbg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)

erikgrinaker

Very nice work! I basically agree with Tobi's comments here.

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)

-- commits line 34 at r4:

Previously, tbg (Tobias Grieger) wrote…

doesn't really reject useful ones, but makes them really slow since they have to share a lot of resources.

Nah, we do reject them beyond 7 concurrent checks (consistencyCheckAsyncConcurrency). I'm hoping we can remove that limit once we land this work, but let's do that later.

-- commits line 76 at r5:

Previously, tbg (Tobias Grieger) wrote…

Good explanation.

+1, nice writeup.

pkg/kv/kvserver/replica_consistency.go line 475 at r4 (raw file):

		c = replicaChecksum{
			ctx:     taskCtx,
			cancel:  taskCancel,

I'm slightly uneasy about storing a context here, since it's considered a bit of an anti-pattern. Then again, we do this in a few other places already, and it gets the job done. We could perhaps consider either using a context given by computeChecksumPostApply, hiding the inner context behind a more general API, or use some other primitive -- but I won't press for it, this is fine too.

pkg/kv/kvserver/replica_consistency.go line 481 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not your code, I know, but could we return an error from this method instead? We ought to avoid unnecessary fatals as a matter of principle; if there were ever an ill-timed replay or something the like, we shouldn't have it have a blast radius larger than necessary. There is no need to shut down the node here. (We can't just ignore this problem since otherwise we probably double-close c.notify)

+1, would really prefer an errors.AssertionFailedf error return path here.

pkg/kv/kvserver/replica_consistency.go line 388 at r5 (raw file):

	ctx, cancel := context.WithCancel(ctx)

	defer close(resultCh) // close the channel only after

nit: we don't really need to close the channel here in this case, since there is no reader anymore (the for-loop can't still be running)

pkg/kv/kvserver/replica_consistency.go line 406 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think we'd want to cancel everyone in this case, since you don't want the node shutdown to block on some remote RPC. Consider adding WithCancelOnQuiesce to the context used in this method if it isn't there already.

Won't the defer cancel() do just that?

pav-kv

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)

pkg/kv/kvserver/replica_consistency.go line 406 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Won't the defer cancel() do just that?

Yep, the idea was that all exit paths will have cancel()+wg.Wait(), so we are covered for the case when err != nil here. An additional WithCancelOnQuiesce would be nice though too, for the case when all tasks have started, an quiescing happened mid-RPC.

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)

-- commits line 34 at r4:

Previously, erikgrinaker (Erik Grinaker) wrote…

Nah, we do reject them beyond 7 concurrent checks (consistencyCheckAsyncConcurrency). I'm hoping we can remove that limit once we land this work, but let's do that later.

Yeah, there is a WaitSem: false, which drops tasks on the floor if the semaphore is full. We may be able to remove it when leaks are minimal, e.g. when we make the checksum computation require an active incoming collection request (in a follow-up PR).

Erik and I discussed previously that it would be nice to add some metrics here. This could further help supporting decisions on whether to block or drop requests, etc. Something like: the number of checksum computations initiated / started / dropped / returned.

-- commits line 49 at r4:

Previously, tbg (Tobias Grieger) wrote…

This isn't exactly true, right? When the remote call never arrives, the checksum computation will run to completion. So this commit without the next one isn't going to make things better. That's fine as we discussed (once the computation is started in parallel, the case in which a call never arrives before the caller times out is hopefully rare), but let's be precise. (We didn't want the receiver to also cancel itself after 5s because then we get the mixed-version concern when the caller still processes itself synchronously).

This is perhaps the wording issue, reworded.

What I meant is that the cancelation of RPCs that have arrived is now respected with this commit, which is strictly better than the status quo, regardless of the next commit (that's why I ordered this one first). The next commit increases the chance of RPCs making it through in the first place, and improves timing.

pkg/kv/kvserver/replica_consistency.go line 453 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mention the contract to call at most once with start=true for a given ID.

Obsolete now.

pkg/kv/kvserver/replica_consistency.go line 472 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

r.store.Stopper().WithCancelOnQuiesce(r.AnnotateCtx(context.Background))) 🥪

Done.

pkg/kv/kvserver/replica_consistency.go line 475 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'm slightly uneasy about storing a context here, since it's considered a bit of an anti-pattern. Then again, we do this in a few other places already, and it gets the job done. We could perhaps consider either using a context given by computeChecksumPostApply, hiding the inner context behind a more general API, or use some other primitive -- but I won't press for it, this is fine too.

I made a refactoring after which storing the context is unnecessary.

pkg/kv/kvserver/replica_consistency.go line 481 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

+1, would really prefer an errors.AssertionFailedf error return path here.

Obsolete.

pkg/kv/kvserver/replica_consistency.go line 489 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I stumbled over this method name a bit. release somehow suggests that we "hold a reference" which we don't. This simply updates the gc timestamp. updateReplicaChecksumExpiration?

Went with a slightly shorter setReplicaChecksumGC.

pkg/kv/kvserver/replica_consistency.go line 493 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Should we pick the smaller of the existing timestamps? At least add a comment about what constellations are expected here.

I can see it both ways. I think it's okay if the last one wins, we don't have to be super precise.
See the refactoring, we now always update to now.Add(gcInterval), so it's always a "touch" where the latest wins.

pkg/kv/kvserver/replica_consistency.go line 494 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This write-back would be unnecessary if the map were map[uuid.UUID]*replicaChecksum. I'd be in favor of that change; having to write back is an invitation for bugs.

Also, having the accessors sit on *Replica is a bit annoying; we just didn't know better when we first wrote this stuff.

I'll defer to your judgment on whether this is worth your while right now, but we could add a commit that replaces the (Replica).mu.checksums field with
type consistencyCheckMap map[uuid.UUID]*replicaChecksum

func (ccm consistencyCheckMap) setExpiration(id uuid.UUID, gcAt time.Time) { /* ... */ }

func (ccm consistencyCheckMap) getOrCreate(/* ... */)
then you can also have true unit tests for the logic, etc, and it doesn't clutter *Replica by being better encapsulated with in.

This is a reasonable clean-up in my eyes (I've done quite a few like this over the last release) but if you're not so sure I won't press it.

SGTM, I added another prereq commit to this PR with a refactoring a similar kind.

I did this in #86591 first, and wanted it to be a follow-up PR. I tried to push back the refactoring, so that we could land simple but important fixes first. It was hard to reason about though, so I decided to refactor first in this PR, and then build on top of it.

pkg/kv/kvserver/replica_consistency.go line 505 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

false /* start */ is somewhere in our guidelines though of course Goland already shows the var name inline. (So it's mostly for reviews I guess).

Not relevant now.

pkg/kv/kvserver/replica_consistency.go line 514 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

A robust way to do this would be

maybeCancel := c.cancel
defer func() {
  maybeCancel() // wrapped func allows overriding maybeCancel below
}()

if err := foo(); err != nil {
  return err // will cancel
}

if err := bar(); err != nil {
  return err // will cancel
}

maybeCancel = func() // got a result!
return nil

Nice suggestion. Done first, but then realised it can be simpler.

pkg/kv/kvserver/replica_consistency.go line 523 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's a bit odd that there is a computed path here. If computed comes back as true, shouldn't we get the same result if we enter the checksumWait branch anyway, just without blocking? I think this would simplify matters by avoiding some conditionals here. There is no medal for avoiding an extra non-blocking select. (I know, not your code originally)

Agreed. Originally I addressed this in a follow-up PR, but now I pushed it to this PR.

pkg/kv/kvserver/replica_consistency.go line 801 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

/* start */

Not relevant now.

pkg/kv/kvserver/replica_consistency.go line 802 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto for the pattern

Done.

pkg/kv/kvserver/replica_consistency.go line 924 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This defer shouldn't be there.

Do you suggest removing defer (seems fine), or removing the whole defer snap.Close() (not clear to me why we can safely do it)?

We grabbed the snapshot in this func above, so attempting to close it in the async task, or, if the task is not started, here. If we don't close it here, wouldn't it leak?
Since this statement pre-existed, and is not a clear cut, I would refrain from removing it in this PR, but we can remove it in another one if needed?

pkg/kv/kvserver/replica_consistency.go line 927 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I know not your code, but I doubt this is worth logging. This is just the error branch of RunAsyncTask, and an error is returned only if the stopper is quiescing, i.e. the surrounding node is shutting down. So this is just going to produce spurious logging during some unit tests and not much else.

Removed.

pkg/kv/kvserver/replica_consistency.go line 56 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You could mention the release timing here. 22.1 waits for the first check to complete, so when interacting with 22.1 we want to be lenient with the timeouts.
If the parallelism commit makes it into 22.2.0 (note the 0) then the code that will be 23.1 (master today) can be aggressive. If parallelism is only in 22.2.N (N>0) then technically we can't make that assumption, though I would be ready to throw a few consistency checks against 22.2.0 under the bus since upgrades from that version into 23.1 are rare anyway and the worst case is bounded.

Done. Better?

pkg/kv/kvserver/replica_consistency.go line 352 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You could throw out this field in an extra commit, vs writing the TODO. I see below that the receiver of this proto already checks whether the field is set. So you can just remove the field from the proto and be done with it.

Done.

pkg/kv/kvserver/replica_consistency.go line 381 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: try to get localReplica from replicas, simply because there isn't any synchronization between this line and the earlier call r.GetReplicaDescriptor; the replica descriptor might've changed. The only consequential change could be that "self" was removed in which case the replica will now be destroyed, so this change is mostly cosmetic, but it is better to tighten it anyway.

Done.

pkg/kv/kvserver/replica_consistency.go line 388 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: we don't really need to close the channel here in this case, since there is no reader anymore (the for-loop can't still be running)

Agree. This would remove the Haiku though :(

pkg/kv/kvserver/replica_consistency_test.go line 108 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: here we should ideally check for require.True(t, errors.Is(ctx, context.DeadlineExceeded))

Made it a require.ErrorIs

pkg/kv/kvserver/stores_server.go line 75 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Let's drop it, not worth it.

Also can't you do this now? The code below shows that 22.1 nodes can already handle the case where it's not set.

Done.

pav-kv

Thanks for the reviews! PTAL, this is ready for another round. I think all commits are in a good shape now, but the last one needs more consideration. I still don't have a good feel for when the long-poll must or must not return a Snapshot.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)

tbg

Reviewed 4 of 4 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, 1 of 1 files at r9, 3 of 3 files at r10, 4 of 4 files at r11, 2 of 2 files at r12, 2 of 2 files at r13, 4 of 4 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @pavelkalinnikov)

pkg/kv/kvserver/replica_consistency.go line 924 at r4 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

Do you suggest removing defer (seems fine), or removing the whole defer snap.Close() (not clear to me why we can safely do it)?

We grabbed the snapshot in this func above, so attempting to close it in the async task, or, if the task is not started, here. If we don't close it here, wouldn't it leak?
Since this statement pre-existed, and is not a clear cut, I would refrain from removing it in this PR, but we can remove it in another one if needed?

I meant that it's odd that we're deferring it, rather than just calling snap.Close().

pkg/kv/kvserver/replica_consistency.go line 927 at r4 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

Removed.

Now you're returning the error but the caller calls log.Error, doesn't it? The caller could avoid this logging if errors.Is(err, stop.ErrUnavailable).

pkg/kv/kvserver/replica_consistency.go line 550 at r11 (raw file):

) {
	// TODO(pavelkalinnikov): Communicate through the replicaChecksum directly,
	// without using r.mu. E.g. send the CollectChecksumResponse through c.notify.

The typical pattern is

type foo struct {
  notify <-chan struct{}
  result something
}

and result is written by whoever closes notify, before they do. So consumers can

select {
  case <-c.notify:
    // continue
  case <-ctx.Done
    return ctx.Err()
}

fmt.Println("hey look here: ", c.result)

so you could probably reinterpret what is happening here under this pattern.

But since the result is only consumed once here we can also send it through the channel, that definitely avoids a class of data races.

Seeing that you're sending to the channel, looks good.

pkg/kv/kvserver/replica_consistency.go line 461 at r12 (raw file):

		// is the second request for this ID, then both got the hold of it, and we
		// no longer need to keep it in the map for GC.
		delete(r.mu.checksums, id)

This sure seems convenient but also brittle. Now a call getReplicaChecksum has the opaque side effect of deleting the checksum from the map if it already exists. It's a very very subtle refcounting scheme. The alternative is actual refcounting, which with the channels you have there right now can't be implemented, and I don't think there's large benefit in adding more. But let's rename this method to createOrDetachReplicaChecksum so that the semantics are more obvious.

pkg/kv/kvserver/replica_consistency_test.go line 108 at r3 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

Made it a require.ErrorIs

That's fine here, just a heads up that it doesn't always work, since require.ErrorIs doesn't use cockroachdb/errors and thus can't type match all errors. (I don't know the specifics but I know it is an issue).

tbg · 2022-09-05T11:43:54Z

but the last one needs more consideration. I still don't have a good feel for when the long-poll must or must not return a Snapshot.

The interesting bits of code are here. We have already run a consistency check, we are the consistency checker queue, and now we're trying to do a second round of consistency checks, that will

get us a diff to print (for a minority SHA node)
terminate the minority SHA node(s)

cockroach/pkg/kv/kvserver/replica_consistency.go

Lines 287 to 302 in 07024c9

    
           if args.WithDiff { 
        
           	// A diff was already printed. Return because all the code below will do 
        
           	// is request another consistency check, with a diff and with 
        
           	// instructions to terminate the minority nodes. 
        
           	log.Errorf(ctx, "consistency check failed") 
        
           	return resp, nil 
        
           } 
        
           // No diff was printed, so we want to re-run with diff. 
        
           // Note that this recursive call will be terminated in the `args.WithDiff` 
        
           // branch above. 
        
           args.WithDiff = true 
        
           args.Checkpoint = true 
        
           for _, idxs := range shaToIdxs[minoritySHA] { 
        
           	args.Terminate = append(args.Terminate, results[idxs].Replica) 
        
           }

The last commit makes it such that all nodes return the contents of their replica. Before, the nodes would be given the SHA the leaseholder had already computed for itself, and if it matched the result of their own computation, they'd deduct that their diff wasn't needed, and elide the snapshot.

Instead what we should be doing is to set the WithDiff option only on the minoritySHA node in the first place, i.e. just mirror what we already do for Terminate.

As a follow-up, when we re-work the diff mechanism to be offline, instead we will only tell the minority nodes to checkpoint and terminate. But, we will also need a checkpoint off one of the good nodes, so we will probably also ask the local node (I think that's guaranteed to be idx zero) to also take a checkpoint (but not terminate). That way, we end up with a checkpoint of the "good" SHA, and (at least one) checkpoint of the "bad" SHA, and can write a tool that computes the diff.

pav-kv · 2022-09-05T14:02:00Z

Instead what we should be doing is to set the WithDiff option only on the minoritySHA node in the first place, i.e. just mirror what we already do for Terminate.

Sounds doable. One issue with this approach is that the second check is almost completely independent from the first one, it will be run on a different snapshot. It's possible that it fails on a different set of replicas (or even succeeds), so the snapshots only from the original inconsistent replicas won't be enough. How about collecting snapshots from all replicas during the second round?

tbg · 2022-09-05T14:58:50Z

It's possible that it fails on a different set of replicas (or even succeeds), so the snapshots only from the original inconsistent replicas won't be enough. How about collecting snapshots from all replicas during the second round?

I have yet to see that in practice but yes, it's possible. Getting snapshots on all replicas sounds good, this might also be helpful when investigating, since then there are a few more LSMs to poke at.

Control request cancelation in the event of store stopper quiescing higher up the stack for convenience. Release justification: part of a performance improvement PR Release note: None

The checksum computation can take long, so if the store quiesces, it's better to cancel it. Release justification: part of a performance improvement PR Release note: None

Release justification: part of a performance improvement PR Release note: None

Return error from Replica.computeChecksumPostApply for a better introspection in tests, and to avoid lengthy log messages in favor of compositing them. Release justification: part of a performance improvement PR Release note: None

This refactoring makes replica.checksums map store *replicaChecksum pointers instead of values. This way we can modify the entries directly without doing another map roundtrip, which was error-prone and required handling situations when an entry was in the map and then disappeared. We still need to lock Replica.mu for reading/writing the entries, but we can avoid this too by putting sync primitives in the entry itself (see the follow-up commit). Release justification: part of a performance improvement PR Release note: None

pav-kv

Done (I think).

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)

pkg/kv/kvserver/replica_consistency.go line 924 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I meant that it's odd that we're deferring it, rather than just calling snap.Close().

Ah, right. De-deferred it. One difference is that defer-red stmt is guaranteed to run even if there is a panic before the func exits. I don't know if that was the intention, WDYT? But since it was the first statement in this clause, it kinda did not benefit from this property anyway.

pkg/kv/kvserver/replica_consistency.go line 927 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Now you're returning the error but the caller calls log.Error, doesn't it? The caller could avoid this logging if errors.Is(err, stop.ErrUnavailable).

Aye. Done.

pkg/kv/kvserver/replica_consistency.go line 550 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The typical pattern is
type foo struct {
  notify <-chan struct{}
  result something
}
and result is written by whoever closes notify, before they do. So consumers can
select {
  case <-c.notify:
    // continue
  case <-ctx.Done
    return ctx.Err()
}

fmt.Println("hey look here: ", c.result)
so you could probably reinterpret what is happening here under this pattern.

But since the result is only consumed once here we can also send it through the channel, that definitely avoids a class of data races.

Seeing that you're sending to the channel, looks good.

Ack.

pkg/kv/kvserver/replica_consistency.go line 461 at r12 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This sure seems convenient but also brittle. Now a call getReplicaChecksum has the opaque side effect of deleting the checksum from the map if it already exists. It's a very very subtle refcounting scheme. The alternative is actual refcounting, which with the channels you have there right now can't be implemented, and I don't think there's large benefit in adding more. But let's rename this method to createOrDetachReplicaChecksum so that the semantics are more obvious.

Moved all the deletion into a single gcReplicaChecksum method to avoid this side effect here, and explained the guarantees in detail in its comment.

pav-kv

@tbg ready for another review

pkg/kv/kvserver/stores_server.go

tbg

Reviewed 8 of 8 files at r15, 1 of 1 files at r16, 2 of 2 files at r17, 1 of 1 files at r18, 3 of 3 files at r19, 4 of 4 files at r20, 2 of 2 files at r21, 2 of 2 files at r22, 4 of 4 files at r23, all commit messages.
Dismissed @erikgrinaker and @pavelkalinnikov from 2 discussions.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)

pkg/kv/kvserver/stores_server.go line 54 at r23 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

Previously this was always non-nil. Does this have to be the case?

No, and it's more idiomatic to return nil on error.

pkg/kv/kvserver/replica_consistency.go

pav-kv · 2022-09-07T20:34:22Z

@erikgrinaker LMK if you would like to take a final look. I may add a couple of tests and merge this PR by EOW.

tbg · 2022-09-08T10:56:42Z

I don't think Erik needs to take another look, esp since he's out sick. Merge away!

tbg · 2022-09-08T10:57:01Z

PS test failure isn't yours, see here

The replicaChecksum type helps bridging the checksum computation task and checksum collection request, for a certain computation ID. The code for the lifecycle of replicaChecksum is scattered across replica_consistency.go, and is difficult to understand. This commit simplifies the semantics of replicaChecksum by using Go channels, and stating the invariant on their state. It also minimizes the extensive Replica mutex locking (which is used by nearly every part of the system) by using channels for the communication between the task and the handler. It also resolves the TODO/bug in which some replicaChecksum entries could stay in the map forever if the request does not have a deadline. Release justification: part of a performance improvement PR Release note: None

Consistency checks are initiated by ComputeChecksum command in the Raft log, and run until completion under a background context. The result is collected by the initiator via the CollectChecksum long poll. The task is synchronized with the collection handler via the map of replicaChecksum structs. When the CollectChecksum handler exits due to a canceled context (for example, the request timed out, or the remote caller crashed), the background task continues to run. If it was not running, it may start in the future. In both cases, the consistency checks pool (which has a limited size and processing rate) spends resources on running dangling checks, and rejects useful ones. This commit makes sure that abandoned checksum computation tasks are: - stopped if the waiting collection request is canceled - never started if there was a recent collection request that gave up When starting, the checksum computation task first checks whether the corresponding collection request has previously been abandoned. If so, the task terminates early. Otherwise it starts and sends a cancel func through the channel that it used to notify the collection handler, so that it can abort the task when it abandons the request. Release justification: performance and stability improvement Release note (bug fix): A consistency check is now skipped/stopped when the collection request is canceled before/while running the check computation. Previously such checks would start and run until completion, and, due to the limited size of the worker pool, prevent the useful checks from running.

Currently, the replica initiating the consistency check sends a collection request to itself first, and only then to other replicas in parallel. This results in substantial asynchrony on the receiving replica, between the incoming CollectChecksum request and the checksum computation task started by the ComputeChecksum message. The current solution to that is keeping the checksum computation results in memory for replicaChecksumGCInterval to return them to late arriving requests. The reason why the first checksum collection blocks the others is that it computes the "master checksum", which is then added to all other requests. However, this field is only used by the receiving end to log an inconsistency error. The actual killing of this replica happens on the second phase of the protocol, after the initiating replica commits another Raft message with the Terminate field populated. So, there is no strong reason to keep this blocking behaviour. If the initiating replica fails to compute its local checksum, it does not send requests to other replicas. This is problematic because the checksum tasks will be run on all replicas, which opens the possibility for accumulating many such dangling checks. This commit makes all the checksum collection requests parallel. Benefits: - There is less asynchrony between the sender and receiver, so we can drop the GC (in follow-up commits), and require an incoming request before starting the checksum computation task. - All the outgoing collection requests are now explicitly canceled if the local computation fails. This way, the cancelation signal has more chance to propagate to all replicas and cancel the tasks that were started anyway. Release justification: performance and stability improvement Release note (bug fix): Consistency checks are now sent to all replicas in parallel, previously it would be blocked on processing the local replica first. This a) reduces the latency of one check 2x, and b) allows better propagation of the cancelation signal which results in fewer abandoned tasks on remote replicas, and more resources spent on useful checks.

pav-kv · 2022-09-09T00:29:11Z

Added a couple of checks in tests for extra niceness. No review needed.

pav-kv · 2022-09-09T12:00:20Z

bors r=tbg

craig · 2022-09-09T12:52:07Z

Build succeeded:

Bazel Essential CI (Cockroach)

pav-kv · 2022-09-12T15:50:27Z

blathers backport 22.2

pav-kv requested review from erikgrinaker and tbg August 25, 2022 16:33

pav-kv requested a review from a team as a code owner August 25, 2022 16:33

pav-kv mentioned this pull request Aug 25, 2022

kvserver: sync checksum computation with long poll #86591

Merged

pav-kv force-pushed the cancel_consistency_checks_on_badness branch from d9a6a25 to 1a2dd2c Compare August 26, 2022 10:27

tbg approved these changes Aug 29, 2022

View reviewed changes

erikgrinaker approved these changes Aug 29, 2022

View reviewed changes

pav-kv commented Aug 29, 2022

View reviewed changes

tbg mentioned this pull request Aug 31, 2022

roachtest: opt some tests into aggressive stats checks #87106

Closed

pav-kv force-pushed the cancel_consistency_checks_on_badness branch 6 times, most recently from 570daa4 to 07024c9 Compare September 1, 2022 16:39

pav-kv commented Sep 1, 2022

View reviewed changes

pav-kv requested review from tbg and erikgrinaker September 1, 2022 17:26

tbg reviewed Sep 5, 2022

View reviewed changes

kvserver: cancel checksum collection context on quiescing

1ec49f3

Control request cancelation in the event of store stopper quiescing higher up the stack for convenience. Release justification: part of a performance improvement PR Release note: None

pav-kv force-pushed the cancel_consistency_checks_on_badness branch from 07024c9 to bacd591 Compare September 7, 2022 12:10

pav-kv added 5 commits September 7, 2022 13:14

kvserver: cancel checksum computation on quiescing

af00b99

The checksum computation can take long, so if the store quiesces, it's better to cancel it. Release justification: part of a performance improvement PR Release note: None

kvserver: return CollectChecksumResponse directly

f223003

Release justification: part of a performance improvement PR Release note: None

kvserver: use require.* helpers in tests

dc4fa72

Release justification: part of a performance improvement PR Release note: None

kvserver: return error for better testability

f68e786

Return error from Replica.computeChecksumPostApply for a better introspection in tests, and to avoid lengthy log messages in favor of compositing them. Release justification: part of a performance improvement PR Release note: None

pav-kv force-pushed the cancel_consistency_checks_on_badness branch from bacd591 to 4daec23 Compare September 7, 2022 12:16

pav-kv commented Sep 7, 2022

View reviewed changes

pav-kv force-pushed the cancel_consistency_checks_on_badness branch 2 times, most recently from ca84f00 to f72d785 Compare September 7, 2022 13:02

pav-kv commented Sep 7, 2022

View reviewed changes

pkg/kv/kvserver/stores_server.go Show resolved Hide resolved

tbg approved these changes Sep 7, 2022

View reviewed changes

pav-kv commented Sep 7, 2022

View reviewed changes

pkg/kv/kvserver/replica_consistency.go Show resolved Hide resolved

pav-kv added 3 commits September 8, 2022 14:04

pav-kv force-pushed the cancel_consistency_checks_on_badness branch from f72d785 to 6d8ace3 Compare September 9, 2022 00:24

craig bot merged commit 203c078 into cockroachdb:master Sep 9, 2022

pav-kv deleted the cancel_consistency_checks_on_badness branch September 9, 2022 12:54

blathers-crl bot mentioned this pull request Sep 12, 2022

release-22.2: kvserver: cancel consistency checks more reliably #87841

Merged

tbg mentioned this pull request Sep 20, 2022

kvserver: prevent build-up of abandoned consistency checks #76855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: cancel consistency checks more reliably #86883

kvserver: cancel consistency checks more reliably #86883

pav-kv commented Aug 25, 2022

cockroach-teamcity commented Aug 25, 2022

pav-kv commented Aug 25, 2022

pav-kv commented Aug 25, 2022

pav-kv commented Aug 26, 2022 •

edited

Loading

tbg left a comment

tbg left a comment

erikgrinaker left a comment

pav-kv left a comment

pav-kv left a comment

pav-kv left a comment

tbg left a comment

tbg commented Sep 5, 2022

pav-kv commented Sep 5, 2022

tbg commented Sep 5, 2022

pav-kv left a comment

pav-kv left a comment

tbg left a comment

pav-kv commented Sep 7, 2022

tbg commented Sep 8, 2022

tbg commented Sep 8, 2022

pav-kv commented Sep 9, 2022

pav-kv commented Sep 9, 2022

craig bot commented Sep 9, 2022

pav-kv commented Sep 12, 2022

kvserver: cancel consistency checks more reliably #86883

kvserver: cancel consistency checks more reliably #86883

Conversation

pav-kv commented Aug 25, 2022

Background

cockroach-teamcity commented Aug 25, 2022

pav-kv commented Aug 25, 2022

pav-kv commented Aug 25, 2022

pav-kv commented Aug 26, 2022 • edited Loading

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Sep 5, 2022

pav-kv commented Sep 5, 2022

tbg commented Sep 5, 2022

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

pav-kv commented Sep 7, 2022

tbg commented Sep 8, 2022

tbg commented Sep 8, 2022

pav-kv commented Sep 9, 2022

pav-kv commented Sep 9, 2022

craig bot commented Sep 9, 2022

pav-kv commented Sep 12, 2022

pav-kv commented Aug 26, 2022 •

edited

Loading