Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: document reproposals #94633

Merged
merged 1 commit into from
Feb 8, 2023
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 3, 2023

Reproposals are a deep rabbit hole and an area in which past changes
were all related to subtle bugs. Write it all up and in particular make
some simplifications that ought to be possible if my understanding is
correct:

  • have proposals always enter (*Replica).propose without a
    MaxLeaseIndex or prior encoded command set, i.e. propose
    behaves the same for reproposals as for first proposals.
  • assert that after a failed call to tryReproposeWithNewLeaseIndex,
    the command is not in the proposals map, i.e. check absence of
    a leak.
  • replace code that should be impossible to reach (and had me confused
    for a good amount of time) with an assertion.
  • add long comment on r.mu.proposals.

This commit also moves tryReproposeWithNewLeaseIndex off (*Replica),
which is possible due to recent changes1. In doing so, I realized
there was a (small) data race (now fixed): when returning a
NotLeaseholderError from that method, we weren't acquiring r.mu. It
may have looked as though we were holding it already since we're
accessing r.mu.propBuf, however that field has special semantics - it
wraps r.mu and acquires it when needed.

Epic: CRDB-220
Release note: None

Footnotes

  1. The "below raft" test mentioned in the previous comment was
    changed in https://github.com/cockroachdb/cockroach/pull/93785 and
    no longer causes a false positive.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the reproposals-doc branch 3 times, most recently from 22fa6e3 to 1d4acc7 Compare January 5, 2023 10:18
@tbg tbg marked this pull request as ready for review January 5, 2023 10:18
@tbg tbg requested a review from a team January 5, 2023 10:18
@tbg tbg requested a review from a team as a code owner January 5, 2023 10:18
@tbg tbg requested review from pav-kv and nvanbenschoten January 5, 2023 10:18
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)


pkg/kv/kvserver/replica.go line 506 at r1 (raw file):

		//    is not superseded, see below) and attach the value to the entry.
		// 2. for each entry:
		//				- stage written and in-memory effects of the entry (some may apply as no-ops

indentation


pkg/kv/kvserver/replica.go line 596 at r1 (raw file):

		// [^1]: https://github.com/cockroachdb/cockroach/blob/59ce13b6052a99a0318e3dfe017908ff5630db30/pkg/kv/kvserver/replica_raft.go#L1224
		// [^2]: https://github.com/cockroachdb/cockroach/blob/59ce13b6052a99a0318e3dfe017908ff5630db30/pkg/kv/kvserver/replica_application_result.go#L148
		// [^3]: it's debatable how useful this is. It was introduced in

what is "this"?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)


pkg/kv/kvserver/replica.go line 523 at r1 (raw file):

		// We can always safely create an identical copy (i.e. (1)) because of the
		// replay protection conferred by the MaxLeaseIndex - all but the first
		// proposal will be rejected (i.e. apply as a no-op).

"all but the first proposal that successfully achieves consensus"


pkg/kv/kvserver/replica.go line 531 at r1 (raw file):

with various different MaxLeaseIndex values

Is it possible for there two be more than two different MaxLeaseIndex values in the unapplied log at a time? And can they be interleaved?


pkg/kv/kvserver/replica.go line 552 at r1 (raw file):

		// where we assume that the `MaxLeaseIndex` 100 is invalid, i.e. when we see the
		// first copy of the command being applied, we've already applied some command
		// with a higher `MaxLeaseIndex`. In a world without mechanism (2), `N` would

"an equal or higher"


pkg/kv/kvserver/replica.go line 553 at r1 (raw file):

		// first copy of the command being applied, we've already applied some command
		// with a higher `MaxLeaseIndex`. In a world without mechanism (2), `N` would
		// be rejected, and would finalize the proposal (i.e. signal the client with

Should we define "finalize" up above in a broader discussion of when entries are added to the proposals map, when they're updated, and when they're removed?


pkg/kv/kvserver/replica.go line 584 at r1 (raw file):

		// then can (2) engage. In addition, an entry that doesn't pass this equality
		// check must not signal the proposer and/or unlink from the proposals map (as a
		// newer reproposal is likely in the log)[^4].

"as a newer reproposal is likely in the log which may succeed"


pkg/kv/kvserver/replica_application_result.go line 259 at r1 (raw file):

	p.command.MaxLeaseIndex = 0
	p.encodedCommand = nil
	pErr := r.propose(ctx, p, tok.Move(ctx))

Replica.propose doesn't assign the MaxLeaseIndex synchronously, it only does when the proposal buffer is flushed. Are we ok with the command having a MaxLeaseIndex of 0 in the meantime? The need to reset the fields on error raises questions.


pkg/kv/kvserver/replica_application_state_machine.go line 236 at r1 (raw file):

			log.Fatalf(ctx, "command already applied: %+v; unexpected successful result", cmd)
		}
		// If any reproposals at a higher MaxLeaseIndex exist we know that they will

I don't understand this branch. It dates back to 0e81674, but I can't make sense of it — it sounds like it was just dead code. Glad you got rid of it and replaced it with an assertion.


pkg/kv/kvserver/replica_proposal.go line 135 at r1 (raw file):

// rejected and should not be reproposed.
//
// Note that some commands (such as leases) don't use MaxLeaseIndex.

"lease requests"

Lease transfers do use MaxLeaseIndex.

@tbg tbg force-pushed the reproposals-doc branch from 1d4acc7 to fcbc736 Compare January 13, 2023 10:54
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews! I rebased and added a fix-up commit, so if reviewable gets confused you might just want to use GH review for the last commit instead.

Dismissed @andreimatei from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)


pkg/kv/kvserver/replica.go line 531 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

with various different MaxLeaseIndex values

Is it possible for there two be more than two different MaxLeaseIndex values in the unapplied log at a time? And can they be interleaved?

I think in practice we won't see too much of that but in theory anything goes, because you always have to assume that all previous incarnations are still floating around out there as raft reproposals that could reach the leader at any time & in any order. Updated the comment with some detail and simplified it a bit too.


pkg/kv/kvserver/replica.go line 596 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what is "this"?

The below-raft reproposal mechanism, updated.


pkg/kv/kvserver/replica_application_result.go line 259 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Replica.propose doesn't assign the MaxLeaseIndex synchronously, it only does when the proposal buffer is flushed. Are we ok with the command having a MaxLeaseIndex of 0 in the meantime? The need to reset the fields on error raises questions.

The command is not in the map at this point, so I am not only ok with this, but intentionally want this. I want r.propose to not have to care about whether it sees a proposal for the first time or fifth time.


pkg/kv/kvserver/replica_application_state_machine.go line 236 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't understand this branch. It dates back to 0e81674, but I can't make sense of it — it sounds like it was just dead code. Glad you got rid of it and replaced it with an assertion.

Ack, thanks for the check. I had a similar conclusion. The logic is still quite complicated and I wish it were a little less subtle to know when a proposal is in which "state" but that's for another day.

@tbg tbg requested a review from nvanbenschoten January 18, 2023 10:53
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/replica_application_result.go line 259 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The command is not in the map at this point, so I am not only ok with this, but intentionally want this. I want r.propose to not have to care about whether it sees a proposal for the first time or fifth time.

The part that feels fragile to me then is the need to reset the fields on error. I probably just don't understand that well enough yet, but why is an error here treated differently than an error when flushing the proposal buffer back into the proposals map?

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/replica_application_result.go line 259 at r1 (raw file):
In this method, the command is not in the proposals map. If r.propose returns an error, the command never made it into the proposals map. So here we just pretend that we never even called r.propose. The proposal will then fail up to the caller, and this is probably not a particularly well-tested code path anyway, but it will not be reproposed and so it shouldn't matter what the MLAI and prev encoded command was. But it felt right to restore the previous state, to avoid having to reason about yet one more code path at the caller.

I'm not defending this code, btw - this is the clearest I was able to make it without really stirring the pot. I wish we didn't have these below-raft reproposals and would "just" delegate to an above-raft goroutine pool to repropose on behalf of clients who used pipelined writes. I bet that would be tricky in its own right, but conceptually clearer.

but why is an error here treated differently than an error when flushing the proposal buffer back into the proposals map?

I might be wrong, but: if we catch an error flushing the buffer, we return it up:

and up:

err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup)

and up:

} else if err != nil {
return stats, errors.Wrap(err, "checking raft group for Ready")
}

and up:

default:
log.FatalfDepth(ctx, 1, "%+v", err)
panic("unreachable")
}

basically flushing the buffer "has to work". But in this method, we can run into the closed timestamp, so it has to be fallible and we need to return an error to the caller.

@tbg tbg requested a review from nvanbenschoten February 6, 2023 07:30
@tbg tbg force-pushed the reproposals-doc branch from fcbc736 to d8892af Compare February 6, 2023 07:41
@blathers-crl
Copy link

blathers-crl bot commented Feb 6, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/replica_application_result.go line 259 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In this method, the command is not in the proposals map. If r.propose returns an error, the command never made it into the proposals map. So here we just pretend that we never even called r.propose. The proposal will then fail up to the caller, and this is probably not a particularly well-tested code path anyway, but it will not be reproposed and so it shouldn't matter what the MLAI and prev encoded command was. But it felt right to restore the previous state, to avoid having to reason about yet one more code path at the caller.

I'm not defending this code, btw - this is the clearest I was able to make it without really stirring the pot. I wish we didn't have these below-raft reproposals and would "just" delegate to an above-raft goroutine pool to repropose on behalf of clients who used pipelined writes. I bet that would be tricky in its own right, but conceptually clearer.

but why is an error here treated differently than an error when flushing the proposal buffer back into the proposals map?

I might be wrong, but: if we catch an error flushing the buffer, we return it up:

and up:

err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) {
numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup)

and up:

} else if err != nil {
return stats, errors.Wrap(err, "checking raft group for Ready")
}

and up:

default:
log.FatalfDepth(ctx, 1, "%+v", err)
panic("unreachable")
}

basically flushing the buffer "has to work". But in this method, we can run into the closed timestamp, so it has to be fallible and we need to return an error to the caller.

Thanks for explaining. This makes sense. If you wanted to add another sentence to the comment stating that an error to propose causes the request to fail while a success means that it's back in the proposal pipeline, I'd be supportive. It's also fine if not.

Reproposals are a deep rabbit hole and an area in which past changes
were all related to subtle bugs. Write it all up and in particular make
some simplifications that ought to be possible if my understanding is
correct:

- have proposals always enter `(*Replica).propose` without a
  MaxLeaseIndex or prior encoded command set, i.e. `propose`
  behaves the same for reproposals as for first proposals.
- assert that after a failed call to tryReproposeWithNewLeaseIndex,
  the command is not in the proposals map, i.e. check absence of
  a leak.
- replace code that should be impossible to reach (and had me confused
  for a good amount of time) with an assertion.
- add long comment on `r.mu.proposals`.

This commit also moves `tryReproposeWithNewLeaseIndex` off `(*Replica)`,
which is possible due to recent changes[^1]. In doing so, I realized
there was a (small) data race (now fixed): when returning a
`NotLeaseholderError` from that method, we weren't acquiring `r.mu`. It
may have looked as though we were holding it already since we're
accessing `r.mu.propBuf`, however that field has special semantics - it
wraps `r.mu` and acquires it when needed.

[^1]: The "below raft" test mentioned in the previous comment was
changed in cockroachdb#93785 and
no longer causes a false positive.

Epic: CRDB-220
Release note: None
@tbg tbg force-pushed the reproposals-doc branch from d8892af to 866d58a Compare February 8, 2023 17:03
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r=nvanbenschoten

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @pavelkalinnikov)


pkg/kv/kvserver/replica_application_result.go line 259 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks for explaining. This makes sense. If you wanted to add another sentence to the comment stating that an error to propose causes the request to fail while a success means that it's back in the proposal pipeline, I'd be supportive. It's also fine if not.

Good idea., done.

@craig
Copy link
Contributor

craig bot commented Feb 8, 2023

Build succeeded:

@craig craig bot merged commit 844a370 into cockroachdb:master Feb 8, 2023
@tbg tbg deleted the reproposals-doc branch February 9, 2023 07:49
tbg added a commit to tbg/cockroach that referenced this pull request Feb 16, 2023
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases
in which we might otherwise accidentally end up applying a proposal
twice.

This assertion had a false positive, see the updated comment within.

I was able to reproduce the failure within ~minutes via
`./experiment.sh` in cockroachdb#97173 as of 33dcdef.

Closes cockroachdb#94633.

[^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 20, 2023
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases
in which we might otherwise accidentally end up applying a proposal
twice.

This assertion had a false positive, see the updated comment within.

I was able to reproduce the failure within ~minutes via
`./experiment.sh` in cockroachdb#97173 as of 33dcdef.

Better testing of these cases would be desirable. Unfortunately, while
there is an abstraction over command application (`apply.Task`), most
of the logic worth testing lives in `(*replicaAppBatch)` which is
essentially a `*Replica` with more moving parts attached. This does
not lend itself well to unit testing.

I had a run[^1][^2][^3] earlier this year to make log application
standalone, but then didn't have enough time to follow through.
It would be desirable to do so at a later date, perhaps with
the explicit goals of having interactions like the one discussion
in this PR unit become testable.

No release note because unreleased (except perhaps in an alpha).

[3]: cockroachdb#93309
[2]: cockroachdb#93266
[1]: cockroachdb#93239

Closes cockroachdb#94633.

[^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 23, 2023
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases
in which we might otherwise accidentally end up applying a proposal
twice.

This assertion had a false positive.

I was able to reproduce the failure within ~minutes via
`./experiment.sh` in cockroachdb#97173 as of 33dcdef.

Better testing of these cases would be desirable. Unfortunately, while
there is an abstraction over command application (`apply.Task`), most
of the logic worth testing lives in `(*replicaAppBatch)` which is
essentially a `*Replica` with more moving parts attached. This does
not lend itself well to unit testing.

I had a run[^2][^3][^4] earlier this year to make log application
standalone, but then didn't have enough time to follow through.
It would be desirable to do so at a later date, perhaps with
the explicit goals of having interactions like the one discussion
in this PR unit become testable.

[^4]: cockroachdb#93309
[^3]: cockroachdb#93266
[^2]: cockroachdb#93239

[^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238

This assertion was previously trying to assert too much at a distance
and was not only incorrect, but additionally inscrutable.

It was mixing up two assertions, the first one of which is sensible:
If an entry is accepted, it must not be superseded by inflight proposal.
If this were violated, this superseded proposal could also apply,
resulting in a failure of replay protection. This assertion is now
still around as a stand-alone assertion.

The other half of the assertion was more confused: if an entry is
rejected, it was claiming that it couldn't also be superseded. The
thinking was that if a superseding log entry exists, maybe it could
apply, and that would be bad since we just told the waiting client
that their proposal got rejected.

This reasoning is incorrect, as the following example shows. Consider
the following initial situation:

    [lease seq is 1]
    log idx 99:  unrelated cmd at LAI 10000, lease seq = 1
    log idx 100: cmd X at LAI 10000, lease seq = 1

And next:

- a new lease enters the log at idx 101 (lease seq = 2)
- an identical copy of idx 100 enters the log at idx 102
- we apply idx 100, leading to superseding reproposal at idx 103

resulting in the log:

    [lease seq is 1]
    log idx 99: unrelated cmd at LAI 10000, lease seq = 1
    log idx 100: cmd X at LAI 10000, lease seq = 1
    log idx 101: lease seq = 2
    log idx 102: (same as idx 100)
    log idx 103: cmd X at LAI = 20000, lease seq = 1

During application of idx 102, we get a *permanent* rejection and yet
the entry is superseded (by the proposal at idx 103). This would
erroneously trigger the assertion, even though this is a legal sequence
of events with no detrimental outcomes: the superseding proposal will
always have the same lease sequence as its superseded copies, so it
will also fail.

I initially tried only soften the assertion a *little bit*. Observing
that the example above led to a *permanent* rejection, should we only
require that a proposal (which in this assertion is always local) is not
superseded if it got rejected due to its lease index (which implies that
it passed the lease check)? It turns out that this is primarily an
assertion on when superseded proposals are counted as "local" at this
point in the code: if there were multiple copies of this rejected
proposal in the current `appTask` (i.e. the current `CommittedEntries`
slice handed to us for application by raft), then all copies are
initially local; and a copy that successfully spawns a superseding
proposal would be made non-local from that point on. On the face
of it, All other copies in the same `appTask` would now hit the
assertion (erroneously): they are local, they are rejected, so
why don't they enter the branch? The magic ingredient is that
if an entry is superseded when we handle the lease index rejection,
we also unlink the proposal from it. So these never enter this
path since it's not local at this point.

For example, if these are the log entries to apply (all at valid lease
seq):

    log idx 99: unrelated cmd at LAI 10000
    log idx 100: cmd X at LAI 10000
    log idx 101: (identical copy of idx 100)

and idxs 99-101 are applied in one batch, then idx 100 would spawn
a reproposal at a new lease applied index:

    log idx 99: unrelated cmd at LAI 10000
    log idx 100: cmd X at LAI 10000             <- applied
    log idx 101: (identical copy of idx 100)
    log idx 100: cmd X at LAI 20000             <- not in current batch

When we apply 101, we observe an illegal lease index, but the proposal
supersedes the entry, so we mark it as non-local and don't enter the
branch that contains the assertion.

The above reasoning is very difficult to understand, and it happens
too far removed from where the interesting state changes happen.

Also, for testing purposes it is interesting to introduce "errors"
in the lease applied index assignment to artificially exercise these
reproposal mechanisms. In doing so, these assertions can trip because
the lease applied index assigned to a reproposal might accidentally
(or intentionally!) match the existing lease applied index, in which
case copies of the command in the same batch now *don't* consider
themselves superseded.

The value of this testing outweighs the very limited benefit of
this branch of the assertion. An argument could even be made that
this assertion alone as negative benefit due to its complexity.

We are removing it in this commit and will instead work towards
simplifying the mechanisms that played a role in explaining the
asssertion.

Closes cockroachdb#94633.
Closes cockroachdb#97347.

No release note because unreleased (except perhaps in an alpha).

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 24, 2023
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases
in which we might otherwise accidentally end up applying a proposal
twice.

This assertion had a false positive.

I was able to reproduce the failure within ~minutes via
`./experiment.sh` in cockroachdb#97173 as of 33dcdef.

Better testing of these cases would be desirable. Unfortunately, while
there is an abstraction over command application (`apply.Task`), most
of the logic worth testing lives in `(*replicaAppBatch)` which is
essentially a `*Replica` with more moving parts attached. This does
not lend itself well to unit testing.

I had a run[^2][^3][^4] earlier this year to make log application
standalone, but then didn't have enough time to follow through.
It would be desirable to do so at a later date, perhaps with
the explicit goals of having interactions like the one discussion
in this PR unit become testable.

[^4]: cockroachdb#93309
[^3]: cockroachdb#93266
[^2]: cockroachdb#93239

[^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238

This assertion was previously trying to assert too much at a distance
and was not only incorrect, but additionally inscrutable.

It was mixing up two assertions, the first one of which is sensible:
If an entry is accepted, it must not be superseded by inflight proposal.
If this were violated, this superseded proposal could also apply,
resulting in a failure of replay protection. This assertion is now
still around as a stand-alone assertion.

The other half of the assertion was more confused: if an entry is
rejected, it was claiming that it couldn't also be superseded. The
thinking was that if a superseding log entry exists, maybe it could
apply, and that would be bad since we just told the waiting client
that their proposal got rejected.

This reasoning is incorrect, as the following example shows. Consider
the following initial situation:

    [lease seq is 1]
    log idx 99:  unrelated cmd at LAI 10000, lease seq = 1
    log idx 100: cmd X at LAI 10000, lease seq = 1

And next:

- a new lease enters the log at idx 101 (lease seq = 2)
- an identical copy of idx 100 enters the log at idx 102
- we apply idx 100, leading to superseding reproposal at idx 103

resulting in the log:

    [lease seq is 1]
    log idx 99: unrelated cmd at LAI 10000, lease seq = 1
    log idx 100: cmd X at LAI 10000, lease seq = 1
    log idx 101: lease seq = 2
    log idx 102: (same as idx 100)
    log idx 103: cmd X at LAI = 20000, lease seq = 1

During application of idx 102, we get a *permanent* rejection and yet
the entry is superseded (by the proposal at idx 103). This would
erroneously trigger the assertion, even though this is a legal sequence
of events with no detrimental outcomes: the superseding proposal will
always have the same lease sequence as its superseded copies, so it
will also fail.

I initially tried only soften the assertion a *little bit*. Observing
that the example above led to a *permanent* rejection, should we only
require that a proposal (which in this assertion is always local) is not
superseded if it got rejected due to its lease index (which implies that
it passed the lease check)? It turns out that this is primarily an
assertion on when superseded proposals are counted as "local" at this
point in the code: if there were multiple copies of this rejected
proposal in the current `appTask` (i.e. the current `CommittedEntries`
slice handed to us for application by raft), then all copies are
initially local; and a copy that successfully spawns a superseding
proposal would be made non-local from that point on. On the face
of it, All other copies in the same `appTask` would now hit the
assertion (erroneously): they are local, they are rejected, so
why don't they enter the branch? The magic ingredient is that
if an entry is superseded when we handle the lease index rejection,
we also unlink the proposal from it. So these never enter this
path since it's not local at this point.

For example, if these are the log entries to apply (all at valid lease
seq):

    log idx 99: unrelated cmd at LAI 10000
    log idx 100: cmd X at LAI 10000
    log idx 101: (identical copy of idx 100)

and idxs 99-101 are applied in one batch, then idx 100 would spawn
a reproposal at a new lease applied index:

    log idx 99: unrelated cmd at LAI 10000
    log idx 100: cmd X at LAI 10000             <- applied
    log idx 101: (identical copy of idx 100)
    log idx 100: cmd X at LAI 20000             <- not in current batch

When we apply 101, we observe an illegal lease index, but the proposal
supersedes the entry, so we mark it as non-local and don't enter the
branch that contains the assertion.

The above reasoning is very difficult to understand, and it happens
too far removed from where the interesting state changes happen.

Also, for testing purposes it is interesting to introduce "errors"
in the lease applied index assignment to artificially exercise these
reproposal mechanisms. In doing so, these assertions can trip because
the lease applied index assigned to a reproposal might accidentally
(or intentionally!) match the existing lease applied index, in which
case copies of the command in the same batch now *don't* consider
themselves superseded.

The value of this testing outweighs the very limited benefit of
this branch of the assertion. An argument could even be made that
this assertion alone as negative benefit due to its complexity.

We are removing it in this commit and will instead work towards
simplifying the mechanisms that played a role in explaining the
asssertion.

Closes cockroachdb#94633.
Closes cockroachdb#97347.

No release note because unreleased (except perhaps in an alpha).

Epic: none
Release note: None
craig bot pushed a commit that referenced this pull request Feb 26, 2023
97564: kvserver: narrow down 'finishing a proposal with outstanding reproposal' r=pavelkalinnikov a=tbg

In #94633, I introduced[^1] an assertion that attempted to catch cases
in which we might otherwise accidentally end up applying a proposal
twice.

This assertion had a false positive.

I was able to reproduce the failure within ~minutes via
`./experiment.sh` in #97173 as of 33dcdef.

Better testing of these cases would be desirable. Unfortunately, while
there is an abstraction over command application (`apply.Task`), most
of the logic worth testing lives in `(*replicaAppBatch)` which is
essentially a `*Replica` with more moving parts attached. This does
not lend itself well to unit testing.

I had a run[^2][^3][^4] earlier this year to make log application
standalone, but then didn't have enough time to follow through.
It would be desirable to do so at a later date, perhaps with
the explicit goals of having interactions like the one discussion
in this PR unit become testable.

[^4]: #93309
[^3]: #93266
[^2]: #93239

[^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238

This assertion was previously trying to assert too much at a distance
and was not only incorrect, but additionally inscrutable.

It was mixing up two assertions, the first one of which is sensible:
If an entry is accepted, it must not be superseded by inflight proposal.
If this were violated, this superseded proposal could also apply,
resulting in a failure of replay protection. This assertion is now
still around as a stand-alone assertion.

The other half of the assertion was more confused: if an entry is
rejected, it was claiming that it couldn't also be superseded. The
thinking was that if a superseding log entry exists, maybe it could
apply, and that would be bad since we just told the waiting client
that their proposal got rejected.

This reasoning is incorrect, as the following example shows. Consider
the following initial situation:

    [lease seq is 1]
    log idx 99:  unrelated cmd at LAI 10000, lease seq = 1
    log idx 100: cmd X at LAI 10000, lease seq = 1

And next:

- a new lease enters the log at idx 101 (lease seq = 2)
- an identical copy of idx 100 enters the log at idx 102
- we apply idx 100, leading to superseding reproposal at idx 103

resulting in the log:

    [lease seq is 1]
    log idx 99: unrelated cmd at LAI 10000, lease seq = 1
    log idx 100: cmd X at LAI 10000, lease seq = 1
    log idx 101: lease seq = 2
    log idx 102: (same as idx 100)
    log idx 103: cmd X at LAI = 20000, lease seq = 1

During application of idx 102, we get a *permanent* rejection and yet
the entry is superseded (by the proposal at idx 103). This would
erroneously trigger the assertion, even though this is a legal sequence
of events with no detrimental outcomes: the superseding proposal will
always have the same lease sequence as its superseded copies, so it
will also fail.

I initially tried only soften the assertion a *little bit*. Observing
that the example above led to a *permanent* rejection, should we only
require that a proposal (which in this assertion is always local) is not
superseded if it got rejected due to its lease index (which implies that
it passed the lease check)? It turns out that this is primarily an
assertion on when superseded proposals are counted as "local" at this
point in the code: if there were multiple copies of this rejected
proposal in the current `appTask` (i.e. the current `CommittedEntries`
slice handed to us for application by raft), then all copies are
initially local; and a copy that successfully spawns a superseding
proposal would be made non-local from that point on. On the face
of it, All other copies in the same `appTask` would now hit the
assertion (erroneously): they are local, they are rejected, so
why don't they enter the branch? The magic ingredient is that
if an entry is superseded when we handle the lease index rejection,
we also unlink the proposal from it. So these never enter this
path since it's not local at this point.

For example, if these are the log entries to apply (all at valid lease
seq):

    log idx 99: unrelated cmd at LAI 10000
    log idx 100: cmd X at LAI 10000
    log idx 101: (identical copy of idx 100)

and idxs 99-101 are applied in one batch, then idx 100 would spawn
a reproposal at a new lease applied index:

    log idx 99: unrelated cmd at LAI 10000
    log idx 100: cmd X at LAI 10000             <- applied
    log idx 101: (identical copy of idx 100)
    log idx 100: cmd X at LAI 20000             <- not in current batch

When we apply 101, we observe an illegal lease index, but the proposal
supersedes the entry, so we mark it as non-local and don't enter the
branch that contains the assertion.

The above reasoning is very difficult to understand, and it happens
too far removed from where the interesting state changes happen.

Also, for testing purposes it is interesting to introduce "errors"
in the lease applied index assignment to artificially exercise these
reproposal mechanisms. In doing so, these assertions can trip because
the lease applied index assigned to a reproposal might accidentally
(or intentionally!) match the existing lease applied index, in which
case copies of the command in the same batch now *don't* consider
themselves superseded.

The value of this testing outweighs the very limited benefit of
this branch of the assertion. An argument could even be made that
this assertion alone as negative benefit due to its complexity.

We are removing it in this commit and will instead work towards
simplifying the mechanisms that played a role in explaining the
asssertion.

Closes #97102.
Closes #97347.
Closes #97447.
Closes #97612.

No release note because unreleased (except perhaps in an alpha).

Epic: none
Release note: None



Co-authored-by: Tobias Grieger <[email protected]>
craig bot pushed a commit that referenced this pull request Mar 14, 2023
98481: kvserver: revert recent changes to reproposals r=pavelkalinnikov a=tbg

Reverts #97606, #97564, #94825, #94633.

- Revert "kvserver: disable assertion 'finished proposal inserted'"
- Revert "kvserver: narrow down 'finishing a proposal with outstanding reproposal'"
- Revert "kvserver: fill gaps in comment near tryReproposeWithNewLeaseIndex"
- Revert "kvserver: hoist early return out of tryReproposeWithNewLeaseIndex"
- Revert "fixup! kvserver: prevent finished proposal from being present in proposals map"
- Revert "kvserver: prevent finished proposal from being present in proposals map"
- Revert "kvserver: improve reproposal assertions and documentation"

Closes #97973.

Epic: CRDB-25287
Release Note: none


98537: sql: check row level ttl change before truncating a table r=chengxiong-ruan a=chengxiong-ruan

Fixes: #93443

Release note (sql change): This commit fixed a bug where crdb paniced wehn user tried to truncate a table which is has an ongoing row level ttl change. We still don't support table truncates in this scenario, but a more gentle unimplemented error is returned instead of panic.

98575: cdc: use int64 for emitted bytes telemetry r=miretskiy a=jayshrivastava

Previously, the stored `emitted_bytes` field was an int32, which can hold a maximum value of 2.1GB. This value is too small because the logging period is 24h and changefeeds can emit much more than 2.1GB in 24h. This change updates the field to be an int64, which solves this problem.

Epic: None
Release note: None

98582: ci: allow-list `BUILD_VCS_NUMBER` env var in cloud unit tests r=jlinder a=rickystewart

This job was filing issues linking to the wrong commit.

Epic: none
Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants