Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.2: kvserver: reintroduce RangeDesc.GenerationComparable #54469

Merged

Conversation

andreimatei
Copy link
Contributor

Backport 1/1 commits from #54199.

/cc @cockroachdb/release


We dropped this field recently, but unfortunately that wasn't safe for
mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the
proto through 20.2 nodes in a fairly subtle way. When it comes back to
the 20.1 node, the descriptor needs to compare Equal() to the original.
We configure our protos to not preserve unrecognized fields, so removing
the field breaks this round-tripping.

Specifically, the scenario which broke is the following:

  1. A 20.1 node processes an AdminSplit, and performs the tranction
    writing the new descriptors. The descriptors have the
    GenerationCompable field set.
  2. The lease changes while the respective txn is running. The lease
    moves to a 20.2 node.
  3. The 20.2 node evaluates the EndTxn, and computes the split trigger
    info that's going to be replicated. The EndTxn has split info in it
    containing the field set, but the field is dropped when converting
    that into the proposed SplitTrigger (since the 20.2 unmarshalls and
    re-marshalls the descriptors).
  4. All the 20.1 replicas of the ranges involved now apply the respective
    trigger via Raft, and their in-memory state doesn't have the field
    set. This doesn't match the bytes written in the database, which have
    the field.
  5. The discrepancy between the in-memory state and the db state is a
    problem, as it causes the 20.1 node to spin if it tries to perform
    subsequent merge/split operations. The reason is that the code
    performing these operations short-circuits itself if it detects that
    the descriptor has changed while the operation was running. This
    detection is performed via the generated Equals() method, and it
    mis-fires because of the phantom field. That detection happens here:
    return desc.Equal(desc2)

This patch takes precautions so that we can remove the field again in
21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll
come back to 20.2 and remove the field.

Fixes #53535

Release note: None

We dropped this field recently, but unfortunately that wasn't safe for
mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the
proto through 20.2 nodes in a fairly subtle way. When it comes back to
the 20.1 node, the descriptor needs to compare Equal() to the original.
We configure our protos to not preserve unrecognized fields, so removing
the field breaks this round-tripping.

Specifically, the scenario which broke is the following:
1. A 20.1 node processes an AdminSplit, and performs the tranction
   writing the new descriptors. The descriptors have the
   GenerationCompable field set.
2. The lease changes while the respective txn is running. The lease
   moves to a 20.2 node.
3. The 20.2 node evaluates the EndTxn, and computes the split trigger
   info that's going to be replicated. The EndTxn has split info in it
   containing the field set, but the field is dropped when converting
   that into the proposed SplitTrigger (since the 20.2 unmarshalls and
   re-marshalls the descriptors).
4. All the 20.1 replicas of the ranges involved now apply the respective
   trigger via Raft, and their in-memory state doesn't have the field
   set. This doesn't match the bytes written in the database, which have
   the field.
5. The discrepancy between the in-memory state and the db state is a
   problem, as it causes the 20.1 node to spin if it tries to perform
   subsequent merge/split operations. The reason is that the code
   performing these operations short-circuits itself if it detects that
   the descriptor has changed while the operation was running. This
   detection is performed via the generated Equals() method, and it
   mis-fires because of the phantom field. That detection happens here:
   https://github.com/cockroachdb/cockroach/blob/79c01d28da9c379f67bb41beef3d85ad3bee1da1/pkg/kv/kvserver/replica_command.go#L1957

This patch takes precautions so that we can remove the field again in
21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll
come back to 20.2 and remove the field. Namely, the patch changes
RangeDesc.Equal() to ignore that field (and the method is no longer
generated).

Fixes cockroachdb#53535

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei andreimatei merged commit 0b83153 into cockroachdb:release-20.2 Sep 17, 2020
@andreimatei andreimatei deleted the backport20.2-54199 branch September 17, 2020 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants