-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: incorrect term on delegated snapshots #127349
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Comments
pav-kv
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Jul 17, 2024
Hi @pav-kv, please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
pav-kv
changed the title
raft: inconsistent term on delegated snapshots
raft: incorrect term on delegated snapshots
Jul 17, 2024
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 22, 2024
Informs cockroachdb#125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * cockroachdb#127349 * cockroachdb#127348 Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 22, 2024
We can't consider MsgSnap to be from the leader of Message.Term until we address cockroachdb#127348 and cockroachdb#127349. Epic: None Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 22, 2024
Informs cockroachdb#125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * cockroachdb#127349 * cockroachdb#127348 Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 22, 2024
We can't consider MsgSnap to be from the leader of Message.Term until we address cockroachdb#127348 and cockroachdb#127349. Epic: None Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 23, 2024
Informs cockroachdb#125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * cockroachdb#127349 * cockroachdb#127348 Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 23, 2024
We can't consider MsgSnap to be from the leader of Message.Term until we address cockroachdb#127348 and cockroachdb#127349. Epic: None Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 25, 2024
Informs cockroachdb#125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * cockroachdb#127349 * cockroachdb#127348 Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 25, 2024
We can't consider MsgSnap to be from the leader of Message.Term until we address cockroachdb#127348 and cockroachdb#127349. Epic: None Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 28, 2024
Informs cockroachdb#125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * cockroachdb#127349 * cockroachdb#127348 Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 28, 2024
We can't consider MsgSnap to be from the leader of Message.Term until we address cockroachdb#127348 and cockroachdb#127349. Epic: None Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 28, 2024
We can't consider MsgSnap to be from the leader of Message.Term until we address cockroachdb#127348 and cockroachdb#127349. Epic: None Release note: None
craig bot
pushed a commit
that referenced
this issue
Oct 28, 2024
133169: raft: assert on HardState field modifications r=nvanbenschoten a=nvanbenschoten Informs #125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * #127349 * #127348 The 2nd, 3rd, and 4th commits in the PR address these issues in raft by not assuming that a `MsgSnap` as always coming from the leader of `Message.Term`. Release note: None 133473: sql,kvflowcontrol: include sum of (eval|send) tokens deducted r=sumeerbhola a=kvoli Populate two new `kvflowinspectpb.ConnectedStream` fields: ```go // TotalEvalDeductedTokens represents the total number of tokens deducted for // evaluation, for this connected stream. Only populated on versions >= 24.3. TotalEvalDeductedTokens int64 // TotalSendDeductedTokens represents the total number of tokens deducted // before sending, for this connected stream. Only populated on versions >= // 24.3. TotalSendDeductedTokens int64 ``` These may be useful to gander at the tokens deducted, which are otherwise not tracked. Also, for a quick check. Note that RACv1 does not populate these variables, as it is soon to be deprecated and these fields don't directly relate, unlike RACv2. --- Update `crdb_internal.kv_flow_control_handles_v2` to include: ```sql total_eval_deducted_tokens INT NOT NULL, total_send_deducted_tokens INT NOT NULL, ``` Corresponding to the total number of `eval` and `send` tokens deducted by the connected stream, in RACv2. Note the `handles_v1` virtual table remains unchanged. The DDL is now: ```sql CREATE TABLE crdb_internal.kv_flow_control_handles_v2 ( range_id INT NOT NULL, tenant_id INT NOT NULL, store_id INT NOT NULL, total_tracked_tokens INT NOT NULL, total_eval_deducted_tokens INT NOT NULL, total_send_deducted_tokens INT NOT NULL, INDEX(range_id) ); ``` Resolves: #132777 Release note: None 133497: build: upgrade `rules_go` and `gazelle` r=rail,tbg,stevendanna a=rickystewart We also drop a [commit on top of our `rules_go` fork](RaduBerinde/rules_go@b2411a8) that is in opposition to #102619. Closes #102619 Closes #133289 Closes #133464 Epic: CRDB-17171 Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Oct 28, 2024
Informs #125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * #127349 * #127348 Release note: None
blathers-crl bot
pushed a commit
that referenced
this issue
Oct 28, 2024
Informs #125352. This commit adds setters for the HardState fields (Term, Vote, lead, leadEpoch), allowing us to assert that these fields are only modified in expected ways. Some bugs that this change could catch: - term regressions - vote changes - lead changes in a single term - leadEpoch regressions The "lead changes in a single term" class of bugs is especially interesting, as we have found two of these in this thread: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1720274557285239. These are tracked in the following issues, which we may need to address ASAP: * #127349 * #127348 Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Oct 30, 2024
Informs cockroachdb#132762. This commit fixes a bug introduced by 58a9f53. Now that we no longer assume that snapshots are coming from the leader, we were not defortifying the raft leadership when receiving a snapshot. This meant that if a snapshot came from a new term, we would hit an assertion failure, which the new test reproduces. This commit addresses this bug by distinguishing between messages that are always sent from the leader of the message's term and messages that indicate that there is a new leader of the message's term, even if the message is not from the leader itself. This is temporary, and can be removed when we address cockroachdb#127349. Release note: None
craig bot
pushed a commit
that referenced
this issue
Oct 30, 2024
132967: schemachanger: add support discarding database/table zc r=rafiss a=annrpom ### scbuild: version gate SetZoneConfig with isV243Active New support was added in 24.3 in `SetZoneConfig` for subzone configs. Since this work was a significant change, we should gate `SetZoneConfig` to minimize the risk of compatibility issues with earlier versions. Epic: none Release note: None --- ### schemachanger: re-enable dsc zone config Since 24.3 has been cut, we can turn zone configs on in the declarative schema changer by default. Epic: none Release note: None --- ### scexec: extend UpdateZoneConfig to allow for deletes This patch extends UpdateZoneConfig to allow for deletes in the system.zones table. Epic: None Release note: None --- ### scbuildstmt: refactor zoneConfigObjBuilder method This patch refactors the `addZoneConfigToBuildCtx` method of our `zoneConfigObjBuilder` to be less specific; we pull the `b.add` out so that we can decide in `SetZoneConfig` whether to add or drop this element. Epic: none Release note: None --- ### schemachanger: add support discarding database/table zc This patch supports discarding a database/table zone config in the declarative schema changer. Informs: #133157 Release note: None 133725: kvserver/rangefeed: add metric for processor-level send timeout r=wenyihu6a a=stevendanna Epic: none Release note: None 133848: bazel-github-helper: sort list of failed tests in summary r=rail a=rickystewart Epic: CRDB-17171 Release note: None Release justification: Non-production code changes 133874: raft: don't panic on defortifying snapshot from new term r=nvanbenschoten a=nvanbenschoten Informs #132762. This commit fixes a bug introduced by 58a9f53. Now that we no longer assume that snapshots are coming from the leader, we were not defortifying the raft leadership when receiving a snapshot. This meant that if a snapshot came from a new term, we would hit an assertion failure, which the new test reproduces. This commit addresses this bug by distinguishing between messages that are always sent from the leader of the message's term and messages that indicate that there is a new leader of the message's term, even if the message is not from the leader itself. This is temporary, and can be removed when we address #127349. Release note: None 133878: revert "logictest: deflake TestLogic_union" r=yuzefovich a=yuzefovich This reverts commit 6218f23. I don't think this commit did anything to remove or reduce flakiness in the union logic test because the error occurs on the SELECT query _after_ the DDL and DML statements modified by this commit. Reverting it to allow for cleaner backports. Epic: None Release note: None Co-authored-by: Annie Pompa <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Oct 30, 2024
Informs #132762. This commit fixes a bug introduced by 58a9f53. Now that we no longer assume that snapshots are coming from the leader, we were not defortifying the raft leadership when receiving a snapshot. This meant that if a snapshot came from a new term, we would hit an assertion failure, which the new test reproduces. This commit addresses this bug by distinguishing between messages that are always sent from the leader of the message's term and messages that indicate that there is a new leader of the message's term, even if the message is not from the leader itself. This is temporary, and can be removed when we address #127349. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
After #105044, the delegate can generate a snapshot while being at a newer term than the coordinator, but in the
MsgSnap
message it puts the Term of the coordinator, as if it was sent by that leader. This is problematic because the delegate can be already applying entries committed at later terms, and as a result theMsgSnap
can carryTerm < snapshot.Metadata.Term
. On the receiver who applies this snapshot, it can break the raft log/state invariant:HardState.Term >= lastEntryTerm >= RaftTruncatedState.Term
. Additionally, the receiver of the snapshot assumes [#127348] the sender of the message is the leader.To fix this, we should seek to guarantee an invariant on
MsgSnap
:One way to do this is to put the delegate's term instead of the originator. Because of #127348 though, the
MsgSnap
should be sent on behalf of a leader at this term. We can consult theSoftState.Lead
field for this.Another option is to reject snapshot delegations if our term differs. One problem with this is that the snapshot could have been initiated by a leaseholder who was not a leader. Looks like we need to consult
SoftState.Lead
in this case too.Jira issue: CRDB-40402
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: