-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: raft application leaks traces on ErrRemoved #53677
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Comments
nvanbenschoten
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-kv-replication
Relating to Raft, consensus, and coordination.
labels
Aug 31, 2020
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 9, 2020
Extension of cockroachdb#53677. This commit ensures that we properly finish tracing spans of Raft commands that follow a command that throws a ErrRemoved. Before this commit, these commands would be abandoned and would never be finished. The effects of this are theoretically even worse than those fixed in the previous commit because these leaked commands could be locally proposed, so we may be abandoning a local proposer indefinitely. It's not clear that we ever saw an instance of this. It seems rare for a local proposal to end up in the same CommittedEntries batch as a command that removes a replica because of the lease requirements, but it doesn't seem impossible, especially of the local proposal was a RequestLease request. Release justification: low risk, high benefit changes to existing functionality.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 11, 2020
Extension of cockroachdb#53677. This commit ensures that we properly finish tracing spans of Raft commands that follow a command that throws a ErrRemoved. Before this commit, these commands would be abandoned and would never be finished. The effects of this are theoretically even worse than those fixed in the previous commit because these leaked commands could be locally proposed, so we may be abandoning a local proposer indefinitely. It's not clear that we ever saw an instance of this. It seems rare for a local proposal to end up in the same CommittedEntries batch as a command that removes a replica because of the lease requirements, but it doesn't seem impossible, especially of the local proposal was a RequestLease request. Release justification: low risk, high benefit changes to existing functionality.
craig bot
pushed a commit
that referenced
this issue
Sep 11, 2020
54140: kv: don't leak raft application tracing spans on or after ErrRemoved r=nvanbenschoten a=nvanbenschoten Fixes #53677. This change ensures that we properly finish tracing spans of Raft commands that throw ErrRemoved errors in ApplySideEffects. It then ensures that we properly finish tracing spans of Raft commands that follow a command that throws an `ErrRemoved`. Before this, these commands would be abandoned and would never be finished. The effects of this are theoretically even worse than those fixed in the previous commit because these leaked commands could be locally proposed, so we may be abandoning a local proposer indefinitely. It's not clear that we ever saw an instance of this. It seems rare for a local proposal to end up in the same CommittedEntries batch as a command that removes a replica because of the lease requirements, but it doesn't seem impossible, especially of the local proposal was a RequestLease request. I was originally intending to do something more dramatic and make `replicaStateMachine.ApplySideEffects` responsible for acknowledging proposers in all cases, but doing so turned out to be pretty invasive so I was concerned that it would be harder to backport to v20.2 and to v20.1. I may revisit that in the future. Release justification: low risk, high benefit changes to existing functionality. Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 11, 2020
Fixes cockroachdb#53677. This commit ensures that we properly finish tracing spans of Raft commands that throw ErrRemoved errors in ApplySideEffects. I was originally intending to do something more dramatic and make `replicaStateMachine.ApplySideEffects` responsible for acknowledging proposers in all cases, but doing so turned out to be pretty invasive so I was concerned that it would be harder to backport to v20.2 and to v20.1. I may revisit that in the future. Release justification: low risk, high benefit changes to existing functionality.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 11, 2020
Extension of cockroachdb#53677. This commit ensures that we properly finish tracing spans of Raft commands that follow a command that throws a ErrRemoved. Before this commit, these commands would be abandoned and would never be finished. The effects of this are theoretically even worse than those fixed in the previous commit because these leaked commands could be locally proposed, so we may be abandoning a local proposer indefinitely. It's not clear that we ever saw an instance of this. It seems rare for a local proposal to end up in the same CommittedEntries batch as a command that removes a replica because of the lease requirements, but it doesn't seem impossible, especially of the local proposal was a RequestLease request. Release justification: low risk, high benefit changes to existing functionality.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 28, 2020
Fixes cockroachdb#53677. This commit ensures that we properly finish tracing spans of Raft commands that throw ErrRemoved errors in ApplySideEffects. I was originally intending to do something more dramatic and make `replicaStateMachine.ApplySideEffects` responsible for acknowledging proposers in all cases, but doing so turned out to be pretty invasive so I was concerned that it would be harder to backport to v20.2 and to v20.1. I may revisit that in the future. Release justification: low risk, high benefit changes to existing functionality.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Sep 28, 2020
Extension of cockroachdb#53677. This commit ensures that we properly finish tracing spans of Raft commands that follow a command that throws a ErrRemoved. Before this commit, these commands would be abandoned and would never be finished. The effects of this are theoretically even worse than those fixed in the previous commit because these leaked commands could be locally proposed, so we may be abandoning a local proposer indefinitely. It's not clear that we ever saw an instance of this. It seems rare for a local proposal to end up in the same CommittedEntries batch as a command that removes a replica because of the lease requirements, but it doesn't seem impossible, especially of the local proposal was a RequestLease request. Release justification: low risk, high benefit changes to existing functionality.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Noticed while investigating #53403.
In a cluster with net/trace enabled for long periods of time, we begin to see an accumulation of leaked tracing spans. This appears to be fallout from c4adb7a, so it affects v20.1 and v20.2. That commit introduced tracing for raft application of non-local proposals. This means that we need to properly clean up (
FinishAndAckOutcome
) these commands even in the presence of replica removal. We don't currently do so.This currently goes wrong in two ways. First, we don't clean up the command that caused an
ErrRemoved
. Second, we don't clean up any commands after this command in the same batch. This second issue may be even worse than a tracing leak – it may be possible for local proposals to be later in the same batch, in which case we would leak them entirely and leave them hanging indefinitely. It's not clear whether this second hazard is possible in practice.The text was updated successfully, but these errors were encountered: