Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: raft application leaks traces on ErrRemoved #53677

Closed
nvanbenschoten opened this issue Aug 31, 2020 · 0 comments · Fixed by #54140
Closed

kv: raft application leaks traces on ErrRemoved #53677

nvanbenschoten opened this issue Aug 31, 2020 · 0 comments · Fixed by #54140
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@nvanbenschoten
Copy link
Member

Noticed while investigating #53403.

Screen Shot 2020-08-28 at 4 53 06 PM

In a cluster with net/trace enabled for long periods of time, we begin to see an accumulation of leaked tracing spans. This appears to be fallout from c4adb7a, so it affects v20.1 and v20.2. That commit introduced tracing for raft application of non-local proposals. This means that we need to properly clean up (FinishAndAckOutcome) these commands even in the presence of replica removal. We don't currently do so.

This currently goes wrong in two ways. First, we don't clean up the command that caused an ErrRemoved. Second, we don't clean up any commands after this command in the same batch. This second issue may be even worse than a tracing leak – it may be possible for local proposals to be later in the same batch, in which case we would leak them entirely and leave them hanging indefinitely. It's not clear whether this second hazard is possible in practice.

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Aug 31, 2020
@nvanbenschoten nvanbenschoten self-assigned this Aug 31, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 9, 2020
Extension of cockroachdb#53677.

This commit ensures that we properly finish tracing spans of Raft
commands that follow a command that throws a ErrRemoved. Before this
commit, these commands would be abandoned and would never be finished.
The effects of this are theoretically even worse than those fixed in the
previous commit because these leaked commands could be locally proposed,
so we may be abandoning a local proposer indefinitely.

It's not clear that we ever saw an instance of this. It seems rare for a
local proposal to end up in the same CommittedEntries batch as a command
that removes a replica because of the lease requirements, but it doesn't
seem impossible, especially of the local proposal was a RequestLease
request.

Release justification: low risk, high benefit changes to existing
functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 11, 2020
Extension of cockroachdb#53677.

This commit ensures that we properly finish tracing spans of Raft
commands that follow a command that throws a ErrRemoved. Before this
commit, these commands would be abandoned and would never be finished.
The effects of this are theoretically even worse than those fixed in the
previous commit because these leaked commands could be locally proposed,
so we may be abandoning a local proposer indefinitely.

It's not clear that we ever saw an instance of this. It seems rare for a
local proposal to end up in the same CommittedEntries batch as a command
that removes a replica because of the lease requirements, but it doesn't
seem impossible, especially of the local proposal was a RequestLease
request.

Release justification: low risk, high benefit changes to existing
functionality.
craig bot pushed a commit that referenced this issue Sep 11, 2020
54140: kv: don't leak raft application tracing spans on or after ErrRemoved r=nvanbenschoten a=nvanbenschoten

Fixes #53677.

This change ensures that we properly finish tracing spans of Raft commands that throw ErrRemoved errors in ApplySideEffects. It then ensures that we properly finish tracing spans of Raft commands that follow a command that throws an `ErrRemoved`. Before this, these commands would be abandoned and would never be finished. The effects of this are theoretically even worse than those fixed in the previous commit because these leaked commands could be locally proposed, so we may be abandoning a local proposer indefinitely.

It's not clear that we ever saw an instance of this. It seems rare for a local proposal to end up in the same CommittedEntries batch as a command that removes a replica because of the lease requirements, but it doesn't seem impossible, especially of the local proposal was a RequestLease request.

I was originally intending to do something more dramatic and make `replicaStateMachine.ApplySideEffects` responsible for acknowledging proposers in all cases, but doing so turned out to be pretty invasive so I was concerned that it would be harder to backport to v20.2 and to v20.1. I may revisit that in the future.

Release justification: low risk, high benefit changes to existing functionality.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in dd81162 Sep 11, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 11, 2020
Fixes cockroachdb#53677.

This commit ensures that we properly finish tracing spans of
Raft commands that throw ErrRemoved errors in ApplySideEffects.

I was originally intending to do something more dramatic and make
`replicaStateMachine.ApplySideEffects` responsible for acknowledging
proposers in all cases, but doing so turned out to be pretty invasive
so I was concerned that it would be harder to backport to v20.2 and
to v20.1. I may revisit that in the future.

Release justification: low risk, high benefit changes to existing
functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 11, 2020
Extension of cockroachdb#53677.

This commit ensures that we properly finish tracing spans of Raft
commands that follow a command that throws a ErrRemoved. Before this
commit, these commands would be abandoned and would never be finished.
The effects of this are theoretically even worse than those fixed in the
previous commit because these leaked commands could be locally proposed,
so we may be abandoning a local proposer indefinitely.

It's not clear that we ever saw an instance of this. It seems rare for a
local proposal to end up in the same CommittedEntries batch as a command
that removes a replica because of the lease requirements, but it doesn't
seem impossible, especially of the local proposal was a RequestLease
request.

Release justification: low risk, high benefit changes to existing
functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 28, 2020
Fixes cockroachdb#53677.

This commit ensures that we properly finish tracing spans of
Raft commands that throw ErrRemoved errors in ApplySideEffects.

I was originally intending to do something more dramatic and make
`replicaStateMachine.ApplySideEffects` responsible for acknowledging
proposers in all cases, but doing so turned out to be pretty invasive
so I was concerned that it would be harder to backport to v20.2 and
to v20.1. I may revisit that in the future.

Release justification: low risk, high benefit changes to existing
functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 28, 2020
Extension of cockroachdb#53677.

This commit ensures that we properly finish tracing spans of Raft
commands that follow a command that throws a ErrRemoved. Before this
commit, these commands would be abandoned and would never be finished.
The effects of this are theoretically even worse than those fixed in the
previous commit because these leaked commands could be locally proposed,
so we may be abandoning a local proposer indefinitely.

It's not clear that we ever saw an instance of this. It seems rare for a
local proposal to end up in the same CommittedEntries batch as a command
that removes a replica because of the lease requirements, but it doesn't
seem impossible, especially of the local proposal was a RequestLease
request.

Release justification: low risk, high benefit changes to existing
functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant