Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix clearrange/* tests #104699

Merged
merged 1 commit into from
Jun 10, 2023

Conversation

irfansharif
Copy link
Contributor

Fixes #104696.
Fixes #104697.
Fixes #104698.
Part of #98703.

In 072c16d (added as part of #95637) we re-worked the locking structure around the RaftTransport's per-RPC class level send queues. When new send queues are instantiated or old ones deleted, we now also maintain the kvflowcontrol connection tracker, so such maintenance now needs to happen while holding a kvflowcontrol mutex. When rebasing #95637 onto master, we accidentally included earlier queue deletion code without holding the appropriate mutex. Queue deletions now happened twice which made it possible to hit a RaftTransport assertion about expecting the right send queue to already exist.

Specifically, the following sequence was possible:

  • (*RaftTransport).SendAsync is invoked, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map.
    • It invokes an async worker W1 to process that send queue through (*RaftTransport).startProcessNewQueue. The async worker is responsible for clearing the tracked queue in the queues map once done.
  • W1 expects to find the tracked queue in the queues map, finds it, proceeds.
  • W1 is done processing. On its way out, W1 clears <nodeid,class> from the queues map the first time.
  • (*RaftTransport).SendAsync is invoked by another goroutine, observes no queue for <nodeid,class>, creates it, and tracks it in the queues map.
    • It invokes an async worker W2 to process that send queue through (*RaftTransport).startProcessNewQueue. The async worker is responsible for clearing the tracked queue in the queues map once done.
  • W1 blindly clears the <nodeid,class> raft send queue the second time.
  • W2 expects to find the queue in the queues map, but doesn't, and fatals.

Release note: None

Fixes cockroachdb#104696.
Fixes cockroachdb#104697.
Fixes cockroachdb#104698.
Part of cockroachdb#98703.

In 072c16d (added as part of cockroachdb#95637) we re-worked the locking
structure around the RaftTransport's per-RPC class level send queues.
When new send queues are instantiated or old ones deleted, we now also
maintain the kvflowcontrol connection tracker, so such maintenance now
needs to happen while holding a kvflowcontrol mutex. When rebasing
\cockroachdb#95637 onto master, we accidentally included earlier queue deletion
code without holding the appropriate mutex. Queue deletions now happened
twice which made it possible to hit a RaftTransport assertion about
expecting the right send queue to already exist.

Specifically, the following sequence was possible:
- (*RaftTransport).SendAsync is invoked, observes no queue for
  <nodeid,class>, creates it, and tracks it in the queues map.
  - It invokes an async worker W1 to process that send queue through
    (*RaftTransport).startProcessNewQueue. The async worker is
    responsible for clearing the tracked queue in the queues map once
    done.
- W1 expects to find the tracked queue in the queues map, finds it,
  proceeds.
- W1 is done processing. On its way out, W1 clears <nodeid,class> from
  the queues map the first time.
- (*RaftTransport).SendAsync is invoked by another goroutine, observes
  no queue for <nodeid,class>, creates it, and tracks it in the queues
  map.
  - It invokes an async worker W2 to process that send queue through
    (*RaftTransport).startProcessNewQueue. The async worker is
    responsible for clearing the tracked queue in the queues map once
    done.
- W1 blindly clears the <nodeid,class> raft send queue the second time.
- W2 expects to find the queue in the queues map, but doesn't, and
  fatals.

Release note: None
@irfansharif irfansharif requested review from a team June 10, 2023 17:12
@blathers-crl
Copy link

blathers-crl bot commented Jun 10, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 10, 2023

Build succeeded:

@craig craig bot merged commit 6e38f67 into cockroachdb:master Jun 10, 2023
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jun 12, 2023
Enable kvadmission.flow_control.enabled by default. We didn't observe
noticeable performance regressions while it was disabled (single weekly
run, three nightly runs). There was some minimal fallout that was since
fixed (cockroachdb#104699). We expect performance regressions now that this commit
enables it by default, and expect more fallout. We'll handle these as
part of cockroachdb#104154.

Release note: None
craig bot pushed a commit that referenced this pull request Jun 12, 2023
104741: kvflowcontrol: enable by default r=irfansharif a=irfansharif

Enable kvadmission.flow_control.enabled by default. We didn't observe noticeable performance regressions while it was disabled (single weekly run, three nightly runs). There was some minimal fallout that was since fixed (#104699). We expect performance regressions now that this commit enables it by default, and expect more fallout. We'll handle these as part of #104154.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@irfansharif irfansharif deleted the 230610.deflake-clearrange branch June 13, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants