Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangefeed: registrations metric is not always drained on processor termination #106126

Open
aliher1911 opened this issue Jul 4, 2023 · 3 comments · Fixed by #110959
Open

rangefeed: registrations metric is not always drained on processor termination #106126

aliher1911 opened this issue Jul 4, 2023 · 3 comments · Fixed by #110959
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@aliher1911
Copy link
Contributor

aliher1911 commented Jul 4, 2023

Rangefeeds maintain a metric kv.rangefeed.registrations which shows how many range feeds are active.
This metric is a gauge increased when registration is successfully created and must be decreased when registration is removed.

In practice, registrations could be terminated by client (when stream is closed from kv client side) or by server (when replica is removed due to rebalancing or split/merge operations).
In first case registration will terminate its output loop, which will trigger unregistration request to processor and it will perform a cleanup as a part of its work loop. Processor will then wind down itself if that was the last registration.
However, if replica decides to terminate rangefeeds, it will send stop request to processor, which will in turn terminate its registrations. Registrations will update their state and close their output loop, which would trigger unregistration request to processor, but it won't be processed because processor's work loop is already terminated.

Environment:

  • CockroachDB version 23.1 but likely earlier versions as well.

Additional context
Metrics issue makes investigations problematic.

Jira issue: CRDB-29415

Epic CRDB-39959

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 4, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 4, 2023

cc @cockroachdb/replication

@aliher1911
Copy link
Contributor Author

There's a similar issue with memory budget, but processor releases all budget unconditionally on termination without waiting for registrations to drain.

craig bot pushed a commit that referenced this issue Sep 22, 2023
110602: roachtest: add c2c/weekly/kv50 roachtest r=adityamaru a=msbutler

This patch adds a new weekly c2c roachtest that tests our 23.2
performance requirements under a given cluster configuraiton.

The workload:
- kv0, 2TB initial scan, 50% writes, 2000 max qps, insert row size uniformly
  between 1 and 4096 bytes.

The cluster specs:
- 8 nodes in each src and dest clusters, 8 vcpus, pdSize 1 TB

Release note: none

Epic: none

110959: rangefeed: fix kv.rangefeed.registrations metric r=erikgrinaker a=aliher1911

Previously when rangefeed processor was stopped by replica, registrations metric was not correctly decreased leading to metric creep on changefeed restarts or range splits. This commit fixes metric decrease for such cases.

Epic: none
Fixes: #106126

Release note: None

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
@craig craig bot closed this as completed in cdafd14 Sep 22, 2023
@erikgrinaker
Copy link
Contributor

#110959 did not appear to fully fix this, since restarting rangefeeds on 23.2 (e.g. by flipping kv.rangefeed.scheduler.enabled) leaks kv.rangefeed.registrations (shows 125, while there are only 12 outputLoop goroutines running).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

Successfully merging a pull request may close this issue.

2 participants