Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangefeed: measure impact of scheduler rangefeeds on system ranges #110344

Closed
aliher1911 opened this issue Sep 11, 2023 · 3 comments
Closed

rangefeed: measure impact of scheduler rangefeeds on system ranges #110344

aliher1911 opened this issue Sep 11, 2023 · 3 comments
Assignees
Labels
A-kv-rangefeed Rangefeed infrastructure, server+client branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-premortem Issues identified during premortem exercise.

Comments

@aliher1911
Copy link
Contributor

aliher1911 commented Sep 11, 2023

We need to measure impact of rangefeed throughput on timely delivery of rangefeed messages.

Regardless of overall combined rangefeed throughput, timely delivery of messages on system rangefeeds is critical.
Overall, processing of individual messages should be quick, but we may have bursts of scheduled events for all ranges and those, should not affect timely message delivery.

Jira issue: CRDB-31388

Epic CRDB-26372

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 11, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker added A-kv-rangefeed Rangefeed infrastructure, server+client GA-blocker and removed A-kv-replication Relating to Raft, consensus, and coordination. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 12, 2023
@erikgrinaker erikgrinaker added the O-premortem Issues identified during premortem exercise. label Sep 14, 2023
@aliher1911
Copy link
Contributor Author

So far scheduler latency reaches 2-3s when cluster is under heavy write load. I used cdc benchmarks with 5 nodes 16vcpu machines. When writes are made in 10k batches across multiple ranges latency spikes during changefeed startups.

I did experiments with reducing number of events that each processor handles in one go. First I did some measurements by attaching histograms to processor to record batch sizes and it shows that it can handle between 64 and 128 events. Since we support postponing part of the work for next iteration, I tried limiting batch to 32 events but it showed marginal improvements.

I think we should proceed with priority shard approach. That will give us better guarantees that low throughtput ranges are unaffected by high throughput within processor domain at least.

@erikgrinaker
Copy link
Contributor

I believe we did see multi-second scheduling latencies during startup (specifically due to catchup scan iterator construction). While this is being moved off of the scheduler goroutine in #111045, it still seems useful to prioritize system ranges, so we're merging that in #110810. We also have scheduling latency metrics, both for the normal and system shard, in #110458, so we can continue to investigate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-rangefeed Rangefeed infrastructure, server+client branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-premortem Issues identified during premortem exercise.
Projects
None yet
Development

No branches or pull requests

2 participants