-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: undesirable behavior as the system load increases past sustainable limits #89142
Comments
From the experiments, the system has 4 "distinct" phases it goes through.
Phase 1During phase 1, the system is mostly well-behaved. The number of transactions/sec is increasing linearly and the CPU utilization reaches 50% near the end. Runnable goroutines/CPU stays stable at about 0.4-0.5. The storage layer is very healthy during this period. During this period the P90 latency stays flat, but P99 latency has a small rise. The latencies going up the stack: Command commit latency - stays close to 30 microseconds (both P90 and P99) Phase 2In phase 2, the system linearly increases throughout, as it goes from 190 to 380 transactions/sec/node. Admission control begins adding delays to requests which increases during the period. Everything begins linear increases during this period (CPU, goroutines/CPU, latencies). However system-level queues are not building as all requests are able to be processed. Command commit latency - stays close to 30 microseconds (both P90 and P99). Phase 3In phase 3, the system starts at close to peak throughput of 500 transactions/sec/node. The CPU stays at 100% for this entire phase. Transaction latency is maxed out at 10s/transactions and queues are building through the system. After it reaches the peak transactions/sec, additional requests are queued and SQL memory begins growing. The transaction throughput decreases during this period as the system becomes increasingly less efficient from about 500/sec to around 300/s near the end. Service and KV latencies are climbing during this period as the number of goroutines/core increases to about 20. At the storage layer, the LSM becomes a little less healthy resulting in read amplification increasing from 5 to a5 near the end of the phase. It is not entirely clear why compaction can not keep up, but likely it is due to the limited CPU on the system not allowing compaction to run enough. Phase 4In phase 4, the system falls apart is it is well past a healthy rate of incoming requests. Memory usage becomes critical and the kernel kills the cockroach process on the nodes. The system declines in throughput to about 50-100/s with a highly unstable number being processed. The first node fails at 2:20 and the others fail by 2:45. The LSM also becomes severely unhealthy during this time. With the high number of goroutines running the system is not even able to correctly monitor clock offsets and they rise to almost 10ms. |
Starting at 2:05 the "KV Admission Slots" are climbing. And so is the "Runnable Goroutines per CPU". The two increases may on different nodes, and its not clear to me whether for the former it is the used slots that are climbing or the limit. But admission control is no longer being able to exercise control, which suggests there is a lot of work bypassing admission control. The raftScheduler goroutines are bounded to 8*GOMAXPROCS so it cannot be follower work that is responsible for these goroutine counts. I suspect it is one of the things that explicitly bypassed admission control, akin to https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1661288404006489?thread_ts=1661189220.814819&cid=C0KB9Q03D -- a goroutine dump and cpu profile would tell us. |
This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs cockroachdb#89142. Release note: None
These are taken about 30 minutes before the nodes will OOM. Goroutine dump: CPU Profile: |
This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs cockroachdb#89142. Release note: None
89813: kvserver: remove legacy snapshot and diff code r=erikgrinaker a=pavelkalinnikov This commit removes the proto and code for `RaftDataSnapshot`. This type was used for reporting replica snapshots and computing diffs, but now this functionality has been removed in favor of storage checkpoints and offline tooling. Part of #21128 Epic: none Release note: None 90006: roachtest: Introduce a test to overwhelm nodes r=irfansharif a=andrewbaptist This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs #89142. Release note: None 90063: roachtest: de-flake admission-control/tpcc-olap r=irfansharif a=irfansharif Fixes #89600. This test makes use of `workload querybench`, which is only available in the `workload` binary. This test started failing after \#89482 where we (attempted) to replace all uses of `./workload` with `./cockroach workload`. Release note: None Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: irfan sharif <[email protected]>
There are only 22 goroutines not in |
Previously this test was created but wasn't run since it would always fail. The test still doesn't pass in the default configuration, however after the addition of cockroachdb#118781, at least it is possible to configure a cluster to pass. Informs: cockroachdb#110272 Informs: cockroachdb#89142 Release note: None
We have marked this issue as stale because it has been inactive for |
121833: roachtest: update tpcc overload r=rafiss a=andrewbaptist Previously this test was created but wasn't run since it would always fail. The test still doesn't pass in the default configuration, however after the addition of #118781, at least it is possible to configure a cluster to pass. Informs: #110272 Informs: #89142 Release note: None Co-authored-by: Andrew Baptist <[email protected]>
Describe the problem
As the system pushes past sustainable limits, lots of bad things happen:
To Reproduce
Instructions are here: https://gist.github.com/andrewbaptist/27cee6c88ac9e704b5efaf15a37d4a37
What did you do? Describe in your own words.
Expected behavior
TBD what the expected behavior is. Part of the goal of this is to define the expected behavior and compare to current behavior.
Environment:
Jira issue: CRDB-20133
The text was updated successfully, but these errors were encountered: