kv: undesirable behavior as the system load increases past sustainable limits #89142

andrewbaptist · 2022-10-01T13:17:07Z

Describe the problem

As the system pushes past sustainable limits, lots of bad things happen:

Throughput drops to a fraction of initial
LSM gets inverted
Nodes crash with out of memory

To Reproduce

Instructions are here: https://gist.github.com/andrewbaptist/27cee6c88ac9e704b5efaf15a37d4a37

What did you do? Describe in your own words.

Set up a 6 node cluster
Initialize with 10K warehouses
Begin a run with a ramp
Monitor the system

Expected behavior

TBD what the expected behavior is. Part of the goal of this is to define the expected behavior and compare to current behavior.

Environment:

CockroachDB version 22.2 beta2

Jira issue: CRDB-20133

andrewbaptist · 2022-10-01T13:23:55Z

andrewbaptist · 2022-10-01T13:28:29Z

logs.tgz

andrewbaptist · 2022-10-01T13:45:11Z

The first part of the run

andrewbaptist · 2022-10-03T14:35:50Z

From the experiments, the system has 4 "distinct" phases it goes through.

Phase 1 (00:45 - 01:25) The SQL throughput is well within the range of the configured system throughput.
Phase 2 (01:02 - 01:25) Admission control begins to slow requests down.
Phase 2 (01:25 - 02:05) Queues are building in the system and measured throughput begins to drop.
Phase 3 (02:05 - 02:45) System memory becomes critical as queues are overloaded. System crashes.

Phase 1

During phase 1, the system is mostly well-behaved. The number of transactions/sec is increasing linearly and the CPU utilization reaches 50% near the end. Runnable goroutines/CPU stays stable at about 0.4-0.5. The storage layer is very healthy during this period.

During this period the P90 latency stays flat, but P99 latency has a small rise. The latencies going up the stack:

Command commit latency - stays close to 30 microseconds (both P90 and P99)
Log commit latency - P99 has a small increase from ~1ms to ~3ms, P50 is stable at 0.2ms.
KV latency - P99 increases about linearly from 5ms to 75ms. P90 is flat at about 1ms.
Service latency - P99 increases about linearly from 20ms to 50ms. P90 stays flat at about 5ms.
Transaction latency - P99 increases linearly from about 100ms to about 400ms. P90 stays flat at about 100ms.

Phase 2

In phase 2, the system linearly increases throughout, as it goes from 190 to 380 transactions/sec/node. Admission control begins adding delays to requests which increases during the period. Everything begins linear increases during this period (CPU, goroutines/CPU, latencies). However system-level queues are not building as all requests are able to be processed.

Command commit latency - stays close to 30 microseconds (both P90 and P99).
Log commit latency - P99 increases from ~3ms to ~15ms, P50 is stable at 0.2ms.
KV latency - P99 increases about linearly from 5ms to 75ms. P90 increases from 1ms to 10ms.
Service latency - P99 increases about linearly 50ms to 200ms. P90 increases from 5ms to 50ms.
Transaction latency - P99 increases linearly from about 0.4s to about 1.2s. P90 increases from 100ms to 400ms.

Phase 3

In phase 3, the system starts at close to peak throughput of 500 transactions/sec/node. The CPU stays at 100% for this entire phase.

Transaction latency is maxed out at 10s/transactions and queues are building through the system. After it reaches the peak transactions/sec, additional requests are queued and SQL memory begins growing. The transaction throughput decreases during this period as the system becomes increasingly less efficient from about 500/sec to around 300/s near the end. Service and KV latencies are climbing during this period as the number of goroutines/core increases to about 20.

At the storage layer, the LSM becomes a little less healthy resulting in read amplification increasing from 5 to a5 near the end of the phase. It is not entirely clear why compaction can not keep up, but likely it is due to the limited CPU on the system not allowing compaction to run enough.

Phase 4

In phase 4, the system falls apart is it is well past a healthy rate of incoming requests. Memory usage becomes critical and the kernel kills the cockroach process on the nodes. The system declines in throughput to about 50-100/s with a highly unstable number being processed. The first node fails at 2:20 and the others fail by 2:45. The LSM also becomes severely unhealthy during this time. With the high number of goroutines running the system is not even able to correctly monitor clock offsets and they rise to almost 10ms.

sumeerbhola · 2022-10-04T17:10:17Z

Starting at 2:05 the "KV Admission Slots" are climbing. And so is the "Runnable Goroutines per CPU". The two increases may on different nodes, and its not clear to me whether for the former it is the used slots that are climbing or the limit. But admission control is no longer being able to exercise control, which suggests there is a lot of work bypassing admission control. The raftScheduler goroutines are bounded to 8*GOMAXPROCS so it cannot be follower work that is responsible for these goroutine counts. I suspect it is one of the things that explicitly bypassed admission control, akin to https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1661288404006489?thread_ts=1661189220.814819&cid=C0KB9Q03D -- a goroutine dump and cpu profile would tell us.

This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs cockroachdb#89142. Release note: None

andrewbaptist · 2022-10-17T19:19:24Z

These are taken about 30 minutes before the nodes will OOM.

Goroutine dump:
profile (1).pb.gz

CPU Profile:
profile.pb.gz

This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs cockroachdb#89142. Release note: None

89813: kvserver: remove legacy snapshot and diff code r=erikgrinaker a=pavelkalinnikov This commit removes the proto and code for `RaftDataSnapshot`. This type was used for reporting replica snapshots and computing diffs, but now this functionality has been removed in favor of storage checkpoints and offline tooling. Part of #21128 Epic: none Release note: None 90006: roachtest: Introduce a test to overwhelm nodes r=irfansharif a=andrewbaptist This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs #89142. Release note: None 90063: roachtest: de-flake admission-control/tpcc-olap r=irfansharif a=irfansharif Fixes #89600. This test makes use of `workload querybench`, which is only available in the `workload` binary. This test started failing after \#89482 where we (attempted) to replace all uses of `./workload` with `./cockroach workload`. Release note: None Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: irfan sharif <[email protected]>

sumeerbhola · 2022-10-18T17:38:03Z

There are only 22 goroutines not in runtime.gopark (which is a waiting state), so there shouldn't be a high runnable count when this profile was taken.

Previously this test was created but wasn't run since it would always fail. The test still doesn't pass in the default configuration, however after the addition of cockroachdb#118781, at least it is possible to configure a cluster to pass. Informs: cockroachdb#110272 Informs: cockroachdb#89142 Release note: None

github-actions · 2024-04-11T11:04:31Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

121833: roachtest: update tpcc overload r=rafiss a=andrewbaptist Previously this test was created but wasn't run since it would always fail. The test still doesn't pass in the default configuration, however after the addition of #118781, at least it is possible to configure a cluster to pass. Informs: #110272 Informs: #89142 Release note: None Co-authored-by: Andrew Baptist <[email protected]>

andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-admission-control labels Oct 1, 2022

irfansharif removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 3, 2022

andrewbaptist mentioned this issue Oct 14, 2022

roachtest: Introduce a test to overwhelm nodes #90006

Merged

andrewbaptist mentioned this issue Mar 6, 2024

Limit concurrency of active SQL requests to reduce metastable outages #110272

Closed

andrewbaptist mentioned this issue Apr 5, 2024

roachtest: update tpcc overload #121833

Merged

github-actions bot added no-issue-activity and removed no-issue-activity labels Apr 11, 2024

andrewbaptist changed the title ~~Undesirable behavior as the system load increases past sustainable limits~~ kv: undesirable behavior as the system load increases past sustainable limits Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: undesirable behavior as the system load increases past sustainable limits #89142

kv: undesirable behavior as the system load increases past sustainable limits #89142

andrewbaptist commented Oct 1, 2022 •

edited by cockroach-jira-scripts

Loading

andrewbaptist commented Oct 1, 2022 •

edited

Loading

andrewbaptist commented Oct 1, 2022

andrewbaptist commented Oct 1, 2022

andrewbaptist commented Oct 3, 2022

sumeerbhola commented Oct 4, 2022

andrewbaptist commented Oct 17, 2022

sumeerbhola commented Oct 18, 2022

github-actions bot commented Apr 11, 2024

kv: undesirable behavior as the system load increases past sustainable limits #89142

kv: undesirable behavior as the system load increases past sustainable limits #89142

Comments

andrewbaptist commented Oct 1, 2022 • edited by cockroach-jira-scripts Loading

andrewbaptist commented Oct 1, 2022 • edited Loading

andrewbaptist commented Oct 1, 2022

andrewbaptist commented Oct 1, 2022

andrewbaptist commented Oct 3, 2022

Phase 1

Phase 2

Phase 3

Phase 4

sumeerbhola commented Oct 4, 2022

andrewbaptist commented Oct 17, 2022

sumeerbhola commented Oct 18, 2022

github-actions bot commented Apr 11, 2024

andrewbaptist commented Oct 1, 2022 •

edited by cockroach-jira-scripts

Loading

andrewbaptist commented Oct 1, 2022 •

edited

Loading