Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: undesirable behavior as the system load increases past sustainable limits #89142

Open
andrewbaptist opened this issue Oct 1, 2022 · 8 comments

Comments

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Oct 1, 2022

Describe the problem

As the system pushes past sustainable limits, lots of bad things happen:

  1. Throughput drops to a fraction of initial
  2. LSM gets inverted
  3. Nodes crash with out of memory

To Reproduce

Instructions are here: https://gist.github.com/andrewbaptist/27cee6c88ac9e704b5efaf15a37d4a37

What did you do? Describe in your own words.

  1. Set up a 6 node cluster
  2. Initialize with 10K warehouses
  3. Begin a run with a ramp
  4. Monitor the system

Expected behavior

TBD what the expected behavior is. Part of the goal of this is to define the expected behavior and compare to current behavior.

Environment:

  • CockroachDB version 22.2 beta2

Jira issue: CRDB-20133

@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-admission-control labels Oct 1, 2022
@andrewbaptist
Copy link
Collaborator Author

andrewbaptist commented Oct 1, 2022

screencapture-34-74-199-84-26258-2022-10-01-09_37_45
screencapture-34-74-199-84-26258-2022-10-01-09_37_21
screencapture-34-74-199-84-26258-2022-10-01-09_36_52
screencapture-34-74-199-84-26258-2022-10-01-09_36_26
screencapture-34-74-199-84-26258-2022-10-01-09_35_47
screencapture-34-74-199-84-26258-2022-10-01-09_35_35
screencapture-34-74-199-84-26258-2022-10-01-09_35_18
screencapture-34-74-199-84-26258-2022-10-01-09_35_07

@andrewbaptist
Copy link
Collaborator Author

logs.tgz

@andrewbaptist
Copy link
Collaborator Author

The first part of the run
screencapture-34-74-199-84-26258-2022-10-01-09_44_45
screencapture-34-74-199-84-26258-2022-10-01-09_44_06
screencapture-34-74-199-84-26258-2022-10-01-09_41_34

@andrewbaptist
Copy link
Collaborator Author

From the experiments, the system has 4 "distinct" phases it goes through.

  • Phase 1 (00:45 - 01:25) The SQL throughput is well within the range of the configured system throughput.
  • Phase 2 (01:02 - 01:25) Admission control begins to slow requests down.
  • Phase 2 (01:25 - 02:05) Queues are building in the system and measured throughput begins to drop.
  • Phase 3 (02:05 - 02:45) System memory becomes critical as queues are overloaded. System crashes.

Phase 1

During phase 1, the system is mostly well-behaved. The number of transactions/sec is increasing linearly and the CPU utilization reaches 50% near the end. Runnable goroutines/CPU stays stable at about 0.4-0.5. The storage layer is very healthy during this period.

During this period the P90 latency stays flat, but P99 latency has a small rise. The latencies going up the stack:

Command commit latency - stays close to 30 microseconds (both P90 and P99)
Log commit latency - P99 has a small increase from ~1ms to ~3ms, P50 is stable at 0.2ms.
KV latency - P99 increases about linearly from 5ms to 75ms. P90 is flat at about 1ms.
Service latency - P99 increases about linearly from 20ms to 50ms. P90 stays flat at about 5ms.
Transaction latency - P99 increases linearly from about 100ms to about 400ms. P90 stays flat at about 100ms.

Phase 2

In phase 2, the system linearly increases throughout, as it goes from 190 to 380 transactions/sec/node. Admission control begins adding delays to requests which increases during the period. Everything begins linear increases during this period (CPU, goroutines/CPU, latencies). However system-level queues are not building as all requests are able to be processed.

Command commit latency - stays close to 30 microseconds (both P90 and P99).
Log commit latency - P99 increases from ~3ms to ~15ms, P50 is stable at 0.2ms.
KV latency - P99 increases about linearly from 5ms to 75ms. P90 increases from 1ms to 10ms.
Service latency - P99 increases about linearly 50ms to 200ms. P90 increases from 5ms to 50ms.
Transaction latency - P99 increases linearly from about 0.4s to about 1.2s. P90 increases from 100ms to 400ms.

Phase 3

In phase 3, the system starts at close to peak throughput of 500 transactions/sec/node. The CPU stays at 100% for this entire phase.

Transaction latency is maxed out at 10s/transactions and queues are building through the system. After it reaches the peak transactions/sec, additional requests are queued and SQL memory begins growing. The transaction throughput decreases during this period as the system becomes increasingly less efficient from about 500/sec to around 300/s near the end. Service and KV latencies are climbing during this period as the number of goroutines/core increases to about 20.

At the storage layer, the LSM becomes a little less healthy resulting in read amplification increasing from 5 to a5 near the end of the phase. It is not entirely clear why compaction can not keep up, but likely it is due to the limited CPU on the system not allowing compaction to run enough.

Phase 4

In phase 4, the system falls apart is it is well past a healthy rate of incoming requests. Memory usage becomes critical and the kernel kills the cockroach process on the nodes. The system declines in throughput to about 50-100/s with a highly unstable number being processed. The first node fails at 2:20 and the others fail by 2:45. The LSM also becomes severely unhealthy during this time. With the high number of goroutines running the system is not even able to correctly monitor clock offsets and they rise to almost 10ms.

@irfansharif irfansharif removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 3, 2022
@sumeerbhola
Copy link
Collaborator

Starting at 2:05 the "KV Admission Slots" are climbing. And so is the "Runnable Goroutines per CPU". The two increases may on different nodes, and its not clear to me whether for the former it is the used slots that are climbing or the limit. But admission control is no longer being able to exercise control, which suggests there is a lot of work bypassing admission control. The raftScheduler goroutines are bounded to 8*GOMAXPROCS so it cannot be follower work that is responsible for these goroutine counts. I suspect it is one of the things that explicitly bypassed admission control, akin to https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1661288404006489?thread_ts=1661189220.814819&cid=C0KB9Q03D -- a goroutine dump and cpu profile would tell us.

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Oct 14, 2022
This test is here to check behavior of the system as the SQL load
greatly exceeds what the nodes are able to handle. In the future
we want to evaulate how this is handled, but today this will cause
nodes to OOM.

Informs cockroachdb#89142.

Release note: None
@andrewbaptist
Copy link
Collaborator Author

These are taken about 30 minutes before the nodes will OOM.

Goroutine dump: image
profile (1).pb.gz

CPU Profile: image
profile.pb.gz

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Oct 17, 2022
This test is here to check behavior of the system as the SQL load
greatly exceeds what the nodes are able to handle. In the future
we want to evaulate how this is handled, but today this will cause
nodes to OOM.

Informs cockroachdb#89142.

Release note: None
craig bot pushed a commit that referenced this issue Oct 17, 2022
89813: kvserver: remove legacy snapshot and diff code r=erikgrinaker a=pavelkalinnikov

This commit removes the proto and code for `RaftDataSnapshot`. This type was used for reporting replica snapshots and computing diffs, but now this functionality has been removed in favor of storage checkpoints and offline tooling.

Part of #21128
Epic: none

Release note: None

90006: roachtest: Introduce a test to overwhelm nodes r=irfansharif a=andrewbaptist

This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM.

Informs #89142.

Release note: None

90063: roachtest: de-flake admission-control/tpcc-olap r=irfansharif a=irfansharif

Fixes #89600. This test makes use of `workload querybench`, which is only available in the `workload` binary. This test started failing after \#89482 where we (attempted) to replace all uses of `./workload` with `./cockroach workload`.

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@sumeerbhola
Copy link
Collaborator

There are only 22 goroutines not in runtime.gopark (which is a waiting state), so there shouldn't be a high runnable count when this profile was taken.

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Apr 10, 2024
Previously this test was created but wasn't run since it would always
fail. The test still doesn't pass in the default configuration, however
after the addition of cockroachdb#118781, at least it is possible to configure a
cluster to pass.

Informs: cockroachdb#110272
Informs: cockroachdb#89142

Release note: None
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

craig bot pushed a commit that referenced this issue Apr 19, 2024
121833: roachtest: update tpcc overload r=rafiss a=andrewbaptist

Previously this test was created but wasn't run since it would always fail. The test still doesn't pass in the default configuration, however after the addition of #118781, at least it is possible to configure a cluster to pass.

Informs: #110272
Informs: #89142

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
@andrewbaptist andrewbaptist changed the title Undesirable behavior as the system load increases past sustainable limits kv: undesirable behavior as the system load increases past sustainable limits Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants