Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: attempt to handle VM overload under tpccbench #62361

Merged
merged 1 commit into from
Apr 2, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 22, 2021

See #62039.

tpccbench, by design, pushes CRDB into overload territory. The test
harness handles nodes crashing or tpmc tanking well. However, it was
not prepared to handle the cloud VMs going unresponsive for ~minutes,
which is one common failure mode.

This commit tweaks the line search to be resilient to failures to
communicate with the cloud VM in the one place where it matters
(stopping the cluster at the beginning of a new search attempt).

The hope is that this will allow the search to run to completion,
even in the face of overload-imposed temporary VM outages. It is
not expected to do this reliably, but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the tpccbench-harden branch from 9d9d3db to dad8261 Compare March 22, 2021 16:54
@tbg
Copy link
Member Author

tbg commented Mar 23, 2021

I realized that this isn't nearly enough. The line search ends on any error:

pass, err := pred(s.current())
if err != nil {
return 0, err
}

I did a quick sanity check running with 2500 warehouses (max is < 2200) and yep, you quickly get a node to crash, the line search ends, the test fails hard. It seems that any overload error encountered by the test would fail the test (though perhaps the stats for any passing run would still get picked up by roachperf, thus providing at least a data point? Not sure)

I'll do some more work here to make it actually resilient to overload, i.e. this should be interpreted as a failing run of TPCC, not a failure of the test.

@irfansharif
Copy link
Contributor

It seems that any overload error encountered by the test would fail the test

I've seen instances where if the overload happens early on during the workload (say 3m in, instead of 8m in), and the unresponsive VM is "back" by the time we've finished the 10m workload (albeit in degraded mode, using compute of 2/3 nodes), and so is roachprod stop-able, the test doesn't actually fail. Not that that's anything relevant to this PR, but maybe you'd find interesting.

@tbg tbg force-pushed the tpccbench-harden branch from 45cb9ac to fe2828f Compare March 25, 2021 14:41
@tbg
Copy link
Member Author

tbg commented Mar 25, 2021

The version I just pushed is starting to look good. In the snippet below (from test.log) shows a node crashing during 2500 warehouses, then we line search to 2488 warehouses, but can't reset the cluster because of VM brown-out; we correctly retry and it works out and now the 2488 warehouse run is running.

I'm running ten instances of this test, so I'll see if anything else shakes out.

14:20:28 tpcc.go:903: test status: running benchmark, warehouses=2500
14:20:28 cluster.go:2318: [w6] > ./cockroach workload run tpcc --warehouses=2500 --active-warehouses=2500 --tolerate-errors --ramp=5m0s --duration=10m0s --histograms=perf/warehouses=2500/stats.json {pgurl:1-3}
3: 7324
2: 7237
1: 7595
2: kill exited nonzero
2: dead
14:31:24 tpcc.go:974: --- SEARCH ITER FAIL: TPCC 2500 resulted in 0.0 tpmC and failed due to monitor failure: unexpected node event: 2: dead
14:31:24 tpcc.go:805: initializing cluster for 2488 warehouses (search attempt: 2)
14:31:24 cluster.go:2207: test status: stopping cluster
14:31:24 cluster.go:387: > /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden stop tobias-1616680721-05-n4cpu16:1-3
tobias-1616680721-05-n4cpu16: stopping and waiting....................................................................................................................................................................
2: exit status 255: 
I210325 14:34:09.311235 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
14:34:09 tpcc.go:824: unable to stop cluster; retrying to allow vm to recover: /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden stop tobias-1616680721-05-n4cpu16:1-3 returned: exit status 1
14:34:39 cluster.go:2207: test status: stopping cluster
14:34:39 cluster.go:387: > /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden stop tobias-1616680721-05-n4cpu16:1-3
tobias-1616680721-05-n4cpu16: stopping and waiting.
14:34:41 cluster.go:2157: test status: starting cluster
14:34:41 cluster.go:387: > /home/tobias/go/src/github.com/cockroachdb/cockroach/roachprod.tpccbench-harden start --encrypt=false tobias-1616680721-05-n4cpu16:1-3
tobias-1616680721-05-n4cpu16: starting nodes
tobias-1616680721-05-n4cpu16: initializing cluster
tobias-1616680721-05-n4cpu16: setting cluster settings
14:35:11 tpcc.go:903: test status: running benchmark, warehouses=2488
14:35:11 cluster.go:2318: [w6] > ./cockroach workload run tpcc --warehouses=2500 --active-warehouses=2488 --tolerate-errors --ramp=5m0s --duration=10m0s --histograms=perf/warehouses=2488/stats.json {pgurl:1-3}
2: 8998
3: 8411
1: 8950

@tbg
Copy link
Member Author

tbg commented Mar 25, 2021

Looks like some of these VMs are browning out pretty long (note the time since "stopping cluster" on some of these):

image

@tbg
Copy link
Member Author

tbg commented Mar 25, 2021

Ugh, all test runs are stopping their clusters. Some of them for 20 minutes :S I wonder if these VMs are ever coming back.

As sort of a trial run, I reset one VM I couldn't ssh to for a while via

CLOUDSDK_CORE_PROJECT=cockroach-ephemeral gcloud compute instances reset --zone us-east1-b tobias-1616680721-04-n4cpu16-0002

and I assumed this was going to backfire due to it losing the local ssd, but it didn't! So I think the test could in theory continue. Unfortunately, it's still stuck on

stopping cluster (set 36m57s ago)

I might need to add a cancellation here, so that retries are eager. But in either case, it's not really a thing to have a roachtest invoke cloud-specific commands to reset a vm, we'd have to properly add that and ugh, doesn't sound like a good idea honestly. But we need to somehow get this test to be resilient as it's not like we can run it on some other infrastructure anytime soon.

@tbg
Copy link
Member Author

tbg commented Mar 25, 2021

Starting a new attempt. I added roachprod reset and this will be invoked before trying to stop the cluster in each invocation (so that it's in the hot path and I can be sure it works all the time).

@tbg
Copy link
Member Author

tbg commented Mar 25, 2021

Hey look at this, this actually does seem to work. Tests reliably seem to make it from the 2500 into the 2488 run:

image

@tbg
Copy link
Member Author

tbg commented Mar 25, 2021

image

image

@tbg
Copy link
Member Author

tbg commented Mar 26, 2021

Some questionable results but at least results! One run failed because my computer went to sleep and this tripped up a gcloud invocation.

tobias@td:~/go/src/github.com/cockroachdb/cockroach$ rg 'MAX WAREHOUSES' -- artifacts/tpccbench/
artifacts/tpccbench/nodes=3/cpu=16/run_3/test.log
469:MAX WAREHOUSES = 2170

artifacts/tpccbench/nodes=3/cpu=16/run_1/test.log
447:MAX WAREHOUSES = 1834

artifacts/tpccbench/nodes=3/cpu=16/run_8/test.log
438:MAX WAREHOUSES = 2086

artifacts/tpccbench/nodes=3/cpu=16/run_4/test.log
633:MAX WAREHOUSES = 961

artifacts/tpccbench/nodes=3/cpu=16/run_2/test.log
389:MAX WAREHOUSES = 2230

artifacts/tpccbench/nodes=3/cpu=16/run_9/test.log
442:MAX WAREHOUSES = 2086

artifacts/tpccbench/nodes=3/cpu=16/run_5/test.log
390:MAX WAREHOUSES = 2170

artifacts/tpccbench/nodes=3/cpu=16/run_6/test.log
647:MAX WAREHOUSES = 961

artifacts/tpccbench/nodes=3/cpu=16/run_10/test.log
463:MAX WAREHOUSES = 2122

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 1, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
@irfansharif
Copy link
Contributor

Flakes were due to #62604. This PR worked beautifully for some test runs I did as part of #62998; will merge on green.

@irfansharif
Copy link
Contributor

bors r+

See cockroachdb#62039.

`tpccbench`, by design, pushes CRDB into overload territory. However,
tpccbench was not handling the resulting conditions at all. Any node
death or stalling VM would fail the whole test run, rather than retry
again with a lower warehouse count.

This commit makes most errors "recoverable" in the sense that they
will simply treat the run as failing, but continue the line search.
Exceptions are communicated via `t.Fatal`, which will abort the
whole run instead.

We also make the `c.Stop()` step at the beginning of each search
step resilient to VMs browning out under memory pressure (from
the previous run), by patiently retrying for a few minutes.

The hope is that this will allow the search to run to completion, even
in the face of overload-imposed temporary VM outages. It is not expected
to do this perfectly - after all, VMs don't generally return from
brown-out within any fixed time period - but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
@craig
Copy link
Contributor

craig bot commented Apr 1, 2021

Canceled.

@irfansharif
Copy link
Contributor

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 2, 2021

Build succeeded:

@craig craig bot merged commit 0966b27 into cockroachdb:master Apr 2, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 2, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
@tbg tbg deleted the tpccbench-harden branch April 2, 2021 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants