roachtest: add replica-circuit-breakers #76147

tbg · 2022-02-07T10:01:59Z

his adds a roachtest that runs TPCC (with a low warehouse count and
--tolerate-errors), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior.

Touches #33007.

Epic: none

Release note: None

cockroach-teamcity · 2022-02-07T10:02:07Z

This change is

Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in cockroachdb#76147 but I imagine it'll come in handy in general as well. Release note: None

76284: prometheus: simplify snapshot visualization r=otan a=tbg Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in #76147 but I imagine it'll come in handy in general as well. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

pkg/cmd/roachtest/tests/chaos.go

erikgrinaker

Nice!

Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in cockroachdb#76147 but I imagine it'll come in handy in general as well. Release note: None

srosenberg

Tour de force of roachtests!

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @tbg)

pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 200 at r3 (raw file):

			},
		)
		t.L().Printf("under healthy cluster: ~%.2f qps", min)

Shouldn't we also log max? It could help with debugging in case we're not able to hit anything higher than min for whatever reason.

pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 206 at r3 (raw file):

		// Starting at one minute after quorum was lost, we should see at least a
		// steady trickle of errors, or a steady trickle of successes. (If
		// miraculously no range lost quorum it would be all successes, if they all

Should we fail if a miracle does occur? Conversely, how do we know that this test is not always miraculous? If we can guarantee a loss of quorum, then a simple assertion on the non-zero count of workload_tpcc_newOrder_error_total would suffice.

pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 221 at r3 (raw file):

}

func visitPromRangeQuery(

It's a fairly useful helper function which could be made general enough to be reused by other tests; i.e., assertions about the range could be parameterized.

pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 256 at r3 (raw file):

	for _, val := range mtr.(model.Matrix)[0].Values {
		n++
		visit(timeutil.Unix(int64(val.Timestamp/1000), 0), val)

Why not val.Timestamp.Time()?

pkg/roachprod/install/cluster_synced.go, line 2043 at r1 (raw file):

	if len(failed) > 0 {
		var err error
		for _, res := range failed {

Do we even need failed? CombineErrors could simply replace the corresponding append(failed, r).

srosenberg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @tbg)

pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 136 at r3 (raw file):

				r := sqlutils.MakeSQLRunner(c.Conn(ctx, t.L(), 1))
				queryTS := ts.Add(-10 * time.Second).Format("2006-01-02 15:04:05")
				q1 := `SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '` + queryTS + `'`

Not sure I follow why q1 is not using the stronger form of AOST, i.e., with_min_timestamp?

tbg · 2022-11-02T04:45:42Z

Rebased this PR, maybe I broke something. No longer passes.

$ ./pkg/cmd/roachtest/roachstress.sh -c 3 replica-circuit-breaker tag:weekly
=== RUN   replica-circuit-breaker#2
=== RUN   replica-circuit-breaker#3
=== RUN   replica-circuit-breaker#1
Grafana dashboard 0: http://34.75.146.40:3000/
Grafana dashboard 0: http://34.148.31.31:3000/
Grafana dashboard 0: http://34.73.60.89:3000/
--- FAIL: replica-circuit-breaker#3 (451.36s)
test artifacts and logs in: artifacts/b8a78987c1-dirty/043215/replica-circuit-breaker/run_3
(test_impl.go:297).Fatalf: error executing 'SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '2022-11-02 04:37:05'': pq: inbox communication error: rpc error: code = Canceled desc = context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_043715.018671211_n6_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=50 --histograms=perf/stats.json --tolerate-errors --ramp=5m0s --duration=10m0s --prometheus-port=0 --pprofport=33333  {pgurl:1-5} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: t.Fatal() was called
--- FAIL: replica-circuit-breaker#2 (453.09s)
test artifacts and logs in: artifacts/b8a78987c1-dirty/043215/replica-circuit-breaker/run_2
(test_impl.go:297).Fatalf: error executing 'SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '2022-11-02 04:37:06'': pq: inbox communication error: rpc error: code = Canceled desc = context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_043716.222616076_n6_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=50 --histograms=perf/stats.json --tolerate-errors --ramp=5m0s --duration=10m0s --prometheus-port=0 --pprofport=33333  {pgurl:1-5} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: t.Fatal() was called

tbg · 2022-11-02T08:12:46Z

@yuzefovich do you know which context got canceled here (and ideally which timeout it had on it)?

pq: inbox communication error: rpc error: code = Canceled desc = context canceled

yuzefovich · 2022-11-02T15:46:34Z

Unfortunately, I don't think we can know for sure. Most likely this error indicates that the gRPC stream connecting two parts of the distributed execution plan was broken for whatever reason, and on the "inbox" (consumer) side of the stream it is seen as "context canceled".

It can be helpful to spin up a `Monitor` that monitors only a set of goroutines, without also watching the nodes in underlying cluster, which is the default, and which could not be "narrowed down" to an empty set of nodes. This commit permits that. Release note: None

This adds a roachtest that runs TPCC (with a low warehouse count and `--tolerate-errors`), loses quorum half-way through, and checks the prometheus metrics of the workload for fail-fast behavior. It also checks that follower reads and bounded staleness reads work as expected at timestamps at which the cluster was known to have been healthy. Touches cockroachdb#33007. Release note: None

This ends up being useful every now and then. I was just brushing it up for cockroachdb#76147 and figured it was about time to just land this. This is a revival of cockroachdb#65844, which I cannot reopen at this point. Epic: none Release note: None

tbg · 2022-12-20T11:03:56Z

I thought I could fix this by setting the circuit breaker timeout to something like 4s but same failure mode, sadly. Since I don't have time to dig in, I'll try with some of the checks removed.

93933: sql: add `indnullsnotdistinct` column to `pg_catalog.pg_index` r=rafiss a=e-mbrown Fixes: #92583 This commit `indnullsnotdistinct` column to pg_index. Since we do not support `NULLS [ NOT ] DISTINCT` on `CREATE INDEX` the column is also false. Release note: None 93938: storage: fix `CheckSSTConflicts` intent conflict above MVCC range tombstone r=erikgrinaker a=erikgrinaker `CheckSSTConflicts` could fail to handle a point key collision with an intent written above an MVCC range tombstone, because the intent was considered to be deleted by the MVCC range tombstone. This patch makes `MVCCRangeKeyStack.Covered()` consider a timestamp of 0 (i.e. an intent) as being above all range keys, because intents must be above all range keys by definition. Otherwise, this condition in `CheckSSTConflicts` would consider the intent as a point key deleted by an MVCC range tombstone, and therefore ignored: ```go extValueDeletedByRange := extHasRange && extHasPoint && extRangeKeys.Covers(extKey) ``` `AddSSTable` operations are only used on offline tables, so this won't affect transaction isolation for ongoing transactions. However, it is possible (but unlikely) for a committed but unresolved write intent left behind by a past transaction to not be considered for conflict handling, and for `AddSSTable` to write below it. Resolves #93934. Release note (bug fix): When experimental MVCC range tombstones are enabled (they're disabled by default), a bulk ingestion (e.g. an import) could fail to take a committed-but-unresolved write intent into account during conflict checks when written above an MVCC range tombstone. It was therefore possible in very rare circumstances for the ingestion to write a value below the timestamp of the committed intent, causing the ingested value to disappear. 93942: sql: add flag to allow resolving index on offline tables r=chengxiong-ruan a=chengxiong-ruan Previously offline tables are always skipped when resolving an index. This pr adds an flag to allow offline tables to be considered. The main use case is for `SHOW RANGES` where index ranges of tables being imported (offline) should be shown. Epic: None Release note: None 93966: roachtest: add helper to set up jaeger and store data as artifacts r=tbg a=tbg This ends up being useful every now and then. I was just brushing it up for #76147 and figured it was about time to just land this. This is a revival of #65844, which I cannot reopen at this point. Epic: none Release note: None Co-authored-by: e-mbrown <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

tbg force-pushed the circ-breaker-roachtest branch 5 times, most recently from 179b88a to 4175eb6 Compare February 8, 2022 14:00

tbg mentioned this pull request Feb 9, 2022

prometheus: simplify snapshot visualization #76284

Merged

tbg force-pushed the circ-breaker-roachtest branch 2 times, most recently from 567a3b2 to 131462f Compare February 9, 2022 15:53

tbg commented Feb 9, 2022

View reviewed changes

pkg/cmd/roachtest/tests/chaos.go Show resolved Hide resolved

tbg marked this pull request as ready for review February 9, 2022 15:54

tbg requested a review from a team February 9, 2022 15:54

tbg mentioned this pull request Feb 9, 2022

kvserver: circuit-break requests to unavailable ranges #33007

Closed

erikgrinaker approved these changes Feb 10, 2022

View reviewed changes

tbg force-pushed the circ-breaker-roachtest branch from 131462f to bd93c1e Compare March 3, 2022 10:03

blathers-crl bot requested a review from otan March 3, 2022 10:03

otan removed their request for review March 20, 2022 11:20

tbg force-pushed the circ-breaker-roachtest branch from bd93c1e to 69090d7 Compare April 12, 2022 11:59

tbg requested a review from a team as a code owner April 12, 2022 11:59

tbg requested review from otan and removed request for a team April 12, 2022 11:59

tbg force-pushed the circ-breaker-roachtest branch from 69090d7 to 9cefd3e Compare April 12, 2022 13:39

srosenberg approved these changes Apr 16, 2022

View reviewed changes

srosenberg reviewed Apr 16, 2022

View reviewed changes

otan removed their request for review April 18, 2022 20:34

tbg force-pushed the circ-breaker-roachtest branch from 9cefd3e to b8a7898 Compare November 2, 2022 04:10

blathers-crl bot requested a review from otan November 2, 2022 04:10

tbg force-pushed the circ-breaker-roachtest branch from b8a7898 to f98c932 Compare November 3, 2022 13:36

otan removed their request for review November 29, 2022 08:46

tbg added 2 commits December 20, 2022 11:07

tbg force-pushed the circ-breaker-roachtest branch from f98c932 to fa6cd08 Compare December 20, 2022 10:16

blathers-crl bot requested a review from otan December 20, 2022 10:16

tbg mentioned this pull request Dec 20, 2022

roachtest: add helper to set up jaeger and store data as artifacts #93966

Merged

otan removed their request for review February 20, 2023 23:55

tbg closed this Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: add replica-circuit-breakers #76147

roachtest: add replica-circuit-breakers #76147

tbg commented Feb 7, 2022 •

edited

Loading

cockroach-teamcity commented Feb 7, 2022

erikgrinaker left a comment

srosenberg left a comment

srosenberg left a comment

tbg commented Nov 2, 2022

tbg commented Nov 2, 2022 •

edited

Loading

yuzefovich commented Nov 2, 2022

tbg commented Dec 20, 2022

roachtest: add replica-circuit-breakers #76147

roachtest: add replica-circuit-breakers #76147

Conversation

tbg commented Feb 7, 2022 • edited Loading

cockroach-teamcity commented Feb 7, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

tbg commented Nov 2, 2022

tbg commented Nov 2, 2022 • edited Loading

yuzefovich commented Nov 2, 2022

tbg commented Dec 20, 2022

tbg commented Feb 7, 2022 •

edited

Loading

tbg commented Nov 2, 2022 •

edited

Loading