Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add replica-circuit-breakers #76147

Closed
wants to merge 2 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 7, 2022

his adds a roachtest that runs TPCC (with a low warehouse count and
--tolerate-errors), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior.

Touches #33007.

Epic: none

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the circ-breaker-roachtest branch 5 times, most recently from 179b88a to 4175eb6 Compare February 8, 2022 14:00
tbg added a commit to tbg/cockroach that referenced this pull request Feb 9, 2022
Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a
shell script next to it that spins it up via docker, so you can just
run:

```
$ ./prometheus-docker-run.sh
[...]
ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests."
```

I got motivated to do this while working on the roachtest in cockroachdb#76147 but
I imagine it'll come in handy in general as well.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 9, 2022
Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a
shell script next to it that spins it up via docker, so you can just
run:

```
$ ./prometheus-docker-run.sh
[...]
ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests."
```

I got motivated to do this while working on the roachtest in cockroachdb#76147 but
I imagine it'll come in handy in general as well.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 9, 2022
76284: prometheus: simplify snapshot visualization r=otan a=tbg

Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a
shell script next to it that spins it up via docker, so you can just
run:

```
$ ./prometheus-docker-run.sh
[...]
ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests."
```

I got motivated to do this while working on the roachtest in #76147 but
I imagine it'll come in handy in general as well.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg force-pushed the circ-breaker-roachtest branch 2 times, most recently from 567a3b2 to 131462f Compare February 9, 2022 15:53
@tbg tbg marked this pull request as ready for review February 9, 2022 15:54
@tbg tbg requested a review from a team February 9, 2022 15:54
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@tbg tbg force-pushed the circ-breaker-roachtest branch from 131462f to bd93c1e Compare March 3, 2022 10:03
@blathers-crl blathers-crl bot requested a review from otan March 3, 2022 10:03
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a
shell script next to it that spins it up via docker, so you can just
run:

```
$ ./prometheus-docker-run.sh
[...]
ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests."
```

I got motivated to do this while working on the roachtest in cockroachdb#76147 but
I imagine it'll come in handy in general as well.

Release note: None
@otan otan removed their request for review March 20, 2022 11:20
@tbg tbg force-pushed the circ-breaker-roachtest branch from bd93c1e to 69090d7 Compare April 12, 2022 11:59
@tbg tbg requested a review from a team as a code owner April 12, 2022 11:59
@tbg tbg requested review from otan and removed request for a team April 12, 2022 11:59
@tbg tbg force-pushed the circ-breaker-roachtest branch from 69090d7 to 9cefd3e Compare April 12, 2022 13:39
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tour de force of roachtests!

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @tbg)


pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 200 at r3 (raw file):

			},
		)
		t.L().Printf("under healthy cluster: ~%.2f qps", min)

Shouldn't we also log max? It could help with debugging in case we're not able to hit anything higher than min for whatever reason.


pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 206 at r3 (raw file):

		// Starting at one minute after quorum was lost, we should see at least a
		// steady trickle of errors, or a steady trickle of successes. (If
		// miraculously no range lost quorum it would be all successes, if they all

Should we fail if a miracle does occur? Conversely, how do we know that this test is not always miraculous? If we can guarantee a loss of quorum, then a simple assertion on the non-zero count of workload_tpcc_newOrder_error_total would suffice.


pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 221 at r3 (raw file):

}

func visitPromRangeQuery(

It's a fairly useful helper function which could be made general enough to be reused by other tests; i.e., assertions about the range could be parameterized.


pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 256 at r3 (raw file):

	for _, val := range mtr.(model.Matrix)[0].Values {
		n++
		visit(timeutil.Unix(int64(val.Timestamp/1000), 0), val)

Why not val.Timestamp.Time()?


pkg/roachprod/install/cluster_synced.go, line 2043 at r1 (raw file):

	if len(failed) > 0 {
		var err error
		for _, res := range failed {

Do we even need failed? CombineErrors could simply replace the corresponding append(failed, r).

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan and @tbg)


pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 136 at r3 (raw file):

				r := sqlutils.MakeSQLRunner(c.Conn(ctx, t.L(), 1))
				queryTS := ts.Add(-10 * time.Second).Format("2006-01-02 15:04:05")
				q1 := `SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '` + queryTS + `'`

Not sure I follow why q1 is not using the stronger form of AOST, i.e., with_min_timestamp?

@otan otan removed their request for review April 18, 2022 20:34
@tbg tbg force-pushed the circ-breaker-roachtest branch from 9cefd3e to b8a7898 Compare November 2, 2022 04:10
@blathers-crl blathers-crl bot requested a review from otan November 2, 2022 04:10
@tbg
Copy link
Member Author

tbg commented Nov 2, 2022

Rebased this PR, maybe I broke something. No longer passes.

$ ./pkg/cmd/roachtest/roachstress.sh -c 3 replica-circuit-breaker tag:weekly
=== RUN   replica-circuit-breaker#2
=== RUN   replica-circuit-breaker#3
=== RUN   replica-circuit-breaker#1
Grafana dashboard 0: http://34.75.146.40:3000/
Grafana dashboard 0: http://34.148.31.31:3000/
Grafana dashboard 0: http://34.73.60.89:3000/
--- FAIL: replica-circuit-breaker#3 (451.36s)
test artifacts and logs in: artifacts/b8a78987c1-dirty/043215/replica-circuit-breaker/run_3
(test_impl.go:297).Fatalf: error executing 'SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '2022-11-02 04:37:05'': pq: inbox communication error: rpc error: code = Canceled desc = context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_043715.018671211_n6_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=50 --histograms=perf/stats.json --tolerate-errors --ramp=5m0s --duration=10m0s --prometheus-port=0 --pprofport=33333  {pgurl:1-5} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: t.Fatal() was called
--- FAIL: replica-circuit-breaker#2 (453.09s)
test artifacts and logs in: artifacts/b8a78987c1-dirty/043215/replica-circuit-breaker/run_2
(test_impl.go:297).Fatalf: error executing 'SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '2022-11-02 04:37:06'': pq: inbox communication error: rpc error: code = Canceled desc = context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: output in run_043716.222616076_n6_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=50 --histograms=perf/stats.json --tolerate-errors --ramp=5m0s --duration=10m0s --prometheus-port=0 --pprofport=33333  {pgurl:1-5} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor task failed: t.Fatal() was called

@tbg
Copy link
Member Author

tbg commented Nov 2, 2022

@yuzefovich do you know which context got canceled here (and ideally which timeout it had on it)?

pq: inbox communication error: rpc error: code = Canceled desc = context canceled

@yuzefovich
Copy link
Member

Unfortunately, I don't think we can know for sure. Most likely this error indicates that the gRPC stream connecting two parts of the distributed execution plan was broken for whatever reason, and on the "inbox" (consumer) side of the stream it is seen as "context canceled".

@tbg tbg force-pushed the circ-breaker-roachtest branch from b8a7898 to f98c932 Compare November 3, 2022 13:36
@otan otan removed their request for review November 29, 2022 08:46
tbg added 2 commits December 20, 2022 11:07
It can be helpful to spin up a `Monitor` that monitors only a set of
goroutines, without also watching the nodes in underlying cluster,
which is the default, and which could not be "narrowed down" to an
empty set of nodes.

This commit permits that.

Release note: None
This adds a roachtest that runs TPCC (with a low warehouse count and
`--tolerate-errors`), loses quorum half-way through, and checks the
prometheus metrics of the workload for fail-fast behavior. It also
checks that follower reads and bounded staleness reads work as
expected at timestamps at which the cluster was known to have been
healthy.

Touches cockroachdb#33007.

Release note: None
@tbg tbg force-pushed the circ-breaker-roachtest branch from f98c932 to fa6cd08 Compare December 20, 2022 10:16
@blathers-crl blathers-crl bot requested a review from otan December 20, 2022 10:16
tbg added a commit to tbg/cockroach that referenced this pull request Dec 20, 2022
This ends up being useful every now and then. I was just brushing it up
for cockroachdb#76147 and figured it
was about time to just land this.

This is a revival of
cockroachdb#65844, which I cannot
reopen at this point.

Epic: none
Release note: None
@tbg
Copy link
Member Author

tbg commented Dec 20, 2022

I thought I could fix this by setting the circuit breaker timeout to something like 4s but same failure mode, sadly. Since I don't have time to dig in, I'll try with some of the checks removed.

craig bot pushed a commit that referenced this pull request Dec 20, 2022
93933: sql: add `indnullsnotdistinct` column to `pg_catalog.pg_index` r=rafiss a=e-mbrown

Fixes: #92583

This commit `indnullsnotdistinct` column to pg_index. Since we do not support `NULLS [ NOT ] DISTINCT` on `CREATE INDEX` the column is also false.

Release note: None

93938: storage: fix `CheckSSTConflicts` intent conflict above MVCC range tombstone r=erikgrinaker a=erikgrinaker

`CheckSSTConflicts` could fail to handle a point key collision with an intent written above an MVCC range tombstone, because the intent was considered to be deleted by the MVCC range tombstone.

This patch makes `MVCCRangeKeyStack.Covered()` consider a timestamp of 0 (i.e. an intent) as being above all range keys, because intents must be above all range keys by definition. Otherwise, this condition in `CheckSSTConflicts` would consider the intent as a point key deleted by an MVCC range tombstone, and therefore ignored:

```go
extValueDeletedByRange := extHasRange && extHasPoint && extRangeKeys.Covers(extKey)
```

`AddSSTable` operations are only used on offline tables, so this won't affect transaction isolation for ongoing transactions. However, it is possible (but unlikely) for a committed but unresolved write intent left behind by a past transaction to not be considered for conflict handling, and for `AddSSTable` to write below it.

Resolves #93934.

Release note (bug fix): When experimental MVCC range tombstones are enabled (they're disabled by default), a bulk ingestion (e.g. an import) could fail to take a committed-but-unresolved write intent into account during conflict checks when written above an MVCC range tombstone. It was therefore possible in very rare circumstances for the ingestion to write a value below the timestamp of the committed intent, causing the ingested value to disappear.

93942: sql: add flag to allow resolving index on offline tables r=chengxiong-ruan a=chengxiong-ruan

Previously offline tables are always skipped when resolving an index. This pr adds an flag to allow offline tables to be considered. The main use case is for `SHOW RANGES` where index ranges of tables being imported (offline) should be shown.

Epic: None
Release note: None

93966: roachtest: add helper to set up jaeger and store data as artifacts r=tbg a=tbg

This ends up being useful every now and then. I was just brushing it up
for #76147 and figured it
was about time to just land this.

This is a revival of
#65844, which I cannot
reopen at this point.

Epic: none
Release note: None


Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@otan otan removed their request for review February 20, 2023 23:55
@tbg tbg closed this Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants