-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add replica-circuit-breakers #76147
Conversation
179b88a
to
4175eb6
Compare
Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in cockroachdb#76147 but I imagine it'll come in handy in general as well. Release note: None
Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in cockroachdb#76147 but I imagine it'll come in handy in general as well. Release note: None
76284: prometheus: simplify snapshot visualization r=otan a=tbg Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in #76147 but I imagine it'll come in handy in general as well. Release note: None Co-authored-by: Tobias Grieger <[email protected]>
567a3b2
to
131462f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
131462f
to
bd93c1e
Compare
Make it *really easy* to look at `prometheus-snapshot.tgz` by putting a shell script next to it that spins it up via docker, so you can just run: ``` $ ./prometheus-docker-run.sh [...] ts=2022-02-09T09:07:06.254Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ``` I got motivated to do this while working on the roachtest in cockroachdb#76147 but I imagine it'll come in handy in general as well. Release note: None
bd93c1e
to
69090d7
Compare
69090d7
to
9cefd3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tour de force of roachtests!
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @tbg)
pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 200 at r3 (raw file):
}, ) t.L().Printf("under healthy cluster: ~%.2f qps", min)
Shouldn't we also log max
? It could help with debugging in case we're not able to hit anything higher than min
for whatever reason.
pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 206 at r3 (raw file):
// Starting at one minute after quorum was lost, we should see at least a // steady trickle of errors, or a steady trickle of successes. (If // miraculously no range lost quorum it would be all successes, if they all
Should we fail if a miracle does occur? Conversely, how do we know that this test is not always miraculous? If we can guarantee a loss of quorum, then a simple assertion on the non-zero count of workload_tpcc_newOrder_error_total
would suffice.
pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 221 at r3 (raw file):
} func visitPromRangeQuery(
It's a fairly useful helper function which could be made general enough to be reused by other tests; i.e., assertions about the range could be parameterized.
pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 256 at r3 (raw file):
for _, val := range mtr.(model.Matrix)[0].Values { n++ visit(timeutil.Unix(int64(val.Timestamp/1000), 0), val)
Why not val.Timestamp.Time()
?
pkg/roachprod/install/cluster_synced.go, line 2043 at r1 (raw file):
if len(failed) > 0 { var err error for _, res := range failed {
Do we even need failed
? CombineErrors
could simply replace the corresponding append(failed, r)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan and @tbg)
pkg/cmd/roachtest/tests/replica_circuit_breaker.go, line 136 at r3 (raw file):
r := sqlutils.MakeSQLRunner(c.Conn(ctx, t.L(), 1)) queryTS := ts.Add(-10 * time.Second).Format("2006-01-02 15:04:05") q1 := `SELECT count(*) FROM tpcc.order_line AS OF SYSTEM TIME '` + queryTS + `'`
Not sure I follow why q1
is not using the stronger form of AOST, i.e., with_min_timestamp
?
9cefd3e
to
b8a7898
Compare
Rebased this PR, maybe I broke something. No longer passes.
|
@yuzefovich do you know which context got canceled here (and ideally which timeout it had on it)?
|
Unfortunately, I don't think we can know for sure. Most likely this error indicates that the gRPC stream connecting two parts of the distributed execution plan was broken for whatever reason, and on the "inbox" (consumer) side of the stream it is seen as "context canceled". |
b8a7898
to
f98c932
Compare
It can be helpful to spin up a `Monitor` that monitors only a set of goroutines, without also watching the nodes in underlying cluster, which is the default, and which could not be "narrowed down" to an empty set of nodes. This commit permits that. Release note: None
This adds a roachtest that runs TPCC (with a low warehouse count and `--tolerate-errors`), loses quorum half-way through, and checks the prometheus metrics of the workload for fail-fast behavior. It also checks that follower reads and bounded staleness reads work as expected at timestamps at which the cluster was known to have been healthy. Touches cockroachdb#33007. Release note: None
f98c932
to
fa6cd08
Compare
This ends up being useful every now and then. I was just brushing it up for cockroachdb#76147 and figured it was about time to just land this. This is a revival of cockroachdb#65844, which I cannot reopen at this point. Epic: none Release note: None
I thought I could fix this by setting the circuit breaker timeout to something like 4s but same failure mode, sadly. Since I don't have time to dig in, I'll try with some of the checks removed. |
93933: sql: add `indnullsnotdistinct` column to `pg_catalog.pg_index` r=rafiss a=e-mbrown Fixes: #92583 This commit `indnullsnotdistinct` column to pg_index. Since we do not support `NULLS [ NOT ] DISTINCT` on `CREATE INDEX` the column is also false. Release note: None 93938: storage: fix `CheckSSTConflicts` intent conflict above MVCC range tombstone r=erikgrinaker a=erikgrinaker `CheckSSTConflicts` could fail to handle a point key collision with an intent written above an MVCC range tombstone, because the intent was considered to be deleted by the MVCC range tombstone. This patch makes `MVCCRangeKeyStack.Covered()` consider a timestamp of 0 (i.e. an intent) as being above all range keys, because intents must be above all range keys by definition. Otherwise, this condition in `CheckSSTConflicts` would consider the intent as a point key deleted by an MVCC range tombstone, and therefore ignored: ```go extValueDeletedByRange := extHasRange && extHasPoint && extRangeKeys.Covers(extKey) ``` `AddSSTable` operations are only used on offline tables, so this won't affect transaction isolation for ongoing transactions. However, it is possible (but unlikely) for a committed but unresolved write intent left behind by a past transaction to not be considered for conflict handling, and for `AddSSTable` to write below it. Resolves #93934. Release note (bug fix): When experimental MVCC range tombstones are enabled (they're disabled by default), a bulk ingestion (e.g. an import) could fail to take a committed-but-unresolved write intent into account during conflict checks when written above an MVCC range tombstone. It was therefore possible in very rare circumstances for the ingestion to write a value below the timestamp of the committed intent, causing the ingested value to disappear. 93942: sql: add flag to allow resolving index on offline tables r=chengxiong-ruan a=chengxiong-ruan Previously offline tables are always skipped when resolving an index. This pr adds an flag to allow offline tables to be considered. The main use case is for `SHOW RANGES` where index ranges of tables being imported (offline) should be shown. Epic: None Release note: None 93966: roachtest: add helper to set up jaeger and store data as artifacts r=tbg a=tbg This ends up being useful every now and then. I was just brushing it up for #76147 and figured it was about time to just land this. This is a revival of #65844, which I cannot reopen at this point. Epic: none Release note: None Co-authored-by: e-mbrown <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
his adds a roachtest that runs TPCC (with a low warehouse count and
--tolerate-errors
), loses quorum half-way through, and checks theprometheus metrics of the workload for fail-fast behavior.
Touches #33007.
Epic: none
Release note: None