-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: harden and extend cancel roachtest #105569
Conversation
36141cb
to
aedcdc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I just have one question.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/cmd/roachtest/tests/cancel.go
line 115 at r1 (raw file):
t.Fatal(err) } time.Sleep(10 * time.Millisecond)
What are your thoughts on how far along the query will be in ~10ms from when it shows up in the query table, and how it relates to coverage of when the query is canceled? I'm just wondering whether in ~10ms we would expect to still be in query planning or in the distributed execution stage. If the former, is this what we're trying to test here? Should we wait a bit longer before canceling the query?
aedcdc2
to
0315e97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @rharding6373)
pkg/cmd/roachtest/tests/cancel.go
line 115 at r1 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
What are your thoughts on how far along the query will be in ~10ms from when it shows up in the query table, and how it relates to coverage of when the query is canceled? I'm just wondering whether in ~10ms we would expect to still be in query planning or in the distributed execution stage. If the former, is this what we're trying to test here? Should we wait a bit longer before canceling the query?
That's a good point. Most likely with 10ms sleep we'd cancel the query when it's in the planning stage, and this test is primarily about the cancellation of the distributed execution plans. I adjusted the test so that we sleep some random duration in [1ms, 250ms]
range and run each query 5 times, requiring 3 successful cancellations. This should make the test non-flaky while giving us decent coverage that distributed plan shutdown doesn't get stuck and is relatively quick (we assert that it's in 5s range).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/cmd/roachtest/tests/cancel.go
line 115 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
That's a good point. Most likely with 10ms sleep we'd cancel the query when it's in the planning stage, and this test is primarily about the cancellation of the distributed execution plans. I adjusted the test so that we sleep some random duration in
[1ms, 250ms]
range and run each query 5 times, requiring 3 successful cancellations. This should make the test non-flaky while giving us decent coverage that distributed plan shutdown doesn't get stuck and is relatively quick (we assert that it's in 5s range).
This seems like a reasonable compromise. We should keep watch for timeouts, since this might result in much longer test runs.
2a9a644
to
3ce92cc
Compare
I ran this updated test 50 times and got some failures (namely that the cancellation didn't occur within 5s). I think the fact that we now run each query 5 times significantly increases the likelihood of this. I'm putting this PR in draft mode for now while I'm investigating the reason for slow cancellation. |
Forgot to reply to this:
Single run of this test now takes on the order of 2-3 minutes (about 1 minute to restore the backup and 1-2 minutes to run 4 queries 5 times each), so we shouldn't see any timeouts. |
283f7aa
to
d22a2e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured the reason for why sometimes the cancellation didn't occur with my previous patch - we were trying to cancel one of the "setup" queries (either USE tpch;
or SET distsql = off;
), so the main TPCH query wasn't being canceled. I refactored the test (we now always require that the main TPCH query is canceled and never finishes) while keeping most of the useful logging in too. I'd appreciate another look, and I also kicked off 200 runs overnight to ensure no failures.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach)
d22a2e6
to
736c30d
Compare
Got 1 failure out of 200 runs where we slept for 1.81s and Q7 finished before it was canceled, so I just reduced the upper bound on sleep duration to 1000ms (from 2000ms). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice. Thanks for making all these improvements! x3
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)
This commit hardens `cancel` roachtest. In particular, this test involves two goroutines: the runner that is executing longer running TPCH query and the main goroutine that cancels that query. Previously, in order to ensure that the main goroutine attempts to cancel the query at the right moment, we slept for 250ms. Then, we would cancel all running non-internal queries other than `SHOW CLUSTER QUERIES` itself. This was problematic for a couple of reasons: - the TPCH query might not have started yet (due some Go scheduling delays) - we could actually try to cancel one of the setup queries (the runner does `USE tpch;` and `SET distsql = off;` before running the TPCH query). In order to address the first reason, this commit adjusts the runner to notify the main goroutine only after the setup queries are done and introduces the polling loop to wait until the TPCH query shows up. That polling loop will now randomly sleep for a random duration up to 1000ms (in order to improve the test coverage of both the optimizer and the execution engine). Note that we only check that the cancellation occurred within 3s (used to be 5s before this commit), so we don't sufficiently exercise the optimizer cancellation (which isn't the primary goal of this test anyway). The second reason is addressed by blocking the main goroutine until the setup queries are done. Release note: None
736c30d
to
f80b346
Compare
Got no failures in 200 runs, should be solid. TFTR! bors r+ |
Build succeeded: |
blathers backport 22.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-22.2-105569 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/110896/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport 23.1 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-105569 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/110897/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit hardens
cancel
roachtest. In particular, this testinvolves two goroutines: the runner that is executing longer running
TPCH query and the main goroutine that cancels that query. Previously,
in order to ensure that the main goroutine attempts to cancel the query
at the right moment, we slept for 250ms. Then, we would cancel all
running non-internal queries other than
SHOW CLUSTER QUERIES
itself.This was problematic for a couple of reasons:
delays)
does
USE tpch;
andSET distsql = off;
before running the TPCHquery).
In order to address the first reason, this commit adjusts the runner to
notify the main goroutine only after the setup queries are done and
introduces the polling loop to wait until the TPCH query shows up. That
polling loop will now randomly sleep for a random duration up to 1000ms
(in order to improve the test coverage of both the optimizer and the
execution engine). Note that we only check that the cancellation
occurred within 3s (used to be 5s before this commit), so we don't
sufficiently exercise the optimizer cancellation (which isn't the
primary goal of this test anyway).
The second reason is addressed by blocking the main goroutine until the
setup queries are done.
Fixes: #105528.
Release note: None