Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: harden and extend cancel roachtest #105569

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jun 26, 2023

This commit hardens cancel roachtest. In particular, this test
involves two goroutines: the runner that is executing longer running
TPCH query and the main goroutine that cancels that query. Previously,
in order to ensure that the main goroutine attempts to cancel the query
at the right moment, we slept for 250ms. Then, we would cancel all
running non-internal queries other than SHOW CLUSTER QUERIES itself.
This was problematic for a couple of reasons:

  • the TPCH query might not have started yet (due some Go scheduling
    delays)
  • we could actually try to cancel one of the setup queries (the runner
    does USE tpch; and SET distsql = off; before running the TPCH
    query).

In order to address the first reason, this commit adjusts the runner to
notify the main goroutine only after the setup queries are done and
introduces the polling loop to wait until the TPCH query shows up. That
polling loop will now randomly sleep for a random duration up to 1000ms
(in order to improve the test coverage of both the optimizer and the
execution engine). Note that we only check that the cancellation
occurred within 3s (used to be 5s before this commit), so we don't
sufficiently exercise the optimizer cancellation (which isn't the
primary goal of this test anyway).

The second reason is addressed by blocking the main goroutine until the
setup queries are done.

Fixes: #105528.

Release note: None

@yuzefovich yuzefovich requested a review from a team June 26, 2023 19:04
@yuzefovich yuzefovich requested a review from a team as a code owner June 26, 2023 19:04
@yuzefovich yuzefovich requested a review from cucaroach June 26, 2023 19:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice, I just have one question.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/cmd/roachtest/tests/cancel.go line 115 at r1 (raw file):

						t.Fatal(err)
					}
					time.Sleep(10 * time.Millisecond)

What are your thoughts on how far along the query will be in ~10ms from when it shows up in the query table, and how it relates to coverage of when the query is canceled? I'm just wondering whether in ~10ms we would expect to still be in query planning or in the distributed execution stage. If the former, is this what we're trying to test here? Should we wait a bit longer before canceling the query?

@yuzefovich yuzefovich changed the title roachtest: harden cancel roachtest roachtest: harden and extend cancel roachtest Jun 27, 2023
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @rharding6373)


pkg/cmd/roachtest/tests/cancel.go line 115 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

What are your thoughts on how far along the query will be in ~10ms from when it shows up in the query table, and how it relates to coverage of when the query is canceled? I'm just wondering whether in ~10ms we would expect to still be in query planning or in the distributed execution stage. If the former, is this what we're trying to test here? Should we wait a bit longer before canceling the query?

That's a good point. Most likely with 10ms sleep we'd cancel the query when it's in the planning stage, and this test is primarily about the cancellation of the distributed execution plans. I adjusted the test so that we sleep some random duration in [1ms, 250ms] range and run each query 5 times, requiring 3 successful cancellations. This should make the test non-flaky while giving us decent coverage that distributed plan shutdown doesn't get stuck and is relatively quick (we assert that it's in 5s range).

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/cmd/roachtest/tests/cancel.go line 115 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

That's a good point. Most likely with 10ms sleep we'd cancel the query when it's in the planning stage, and this test is primarily about the cancellation of the distributed execution plans. I adjusted the test so that we sleep some random duration in [1ms, 250ms] range and run each query 5 times, requiring 3 successful cancellations. This should make the test non-flaky while giving us decent coverage that distributed plan shutdown doesn't get stuck and is relatively quick (we assert that it's in 5s range).

This seems like a reasonable compromise. We should keep watch for timeouts, since this might result in much longer test runs.

@yuzefovich yuzefovich force-pushed the harden-cancel branch 2 times, most recently from 2a9a644 to 3ce92cc Compare June 27, 2023 22:22
@yuzefovich
Copy link
Member Author

I ran this updated test 50 times and got some failures (namely that the cancellation didn't occur within 5s). I think the fact that we now run each query 5 times significantly increases the likelihood of this. I'm putting this PR in draft mode for now while I'm investigating the reason for slow cancellation.

@yuzefovich yuzefovich marked this pull request as draft June 28, 2023 02:24
@yuzefovich
Copy link
Member Author

yuzefovich commented Jun 28, 2023

Forgot to reply to this:

We should keep watch for timeouts, since this might result in much longer test runs.

Single run of this test now takes on the order of 2-3 minutes (about 1 minute to restore the backup and 1-2 minutes to run 4 queries 5 times each), so we shouldn't see any timeouts.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured the reason for why sometimes the cancellation didn't occur with my previous patch - we were trying to cancel one of the "setup" queries (either USE tpch; or SET distsql = off;), so the main TPCH query wasn't being canceled. I refactored the test (we now always require that the main TPCH query is canceled and never finishes) while keeping most of the useful logging in too. I'd appreciate another look, and I also kicked off 200 runs overnight to ensure no failures.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach)

@yuzefovich yuzefovich marked this pull request as ready for review June 29, 2023 16:14
@yuzefovich
Copy link
Member Author

Got 1 failure out of 200 runs where we slept for 1.81s and Q7 finished before it was canceled, so I just reduced the upper bound on sleep duration to 1000ms (from 2000ms).

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice. Thanks for making all these improvements! :lgtm: x3

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: x4 with approval this time

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

This commit hardens `cancel` roachtest. In particular, this test
involves two goroutines: the runner that is executing longer running
TPCH query and the main goroutine that cancels that query. Previously,
in order to ensure that the main goroutine attempts to cancel the query
at the right moment, we slept for 250ms. Then, we would cancel all
running non-internal queries other than `SHOW CLUSTER QUERIES` itself.
This was problematic for a couple of reasons:
- the TPCH query might not have started yet (due some Go scheduling
delays)
- we could actually try to cancel one of the setup queries (the runner
does `USE tpch;` and `SET distsql = off;` before running the TPCH
query).

In order to address the first reason, this commit adjusts the runner to
notify the main goroutine only after the setup queries are done and
introduces the polling loop to wait until the TPCH query shows up. That
polling loop will now randomly sleep for a random duration up to 1000ms
(in order to improve the test coverage of both the optimizer and the
execution engine). Note that we only check that the cancellation
occurred within 3s (used to be 5s before this commit), so we don't
sufficiently exercise the optimizer cancellation (which isn't the
primary goal of this test anyway).

The second reason is addressed by blocking the main goroutine until the
setup queries are done.

Release note: None
@yuzefovich
Copy link
Member Author

Got no failures in 200 runs, should be solid.

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 29, 2023

Build succeeded:

@craig craig bot merged commit da4a10b into cockroachdb:master Jun 29, 2023
@yuzefovich yuzefovich deleted the harden-cancel branch June 29, 2023 20:19
@yuzefovich
Copy link
Member Author

blathers backport 22.2

@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-22.2-105569 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/110896/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich
Copy link
Member Author

blathers backport 23.1

@blathers-crl
Copy link

blathers-crl bot commented Sep 19, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-105569 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/110897/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: cancel/tpch/distsql/queries=7,9,20,21,nodes=3 failed
3 participants