Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest/mixedversion: monitor nodes in every test #107548

Merged
merged 4 commits into from
Aug 9, 2023

Conversation

renatolabs
Copy link
Contributor

@renatolabs renatolabs commented Jul 25, 2023

This PR includes a few changes related to the roachprod/roachtest monitor itself, and integrates it with the mixedversion package. Specifically, the changes are as follows (each bullet point corresponds to a commit):

  • emit structured events from the roachprod monitor; this removes the need to perform string parsing on callers.
  • change mixedversion framework to monitor nodes by default. An unexpected node death immediately fails the test.
  • move public functions on the *mixedversion.Helper struct to its own file, for ease of browsing.
  • update the README with instructions on background tasks; most importantly, mixedversion tests can't use cluster.NewMonitor like other roachtests.

Epic: CRDB-19321

Release notes: None

@renatolabs renatolabs requested a review from a team as a code owner July 25, 2023 17:20
@renatolabs renatolabs requested review from herkolategan and smg260 and removed request for a team July 25, 2023 17:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs
Copy link
Contributor Author

Meant to create this as a draft; there are likely rough edges, I'll run more tests and open for review again when ready.

@renatolabs renatolabs marked this pull request as draft July 25, 2023 17:22
@renatolabs renatolabs force-pushed the rc/mixedversion-monitor branch 3 times, most recently from ebadacd to 75e88ce Compare July 25, 2023 18:17
@renatolabs renatolabs marked this pull request as ready for review July 26, 2023 18:47
@renatolabs
Copy link
Contributor Author

Ran this PR with SELECT_PROBABILITY=0.5 and the nothing seems out of the ordinary. This is ready for review.

https://teamcity.cockroachdb.com/viewLog.html?buildId=11055808&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildResultsDiv

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the work around the monitors. I'm also in the process of doing some work around the monitors for MT, and eager to have this as a base.

Reviewed 4 of 5 files at r1, 4 of 5 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

Previously, the `Monitor` function would return semi-structured data
in the form of a `(node, message)` pair. The `message` field
corresponded to the output (as in, stdout) of the bash script that
implements the monitor logic. This tied callers to the specific
implementation of the function and to the strings passed to `echo` in
that script (e.g., all the checks for the `"dead"` string in the
message).

This commit updates the function to instead return an event
channel. Events can be of different types corresponding to each of the
events that the shell script is able to emit (e.g., when the cockroach
process is found to be running, when it stops, etc). This makes the
parsing of the script output private, and simplifies the logic in most
callers.

The message returned by the `roachtest` wrapper is also changed
slightly in this commit to make it clearer where the error is coming
from: the function passed to `monitor.Go` or the monitor process
itself. This has been a source of confusion in the past.

Finally, the monitor implementation in roachprod is changed slightly
to avoid blocking on channel sends if the context has already been
canceled. This avoids leaked goroutines in tests, as canceling the
context passed to `Monitor()` should cause all goroutines to
eventually finish instead of blocking indefinitely.

Epic: CRDB-19321

Release note: None
This updates the `mixedversion` runner to monitor the status of the
cockroach processes on the cluster during a mixed-version test. This
allows the test to fail immediately if a node crashes unexpectedly.

Previously, if a node crashed, the test would only fail if the crash
resulted in an assertion failure. There is a chance that this would
not happen because the framework could decide to restart the node
shortly after its crash.

With this change, an unexpected node death causes the test to fail
immediately. This commit also exposes the ability for test authors to
use `ExpectDeath()` in their tests, in case the test performs its own
restarts or chaos events.

Epic: CRDB-19321

Release note: None
This makes it easier for test authors to see all helper functions
available without having to browse through internal framework code.

Epic: CRDB-19321

Release note: None
The note about background functions is important to highlight
specifically due to potential habit of using existing roachtest APIs.

Epic: CRDB-19321

Release note: None
@renatolabs renatolabs force-pushed the rc/mixedversion-monitor branch from 75e88ce to 3609615 Compare August 8, 2023 17:46
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Offloading "death monitoring" to the mixedversion framework is great.

@renatolabs
Copy link
Contributor Author

TFTRs!

bors r=herkolategan,srosenberg

@renatolabs renatolabs added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 9, 2023
@craig
Copy link
Contributor

craig bot commented Aug 9, 2023

Build succeeded:

@craig craig bot merged commit 4c34b48 into cockroachdb:master Aug 9, 2023
@renatolabs renatolabs deleted the rc/mixedversion-monitor branch August 10, 2023 13:23
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 10, 2023
PR cockroachdb#107548 introduced a bug where monitor context
is not cancelled if the monitor has no tasks running.
In this case, the goroutine waiting for cluster events
never terminates (if the cluster is health) and the
roachtest times out waiting for shutdown.

Fixes cockroachdb#108507
Epic: None

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Aug 11, 2023
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such
that, if the caller provided no functions (via the `Go` method), then
`wait()` would block while listening for node events. However, that
was a change in the semantics of the function: previously, a monitor
without goroutines would return immediately after a call to `wait()`
is made. This behaviour is preferred as it's familiar: it's how
`errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where
callers are only interested in listening for unexpected node deaths
without having to pass a function to `Go()`. To support that scenario,
a new function is added to the `Monitor` interface:
`WaitForNodeDeath`. This function is like `WaitE`, except that only
errors related to unexpected node deaths are returned. When callers
wish to terminate the monitoring goroutine, they call call
`Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by
making use of two `errgroup`s to achieve this behaviour, replacing the
previous, harder to understand, logic.

Fixes: cockroachdb#108530.

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Aug 11, 2023
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such
that, if the caller provided no functions (via the `Go` method), then
`wait()` would block while listening for node events. However, that
was a change in the semantics of the function: previously, a monitor
without goroutines would return immediately after a call to `wait()`
is made. This behaviour is preferred as it's familiar: it's how
`errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where
callers are only interested in listening for unexpected node deaths
without having to pass a function to `Go()`. To support that scenario,
a new function is added to the `Monitor` interface:
`WaitForNodeDeath`. This function is like `WaitE`, except that only
errors related to unexpected node deaths are returned. When callers
wish to terminate the monitoring goroutine, they call call
`Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by
making use of two `errgroup`s to achieve this behaviour, replacing the
previous, harder to understand, logic.

Fixes: cockroachdb#108530.

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Aug 11, 2023
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such
that, if the caller provided no functions (via the `Go` method), then
`wait()` would block while listening for node events. However, that
was a change in the semantics of the function: previously, a monitor
without goroutines would return immediately after a call to `wait()`
is made. This behaviour is preferred as it's familiar: it's how
`errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where
callers are only interested in listening for unexpected node deaths
without having to pass a function to `Go()`. To support that scenario,
a new function is added to the `Monitor` interface:
`WaitForNodeDeath`. This function is like `WaitE`, except that only
errors related to unexpected node deaths are returned. When callers
wish to terminate the monitoring goroutine, they call call
`Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by
making use of two `errgroup`s to achieve this behaviour, replacing the
previous, harder to understand, logic.

Fixes: cockroachdb#108530.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 14, 2023
108594: roachtest: allow callers to listen for node events in monitor r=srosenberg a=renatolabs

In #107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic.

Fixes: #108530.

Release note: None

Co-authored-by: Renato Costa <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Aug 14, 2023
In #107548, the `(*monitorImpl).wait()` function was modified such
that, if the caller provided no functions (via the `Go` method), then
`wait()` would block while listening for node events. However, that
was a change in the semantics of the function: previously, a monitor
without goroutines would return immediately after a call to `wait()`
is made. This behaviour is preferred as it's familiar: it's how
`errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where
callers are only interested in listening for unexpected node deaths
without having to pass a function to `Go()`. To support that scenario,
a new function is added to the `Monitor` interface:
`WaitForNodeDeath`. This function is like `WaitE`, except that only
errors related to unexpected node deaths are returned. When callers
wish to terminate the monitoring goroutine, they call call
`Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by
making use of two `errgroup`s to achieve this behaviour, replacing the
previous, harder to understand, logic.

Fixes: #108530.

Release note: None
miraradeva pushed a commit to miraradeva/cockroach that referenced this pull request Aug 17, 2023
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such
that, if the caller provided no functions (via the `Go` method), then
`wait()` would block while listening for node events. However, that
was a change in the semantics of the function: previously, a monitor
without goroutines would return immediately after a call to `wait()`
is made. This behaviour is preferred as it's familiar: it's how
`errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where
callers are only interested in listening for unexpected node deaths
without having to pass a function to `Go()`. To support that scenario,
a new function is added to the `Monitor` interface:
`WaitForNodeDeath`. This function is like `WaitE`, except that only
errors related to unexpected node deaths are returned. When callers
wish to terminate the monitoring goroutine, they call call
`Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by
making use of two `errgroup`s to achieve this behaviour, replacing the
previous, harder to understand, logic.

Fixes: cockroachdb#108530.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants