roachtest/mixedversion: monitor nodes in every test #107548

renatolabs · 2023-07-25T17:20:49Z

This PR includes a few changes related to the roachprod/roachtest monitor itself, and integrates it with the mixedversion package. Specifically, the changes are as follows (each bullet point corresponds to a commit):

emit structured events from the roachprod monitor; this removes the need to perform string parsing on callers.
change mixedversion framework to monitor nodes by default. An unexpected node death immediately fails the test.
move public functions on the *mixedversion.Helper struct to its own file, for ease of browsing.
update the README with instructions on background tasks; most importantly, mixedversion tests can't use cluster.NewMonitor like other roachtests.

Epic: CRDB-19321

Release notes: None

cockroach-teamcity · 2023-07-25T17:21:04Z

This change is

renatolabs · 2023-07-25T17:22:17Z

Meant to create this as a draft; there are likely rough edges, I'll run more tests and open for review again when ready.

renatolabs · 2023-07-26T18:47:32Z

Ran this PR with SELECT_PROBABILITY=0.5 and the nothing seems out of the ordinary. This is ready for review.

https://teamcity.cockroachdb.com/viewLog.html?buildId=11055808&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildResultsDiv

herkolategan

Nice! I like the work around the monitors. I'm also in the process of doing some work around the monitors for MT, and eager to have this as a base.

Reviewed 4 of 5 files at r1, 4 of 5 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

Previously, the `Monitor` function would return semi-structured data in the form of a `(node, message)` pair. The `message` field corresponded to the output (as in, stdout) of the bash script that implements the monitor logic. This tied callers to the specific implementation of the function and to the strings passed to `echo` in that script (e.g., all the checks for the `"dead"` string in the message). This commit updates the function to instead return an event channel. Events can be of different types corresponding to each of the events that the shell script is able to emit (e.g., when the cockroach process is found to be running, when it stops, etc). This makes the parsing of the script output private, and simplifies the logic in most callers. The message returned by the `roachtest` wrapper is also changed slightly in this commit to make it clearer where the error is coming from: the function passed to `monitor.Go` or the monitor process itself. This has been a source of confusion in the past. Finally, the monitor implementation in roachprod is changed slightly to avoid blocking on channel sends if the context has already been canceled. This avoids leaked goroutines in tests, as canceling the context passed to `Monitor()` should cause all goroutines to eventually finish instead of blocking indefinitely. Epic: CRDB-19321 Release note: None

This updates the `mixedversion` runner to monitor the status of the cockroach processes on the cluster during a mixed-version test. This allows the test to fail immediately if a node crashes unexpectedly. Previously, if a node crashed, the test would only fail if the crash resulted in an assertion failure. There is a chance that this would not happen because the framework could decide to restart the node shortly after its crash. With this change, an unexpected node death causes the test to fail immediately. This commit also exposes the ability for test authors to use `ExpectDeath()` in their tests, in case the test performs its own restarts or chaos events. Epic: CRDB-19321 Release note: None

This makes it easier for test authors to see all helper functions available without having to browse through internal framework code. Epic: CRDB-19321 Release note: None

The note about background functions is important to highlight specifically due to potential habit of using existing roachtest APIs. Epic: CRDB-19321 Release note: None

srosenberg

Nice! Offloading "death monitoring" to the mixedversion framework is great.

renatolabs · 2023-08-09T12:56:04Z

TFTRs!

bors r=herkolategan,srosenberg

craig · 2023-08-09T13:34:32Z

Build succeeded:

Bazel Essential CI (Cockroach)

PR cockroachdb#107548 introduced a bug where monitor context is not cancelled if the monitor has no tasks running. In this case, the goroutine waiting for cluster events never terminates (if the cluster is health) and the roachtest times out waiting for shutdown. Fixes cockroachdb#108507 Epic: None Release note: None

In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None

108594: roachtest: allow callers to listen for node events in monitor r=srosenberg a=renatolabs In #107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: #108530. Release note: None Co-authored-by: Renato Costa <[email protected]>

In #107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: #108530. Release note: None

In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None

renatolabs requested a review from a team as a code owner July 25, 2023 17:20

renatolabs requested review from herkolategan and smg260 and removed request for a team July 25, 2023 17:20

renatolabs marked this pull request as draft July 25, 2023 17:22

renatolabs force-pushed the rc/mixedversion-monitor branch 3 times, most recently from ebadacd to 75e88ce Compare July 25, 2023 18:17

renatolabs mentioned this pull request Jul 26, 2023

Assertion failure during unoptimized-query-oracle/disable-rules=all/rand-tables test #107654

Closed

renatolabs marked this pull request as ready for review July 26, 2023 18:47

herkolategan approved these changes Aug 8, 2023

View reviewed changes

renatolabs added 4 commits August 8, 2023 13:46

roachtest/mixedversion: move helper functions to its own file

7537b92

This makes it easier for test authors to see all helper functions available without having to browse through internal framework code. Epic: CRDB-19321 Release note: None

roachtest/mixedversion: update README with note about helpers

3609615

The note about background functions is important to highlight specifically due to potential habit of using existing roachtest APIs. Epic: CRDB-19321 Release note: None

renatolabs force-pushed the rc/mixedversion-monitor branch from 75e88ce to 3609615 Compare August 8, 2023 17:46

srosenberg approved these changes Aug 8, 2023

View reviewed changes

renatolabs added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 9, 2023

craig bot merged commit 4c34b48 into cockroachdb:master Aug 9, 2023

blathers-crl bot mentioned this pull request Aug 9, 2023

release-23.1: roachtest/mixedversion: monitor nodes in every test #108438

Merged

renatolabs deleted the rc/mixedversion-monitor branch August 10, 2023 13:23

This was referenced Aug 10, 2023

roachtest: Abort monitor wait when no tasks exist #108521

Closed

roachtest: Monitor.Wait should work correctly without active tasks #108530

Closed

renatolabs mentioned this pull request Aug 11, 2023

roachtest: allow callers to listen for node events in monitor #108594

Merged

blathers-crl bot mentioned this pull request Aug 14, 2023

release-23.1: roachtest: allow callers to listen for node events in monitor #108716

Merged

renatolabs mentioned this pull request Aug 28, 2023

roachtest: 255 exit status from Monitor leads to node incorrectly labeled as dead #105143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest/mixedversion: monitor nodes in every test #107548

roachtest/mixedversion: monitor nodes in every test #107548

renatolabs commented Jul 25, 2023 •

edited

Loading

cockroach-teamcity commented Jul 25, 2023

renatolabs commented Jul 25, 2023

renatolabs commented Jul 26, 2023

herkolategan left a comment

srosenberg left a comment

renatolabs commented Aug 9, 2023

craig bot commented Aug 9, 2023

roachtest/mixedversion: monitor nodes in every test #107548

roachtest/mixedversion: monitor nodes in every test #107548

Conversation

renatolabs commented Jul 25, 2023 • edited Loading

cockroach-teamcity commented Jul 25, 2023

renatolabs commented Jul 25, 2023

renatolabs commented Jul 26, 2023

herkolategan left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs commented Aug 9, 2023

craig bot commented Aug 9, 2023

renatolabs commented Jul 25, 2023 •

edited

Loading