-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest/mixedversion: monitor nodes in every test #107548
Conversation
Meant to create this as a draft; there are likely rough edges, I'll run more tests and open for review again when ready. |
ebadacd
to
75e88ce
Compare
Ran this PR with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like the work around the monitors. I'm also in the process of doing some work around the monitors for MT, and eager to have this as a base.
Reviewed 4 of 5 files at r1, 4 of 5 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)
Previously, the `Monitor` function would return semi-structured data in the form of a `(node, message)` pair. The `message` field corresponded to the output (as in, stdout) of the bash script that implements the monitor logic. This tied callers to the specific implementation of the function and to the strings passed to `echo` in that script (e.g., all the checks for the `"dead"` string in the message). This commit updates the function to instead return an event channel. Events can be of different types corresponding to each of the events that the shell script is able to emit (e.g., when the cockroach process is found to be running, when it stops, etc). This makes the parsing of the script output private, and simplifies the logic in most callers. The message returned by the `roachtest` wrapper is also changed slightly in this commit to make it clearer where the error is coming from: the function passed to `monitor.Go` or the monitor process itself. This has been a source of confusion in the past. Finally, the monitor implementation in roachprod is changed slightly to avoid blocking on channel sends if the context has already been canceled. This avoids leaked goroutines in tests, as canceling the context passed to `Monitor()` should cause all goroutines to eventually finish instead of blocking indefinitely. Epic: CRDB-19321 Release note: None
This updates the `mixedversion` runner to monitor the status of the cockroach processes on the cluster during a mixed-version test. This allows the test to fail immediately if a node crashes unexpectedly. Previously, if a node crashed, the test would only fail if the crash resulted in an assertion failure. There is a chance that this would not happen because the framework could decide to restart the node shortly after its crash. With this change, an unexpected node death causes the test to fail immediately. This commit also exposes the ability for test authors to use `ExpectDeath()` in their tests, in case the test performs its own restarts or chaos events. Epic: CRDB-19321 Release note: None
This makes it easier for test authors to see all helper functions available without having to browse through internal framework code. Epic: CRDB-19321 Release note: None
The note about background functions is important to highlight specifically due to potential habit of using existing roachtest APIs. Epic: CRDB-19321 Release note: None
75e88ce
to
3609615
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Offloading "death monitoring" to the mixedversion framework is great.
TFTRs! bors r=herkolategan,srosenberg |
Build succeeded: |
PR cockroachdb#107548 introduced a bug where monitor context is not cancelled if the monitor has no tasks running. In this case, the goroutine waiting for cluster events never terminates (if the cluster is health) and the roachtest times out waiting for shutdown. Fixes cockroachdb#108507 Epic: None Release note: None
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None
108594: roachtest: allow callers to listen for node events in monitor r=srosenberg a=renatolabs In #107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: #108530. Release note: None Co-authored-by: Renato Costa <[email protected]>
In #107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: #108530. Release note: None
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None
This PR includes a few changes related to the roachprod/roachtest monitor itself, and integrates it with the
mixedversion
package. Specifically, the changes are as follows (each bullet point corresponds to a commit):mixedversion
framework to monitor nodes by default. An unexpected node death immediately fails the test.*mixedversion.Helper
struct to its own file, for ease of browsing.cluster.NewMonitor
like other roachtests.Epic: CRDB-19321
Release notes: None