Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: split Monitor functionality in two components #118214

Open
renatolabs opened this issue Jan 23, 2024 · 1 comment
Open

roachtest: split Monitor functionality in two components #118214

renatolabs opened this issue Jan 23, 2024 · 1 comment
Assignees
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Jan 23, 2024

Typical uses of the roachtest monitor are as follows:

m := c.NewMonitor()
m.Go(func(context.Context) error {
    // user logic...
})

m.Wait()

This is a summary of the two main functions provided by the monitor:

  • Go: starts a goroutine. This is useful because we never want bare go calls in roachtests (or anywhere, really -- see *: ban the go keyword in non-test code #58164); having automated panic recovery is nice to have for all tests. Note that calling Go will start the goroutine immediately.
  • Wait: serves two purposes. First, it waits for all goroutines started with Go to return (canceling on error like errgroup). In addition, it also starts a "processs monitor" (hence the name) that will watch for unexpected cockroach process deaths in the nodes it monitors.

This API is less than ideal for the following reasons:

  • Monitoring processes and managing goroutines are two completely separate functionalities. It is unexpected that we will only monitor node deaths if we also have some goroutine to run. We recently had to patch this module in order to allow callers to just monitor nodes, without having to run a function to achieve that goal (roachtest: allow callers to listen for node events in monitor #108594).

  • We will only actually monitor nodes if the test calls Wait() on the monitor. While this is good practice and should happen every time, this is not enforced and tests can continue to pass even if they don't. In addition, node deaths that happen before the test calls Wait() go completely unnoticed.

  • Because goroutine management and process monitoring are coupled, it makes composition really difficult. Suppose a function A creates a new monitor in order to run a background task. Suppose also that we're writing a new test that wants to call A to reuse its functionality. If our test B wishes to restart nodes, we now have to find a way to pass that information down to A so that A's monitor can properly handle that node death event. In the general case, if our test involves creating N monitors (directly or indirectly), and we want to perform a node restart, we need to find a way to pass that information to all N monitors.

Proposal

A proposed way to deal with this is to promote node monitoring to the test runner as a default. Every test should have an active "node monitor" that checks for unexpected node deaths. We could have a single t.ExpectNodeDeath(n1) to indicate that this test expects n1 to die, and that can happen only once. What's currently the Monitor will be deprecated in tests in favour of a simple goroutine runner with builtin panic handling.

Jira issue: CRDB-35614

@renatolabs renatolabs added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure T-testeng TestEng Team labels Jan 23, 2024
Copy link

blathers-crl bot commented Jan 23, 2024

cc @cockroachdb/test-eng

@herkolategan herkolategan self-assigned this Sep 16, 2024
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
Previously, roachtests would use bare goroutines `go func...` to start tasks
during a test to perform various background functions. This is problematic due
to the nature of roachtest, which runs multiple tests in parallel using workers.
If one such bare goroutine panics the whole test suite is interrupted and the
process exits.

This change is the first step and introduces an interface that both the
traditional and mixedversion test frameworks can utilise. It intends to target
both regular and MVT tests by supplying the test with the interface either
through a helper or the test interface.

In addition to providing the `Go` function, there are additional options that
can be passed. The test framework will generally supply the panic and error
handler, but users can supply additional options such as naming the task, or
passing a custom logger. This makes logging more useful to determine what task
panic or encountered an error.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
This change provides a task manager implementation that will drive the logic
running behind the Tasker interface, as well as provide the implementation of
it. The manager is meant to be used by the test framework(s), while the Tasker
interface is supplied to tests. The framework will do additional logic such as
terminating tasks at the end of a test, logging uncaught errors or issues and
providing the appropriate panic and error handlers.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
Previously, the mixedversion framework had its own implementation for managing
background tasks. This implementation performed well, but in its current state
it is difficult to share with tests that do not require the mixedversion
framework.

This change replace the mixedversion framework background runner with the newly
implemented task manager. The implementation details are relatively similar
making it close to a drop-in replacement. Additionally, some utilities relating
to tasks have been replaced by the task package.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
Previously, roachtests would use bare goroutines `go func...` to start tasks
during a test to perform various background functions. This is problematic due
to the nature of roachtest, which runs multiple tests in parallel using workers.
If one such bare goroutine panics the whole test suite is interrupted and the
process exits.

This change is the first step and introduces an interface that both the
traditional and mixedversion test frameworks can utilise. It intends to target
both regular and MVT tests by supplying the test with the interface either
through a helper or the test interface.

In addition to providing the `Go` function, there are additional options that
can be passed. The test framework will generally supply the panic and error
handler, but users can supply additional options such as naming the task, or
passing a custom logger. This makes logging more useful to determine what task
panic or encountered an error.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
This change provides a task manager implementation that will drive the logic
running behind the Tasker interface, as well as provide the implementation of
it. The manager is meant to be used by the test framework(s), while the Tasker
interface is supplied to tests. The framework will do additional logic such as
terminating tasks at the end of a test, logging uncaught errors or issues and
providing the appropriate panic and error handlers.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
Previously, the mixedversion framework had its own implementation for managing
background tasks. This implementation performed well, but in its current state
it is difficult to share with tests that do not require the mixedversion
framework.

This change replace the mixedversion framework background runner with the newly
implemented task manager. The implementation details are relatively similar
making it close to a drop-in replacement. Additionally, some utilities relating
to tasks have been replaced by the task package.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 23, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
Previously, roachtests would use bare goroutines `go func...` to start tasks
during a test to perform various background functions. This is problematic due
to the nature of roachtest, which runs multiple tests in parallel using workers.
If one such bare goroutine panics the whole test suite is interrupted and the
process exits.

This change is the first step and introduces an interface that both the
traditional and mixedversion test frameworks can utilise. It intends to target
both regular and MVT tests by supplying the test with the interface either
through a helper or the test interface.

In addition to providing the `Go` function, there are additional options that
can be passed. The test framework will generally supply the panic and error
handler, but users can supply additional options such as naming the task, or
passing a custom logger. This makes logging more useful to determine what task
panic or encountered an error.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
This change provides a task manager implementation that will drive the logic
running behind the Tasker interface, as well as provide the implementation of
it. The manager is meant to be used by the test framework(s), while the Tasker
interface is supplied to tests. The framework will do additional logic such as
terminating tasks at the end of a test, logging uncaught errors or issues and
providing the appropriate panic and error handlers.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
Previously, the mixedversion framework had its own implementation for managing
background tasks. This implementation performed well, but in its current state
it is difficult to share with tests that do not require the mixedversion
framework.

This change replace the mixedversion framework background runner with the newly
implemented task manager. The implementation details are relatively similar
making it close to a drop-in replacement. Additionally, some utilities relating
to tasks have been replaced by the task package.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
Previously, roachtests would use bare goroutines `go func...` to start tasks
during a test to perform various background functions. This is problematic due
to the nature of roachtest, which runs multiple tests in parallel using workers.
If one such bare goroutine panics the whole test suite is interrupted and the
process exits.

This change is the first step and introduces an interface that both the
traditional and mixedversion test frameworks can utilise. It intends to target
both regular and MVT tests by supplying the test with the interface either
through a helper or the test interface.

In addition to providing the `Go` function, there are additional options that
can be passed. The test framework will generally supply the panic and error
handler, but users can supply additional options such as naming the task, or
passing a custom logger. This makes logging more useful to determine what task
panic or encountered an error.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
This change provides a task manager implementation that will drive the logic
running behind the Tasker interface, as well as provide the implementation of
it. The manager is meant to be used by the test framework(s), while the Tasker
interface is supplied to tests. The framework will do additional logic such as
terminating tasks at the end of a test, logging uncaught errors or issues and
providing the appropriate panic and error handlers.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
Previously, the mixedversion framework had its own implementation for managing
background tasks. This implementation performed well, but in its current state
it is difficult to share with tests that do not require the mixedversion
framework.

This change replace the mixedversion framework background runner with the newly
implemented task manager. The implementation details are relatively similar
making it close to a drop-in replacement. Additionally, some utilities relating
to tasks have been replaced by the task package.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 24, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 25, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 25, 2024
This change now utilises the Tasker interface supplied by the mixedversion
helper to start a task. In essence a background function can also be utilised,
but for low impact go routines the utility provides a quick way of starting a
task.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 25, 2024
Previously, roachtests would use bare goroutines `go func...` to start tasks
during a test to perform various background functions. This is problematic due
to the nature of roachtest, which runs multiple tests in parallel using workers.
If one such bare goroutine panics the whole test suite is interrupted and the
process exits.

This change is the first step and introduces an interface that both the
traditional and mixedversion test frameworks can utilise. It intends to target
both regular and MVT tests by supplying the test with the interface either
through a helper or the test interface.

In addition to providing the `Go` function, there are additional options that
can be passed. The test framework will generally supply the panic and error
handler, but users can supply additional options such as naming the task, or
passing a custom logger. This makes logging more useful to determine what task
panic or encountered an error.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, Monitor was only able to monitor processes already started. It could
not detect a new processes if it was started after monitor started monitoring.

This change uses the new monitor scripts `monitor_local.sh` &
`monitor_remote.sh` which produces a frame of processes when any cockroach
process changes occur including when a new cockroach process is started.

The logic for detecting changes in processes has been moved to the "client"
side. The scripts are now less complicated and only have the responsibility of
sending through a frame of processes. A frame consists of the cluster label,
process id, and processes status of all cockroach processes.

In addition, Monitor has been moved into its own source file for better
separation of logic. The `ignore empty nodes` functionality has also been removed,
as it is an unreliable way of detecting "empty nodes", and does not serve any
real purpose anymore.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, Monitor was mostly untested. This change aims to add a few data
driven tests using the new mock session functionality. The Monitor logic for
checking for process changes have moved to the client, and can now be tested as
well.

Although we can't test that the shell scripts are correct with these tests, we
can test that Monitor interprets the output correctly, and emits the correct
events. We can also test that it correctly handles IO errors.

The data driven tests specify what the scripts output, and we can then verify
the number of events we expect and what those events are. Since events can be
out of order, to ensure the tests do not flake we sort the events. Event timing
can be tested by chaining multiple output write blocks and event checks.

Additionally, a few smalls separate tests are added to check different IO
scenarios.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Add a new monitor script for monitoring remote processes. `systemctl` is used to
monitor cockroach processes, on remote nodes, and determine the exit status if a
process has died.

It emits the same frame format as the local script version, and depends on
Monitor to implement the logic for detecting changes in the process list frame.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, it was difficult to test code that creates a remote session.

This change adds a mock implementation of session, and the hooks in
SyncedCluster to use it during tests.

The mock session provides several hooks for controlling the IO of a session.
The inputs and outputs are backed by a channel to allow finer control of the
flow of data.

Currently, it only implements what was required to test the Monitor
functionality, but can be extended in the future to cover more use cases.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, Monitor was only able to monitor processes already started. It could
not detect a new processes if it was started after monitor started monitoring.

This change uses the new monitor scripts `monitor_local.sh` &
`monitor_remote.sh` which produces a frame of processes when any cockroach
process changes occur including when a new cockroach process is started.

The logic for detecting changes in processes has been moved to the "client"
side. The scripts are now less complicated and only have the responsibility of
sending through a frame of processes. A frame consists of the cluster label,
process id, and processes status of all cockroach processes.

In addition, Monitor has been moved into its own source file for better
separation of logic. The `ignore empty nodes` functionality has also been removed,
as it is an unreliable way of detecting "empty nodes", and does not serve any
real purpose anymore.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, Monitor was mostly untested. This change aims to add a few data
driven tests using the new mock session functionality. The Monitor logic for
checking for process changes have moved to the client, and can now be tested as
well.

Although we can't test that the shell scripts are correct with these tests, we
can test that Monitor interprets the output correctly, and emits the correct
events. We can also test that it correctly handles IO errors.

The data driven tests specify what the scripts output, and we can then verify
the number of events we expect and what those events are. Since events can be
out of order, to ensure the tests do not flake we sort the events. Event timing
can be tested by chaining multiple output write blocks and event checks.

Additionally, a few smalls separate tests are added to check different IO
scenarios.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 9, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Add a new monitor script for monitoring remote processes. `systemctl` is used to
monitor cockroach processes, on remote nodes, and determine the exit status if a
process has died.

It emits the same frame format as the local script version, and depends on
Monitor to implement the logic for detecting changes in the process list frame.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Previously, it was difficult to test code that creates a remote session.

This change adds a mock implementation of session, and the hooks in
SyncedCluster to use it during tests.

The mock session provides several hooks for controlling the IO of a session.
The inputs and outputs are backed by a channel to allow finer control of the
flow of data.

Currently, it only implements what was required to test the Monitor
functionality, but can be extended in the future to cover more use cases.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Previously, Monitor was only able to monitor processes already started. It could
not detect a new processes if it was started after monitor started monitoring.

This change uses the new monitor scripts `monitor_local.sh` &
`monitor_remote.sh` which produces a frame of processes when any cockroach
process changes occur including when a new cockroach process is started.

The logic for detecting changes in processes has been moved to the "client"
side. The scripts are now less complicated and only have the responsibility of
sending through a frame of processes. A frame consists of the cluster label,
process id, and processes status of all cockroach processes.

In addition, Monitor has been moved into its own source file for better
separation of logic. The `ignore empty nodes` functionality has also been removed,
as it is an unreliable way of detecting "empty nodes", and does not serve any
real purpose anymore.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Previously, Monitor was mostly untested. This change aims to add a few data
driven tests using the new mock session functionality. The Monitor logic for
checking for process changes have moved to the client, and can now be tested as
well.

Although we can't test that the shell scripts are correct with these tests, we
can test that Monitor interprets the output correctly, and emits the correct
events. We can also test that it correctly handles IO errors.

The data driven tests specify what the scripts output, and we can then verify
the number of events we expect and what those events are. Since events can be
out of order, to ensure the tests do not flake we sort the events. Event timing
can be tested by chaining multiple output write blocks and event checks.

Additionally, a few smalls separate tests are added to check different IO
scenarios.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 10, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, a new API for managing tasks were introduced (see cockroachdb#133263). However,
this did not address roachtests that want to manage groups. In an effort to replace
`errgroup`, and possibly using `monitor.Go` for task management this change
introduces a group provider in the task manager.

A group adds the ability to wait on a subset of tasks to finish before
proceeding. The task handler will still report returned errors or panics to the
test framework.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
This change groups the default options used by tasks and groups into a function
that can be shared when creating new groups or tasks via the mixedversion test
helper. The mixedversion helper now also implements `GroupProvider` to enable
grouping tasks.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
This change groups the default options used by tasks and groups into a function
that can be shared when creating new groups or tasks via the test framework. The
`Test` interface now implements `GroupProvider` as well to enable grouping
tasks.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, a roachtest would need to use wait groups in order to wait on tasks.
The new `GroupProvider` interface now allows tests to manage groups of tasks via
the framework instead of manually.

Informs: cockroachdb#118214

Epic: None
Release note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, Monitor was only able to monitor processes already started. It could
not detect a new processes if it was started after monitor started monitoring.

This change uses the new monitor scripts `monitor_local.sh` &
`monitor_remote.sh` which produces a frame of processes when any cockroach
process changes occur including when a new cockroach process is started.

The logic for detecting changes in processes has been moved to the "client"
side. The scripts are now less complicated and only have the responsibility of
sending through a frame of processes. A frame consists of the cluster label,
process id, and processes status of all cockroach processes.

In addition, Monitor has been moved into its own source file for better
separation of logic. The `ignore empty nodes` functionality has also been removed,
as it is an unreliable way of detecting "empty nodes", and does not serve any
real purpose anymore.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, Monitor was mostly untested. This change aims to add a few data
driven tests using the new mock session functionality. The Monitor logic for
checking for process changes have moved to the client, and can now be tested as
well.

Although we can't test that the shell scripts are correct with these tests, we
can test that Monitor interprets the output correctly, and emits the correct
events. We can also test that it correctly handles IO errors.

The data driven tests specify what the scripts output, and we can then verify
the number of events we expect and what those events are. Since events can be
out of order, to ensure the tests do not flake we sort the events. Event timing
can be tested by chaining multiple output write blocks and event checks.

Additionally, a few smalls separate tests are added to check different IO
scenarios.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 11, 2024
Previously, the post assertions would check for dead nodes by using roachprod
Monitor. It checks if the data directory of a cockroach node is non-trivial.
This is not a reliable way to exclude empty nodes, or to check for dead nodes.
It also complicates the Monitor functionality unnecessarily.

This change refactors the code to rather use the HTTP health checks to determine
if there are dead nodes. Since we now know which node is used for the workload,
this check only needs to determine if the cockroach nodes are all still live.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 12, 2024
Previously, the post assertions would check for dead nodes by using
`roachprod.Monitor`. It checks if the data directory of a cockroach node is
non-trivial. This is not a reliable way to exclude empty nodes, or to check for
dead nodes. It also complicates the Monitor functionality unnecessarily. It also
no longer functions, since `roachprod.Monitor` only watches already live
processes.

This change removes this functionality as it no longer works. An issue (See:
cockroachdb#137329) has been created to address this should it still be a requirement.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 12, 2024
Previously, the post assertions would check for dead nodes by using
`roachprod.Monitor`. It checks if the data directory of a cockroach node is
non-trivial. This is not a reliable way to exclude empty nodes, or to check for
dead nodes. It also complicates the Monitor functionality unnecessarily. It also
no longer functions, since `roachprod.Monitor` only watches already live
processes.

This change removes this functionality as it no longer works. An issue (See:
cockroachdb#137329) has been created to address this should it still be a requirement.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 12, 2024
Previously, the post assertions would check for dead nodes by using
`roachprod.Monitor`. It checks if the data directory of a cockroach node is
non-trivial. This is not a reliable way to exclude empty nodes, or to check for
dead nodes. It also complicates the Monitor functionality unnecessarily. It also
no longer functions, since `roachprod.Monitor` only watches already live
processes.

This change removes this functionality as it no longer works. An issue (See:
cockroachdb#137329) has been created to address this should it still be a requirement.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 12, 2024
Previously, the post assertions would check for dead nodes by using
`roachprod.Monitor`. It checks if the data directory of a cockroach node is
non-trivial. This is not a reliable way to exclude empty nodes, or to check for
dead nodes. It also complicates the Monitor functionality unnecessarily. It also
no longer functions, since `roachprod.Monitor` only watches already live
processes.

This change removes this functionality as it no longer works. An issue (See:
cockroachdb#137329) has been created to address this should it still be a requirement.

Informs: cockroachdb#118214
Epic: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Dec 12, 2024
Previously, `cluster.HealthStatus` seemed to check all nodes, but rely on the
count of a nodes list passed in as a parameter.

This change fixes the function to use the passed nodes during the request phase
as well. It also adds a small retry mechanism during the 15 seconds it checks
the health of the nodes.

The issue was found when updating the health check to only look at
`c.CRDBNodes()` and exclude the workload node if one is part of the cluster.

Informs: cockroachdb#118214
Epic: None
craig bot pushed a commit that referenced this issue Dec 13, 2024
135831: roachtest: add groups to task manager r=DarrylWong a=herkolategan

Previously, a new API for managing tasks were introduced (see #133263). However, this did not address roachtests that want to manage groups. In an effort to replace `errgroup`, and `monitor.Go` for task management, this change introduces a group provider in the task manager.

A group adds the ability to wait on a subset of tasks to finish before proceeding. The task handler will still report returned errors or panics to the test framework.

Informs: #118214

Epic: None
Release note: None

Co-authored-by: Herko Lategan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
No open projects
Status: Triage
Development

No branches or pull requests

2 participants