roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

srosenberg · 2023-06-05T03:48:24Z

After implementing SSH retries and refactoring c.Parallel, the new implementation expects a user-defined function to return no error unless there was an internal (roachprod) error; instead, if the executed function resulted in an error, it's expected to be in RunResultDetails.Err.

The above (tacit) contract is violated by several existing commands, e.g., Wipe, Stop. When either of these functions errors out, the actual details of the error are not written to the logs. E.g., an attempt to reuse a cluster failed to execute either Wipe or Stop, but the actual details are unknown; all we get is the following message,

13:47:29 test_runner.go:605: [w4] Unable to reuse cluster: srosenberg-1685591519-18-n4cpu32 due to: COMMAND_PROBLEM: exit status 1. Will attempt to create a fresh one

The ask is as follows,

audit all the existing commands which use c.Parallel and fix contract violations
ensure that even internal errors are handled correctly by logging their (nested) context
update the documentation to make it clear that error should be nil unless there was an internal error
- Note, the majority of the cases appear to be return res, nil; perhaps they could use a different API, one which has a single return of type *RunResultDetails?

[1] #90451

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-06-05T03:48:26Z

cc @cockroachdb/test-eng

Support for disk snapshots has been recently added in [1]. This change adds minor fixes which were necessary to run in a different environment which was previously tested. Specifically, several commands were missing `--project` which resulted in (silent) failures. Error propagation from `c.Parallel` was signaling an internal error instead of passing command's error. (Granted, the existing mechanism for error propagation is to be revisited in [2].) [1] cockroachdb#103757 [2] cockroachdb#104312 Epic: none Release note: None

104573: roachprod: minor fixes for snapshots r=srosenberg a=srosenberg Support for disk snapshots has been recently added in [1]. This change adds minor fixes which were necessary to run in a different environment which was previously tested. Specifically, several commands were missing `--project` which resulted in (silent) failures. Error propagation from `c.Parallel` was signaling an internal error instead of passing command's error. (Granted, the existing mechanism for error propagation is to be revisited in [2].) [1] #103757 [2] #104312 Epic: none Release note: None Co-authored-by: Stan Rosenberg <[email protected]>

104915: roachprod: roachtest: refactoring r=renatolabs,herkolategan a=smg260 This commit addresses some usability issues within roachprod/roachtest, and does some minor clean up. Summary: `roachprod` - `c.Parallel` now accepts `nodes Nodes` and `func(ctx context.Context, node Node)`. Previously, these was an `int n` representing the first `n` nodes to operate on, and `func(ctx context.Context, int n)`, where second arg would confusingly represented the index of the node being operated on, which was wholly dependent on a captured-by-reference slice at the call site. This change makes it explicit to the caller and the function, which exact node is targeted. - `c.Parallel` now returns `[]*RunResultDetails, boolean, error`. The slice will contain results, successful and failed, for each node (or thus far, depending on fail-fast behaviour). The boolean is a signal to the caller that at least one of the results in the slice suffered from a command error, which can be inspected in the result itself. The `error` now more idiomatically represents an unrecoverable error encountered during `Parallel`; i.e. an un retryable error in `roachprod`. In the latter case, results will be `nil`. - All calls to execute a remote command on a node are all funnelled to `c.runCmdOnSingleNode`, ensuring consistency, and reducing duplicated code. This function accepts a `RunCmdOptions` struct to configure it's behaviour, including `stdin/out` - `runCmdOnSingleNode` : in error cases, include stdout/err output (truncated after n bytes - effectively `tail`). This output will show in the `failure_n.log` and give convenient insight. Full output is always retained in the run log. - `RunResultDetails.{Stdout,Stderr,CombinedOut}` are now all `string` to represent the most common use case, and a convenience `r.Output()` added to print out the contents of `Stderr,Stdout,CombinedOut` - Remove complex legacy pipe processing logic from `roachtest` (`execCmd,execCmdEx`), previously used when `roachprod` was not a library of `roachtest` - When showing the result of a command on each node on the CLI, explicitly display `<ok>` or `<err>` followed by output/error `roachtest` - skip post-test assertions on a test failure - removal of `execCmd/execCmdRes` which were doing some complex pipe processing before calling roachprod.Run (not required since roachprod is a lib of roachtest) Minor - Clean up various redundant/unused variables - Typos - Command formatting Informs: #104312 Informs: #104316 (TBD) Epic: none Release note: none 107354: asim: add randomized range generation r=kvoli a=wenyihu6 This patch enables random range configuration to be generated. TestRandomized can now take another setting parameter rangeGen (default: uniform rangeGenType, uniform keySpaceGenType, empty weightedRand). These generators are part of the framework fields which persist across iterations. The numbers produced by the generator shape the distribution across iterations. - rangeKeyGenType: determines range generator type across iterations (default: uniformGenerator, min = 1, max = 1000) - keySpaceGenType: determines key space generator type across iterations (default: uniformGenerator, min = 1000, max = 200000) - weightedRand: if non-empty, enables weighted randomization for range distribution This provides three modes for range generation: 1. Default: currently set to uniform distribution 2. Random: randomly generates range distribution across stores 3. Weighted Randomization: enables weighted randomization for range distribution if and only if given weightedRand is non-empty Part of: #106311 Release note: None 108906: changefeedccl: make tests work offline r=miretskiy a=jayshrivastava This change updates two unit tests to pass when running offline. Fixes: #108867 Release note: None Epic: None Co-authored-by: Miral Gadani <[email protected]> Co-authored-by: wenyihu6 <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]>

smg260 · 2023-09-07T13:18:13Z

Fixed by #104915

srosenberg changed the title ~~roachtest: audit and fix incorrect usages of c.Parallel wrt error handling~~ roachprod: audit and fix incorrect usages of c.Parallel wrt error handling Jun 5, 2023

srosenberg mentioned this issue Jun 8, 2023

roachprod: minor fixes for snapshots #104573

Merged

exalate-issue-sync bot assigned smg260 Jun 13, 2023

smg260 mentioned this issue Jun 14, 2023

roachprod: roachtest: refactoring #104915

Merged

smg260 mentioned this issue Aug 21, 2023

release-23.1: roachprod: roachtest: refactoring #109144

Merged

smg260 closed this as completed Sep 7, 2023

github-project-automation bot added this to Test Engineering Aug 28, 2024

github-project-automation bot moved this to Done in Test Engineering Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

srosenberg commented Jun 5, 2023 •

edited

Loading

blathers-crl bot commented Jun 5, 2023

smg260 commented Sep 7, 2023

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

Comments

srosenberg commented Jun 5, 2023 • edited Loading

blathers-crl bot commented Jun 5, 2023

smg260 commented Sep 7, 2023

srosenberg commented Jun 5, 2023 •

edited

Loading