Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

Closed
srosenberg opened this issue Jun 5, 2023 · 2 comments
Closed
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@srosenberg
Copy link
Member

srosenberg commented Jun 5, 2023

After implementing SSH retries and refactoring c.Parallel, the new implementation expects a user-defined function to return no error unless there was an internal (roachprod) error; instead, if the executed function resulted in an error, it's expected to be in RunResultDetails.Err.

The above (tacit) contract is violated by several existing commands, e.g., Wipe, Stop. When either of these functions errors out, the actual details of the error are not written to the logs. E.g., an attempt to reuse a cluster failed to execute either Wipe or Stop, but the actual details are unknown; all we get is the following message,

13:47:29 test_runner.go:605: [w4] Unable to reuse cluster: srosenberg-1685591519-18-n4cpu32 due to: COMMAND_PROBLEM: exit status 1. Will attempt to create a fresh one

The ask is as follows,

  • audit all the existing commands which use c.Parallel and fix contract violations
  • ensure that even internal errors are handled correctly by logging their (nested) context
  • update the documentation to make it clear that error should be nil unless there was an internal error
    • Note, the majority of the cases appear to be return res, nil; perhaps they could use a different API, one which has a single return of type *RunResultDetails?

[1] #90451

@srosenberg srosenberg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure T-testeng TestEng Team labels Jun 5, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 5, 2023

cc @cockroachdb/test-eng

@srosenberg srosenberg changed the title roachtest: audit and fix incorrect usages of c.Parallel wrt error handling roachprod: audit and fix incorrect usages of c.Parallel wrt error handling Jun 5, 2023
srosenberg added a commit to srosenberg/cockroach that referenced this issue Jun 8, 2023
Support for disk snapshots has been recently added in [1].
This change adds minor fixes which were necessary to
run in a different environment which was previously tested.
Specifically, several commands were missing `--project`
which resulted in (silent) failures. Error propagation
from `c.Parallel` was signaling an internal error
instead of passing command's error. (Granted, the existing
mechanism for error propagation is to be revisited in [2].)

[1] cockroachdb#103757
[2] cockroachdb#104312

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jun 9, 2023
104573: roachprod: minor fixes for snapshots r=srosenberg a=srosenberg

Support for disk snapshots has been recently added in [1]. This change adds minor fixes which were necessary to run in a different environment which was previously tested. Specifically, several commands were missing `--project` which resulted in (silent) failures. Error propagation from `c.Parallel` was signaling an internal error
instead of passing command's error. (Granted, the existing mechanism for error propagation is to be revisited in [2].)

[1] #103757
[2] #104312

Epic: none

Release note: None

Co-authored-by: Stan Rosenberg <[email protected]>
craig bot pushed a commit that referenced this issue Aug 17, 2023
104915: roachprod: roachtest: refactoring r=renatolabs,herkolategan a=smg260

This commit addresses some usability issues within roachprod/roachtest, and does some minor clean up.

Summary:

`roachprod`

- `c.Parallel` now accepts  `nodes Nodes` and `func(ctx context.Context, node Node)`. Previously, these was an `int n` representing the first `n` nodes to operate on, and `func(ctx context.Context, int n)`, where second arg would confusingly represented the index of the node being operated on, which was wholly dependent on a captured-by-reference slice at the call site.  This change makes it explicit to the caller and the function, which exact node is targeted. 

- `c.Parallel` now returns `[]*RunResultDetails, boolean, error`. The slice will contain results, successful and failed, for each node (or thus far, depending on fail-fast behaviour). The boolean is a signal to the caller that at least one of the results in the slice suffered from a command error, which can be inspected in the result itself. The `error` now more idiomatically represents an unrecoverable error encountered during `Parallel`; i.e. an un retryable error in `roachprod`. In the latter case, results will be `nil`.

- All calls to execute a remote command on a node are all funnelled to `c.runCmdOnSingleNode`, ensuring consistency, and reducing duplicated code. This function accepts a `RunCmdOptions`  struct to configure it's behaviour, including `stdin/out`

- `runCmdOnSingleNode` : in error cases, include stdout/err output (truncated after n bytes - effectively `tail`). This output will show in the `failure_n.log` and give convenient insight. Full output is always retained in the run log.

- `RunResultDetails.{Stdout,Stderr,CombinedOut}` are now all `string` to represent the most common use case, and a convenience `r.Output()` added to print out the contents of `Stderr,Stdout,CombinedOut`

- Remove complex legacy pipe processing logic from `roachtest` (`execCmd,execCmdEx`), previously used when `roachprod` was not a library of `roachtest`

- When showing the result of a command on each node on the CLI, explicitly display `<ok>` or `<err>` followed by output/error


`roachtest`

- skip post-test assertions on a test failure

- removal of `execCmd/execCmdRes` which were doing some complex pipe processing before calling roachprod.Run (not required since roachprod is a lib of roachtest)

Minor

- Clean up various redundant/unused variables
- Typos
- Command formatting

Informs: #104312 
Informs: #104316 (TBD)

Epic: none
Release note: none

107354: asim: add randomized range generation  r=kvoli a=wenyihu6

This patch enables random range configuration to be generated.

TestRandomized can now take another setting parameter rangeGen (default: uniform
rangeGenType, uniform keySpaceGenType, empty weightedRand).

These generators are part of the framework fields which persist across
iterations. The numbers produced by the generator shape the distribution across
iterations.

- rangeKeyGenType: determines range generator type across iterations (default:
uniformGenerator, min = 1, max = 1000)

- keySpaceGenType: determines key space generator type across iterations
(default: uniformGenerator, min = 1000, max = 200000)

- weightedRand: if non-empty, enables weighted randomization for range
distribution

This provides three modes for range generation:
1. Default: currently set to uniform distribution
2. Random: randomly generates range distribution across stores
3. Weighted Randomization: enables weighted randomization for range distribution
if and only if given weightedRand is non-empty

Part of: #106311

Release note: None

108906: changefeedccl: make tests work offline r=miretskiy a=jayshrivastava

This change updates two unit tests to pass when
running offline.

Fixes: #108867
Release note: None
Epic: None

Co-authored-by: Miral Gadani <[email protected]>
Co-authored-by: wenyihu6 <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
@smg260
Copy link
Contributor

smg260 commented Sep 7, 2023

Fixed by #104915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

No branches or pull requests

2 participants