roachtest: use structured errors #88556

tbg · 2022-09-23T09:49:44Z

roachtest predates cockroachdb/errors and as a result so far hasn't
capitalized on many improvements. Today, roachtest errors are often noisy.

This commit adopts cockroachdb/errors and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

(test_test.go:129).func3: first error
(test_test.go:130).func3: second error

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
cockroachdb/errors already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have roachtest create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which roachtest is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

can assign the owner based on the type of failure. For example,
roachtest: opt some tests into aggressive stats checks #87106 wants consistency check failures to always go to the KV
team, regardless of which test's cluster was being checked.

Failures during an IMPORT/RESTORE (common across all tests) could be
routed to the Disaster Recovery team by default (assuming we provide a
wrapper around these operations that all tests use and which does this
wrapping).
Similarly, SSH failures can be special cased via a marking error
and can be directed away from the owning team, which can't do
anything about them anyway (roachtest: special-case SSH_PROBLEM as infra flakes #82398).
We can conceivably start grouping failure modes by "error signature".
That is, errors which have a "comparable" chain of errors (e.g. same
types, and within formatted errors the same format string). Issues
could then be reused only for compatible error signatures.

Release note: None

cockroach-teamcity · 2022-09-23T09:49:52Z

This change is

tbg · 2022-09-23T10:20:39Z

Still need to figure out little bits and pieces:

pkg/cmd/roachtest/test_impl.go:301:35: Errorf(): format argument is not a constant expression\n++Tip: use YourFuncf("descriptive prefix %s", ...) or list new formatting wrappers in pkg/testutils/lint/passes/fmtsafe/functions.go. (fmtsafe)

but I added it to functions.go already, not sure why it isn't working.

smg260 · 2022-10-04T00:59:17Z

Some minor updates in the latest push.

use errors.Newf in t.Fatalf since the format was being ignored
add t.Error in the interface for consistency with Skip/f and Fatal/f

The ssh flakes pr has been rebased locally on this and will still work off detecting SSH_PROBLEM in the error output.

PR after that will look at removing roachprod/errors.

@tbg

pkg/cmd/roachtest/test_impl.go

`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None

tbg · 2022-10-04T12:06:29Z

I'm going to merge this on green. Thanks Miral!

smg260

Reviewed 1 of 4 files at r1, 4 of 5 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

tbg · 2022-10-04T13:17:16Z

bors r=smg260

craig · 2022-10-04T14:46:23Z

Build succeeded:

Bazel Essential CI (Cockroach)

renatolabs

A few comments/questions, nothing that would have blocked the PR anyway. 👍

renatolabs · 2022-10-04T17:39:49Z

pkg/cmd/roachtest/test_impl.go

+		// NB: the first failure is not always the relevant one due to:
+		// https://github.com/cockroachdb/cockroach/issues/44436
+		errors []error
+		// If len(errors)>0, this indicates whether the test timed out


This doesn't seem right? len(errors) > 0 indicates that the test timed out?

renatolabs · 2022-10-04T17:40:13Z

pkg/cmd/roachtest/test_impl.go

+		// cancel, if set, is called from the t.Fatal() family of functions when the
+		// test is being marked as failed (i.e. when the failed field above is also
+		// set). This is used to cancel the context passed to t.spec.Run(), so async
+		// test goroutines can be notified.


This seems misplaced, shouldn't this be near the cancel field?

renatolabs · 2022-10-04T17:42:40Z

pkg/cmd/roachtest/test_impl.go

@@ -255,162 +280,100 @@ func (t *testImpl) Skipf(format string, args ...interface{}) {
 // ATTENTION: Since this calls panic(errTestFatal), it should only be called
 // from a test's closure. The test runner itself should never call this.
 func (t *testImpl) Fatal(args ...interface{}) {
-	t.markFailedInner("" /* format */, args...)
+	t.addFailure(argsToErr(1, args...))


Should Fatal call Error to make it explicit that Fatal = Error + panic?

renatolabs · 2022-10-04T20:46:49Z

pkg/testutils/lint/passes/fmtsafe/functions.go

@@ -85,6 +91,14 @@ var requireConstFmt = map[string]bool{
 	"(*github.com/cockroachdb/cockroach/pkg/util/grpcutil.grpcLogger).Errorf":   true,
 	"(*github.com/cockroachdb/cockroach/pkg/util/grpcutil.grpcLogger).Fatalf":   true,

+	// Both of these signatures need to be included for the linter to not flag
+	// roachtest testImpl.Errorf since it is in the main package
+	"(*main.testImpl).Errorf": true,


This is quite unfortunate, as it seems like it's the result of a bug somewhere (we shouldn't be having to define the *main part of this). I spent some time looking into this and I suspect it's an issue with nogo (potentially only in our fork).

nogo just runs checkPackage; something must be up with the passed packagePath (aka importpath). It seems correct in BUILD.bazel. I wonder what might be passing main instead?

importpath = "github.com/cockroachdb/cockroach/pkg/cmd/roachtest/"

Methinks this might be the culprit [1]. It hardcodes importpath to main from the archive action which calls into compilePkg, and then into checkPackage.

[1] bazel-contrib/rules_go#2135

tbg marked this pull request as ready for review September 23, 2022 10:20

tbg requested a review from a team as a code owner September 23, 2022 10:20

tbg requested review from smg260 and removed request for a team September 23, 2022 10:20

tbg mentioned this pull request Sep 23, 2022

Roachtest redirect SSH flakes to test-eng #88492

Merged

tbg force-pushed the roachtest-err-structure branch from 37293cc to 41e734c Compare September 23, 2022 14:08

smg260 force-pushed the roachtest-err-structure branch 4 times, most recently from 8ee1ccb to 2622def Compare October 3, 2022 19:59

tbg commented Oct 4, 2022

View reviewed changes

pkg/cmd/roachtest/test_impl.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/test_impl.go Outdated Show resolved Hide resolved

tbg force-pushed the roachtest-err-structure branch from 2622def to c9a4c73 Compare October 4, 2022 12:06

smg260 approved these changes Oct 4, 2022

View reviewed changes

craig bot merged commit 60928a6 into cockroachdb:master Oct 4, 2022

renatolabs reviewed Oct 4, 2022

View reviewed changes

smg260 mentioned this pull request Dec 9, 2022

roachtest: use structured errors #93328

Merged

This was referenced Jan 17, 2023

roachtest: fix zipping of artifacts to include other zips #95374

Closed

release-22.1: roachtest structured errors and ssh/scp retries #95448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: use structured errors #88556

roachtest: use structured errors #88556

tbg commented Sep 23, 2022 •

edited

Loading

cockroach-teamcity commented Sep 23, 2022

tbg commented Sep 23, 2022 •

edited

Loading

smg260 commented Oct 4, 2022

tbg commented Oct 4, 2022

smg260 left a comment

tbg commented Oct 4, 2022

craig bot commented Oct 4, 2022

renatolabs left a comment

renatolabs Oct 4, 2022

renatolabs Oct 4, 2022

renatolabs Oct 4, 2022

renatolabs Oct 4, 2022

srosenberg Oct 6, 2022

srosenberg Oct 6, 2022 •

edited

Loading

roachtest: use structured errors #88556

roachtest: use structured errors #88556

Conversation

tbg commented Sep 23, 2022 • edited Loading

cockroach-teamcity commented Sep 23, 2022

tbg commented Sep 23, 2022 • edited Loading

smg260 commented Oct 4, 2022

tbg commented Oct 4, 2022

smg260 left a comment

Choose a reason for hiding this comment

tbg commented Oct 4, 2022

craig bot commented Oct 4, 2022

renatolabs left a comment

Choose a reason for hiding this comment

renatolabs Oct 4, 2022

Choose a reason for hiding this comment

renatolabs Oct 4, 2022

Choose a reason for hiding this comment

renatolabs Oct 4, 2022

Choose a reason for hiding this comment

renatolabs Oct 4, 2022

Choose a reason for hiding this comment

srosenberg Oct 6, 2022

Choose a reason for hiding this comment

srosenberg Oct 6, 2022 • edited Loading

Choose a reason for hiding this comment

tbg commented Sep 23, 2022 •

edited

Loading

tbg commented Sep 23, 2022 •

edited

Loading

srosenberg Oct 6, 2022 •

edited

Loading