Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: use structured errors #88556

Merged
merged 1 commit into from
Oct 4, 2022

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 23, 2022

roachtest predates cockroachdb/errors and as a result so far hasn't
capitalized on many improvements. Today, roachtest errors are often noisy.

This commit adopts cockroachdb/errors and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

(test_test.go:129).func3: first error
(test_test.go:130).func3: second error

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
cockroachdb/errors already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have roachtest create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which roachtest is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

  • can assign the owner based on the type of failure. For example,
    roachtest: opt some tests into aggressive stats checks #87106 wants consistency check failures to always go to the KV
    team, regardless of which test's cluster was being checked.

    Failures during an IMPORT/RESTORE (common across all tests) could be
    routed to the Disaster Recovery team by default (assuming we provide a
    wrapper around these operations that all tests use and which does this
    wrapping).

  • Similarly, SSH failures can be special cased via a marking error
    and can be directed away from the owning team, which can't do
    anything about them anyway (roachtest: special-case SSH_PROBLEM as infra flakes #82398).

  • We can conceivably start grouping failure modes by "error signature".
    That is, errors which have a "comparable" chain of errors (e.g. same
    types, and within formatted errors the same format string). Issues
    could then be reused only for compatible error signatures.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Sep 23, 2022

Still need to figure out little bits and pieces:

pkg/cmd/roachtest/test_impl.go:301:35: Errorf(): format argument is not a constant expression\n++Tip: use YourFuncf("descriptive prefix %s", ...) or list new formatting wrappers in pkg/testutils/lint/passes/fmtsafe/functions.go. (fmtsafe)

but I added it to functions.go already, not sure why it isn't working.

@tbg tbg marked this pull request as ready for review September 23, 2022 10:20
@tbg tbg requested a review from a team as a code owner September 23, 2022 10:20
@tbg tbg requested review from smg260 and removed request for a team September 23, 2022 10:20
@tbg tbg force-pushed the roachtest-err-structure branch from 37293cc to 41e734c Compare September 23, 2022 14:08
@smg260 smg260 force-pushed the roachtest-err-structure branch 4 times, most recently from 8ee1ccb to 2622def Compare October 3, 2022 19:59
@smg260
Copy link
Contributor

smg260 commented Oct 4, 2022

Some minor updates in the latest push.

  • use errors.Newf in t.Fatalf since the format was being ignored

  • add t.Error in the interface for consistency with Skip/f and Fatal/f

The ssh flakes pr has been rebased locally on this and will still work off detecting SSH_PROBLEM in the error output.

PR after that will look at removing roachprod/errors.

@tbg

pkg/cmd/roachtest/test_impl.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/test_impl.go Outdated Show resolved Hide resolved
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
@tbg tbg force-pushed the roachtest-err-structure branch from 2622def to c9a4c73 Compare October 4, 2022 12:06
@tbg
Copy link
Member Author

tbg commented Oct 4, 2022

I'm going to merge this on green. Thanks Miral!

Copy link
Contributor

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1, 4 of 5 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

@tbg
Copy link
Member Author

tbg commented Oct 4, 2022

bors r=smg260

@craig
Copy link
Contributor

craig bot commented Oct 4, 2022

Build succeeded:

@craig craig bot merged commit 60928a6 into cockroachdb:master Oct 4, 2022
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions, nothing that would have blocked the PR anyway. 👍

// NB: the first failure is not always the relevant one due to:
// https://github.com/cockroachdb/cockroach/issues/44436
errors []error
// If len(errors)>0, this indicates whether the test timed out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right? len(errors) > 0 indicates that the test timed out?

// cancel, if set, is called from the t.Fatal() family of functions when the
// test is being marked as failed (i.e. when the failed field above is also
// set). This is used to cancel the context passed to t.spec.Run(), so async
// test goroutines can be notified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems misplaced, shouldn't this be near the cancel field?

@@ -255,162 +280,100 @@ func (t *testImpl) Skipf(format string, args ...interface{}) {
// ATTENTION: Since this calls panic(errTestFatal), it should only be called
// from a test's closure. The test runner itself should never call this.
func (t *testImpl) Fatal(args ...interface{}) {
t.markFailedInner("" /* format */, args...)
t.addFailure(argsToErr(1, args...))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Fatal call Error to make it explicit that Fatal = Error + panic?

@@ -85,6 +91,14 @@ var requireConstFmt = map[string]bool{
"(*github.com/cockroachdb/cockroach/pkg/util/grpcutil.grpcLogger).Errorf": true,
"(*github.com/cockroachdb/cockroach/pkg/util/grpcutil.grpcLogger).Fatalf": true,

// Both of these signatures need to be included for the linter to not flag
// roachtest testImpl.Errorf since it is in the main package
"(*main.testImpl).Errorf": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite unfortunate, as it seems like it's the result of a bug somewhere (we shouldn't be having to define the *main part of this). I spent some time looking into this and I suspect it's an issue with nogo (potentially only in our fork).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nogo just runs checkPackage; something must be up with the passed packagePath (aka importpath). It seems correct in BUILD.bazel. I wonder what might be passing main instead?

importpath = "github.com/cockroachdb/cockroach/pkg/cmd/roachtest/"

Copy link
Member

@srosenberg srosenberg Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methinks this might be the culprit [1]. It hardcodes importpath to main from the archive action which calls into compilePkg, and then into checkPackage.

[1] bazel-contrib/rules_go#2135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants