-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: use structured errors #88556
Conversation
Still need to figure out little bits and pieces:
but I added it to |
37293cc
to
41e734c
Compare
8ee1ccb
to
2622def
Compare
Some minor updates in the latest push.
The ssh flakes pr has been rebased locally on this and will still work off detecting SSH_PROBLEM in the error output. PR after that will look at removing roachprod/errors. |
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None
2622def
to
c9a4c73
Compare
I'm going to merge this on green. Thanks Miral! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r1, 4 of 5 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)
bors r=smg260 |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions, nothing that would have blocked the PR anyway. 👍
// NB: the first failure is not always the relevant one due to: | ||
// https://github.com/cockroachdb/cockroach/issues/44436 | ||
errors []error | ||
// If len(errors)>0, this indicates whether the test timed out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right? len(errors) > 0
indicates that the test timed out?
// cancel, if set, is called from the t.Fatal() family of functions when the | ||
// test is being marked as failed (i.e. when the failed field above is also | ||
// set). This is used to cancel the context passed to t.spec.Run(), so async | ||
// test goroutines can be notified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems misplaced, shouldn't this be near the cancel
field?
@@ -255,162 +280,100 @@ func (t *testImpl) Skipf(format string, args ...interface{}) { | |||
// ATTENTION: Since this calls panic(errTestFatal), it should only be called | |||
// from a test's closure. The test runner itself should never call this. | |||
func (t *testImpl) Fatal(args ...interface{}) { | |||
t.markFailedInner("" /* format */, args...) | |||
t.addFailure(argsToErr(1, args...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Fatal
call Error
to make it explicit that Fatal
= Error
+ panic
?
@@ -85,6 +91,14 @@ var requireConstFmt = map[string]bool{ | |||
"(*github.com/cockroachdb/cockroach/pkg/util/grpcutil.grpcLogger).Errorf": true, | |||
"(*github.com/cockroachdb/cockroach/pkg/util/grpcutil.grpcLogger).Fatalf": true, | |||
|
|||
// Both of these signatures need to be included for the linter to not flag | |||
// roachtest testImpl.Errorf since it is in the main package | |||
"(*main.testImpl).Errorf": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite unfortunate, as it seems like it's the result of a bug somewhere (we shouldn't be having to define the *main
part of this). I spent some time looking into this and I suspect it's an issue with nogo
(potentially only in our fork).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nogo
just runs checkPackage
; something must be up with the passed packagePath
(aka importpath
). It seems correct in BUILD.bazel
. I wonder what might be passing main
instead?
importpath = "github.com/cockroachdb/cockroach/pkg/cmd/roachtest/"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methinks this might be the culprit [1]. It hardcodes importpath
to main
from the archive
action which calls into compilePkg
, and then into checkPackage
.
roachtest
predatescockroachdb/errors
and as a result so far hasn'tcapitalized on many improvements. Today, roachtest errors are often noisy.
This commit adopts
cockroachdb/errors
and reaps some of the immediaterewards, while making related improvements that would also have been
possible without the adoption.
The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:
The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.
Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
cockroachdb/errors
already handles it. Custom rendering codewas similarly significantly trimmed down.
In making this change, I opted to always have
roachtest
createloggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which
roachtest
is always used inactual invocations.
Looking ahead, structured errors open a few avenues:
can assign the owner based on the type of failure. For example,
roachtest: opt some tests into aggressive stats checks #87106 wants consistency check failures to always go to the KV
team, regardless of which test's cluster was being checked.
Failures during an IMPORT/RESTORE (common across all tests) could be
routed to the Disaster Recovery team by default (assuming we provide a
wrapper around these operations that all tests use and which does this
wrapping).
Similarly, SSH failures can be special cased via a marking error
and can be directed away from the owning team, which can't do
anything about them anyway (roachtest: special-case SSH_PROBLEM as infra flakes #82398).
We can conceivably start grouping failure modes by "error signature".
That is, errors which have a "comparable" chain of errors (e.g. same
types, and within formatted errors the same format string). Issues
could then be reused only for compatible error signatures.
Release note: None