build,tests: replace `go test` with with a new test runner that splits output in different files #59045

knz · 2021-01-15T13:34:20Z

The design of go test is defective in numerous ways which are costing a lot of time (and money) to CRL.

By far the biggest two hurdles are:

a lack of traceability of errors back to individual tests
improper separation of test output into different files to enable separating causes when there are multiple errors

Generally we need a new test engine able to run Go tests but without the jank and misplaced NIH decisions that went into the go test runner itself; something that runs each package in a different process with a different test controller and different output files for parallel tests.

See comment below for details.

Epic CRDB-8306

The text was updated successfully, but these errors were encountered:

knz · 2021-01-15T13:34:35Z

cc @tbg to put this on your radar.

knz · 2021-01-15T13:45:10Z

FWIW, @RaduBerinde reports a failure where a test (TestTenantLogic) failed, there was no error message, there was no panic object, and although the server logs were collected there was no offender in the server logs.

It's becoming clear that go test eats the output of tests away quite often, in a way that they completely disappear from our CI artifacts. This is properly unacceptable.

tbg · 2021-01-15T15:49:15Z

Could you be more specific?

a lack of traceability of errors back to individual tests

But go test -json (and to a lesser degree, go test) do that - test output is associated with the currently running test? (Don't get me started on t.Parallel(), we shouldn't be using that)

improper separation of test output into different files to enable separating causes when there are multiple errors

go test doesn't do anything about files, probably by design, so I'm not sure what you're criticizing. We can "easily" decompose the json output into streams for different (pkg,test) pairs and route that to files via postprocessing. It gets tricky when exceptions occur outside of the main test goroutine, but assuming no test parallism in the package (t.Parallel()) the panic is mapped to the right test, at least, though I acknowledge that the json output should pin the failure on the test there and I believe it does not.

My biggest concern about the go test approach is that of running multiple tests in the same process because there is no separation between goroutines. This means a test that times out takes the whole package down with it, a panic on a goroutine does too, etc - rolling our own won't fix that, we need to either move to a one-test-per-binary model (i.e. add preprocessing that lists the tests and invokes the binary one by one, etc), or we need to make it a small enough problem in practice (more willingly leak goroutines, etc).

On the whole, though, I think the bar for moving off go test in a way that's incompatible with using go test at all is a non-starter. Modern IDEs heavily interop with go test and the loss of eng productivity by forcing something bespoke will be difficult to recuperate.

Mind sharing the link to Radu's build. I am noticing that this is a logic test which uses t.Parallel() and would be a challenge for any testing harness. It's likely that an error is there, somewhere - will take a look when I have the link.

knz · 2021-01-15T16:03:14Z

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Test/2586006?buildTab=overview

knz · 2021-01-15T16:03:42Z

I already analyzed this:

the offending test is the one given in the message you pasted: TestTenantLogic
by looking at the server logs in the artifact directorylogTestTenantLogic355061631 I don't see server errors, SQL internal errors or Fatal messages, so that tells me that if there was a panic or crash it did not occur in most of the existing place, where there'd be a panic handler.
by looking at the artifacts, specificlaly by looking for the last occurrence of "TestTenantLogic" in raw.0.json.txt I find this:
TestTenantLogic/3node-tenant/cascade/UpdateCascade_CompositeFKs_MatchSimple
There was less than 1 second betwen the moment this cascade test started, and the moment the entire package failed. So in my book I'd attribute the failure to either that test or something in the server "under" it.

knz · 2021-01-15T16:05:38Z

It gets tricky when exceptions occur outside of the main test goroutine, but assuming no test parallism in the package (t.Parallel())

But that's the thing:

we have test parallelism
we have goroutines other than that of the test, eg in TestServer, grpc, etc.

we want something better. I reckon we do want test parallelism; and this could be achieved by providing a different test harness

for the purpose of CI

(I do not propose to replace go test entirely)

tbg · 2021-01-15T16:29:58Z

I took a look at that log and indeed the output from the logictestccl.test just ends:

{"Time":"2021-01-14T22:30:56.565833625-05:00","Action":"output","Package":"github.com/cockroachdb/cockroach/pkg/ccl/logictestccl","Test":"TestTenantLogic/3node-tenant/cascade/UpdateCascade_CompositeFKs_MatchSimple","Output":"=== RUN   TestTenantLogic/3node-tenant/cascade/UpdateCascade_CompositeFKs_MatchSimple\n"}
{"Time":"2021-01-14T22:30:57.86244934-05:00","Action":"output","Package":"github.com/cockroachdb/cockroach/pkg/ccl/logictestccl","Output":"FAIL\tgithub.com/cockroachdb/cockroach/pkg/ccl/logictestccl\t170.431s\n"}
{"Time":"2021-01-14T22:30:57.862507994-05:00","Action":"fail","Package":"github.com/cockroachdb/cockroach/pkg/ccl/logictestccl","Elapsed":170.431}

This would typically indicate that either something in the code called os.Exit(.) or that the process was killed by external means.

we want something better. I reckon we do want test parallelism; and this could be achieved by providing a different test harness

I'm still not sure what exactly you're proposing here (if anything). My take-away is that our CI is janky and that that is a problem, which I agree with. Would you mind filing this issue into dev-inf and referring+closing this issue. The closest related issue on the dev-inf side that I found is https://github.com/cockroachdb/dev-inf/issues/53.

knz · 2021-01-15T16:59:00Z

I would like to propose that you've already put up so much with go test that you've become blind to how unacceptable it behaves.

To start with, I am expecting as much information about sub-processes run by go test than what my command-line shell would give me. That is not available to me today!

That everyone seems to think the status quo is acceptable is flabbergasting to me.

Just the exit code would help distinguish between

test failure (make that code 100+)
our code called os.Exit (values likely under 100)
the kernel killed the process due to oom (exit code indicates killed by signal SIGKILL)
segfaults
illegal instructions and the like

And for several of these conditions, a core file is dumped, I want the test harness to give me a link to the core file to investigate.

Additionally, given that multiple processes are running side-by-side, at the very least the thing should print out the PIDs of the test sub-processes, so that if two or more encounter problems it would be clear which processes the left-over files belong to (either core files or dmesg output).

knz · 2021-01-15T17:01:10Z

I also want the thing to stop mixing the test output with the stdout/stderr output by tests. The unix streams should be redirected to separate files, one per package. The stdout of the test harness should contain exclusively reports about a test's progress.

knz · 2021-01-15T17:03:36Z

So one concrete proposal is to have go test build the .test executables for all packages, but then use our own program to run these executables (i.e. stop relying on go test to do the running proper)

tbg · 2021-01-15T17:10:34Z

I would like to propose that you've already put up so much with go test that you've become blind to how unacceptable it behaves.

I don't think so. I've spent tons of time on this stuff and would rather not have that toil every week. I am trying to funnel this discussion to dev-inf because that is where it belongs, both by team boundary and because of the interplay with the bazelization project. I do in general agree that we're outgrowing go test ./... and bazel naturally handles that, to many other benefits. The use of intra-package parallelism, my personal biggest headache, will unfortunately not be solved by this.

knz · 2021-01-15T17:27:19Z

I don't disagree that bazel could help here. We can probably leverage bazel as replacement to go test to run these processes and retrieve the data we need properly.

Note that the issue at top was to replace go test and I did not suggest to implement something new from scratch. If bazel does what we need better than go test then all the better.

tbg · 2021-01-15T17:29:23Z

I think bazel would still invoke go test, but individually at the package level. But - my understanding of bazel and how it applies to go is very limited. Someone like @rickystewart or @irfansharif would know much better.

knz · 2021-01-15T17:34:59Z

Having bazel run go test, even just once per package, would not solve any of our problems here. The system must build the test executable using go test but then run it directly and get all its results -- all the things that go test unhelpfully forgets to report.

irfansharif · 2021-01-15T18:09:36Z

Bazel will help here, yes. You're able to define individual test targets within a package (so you could do something like defining 16 test targets for all the tests within pkg/kv/kvserver) and have them then execute in parallel. I think it does what Rafa is describing, which is building out the test binaries using go test and then taking responsibility for running it all the "bazel" way. We haven't really tried it though (and only saw bits of that experimentation occur in #56127 and #56436). As always, UX hurdles and rough edges are abundant, but nothing unsurmountable if we wanted to push on this.

tbg · 2021-01-15T18:10:55Z

Oh, sorry, I forgot about the distinction. I think it would build and invoke as we prefer. But this may happen through go test yet, via the -c and -exec flags. Not sure.

…

On Fri, Jan 15, 2021, 18:35 kena ***@***.***> wrote: Having bazel run go test, even just once per package, would not solve any of our problems here. The system must build the test executable using go test but then run it directly and get all its results -- all the things that go test unhelpfully forgets to report. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#59045 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZHJZWQ35PO7B2IPJLTS2B4FLANCNFSM4WEB6VHA> .

rickystewart · 2021-01-19T23:39:13Z

Yes, testing via Bazel is smarter than just dumbly wrapping go test on a per-package basis. As suggested above, when testing via Bazel, all the sources for a particular test target are compiled into a binary that gets run without support from go test. Since they're all in separate processes, testing of individual targets wouldn't be able to incidentally tear down unrelated targets.

The build targets are currently constructed monolithically, one test target per Go package, since that's what Gazelle does by default, but there's nothing stopping us from overriding that behavior where appropriate. We could do this either on a case-by-case basis if certain tests are known to be particularly troublesome or more broadly across the entire codebase if that's not possible. If the second is desired, a Gazelle feature request might be necessary to make that happen -- if that directive already exists, I can't find it in the docs. (Obviously, splitting tests up to that level of granularity will come with a perf impact, so ideally we wouldn't have to drill down that much.)

As stated, there doesn't seem to be any action to take on this issue immediately. I think as specific issues with Bazel tests and/or their integration with TeamCity come up, we can address them accordingly. Completing the Bazel migration for tests will result in this issue being closed as a matter of course, because we wouldn't use go test any more.

knz · 2021-01-20T11:34:20Z

@rickystewart before closing this issue, please explain here how a bazel-powered run reports:

the exit status of test processes
any left-over artifact files produced in $TMPDIR
the processes' stderr and stdout streams

(we need all 3 things as discrete artifacts. If a bazel-powered run is missing them, we need to keep this issue open until all 3 get properly collected.)

rickystewart · 2021-08-25T23:26:45Z

#69378 should take care of the last bit here, so to review:

The stdout and stderr streams of test processes are captured on a per-test basis in the artifacts for Bazel builds. e.g. test logs for pkg/roachpb are found in bazel-testlogs/pkg/roachpb/roachpb_test/test.log. Next to test.log is a file called test.xml which contains structured test output data (this file is parsed by TeamCity to produce test results).
We've patched rules_go to write the exit code of the test process to test.log.
When bazel,ci: propagate TEST_TMPDIR down to go tests and capture artifacts #69378 is merged, we'll also be capturing left-over TMPDIR artifacts in the tmp subdirectory of the artifacts. Our existing "logs left over" messages should "just work" as they are doing so today.

As we make more Bazelfied build configurations in TC, all of the above should remain true. If any of the above regress then it should be considered a bug.

tbg · 2021-08-26T07:18:12Z

Note that ./pkg/roachpb isn't a "test", it's a package with hundreds of tests. Some of these might run in parallel (not in the case of roachpb, but happens in other packages), and so their output might interleave. Honestly this is nearly impossible to avoid since the Go test framework allows such behavior within invidivual subtests that aren't even listable. So I think what you're describing above is "fine".

I just looked at a recent bazel build to check what things look like now, and I do have a question - take a look here:

If a test failed, would that always happen while prominently displaying the shard? I'm asking because it would be tough to go through up to 16 directories on any failure of the kvserver tests that requires looking at the logs.

I also noticed we're actually not retaining the server logs:

=== RUN   TestReplicaEvaluationNotTxnMutation
    test_log_scope.go:73: test logs captured to: /tmp/logTestReplicaEvaluationNotTxnMutation596009217
    test_log_scope.go:74: use -show-logs to present logs inline
    replica_test.go:8814: -- test log scope end --

We're having the test servers log to /tmp, so this isn't in artifacts. But I think that's exactly what's being addressed in #69378.

rickystewart · 2021-08-26T15:00:48Z

If a test failed, would that always happen while prominently displaying the shard? I'm asking because it would be tough to go through up to 16 directories on any failure of the kvserver tests that requires looking at the logs.

Good question. I'll take a look. We can always munge the logs e.g. to concatenate everything into a single file.

We're having the test servers log to /tmp, so this isn't in artifacts. But I think that's exactly what's being addressed in #69378.

Yes (and again, after it's merged if some logs are still missing, please report it as a bug).

This fulfills a long-standing request to capture left-over artifacts in `TMPDIR` (see cockroachdb#59045 (comment)). Bazel sets the `TEST_TMPDIR` environment variable for the temporary directory and expects all tests to write temporary files to that directory. In our Go tests, however, we consult the `TMPDIR` environment variable to find that directory. So we pull in a custom change to `rules_go` to copy `TEST_TMPDIR` to `TMPDIR`. Update `.bazelrc` to use `/artifacts/tmp` as the `TEST_TMPDIR`. Closes cockroachdb#59045. Closes cockroachdb#69372. Release justification: Non-production code change Release note: None

68983: backupccl: stop including historical databases in cluster backup Descs r=adityamaru a=pbardea A previous commit attempted to fix a bug where cluster backup would not include tables in dropped databases between incremental backups. That fixed aimed to find dropped databases and add it to the set of descriptors. However, this causes issues when a database is recreated with the same name. Rather than adding the dropped DBs to the Descriptors field on the backup manifest, this commit updates how DescriptorChanges are populated for cluster backups with revision history. Now, the initial scan of descriptors as of the start time will look for all descriptors in the cluster rather than just those that were resolved as of the end time of the backup. Release note (bug fix): Fix a bug where cluster revision-history backups may have included dropped descriptors in the "current" snapshot of descriptors on the cluster. Release justification: bug fix. Fix a bug where cluster revision-history backups may have included dropped descriptors in the "current" snapshot of descriptors on the cluster. 69378: bazel,ci: propagate `TEST_TMPDIR` down to go tests and capture artifacts r=jlinder a=rickystewart This fulfills a long-standing request to capture left-over artifacts in `TMPDIR` (see #59045 (comment)). Bazel sets the `TEST_TMPDIR` environment variable for the temporary directory and expects all tests to write temporary files to that directory. In our Go tests, however, we consult the `TMPDIR` environment variable to find that directory. So we pull in a custom change to `rules_go` to copy `TEST_TMPDIR` to `TMPDIR`. Update `.bazelrc` to use `/artifacts/tmp` as the `TEST_TMPDIR`. Closes #59045. Closes #69372. Release justification: Non-production code change Release note: None 69612: colflow: propagate concurrency info from vectorized to FlowBase r=yuzefovich a=yuzefovich **colflow: propagate concurrency info from vectorized to FlowBase** We've recently merged a change to introduce concurrency in the local flows. Those new concurrent goroutines are started by the vectorized parallel unordered synchronizer, and `FlowBase` isn't aware of them; as a result, `FlowBase.Wait` currently might not wait for all goroutines to exit (which is an optimization when there are no concurrent goroutines). This commit fixes the problem by propagating the information from the vectorized flow to the FlowBase. Addresses: #69419. Release note: None (no stable release with this bug) Release justification: bug fix to new functionality. **sql: loosen up the physical planning of parallel scans** This commit makes it so that in case we try to plan parallel TableReaders but encounter an error during planning, we fallback to having a single TableReader. Release note: None Release justification: bug fix to new functionality. Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

rickystewart · 2021-08-31T21:41:29Z

If a test failed, would that always happen while prominently displaying the shard? I'm asking because it would be tough to go through up to 16 directories on any failure of the kvserver tests that requires looking at the logs.

I did some initial investigation here, in case a test fails Bazel does indeed tell you which log file to look in:

FAIL: //pkg/kv/kvserver:kvserver_test (shard 1 of 16) (see /home/roach/.cache/bazel/_bazel_roach/cc377fc379544923cc7508dd261e4a48/execroot/cockroach/bazel-out/k8-fastbuild/testlogs/pkg/kv/kvserver/kvserver_test/shard_1_of_16/test.log)
INFO: From Testing //pkg/kv/kvserver:kvserver_test (shard 1 of 16):
==================== Test output for //pkg/kv/kvserver:kvserver_test (shard 1 of 16):
initialized metamorphic constant "coldata-batch-size" with value 742
...
I210831 21:09:29.223257 1 (gostd) testmain.go:1472  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 1
================================================================================
Target //pkg/kv/kvserver:kvserver_test up-to-date:
  _bazel/bin/pkg/kv/kvserver/kvserver_test_/kvserver_test
INFO: Elapsed time: 880.000s, Critical Path: 166.13s
INFO: 21 processes: 1 internal, 16 local, 4 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 21 total actions
//pkg/kv/kvserver:kvserver_test                                          FAILED in 1 out of 16 in 142.2s
  Stats over 16 runs: max = 142.2s, min = 22.2s, avg = 53.4s, dev = 33.9s
  /home/roach/.cache/bazel/_bazel_roach/cc377fc379544923cc7508dd261e4a48/execroot/cockroach/bazel-out/k8-fastbuild/testlogs/pkg/kv/kvserver/kvserver_test/shard_1_of_16/test.log

INFO: Build completed, 1 test FAILED, 21 total actions

So in this case the failure is in shard 1. (This example is from a local run, but in CI the path to the log will start with /artifacts/tmp and it will match up to the artifacts published on TC.)

Not saying the experience can't be improved at all, but the immediate concern isn't an issue.

knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure labels Jan 15, 2021

knz added the A-build-system label Jan 15, 2021

jlinder added the T-dev-inf label Jun 16, 2021

rickystewart mentioned this issue Aug 25, 2021

bazel,ci: propagate TEST_TMPDIR down to go tests and capture artifacts #69378

Merged

rickystewart mentioned this issue Aug 26, 2021

bazel,ci: investigate whether sharded tests are difficult to debug #69423

Closed

craig bot closed this as completed in 6e27bca Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build,tests: replace `go test` with with a new test runner that splits output in different files #59045

build,tests: replace `go test` with with a new test runner that splits output in different files #59045

knz commented Jan 15, 2021 •

edited by exalate-issue-sync bot

Loading

knz commented Jan 15, 2021

knz commented Jan 15, 2021

tbg commented Jan 15, 2021

knz commented Jan 15, 2021

knz commented Jan 15, 2021

knz commented Jan 15, 2021

tbg commented Jan 15, 2021

knz commented Jan 15, 2021 •

edited

Loading

knz commented Jan 15, 2021

knz commented Jan 15, 2021

tbg commented Jan 15, 2021 •

edited

Loading

knz commented Jan 15, 2021

tbg commented Jan 15, 2021

knz commented Jan 15, 2021

irfansharif commented Jan 15, 2021

tbg commented Jan 15, 2021 via email

rickystewart commented Jan 19, 2021

knz commented Jan 20, 2021

rickystewart commented Aug 25, 2021

tbg commented Aug 26, 2021

rickystewart commented Aug 26, 2021

rickystewart commented Aug 31, 2021

build,tests: replace go test with with a new test runner that splits output in different files #59045

build,tests: replace go test with with a new test runner that splits output in different files #59045

Comments

knz commented Jan 15, 2021 • edited by exalate-issue-sync bot Loading

knz commented Jan 15, 2021

knz commented Jan 15, 2021

tbg commented Jan 15, 2021

knz commented Jan 15, 2021

knz commented Jan 15, 2021

knz commented Jan 15, 2021

tbg commented Jan 15, 2021

knz commented Jan 15, 2021 • edited Loading

knz commented Jan 15, 2021

knz commented Jan 15, 2021

tbg commented Jan 15, 2021 • edited Loading

knz commented Jan 15, 2021

tbg commented Jan 15, 2021

knz commented Jan 15, 2021

irfansharif commented Jan 15, 2021

tbg commented Jan 15, 2021 via email

rickystewart commented Jan 19, 2021

knz commented Jan 20, 2021

rickystewart commented Aug 25, 2021

tbg commented Aug 26, 2021

rickystewart commented Aug 26, 2021

rickystewart commented Aug 31, 2021

build,tests: replace `go test` with with a new test runner that splits output in different files #59045

build,tests: replace `go test` with with a new test runner that splits output in different files #59045

knz commented Jan 15, 2021 •

edited by exalate-issue-sync bot

Loading

knz commented Jan 15, 2021 •

edited

Loading

tbg commented Jan 15, 2021 •

edited

Loading