Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build,tests: replace go test with with a new test runner that splits output in different files #59045

Closed
knz opened this issue Jan 15, 2021 · 22 comments · Fixed by #69378
Closed
Labels
A-build-system A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-dev-inf

Comments

@knz
Copy link
Contributor

knz commented Jan 15, 2021

The design of go test is defective in numerous ways which are costing a lot of time (and money) to CRL.

By far the biggest two hurdles are:

  • a lack of traceability of errors back to individual tests
  • improper separation of test output into different files to enable separating causes when there are multiple errors

Generally we need a new test engine able to run Go tests but without the jank and misplaced NIH decisions that went into the go test runner itself; something that runs each package in a different process with a different test controller and different output files for parallel tests.

See comment below for details.

Epic CRDB-8306

@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure labels Jan 15, 2021
@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

cc @tbg to put this on your radar.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

FWIW, @RaduBerinde reports a failure where a test (TestTenantLogic) failed, there was no error message, there was no panic object, and although the server logs were collected there was no offender in the server logs.

It's becoming clear that go test eats the output of tests away quite often, in a way that they completely disappear from our CI artifacts. This is properly unacceptable.

@tbg
Copy link
Member

tbg commented Jan 15, 2021

Could you be more specific?

a lack of traceability of errors back to individual tests

But go test -json (and to a lesser degree, go test) do that - test output is associated with the currently running test? (Don't get me started on t.Parallel(), we shouldn't be using that)

improper separation of test output into different files to enable separating causes when there are multiple errors

go test doesn't do anything about files, probably by design, so I'm not sure what you're criticizing. We can "easily" decompose the json output into streams for different (pkg,test) pairs and route that to files via postprocessing. It gets tricky when exceptions occur outside of the main test goroutine, but assuming no test parallism in the package (t.Parallel()) the panic is mapped to the right test, at least, though I acknowledge that the json output should pin the failure on the test there and I believe it does not.

My biggest concern about the go test approach is that of running multiple tests in the same process because there is no separation between goroutines. This means a test that times out takes the whole package down with it, a panic on a goroutine does too, etc - rolling our own won't fix that, we need to either move to a one-test-per-binary model (i.e. add preprocessing that lists the tests and invokes the binary one by one, etc), or we need to make it a small enough problem in practice (more willingly leak goroutines, etc).

On the whole, though, I think the bar for moving off go test in a way that's incompatible with using go test at all is a non-starter. Modern IDEs heavily interop with go test and the loss of eng productivity by forcing something bespoke will be difficult to recuperate.

Mind sharing the link to Radu's build. I am noticing that this is a logic test which uses t.Parallel() and would be a challenge for any testing harness. It's likely that an error is there, somewhere - will take a look when I have the link.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

I already analyzed this:

  • the offending test is the one given in the message you pasted: TestTenantLogic
  • by looking at the server logs in the artifact directorylogTestTenantLogic355061631 I don't see server errors, SQL internal errors or Fatal messages, so that tells me that if there was a panic or crash it did not occur in most of the existing place, where there'd be a panic handler.
  • by looking at the artifacts, specificlaly by looking for the last occurrence of "TestTenantLogic" in raw.0.json.txt I find this:
    TestTenantLogic/3node-tenant/cascade/UpdateCascade_CompositeFKs_MatchSimple
  • There was less than 1 second betwen the moment this cascade test started, and the moment the entire package failed. So in my book I'd attribute the failure to either that test or something in the server "under" it.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

It gets tricky when exceptions occur outside of the main test goroutine, but assuming no test parallism in the package (t.Parallel())

But that's the thing:

  1. we have test parallelism
  2. we have goroutines other than that of the test, eg in TestServer, grpc, etc.

we want something better. I reckon we do want test parallelism; and this could be achieved by providing a different test harness

for the purpose of CI

(I do not propose to replace go test entirely)

@tbg
Copy link
Member

tbg commented Jan 15, 2021

I took a look at that log and indeed the output from the logictestccl.test just ends:

{"Time":"2021-01-14T22:30:56.565833625-05:00","Action":"output","Package":"github.com/cockroachdb/cockroach/pkg/ccl/logictestccl","Test":"TestTenantLogic/3node-tenant/cascade/UpdateCascade_CompositeFKs_MatchSimple","Output":"=== RUN   TestTenantLogic/3node-tenant/cascade/UpdateCascade_CompositeFKs_MatchSimple\n"}
{"Time":"2021-01-14T22:30:57.86244934-05:00","Action":"output","Package":"github.com/cockroachdb/cockroach/pkg/ccl/logictestccl","Output":"FAIL\tgithub.com/cockroachdb/cockroach/pkg/ccl/logictestccl\t170.431s\n"}
{"Time":"2021-01-14T22:30:57.862507994-05:00","Action":"fail","Package":"github.com/cockroachdb/cockroach/pkg/ccl/logictestccl","Elapsed":170.431}

This would typically indicate that either something in the code called os.Exit(.) or that the process was killed by external means.

we want something better. I reckon we do want test parallelism; and this could be achieved by providing a different test harness

I'm still not sure what exactly you're proposing here (if anything). My take-away is that our CI is janky and that that is a problem, which I agree with. Would you mind filing this issue into dev-inf and referring+closing this issue. The closest related issue on the dev-inf side that I found is https://github.com/cockroachdb/dev-inf/issues/53.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

I would like to propose that you've already put up so much with go test that you've become blind to how unacceptable it behaves.

To start with, I am expecting as much information about sub-processes run by go test than what my command-line shell would give me. That is not available to me today!

That everyone seems to think the status quo is acceptable is flabbergasting to me.

Just the exit code would help distinguish between

  • test failure (make that code 100+)
  • our code called os.Exit (values likely under 100)
  • the kernel killed the process due to oom (exit code indicates killed by signal SIGKILL)
  • segfaults
  • illegal instructions and the like

And for several of these conditions, a core file is dumped, I want the test harness to give me a link to the core file to investigate.

Additionally, given that multiple processes are running side-by-side, at the very least the thing should print out the PIDs of the test sub-processes, so that if two or more encounter problems it would be clear which processes the left-over files belong to (either core files or dmesg output).

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

I also want the thing to stop mixing the test output with the stdout/stderr output by tests. The unix streams should be redirected to separate files, one per package. The stdout of the test harness should contain exclusively reports about a test's progress.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

So one concrete proposal is to have go test build the .test executables for all packages, but then use our own program to run these executables (i.e. stop relying on go test to do the running proper)

@tbg
Copy link
Member

tbg commented Jan 15, 2021

I would like to propose that you've already put up so much with go test that you've become blind to how unacceptable it behaves.

I don't think so. I've spent tons of time on this stuff and would rather not have that toil every week. I am trying to funnel this discussion to dev-inf because that is where it belongs, both by team boundary and because of the interplay with the bazelization project. I do in general agree that we're outgrowing go test ./... and bazel naturally handles that, to many other benefits. The use of intra-package parallelism, my personal biggest headache, will unfortunately not be solved by this.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

I don't disagree that bazel could help here. We can probably leverage bazel as replacement to go test to run these processes and retrieve the data we need properly.

Note that the issue at top was to replace go test and I did not suggest to implement something new from scratch. If bazel does what we need better than go test then all the better.

@tbg
Copy link
Member

tbg commented Jan 15, 2021

I think bazel would still invoke go test, but individually at the package level. But - my understanding of bazel and how it applies to go is very limited. Someone like @rickystewart or @irfansharif would know much better.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2021

Having bazel run go test, even just once per package, would not solve any of our problems here. The system must build the test executable using go test but then run it directly and get all its results -- all the things that go test unhelpfully forgets to report.

@irfansharif
Copy link
Contributor

Bazel will help here, yes. You're able to define individual test targets within a package (so you could do something like defining 16 test targets for all the tests within pkg/kv/kvserver) and have them then execute in parallel. I think it does what Rafa is describing, which is building out the test binaries using go test and then taking responsibility for running it all the "bazel" way. We haven't really tried it though (and only saw bits of that experimentation occur in #56127 and #56436). As always, UX hurdles and rough edges are abundant, but nothing unsurmountable if we wanted to push on this.

@tbg
Copy link
Member

tbg commented Jan 15, 2021 via email

@rickystewart
Copy link
Collaborator

Yes, testing via Bazel is smarter than just dumbly wrapping go test on a per-package basis. As suggested above, when testing via Bazel, all the sources for a particular test target are compiled into a binary that gets run without support from go test. Since they're all in separate processes, testing of individual targets wouldn't be able to incidentally tear down unrelated targets.

The build targets are currently constructed monolithically, one test target per Go package, since that's what Gazelle does by default, but there's nothing stopping us from overriding that behavior where appropriate. We could do this either on a case-by-case basis if certain tests are known to be particularly troublesome or more broadly across the entire codebase if that's not possible. If the second is desired, a Gazelle feature request might be necessary to make that happen -- if that directive already exists, I can't find it in the docs. (Obviously, splitting tests up to that level of granularity will come with a perf impact, so ideally we wouldn't have to drill down that much.)

As stated, there doesn't seem to be any action to take on this issue immediately. I think as specific issues with Bazel tests and/or their integration with TeamCity come up, we can address them accordingly. Completing the Bazel migration for tests will result in this issue being closed as a matter of course, because we wouldn't use go test any more.

@knz
Copy link
Contributor Author

knz commented Jan 20, 2021

@rickystewart before closing this issue, please explain here how a bazel-powered run reports:

  1. the exit status of test processes
  2. any left-over artifact files produced in $TMPDIR
  3. the processes' stderr and stdout streams

(we need all 3 things as discrete artifacts. If a bazel-powered run is missing them, we need to keep this issue open until all 3 get properly collected.)

@rickystewart
Copy link
Collaborator

#69378 should take care of the last bit here, so to review:

  • The stdout and stderr streams of test processes are captured on a per-test basis in the artifacts for Bazel builds. e.g. test logs for pkg/roachpb are found in bazel-testlogs/pkg/roachpb/roachpb_test/test.log. Next to test.log is a file called test.xml which contains structured test output data (this file is parsed by TeamCity to produce test results).
  • We've patched rules_go to write the exit code of the test process to test.log.
  • When bazel,ci: propagate TEST_TMPDIR down to go tests and capture artifacts #69378 is merged, we'll also be capturing left-over TMPDIR artifacts in the tmp subdirectory of the artifacts. Our existing "logs left over" messages should "just work" as they are doing so today.

As we make more Bazelfied build configurations in TC, all of the above should remain true. If any of the above regress then it should be considered a bug.

@tbg
Copy link
Member

tbg commented Aug 26, 2021

Note that ./pkg/roachpb isn't a "test", it's a package with hundreds of tests. Some of these might run in parallel (not in the case of roachpb, but happens in other packages), and so their output might interleave. Honestly this is nearly impossible to avoid since the Go test framework allows such behavior within invidivual subtests that aren't even listable. So I think what you're describing above is "fine".

I just looked at a recent bazel build to check what things look like now, and I do have a question - take a look here:

image

If a test failed, would that always happen while prominently displaying the shard? I'm asking because it would be tough to go through up to 16 directories on any failure of the kvserver tests that requires looking at the logs.

I also noticed we're actually not retaining the server logs:

=== RUN   TestReplicaEvaluationNotTxnMutation
    test_log_scope.go:73: test logs captured to: /tmp/logTestReplicaEvaluationNotTxnMutation596009217
    test_log_scope.go:74: use -show-logs to present logs inline
    replica_test.go:8814: -- test log scope end --

We're having the test servers log to /tmp, so this isn't in artifacts. But I think that's exactly what's being addressed in #69378.

@rickystewart
Copy link
Collaborator

If a test failed, would that always happen while prominently displaying the shard? I'm asking because it would be tough to go through up to 16 directories on any failure of the kvserver tests that requires looking at the logs.

Good question. I'll take a look. We can always munge the logs e.g. to concatenate everything into a single file.

We're having the test servers log to /tmp, so this isn't in artifacts. But I think that's exactly what's being addressed in #69378.

Yes (and again, after it's merged if some logs are still missing, please report it as a bug).

rickystewart added a commit to rickystewart/cockroach that referenced this issue Aug 26, 2021
This fulfills a long-standing request to capture left-over artifacts in
`TMPDIR` (see cockroachdb#59045 (comment)).

Bazel sets the `TEST_TMPDIR` environment variable for the temporary
directory and expects all tests to write temporary files to that
directory. In our Go tests, however, we consult the `TMPDIR` environment
variable to find that directory. So we pull in a custom change to
`rules_go` to copy `TEST_TMPDIR` to `TMPDIR`. Update `.bazelrc` to use
`/artifacts/tmp` as the `TEST_TMPDIR`.

Closes cockroachdb#59045.
Closes cockroachdb#69372.

Release justification: Non-production code change
Release note: None
craig bot pushed a commit that referenced this issue Aug 31, 2021
68983: backupccl: stop including historical databases in cluster backup Descs r=adityamaru a=pbardea

A previous commit attempted to fix a bug where cluster backup would not
include tables in dropped databases between incremental backups. That
fixed aimed to find dropped databases and add it to the set of
descriptors. However, this causes issues when a database is recreated
with the same name.

Rather than adding the dropped DBs to the Descriptors field on the
backup manifest, this commit updates how DescriptorChanges are populated
for cluster backups with revision history. Now, the initial scan of
descriptors as of the start time will look for all descriptors in the
cluster rather than just those that were resolved as of the end time of
the backup.

Release note (bug fix): Fix a bug where cluster revision-history backups
may have included dropped descriptors in the "current" snapshot of
descriptors on the cluster.

Release justification: bug fix. Fix a bug where cluster revision-history backups
may have included dropped descriptors in the "current" snapshot of
descriptors on the cluster.

69378: bazel,ci: propagate `TEST_TMPDIR` down to go tests and capture artifacts r=jlinder a=rickystewart

This fulfills a long-standing request to capture left-over artifacts in
`TMPDIR` (see #59045 (comment)).

Bazel sets the `TEST_TMPDIR` environment variable for the temporary
directory and expects all tests to write temporary files to that
directory. In our Go tests, however, we consult the `TMPDIR` environment
variable to find that directory. So we pull in a custom change to
`rules_go` to copy `TEST_TMPDIR` to `TMPDIR`. Update `.bazelrc` to use
`/artifacts/tmp` as the `TEST_TMPDIR`.

Closes #59045.
Closes #69372.

Release justification: Non-production code change
Release note: None

69612: colflow: propagate concurrency info from vectorized to FlowBase r=yuzefovich a=yuzefovich

**colflow: propagate concurrency info from vectorized to FlowBase**

We've recently merged a change to introduce concurrency in the local
flows. Those new concurrent goroutines are started by the vectorized
parallel unordered synchronizer, and `FlowBase` isn't aware of them; as
a result, `FlowBase.Wait` currently might not wait for all goroutines to
exit (which is an optimization when there are no concurrent goroutines).
This commit fixes the problem by propagating the information from the
vectorized flow to the FlowBase.

Addresses: #69419.

Release note: None (no stable release with this bug)

Release justification: bug fix to new functionality.

**sql: loosen up the physical planning of parallel scans**

This commit makes it so that in case we try to plan parallel
TableReaders but encounter an error during planning, we fallback to
having a single TableReader.

Release note: None

Release justification: bug fix to new functionality.

Co-authored-by: Paul Bardea <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 6e27bca Aug 31, 2021
@rickystewart
Copy link
Collaborator

If a test failed, would that always happen while prominently displaying the shard? I'm asking because it would be tough to go through up to 16 directories on any failure of the kvserver tests that requires looking at the logs.

I did some initial investigation here, in case a test fails Bazel does indeed tell you which log file to look in:

FAIL: //pkg/kv/kvserver:kvserver_test (shard 1 of 16) (see /home/roach/.cache/bazel/_bazel_roach/cc377fc379544923cc7508dd261e4a48/execroot/cockroach/bazel-out/k8-fastbuild/testlogs/pkg/kv/kvserver/kvserver_test/shard_1_of_16/test.log)
INFO: From Testing //pkg/kv/kvserver:kvserver_test (shard 1 of 16):
==================== Test output for //pkg/kv/kvserver:kvserver_test (shard 1 of 16):
initialized metamorphic constant "coldata-batch-size" with value 742
...
I210831 21:09:29.223257 1 (gostd) testmain.go:1472  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 1
================================================================================
Target //pkg/kv/kvserver:kvserver_test up-to-date:
  _bazel/bin/pkg/kv/kvserver/kvserver_test_/kvserver_test
INFO: Elapsed time: 880.000s, Critical Path: 166.13s
INFO: 21 processes: 1 internal, 16 local, 4 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 21 total actions
//pkg/kv/kvserver:kvserver_test                                          FAILED in 1 out of 16 in 142.2s
  Stats over 16 runs: max = 142.2s, min = 22.2s, avg = 53.4s, dev = 33.9s
  /home/roach/.cache/bazel/_bazel_roach/cc377fc379544923cc7508dd261e4a48/execroot/cockroach/bazel-out/k8-fastbuild/testlogs/pkg/kv/kvserver/kvserver_test/shard_1_of_16/test.log

INFO: Build completed, 1 test FAILED, 21 total actions

So in this case the failure is in shard 1. (This example is from a local run, but in CI the path to the log will start with /artifacts/tmp and it will match up to the artifacts published on TC.)

Not saying the experience can't be improved at all, but the immediate concern isn't an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-build-system A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-dev-inf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants