roachtest: skip tests that fail during "setup" #33377

petermattis · 2018-12-27T19:06:21Z

We periodically see nightly roachtest failures which are due to a failure during setup and not a failure of the functionality being tested. For example, if the nightly roachtests run too long, then tests which start after the timeout will fail during cluster setup. We've seen backup2TB fail because the store dumps cannot be downloaded. We've seen cdc tests fail because of docker flakiness and we've seen other tests fail because of apt-get flakiness.

In addition to trying to improve the reliability of the setup steps, we should avoid considering failure during setup as test failure. I think this can be achieved by marking tests as "skipped" if they fail during setup. Perhaps "skipped" is the wrong term and tests should be marked as "setup-failed". One possibility for achieving this is to mark a test as skipped if failure occurs before cluster.Start is invoked. That could be a bit too subtle. Another possibility is to add a cluster.Setup() function that holds all of the steps performed during setup. If a test doesn't use cluster.Setup then any failure would be considered a real test failure. I think that default is fine.

@danhhz do you have any thoughts about this?

Epic CRDB-10428

Jira issue: CRDB-4691

The text was updated successfully, but these errors were encountered:

andreimatei · 2018-12-27T19:24:26Z

Out of all the badness you've enumerated, I'm only impressed by "tests which start after the timeout will fail during cluster setup". How exactly to the future tests fail? I'll make sure to have a better story here.

Other than this, I don't quite see a need for any common "setup" infrastructure. Individual tests can, of course, deal with their individual hardships and retry / mark themselves skipped / do whatever they seem fit for their particular situation.

danhhz · 2018-12-27T20:32:57Z

Yeah, there's definitely a real problem here. In an ideal world, we'd add retries/backoff/rescheduling as necessary to get everything to consistently pass, but I agree that it's not worth the time it would take to root out every single one of these.

If we had something like you propose, the first thing I'd do is go mark everything up to and including the fixture restore in all the cdc tests as setup. If I could, I'd also mark any errors coming back from workload run as a skip. (These happen all the time and are purely a burden from a "testing cdc" perspective. See #33317, #32813 (comment), and https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/cdc.go#L101-L102)

The downside to this is that these things I'd skip have historically be under tested. For example, for quite a while, cdc was the only thing regularly testing a nightly tpcc-1000. This has gotten better recently, but it's still something we should be careful of.

As for a concrete proposal, I'd suggest something parallel to StartTimer and StopTimer on (*testing.B). So a StartSetup/StopSetup or the inverse semantics StartInterestingTest/StopInterestingTest that turns fatals into skips. I don't have a strong opinion on "skipped" vs "setup-failed".

andreimatei · 2018-12-27T20:55:35Z

Can't the CDC tests do whatever they want themselves by using their own helpers or t wrapper? If possible, let's pilot things there for a while. Please don't add things lightly to the interface between the tests and the test runner.

I don't have a strong opinion on "skipped" vs "setup-failed".

You seem to have an opinion. Or, rather, what's the difference y'all are seeking to produce between "skipped" and "failed"? Or, conversely, between "setup-failed" and "failed"?

As for a concrete proposal, I'd suggest something parallel to StartTimer and StopTimer on (*testing.B).

How would having tests use that be less burdensome than having them... not call t.Fatal() inside the respective block?

petermattis · 2018-12-27T21:41:49Z

The purpose of some sort of cluster.Setup or test.Setup infrastructure is to make the pattern for specifying some steps as part of setup obvious. There is currently only a single example of a test "skipping" itself (disk-full). I would hazard a guess that most folks don't know that is possible. Why do I want to distinguish failure from setup-failure? Because the latter should not cause an issue to be filed and should look different in the summary slack message.

@danhhz StartSetup / StopSetup would also work. Not sure if that has advantages over test.Setup(fn func()).

How would having tests use that be less burdensome than having them... not call t.Fatal() inside the respective block?

The t.Fatal() call might be hidden from the test's control.

danhhz · 2018-12-27T21:48:20Z

@danhhz StartSetup / StopSetup would also work. Not sure if that has advantages over test.Setup(fn func()).

For one, we'd have to change the name but it could also be used for teardown. Also, dunno that I have any examples for setup in current roachtests, but I've found the StartTimer/StopTimer/b.N pattern much more flexible for benchmarks than, for example, what rust does with just a callback function. However, I think either is workable.

andreimatei · 2018-12-27T22:38:56Z

The t.Fatal() call might be hidden from the test's control.

That's a harmful anti-pattern! t.Fatal() should always be under the test's control!

Why do I want to distinguish failure from setup-failure? Because the latter should not cause an issue to be filed and should look different in the summary slack message.

Right. But it seems to me that exactly what behavior we want is still nebulous. We do want some issue to be filed for "setup failure", right? But perhaps not one per test, but one per failure type. So perhaps what's needed is more control over issue creation, not a blunt setup/non-setup false dichotomy.

petermattis · 2018-12-28T02:43:52Z

That's a harmful anti-pattern! t.Fatal() should always be under the test's control!

Except for cluster creation, t.Fatal is under the test's control, though sometimes it is hidden within various cluster convenience methods. That seems fine to me. And it seems fine to add more convenience that those fatals are not actually fatal if done during a "setup" phase.

Right. But it seems to me that exactly what behavior we want is still nebulous. We do want some issue to be filed for "setup failure", right? But perhaps not one per test, but one per failure type. So perhaps what's needed is more control over issue creation, not a blunt setup/non-setup false dichotomy.

Well, no. I don't want an issue filed for "setup failure". I do want notification via slack. I don't think an issue is the right medium for this "expected" failure.

tbg · 2019-01-02T11:24:07Z

I think we should let these tests fail nominally, though we may want to hold off on posting an issue. I'm open to introducing a separate error state for this (setup-failed) but I would rather not just mark them as skipped. Part of getting reliable tests is to invest in a clean setup. If we're finding that tests fail regularly during setup, we need to invest in improving that.

For example, data imports on GCE can be baked into the image, creating an image from an existing disk seems straightforward. We could bake this kind of detection into downloadStoreDumps and create the image with the --delete-in flag to avoid build-up of gunk. (And then we'd need to do something similar on any other cloud platform that we run into problems in).

We're obviously short on resources to work on this sort of thing (but I think roles in that department are listing soon).

andreimatei · 2019-02-07T16:29:37Z

Since #34687 there's a t.Skip() which is used when cluster creation fails.

knz · 2019-10-01T13:16:48Z

Copying this comment over from #41227 which I mistakenly filed as a separate issue this morning:

The following issues are caused by an I/O timeout during bulk i/o, which prevent the test to proceed beyond its initialization stage:

roachtest: clearrange/checks=true failed #38720 roachtest clearrange
roachtest: clearrange/checks=false failed #41123 roachtest clearrange
roachtest: import/tpcc/warehouses=1000/nodes=32 failed #41154 roachtest import/tpcc
roachtest: restore2TB/nodes=10 failed #41152 roachtest restore2TB

Although the cause of these failures is serious and needs its own investigation, it is immaterial to the main feature being tested by the various roachtests.

To facilitate triage and enable different team members to focus on specific issues (as opposed to the same investigation of a common cause by many team members), it would be useful to ensure that roachtest files separate github issues depending on which phase of a test fails.

so my take on this is to have a marker (possibly a go string) of which "phase" a test is in and add this to the string used to file the github issue.

andreimatei · 2019-10-01T17:17:19Z

There's two things that need custom handling, in my opinion.

cluster operation failing - in particular roachprod put and roachprod start.
imports and restores

What I think I would do is make it so that these operations file issues separately (not even related to the particular test that was running). By isolating these operations to dedicated helpers that deal with issues creation, or perhaps in some other ways.
The separately I think we can also introduce this idea of a test "phase" that you talk about and let tests use phases as they see fit. But, in my opinion, if we address these two particular cases, we'll be in a much better place.

irfansharif · 2021-02-23T16:58:14Z

I think addressing this, or something similar, would go a long way to address the very annoying amount of integration test noise we have at CRL. I'll give a very recent example: #59562 (see the 200+ linked issues). When we make changes to our testing infrastructure (like roachprod, workload, or CI images, etc), or there are real infra-issues (apt-get doesn't work), or restoring from a fixture fails (maybe we can't do much about this one with this issue), the way we find out about them is that the underlying single issue ends up tripping up virtually every integration test we have, and our test filer files an issue for each one. This is just too much noise to wade through, and real failures very likely end up getting lost because of it.

If we could massage our test filer to file "differently" for failures that occurred after "setup", it can improve the noise:signal ratio tremendously.

+cc @kenliu / @jlinder.

TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377

github-actions · 2024-01-22T11:05:21Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure labels Dec 27, 2018

petermattis assigned andreimatei and petermattis Dec 27, 2018

This was referenced Dec 28, 2018

roachtest: clearrange/checks=false failed #33332

Closed

roachtest: kv95/encrypt=false/nodes=3 failed #33401

Closed

roachtest: backup2TB failed #33403

Closed

jordanlewis added the O-roachtest label Feb 5, 2019

petermattis mentioned this issue Mar 27, 2019

roachtest: acceptance/gossip/locality-address failed #36199

Closed

petermattis assigned bobvawter Apr 16, 2019

kenliu removed the O-roachtest label Jul 12, 2019

petermattis mentioned this issue Oct 1, 2019

roachtest: file different github issues when a test fails to start, from when it finds an error #41227

Closed

knz mentioned this issue Oct 1, 2019

release: v19.2.0-beta.20190930 #41128

Closed

18 tasks

pbardea mentioned this issue Oct 1, 2019

roachtest: add ability relate an owning team to a given roachtest #41236

Closed

kenliu assigned jlinder and unassigned bobvawter and petermattis Jan 8, 2020

jlinder mentioned this issue Jan 16, 2020

roachprod: Clusters should default to secure #38539

Closed

srosenberg linked a pull request Apr 19, 2023 that will close this issue

roachtest: add optional PreSetup and PostSetup hooks #101814

Draft

srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 21, 2023

roachtest: add optional PreSetup and PostSetup hooks

a80524b

TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377

srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 21, 2023

roachtest: add optional PreSetup and PostSetup hooks

b0786e9

TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377

srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 21, 2023

roachtest: add optional PreSetup and PostSetup hooks

a719620

TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377

srosenberg added a commit to srosenberg/cockroach that referenced this issue Apr 25, 2023

roachtest: add optional PreSetup and PostSetup hooks

fa11649

TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377

github-actions bot added the no-issue-activity label Jan 22, 2024

github-actions bot added the X-stale label Feb 5, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024

exalate-issue-sync bot closed this as completed Feb 5, 2024

srosenberg reopened this Jul 31, 2024

srosenberg mentioned this issue Jul 31, 2024

roachtest: c2c/kv0 failed [flakes in require.NoError not detected] #125674

Closed

github-actions bot removed the X-stale label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: skip tests that fail during "setup" #33377

roachtest: skip tests that fail during "setup" #33377

petermattis commented Dec 27, 2018 •

edited by cockroach-jira-scripts

Loading

andreimatei commented Dec 27, 2018

danhhz commented Dec 27, 2018

andreimatei commented Dec 27, 2018

petermattis commented Dec 27, 2018

danhhz commented Dec 27, 2018

andreimatei commented Dec 27, 2018

petermattis commented Dec 28, 2018

tbg commented Jan 2, 2019

andreimatei commented Feb 7, 2019

knz commented Oct 1, 2019

andreimatei commented Oct 1, 2019

irfansharif commented Feb 23, 2021

github-actions bot commented Jan 22, 2024

roachtest: skip tests that fail during "setup" #33377

roachtest: skip tests that fail during "setup" #33377

Comments

petermattis commented Dec 27, 2018 • edited by cockroach-jira-scripts Loading

andreimatei commented Dec 27, 2018

danhhz commented Dec 27, 2018

andreimatei commented Dec 27, 2018

petermattis commented Dec 27, 2018

danhhz commented Dec 27, 2018

andreimatei commented Dec 27, 2018

petermattis commented Dec 28, 2018

tbg commented Jan 2, 2019

andreimatei commented Feb 7, 2019

knz commented Oct 1, 2019

andreimatei commented Oct 1, 2019

irfansharif commented Feb 23, 2021

github-actions bot commented Jan 22, 2024

petermattis commented Dec 27, 2018 •

edited by cockroach-jira-scripts

Loading