-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: skip tests that fail during "setup" #33377
Comments
Out of all the badness you've enumerated, I'm only impressed by "tests which start after the timeout will fail during cluster setup". How exactly to the future tests fail? I'll make sure to have a better story here. Other than this, I don't quite see a need for any common "setup" infrastructure. Individual tests can, of course, deal with their individual hardships and retry / mark themselves skipped / do whatever they seem fit for their particular situation. |
Yeah, there's definitely a real problem here. In an ideal world, we'd add retries/backoff/rescheduling as necessary to get everything to consistently pass, but I agree that it's not worth the time it would take to root out every single one of these. If we had something like you propose, the first thing I'd do is go mark everything up to and including the fixture restore in all the cdc tests as setup. If I could, I'd also mark any errors coming back from The downside to this is that these things I'd skip have historically be under tested. For example, for quite a while, cdc was the only thing regularly testing a nightly tpcc-1000. This has gotten better recently, but it's still something we should be careful of. As for a concrete proposal, I'd suggest something parallel to |
Can't the CDC tests do whatever they want themselves by using their own helpers or
You seem to have an opinion. Or, rather, what's the difference y'all are seeking to produce between "skipped" and "failed"? Or, conversely, between "setup-failed" and "failed"?
How would having tests use that be less burdensome than having them... not call t.Fatal() inside the respective block? |
The purpose of some sort of @danhhz
The |
For one, we'd have to change the name but it could also be used for teardown. Also, dunno that I have any examples for setup in current roachtests, but I've found the StartTimer/StopTimer/b.N pattern much more flexible for benchmarks than, for example, what rust does with just a callback function. However, I think either is workable. |
That's a harmful anti-pattern!
Right. But it seems to me that exactly what behavior we want is still nebulous. We do want some issue to be filed for "setup failure", right? But perhaps not one per test, but one per failure type. So perhaps what's needed is more control over issue creation, not a blunt setup/non-setup false dichotomy. |
Except for cluster creation,
Well, no. I don't want an issue filed for "setup failure". I do want notification via slack. I don't think an issue is the right medium for this "expected" failure. |
I think we should let these tests fail nominally, though we may want to hold off on posting an issue. I'm open to introducing a separate error state for this ( For example, data imports on GCE can be baked into the image, creating an image from an existing disk seems straightforward. We could bake this kind of detection into We're obviously short on resources to work on this sort of thing (but I think roles in that department are listing soon). |
Since #34687 there's a |
Copying this comment over from #41227 which I mistakenly filed as a separate issue this morning: The following issues are caused by an I/O timeout during bulk i/o, which prevent the test to proceed beyond its initialization stage:
Although the cause of these failures is serious and needs its own investigation, it is immaterial to the main feature being tested by the various roachtests. To facilitate triage and enable different team members to focus on specific issues (as opposed to the same investigation of a common cause by many team members), it would be useful to ensure that roachtest files separate github issues depending on which phase of a test fails. so my take on this is to have a marker (possibly a go string) of which "phase" a test is in and add this to the string used to file the github issue. |
There's two things that need custom handling, in my opinion.
What I think I would do is make it so that these operations file issues separately (not even related to the particular test that was running). By isolating these operations to dedicated helpers that deal with issues creation, or perhaps in some other ways. |
I think addressing this, or something similar, would go a long way to address the very annoying amount of integration test noise we have at CRL. I'll give a very recent example: #59562 (see the 200+ linked issues). When we make changes to our testing infrastructure (like roachprod, workload, or CI images, etc), or there are real infra-issues (apt-get doesn't work), or restoring from a fixture fails (maybe we can't do much about this one with this issue), the way we find out about them is that the underlying single issue ends up tripping up virtually every integration test we have, and our test filer files an issue for each one. This is just too much noise to wade through, and real failures very likely end up getting lost because of it. If we could massage our test filer to file "differently" for failures that occurred after "setup", it can improve the noise:signal ratio tremendously. |
TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377
TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377
TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377
TBD: summarize Epic: none Release note: None Resolves: cockroachdb#33377
We have marked this issue as stale because it has been inactive for |
We periodically see nightly roachtest failures which are due to a failure during setup and not a failure of the functionality being tested. For example, if the nightly roachtests run too long, then tests which start after the timeout will fail during cluster setup. We've seen
backup2TB
fail because the store dumps cannot be downloaded. We've seencdc
tests fail because ofdocker
flakiness and we've seen other tests fail because ofapt-get
flakiness.In addition to trying to improve the reliability of the setup steps, we should avoid considering failure during setup as test failure. I think this can be achieved by marking tests as "skipped" if they fail during setup. Perhaps "skipped" is the wrong term and tests should be marked as "setup-failed". One possibility for achieving this is to mark a test as skipped if failure occurs before
cluster.Start
is invoked. That could be a bit too subtle. Another possibility is to add acluster.Setup()
function that holds all of the steps performed during setup. If a test doesn't usecluster.Setup
then any failure would be considered a real test failure. I think that default is fine.@danhhz do you have any thoughts about this?
Epic CRDB-10428
Jira issue: CRDB-4691
The text was updated successfully, but these errors were encountered: