Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: don't report SSH flake if test times out #91237

Closed
renatolabs opened this issue Nov 3, 2022 · 1 comment · Fixed by #96743
Closed

roachtest: don't report SSH flake if test times out #91237

renatolabs opened this issue Nov 3, 2022 · 1 comment · Fixed by #96743
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Nov 3, 2022

Describe the problem

Currently, if a roachtest times out, there's a good chance the failure is reported as an SSH-problem (after the improvements we introduced in #88492). When the test times out, roachtest will stop the cluster (including every process running in it), which may cause long-running processes running over SSH to return exit code 255. This leads to our test infrastructure reporting the failure as an SSH problem. We need to improve our handling of timed out tests so that they are properly reported as a test failure to the owning team in these cases.

We should also take the chance to improve the reporting of the "time out" error added to the test, as it uses the Timeout field of the test spec, which will be 0 for tests that don't specify a custom timeout.

Related failure: #90695 (comment)

Jira issue: CRDB-21167

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Nov 3, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 3, 2022

cc @cockroachdb/test-eng

@herkolategan herkolategan self-assigned this Feb 7, 2023
@exalate-issue-sync exalate-issue-sync bot assigned smg260 and unassigned herkolategan Feb 7, 2023
craig bot pushed a commit that referenced this issue Feb 9, 2023
96743: roachtest: report timeout failures accordingly r=renatolabs,herkolategan a=smg260

Previously, a timeout failure would be deferred until after artifacts were collected, which sometimes resulted in subsequent failures being attributed as the primary cause.

- Timeout failures are now recorded at actual timeout, with subsequent failures secondary. Context cancellation occurs at the end of test teardown
- `addFailure` accepts a depth parameter and no longer includes context cancellation, which is done separately.

Epic: none
Fixes: #91237
Release note: None

Co-authored-by: Miral Gadani <[email protected]>
@craig craig bot closed this as completed in 25045db Feb 9, 2023
smg260 pushed a commit to smg260/cockroach that referenced this issue Mar 7, 2023
Timeout failures are recorded at actual timeout, with
subsequent failures secondary.

`addFailure` accepts a depth parameter and no longer
includes context cancellation, which is done separately.

Epic: none
Fixes: cockroachdb#91237
Release note: None
smg260 pushed a commit to smg260/cockroach that referenced this issue Mar 7, 2023
Timeout failures are recorded at actual timeout, with
subsequent failures secondary.

`addFailure` accepts a depth parameter and no longer
includes context cancellation, which is done separately.

Epic: none
Fixes: cockroachdb#91237
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants