Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: post-validations can sometimes hang indefinitely #99684

Closed
renatolabs opened this issue Mar 27, 2023 · 1 comment · Fixed by #101254
Closed

roachtest: post-validations can sometimes hang indefinitely #99684

renatolabs opened this issue Mar 27, 2023 · 1 comment · Fixed by #101254
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Mar 27, 2023

When a test finishes, roachtest attempts to run some post validation checks to make sure the database is not in an inconsistent state. This involves running SQL statements through some live node, if any.

There's a comment in that logic from a long time ago that alludes to the fact that such calls may sometimes hang:

// We avoid trying to do this when t.Failed() (and in particular when there
// are dead nodes) because for reasons @tbg does not understand this gets
// stuck occasionally, which really ruins the roachtest run. The method
// below already uses a ctx timeout and SQL statement_timeout, but it does
// not seem to be enough.
//
// TODO(testinfra): figure out why this can still get stuck despite the
// above.

While we haven't seen that happen frequently, it did happen in a recent roachtest failure [1]. The consequences of the check hanging are pretty bad, since it means the failure won't have any artifacts, making debugging impossible.

A couple of things worth noting:

  • the comment above says "We avoid trying to do this when t.Failed()"; however, we are currently attempting to perform the validations regardless of test status.

  • The logs in [1] indicate that we didn't even attempt to run any validation checks, meaning ConnectToLiveNode was stuck for 1h. This might mean that the db.ExecContext is not respecting context cancelation. The fix might be similar to what was done in roachtest: added connection timeout to loq recovery ops #98851, and we need to set a connection timeout in these calls.

[1] #99586

Jira issue: CRDB-26023

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team labels Mar 27, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 27, 2023

cc @cockroachdb/test-eng

smg260 pushed a commit to smg260/cockroach that referenced this issue Apr 20, 2023
Epic: none
Fixes: cockroachdb#99684

Release note: None
craig bot pushed a commit that referenced this issue Apr 21, 2023
101254: roachtest: roachprod: Remove check for Pebble CURRENT file in favour of marker.* file r=renatolabs a=smg260

roachtest: roachprod: Remove check for Pebble CURRENT file in favour
of marker.* files

roachtest: `set_timeout` for `CheckInvalidDescriptors`. Without the `set_timeout`, a cancellation due to timeout is not respected and can cause the step to hang.

Epic: none
Fixes: #95170
Fixes: #99684

Release note: None

Combined with #101222 TC run [here](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/9551260?showRootCauses=true&expandBuildChangesSection=true)

Co-authored-by: Miral Gadani <[email protected]>
@craig craig bot closed this as completed in 831e830 Apr 21, 2023
renatolabs pushed a commit to renatolabs/cockroach that referenced this issue Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
None yet
2 participants