You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a test finishes, roachtest attempts to run some post validation checks to make sure the database is not in an inconsistent state. This involves running SQL statements through some live node, if any.
There's a comment in that logic from a long time ago that alludes to the fact that such calls may sometimes hang:
// We avoid trying to do this when t.Failed() (and in particular when there
// are dead nodes) because for reasons @tbg does not understand this gets
// stuck occasionally, which really ruins the roachtest run. The method
// below already uses a ctx timeout and SQL statement_timeout, but it does
// not seem to be enough.
//
// TODO(testinfra): figure out why this can still get stuck despite the
// above.
While we haven't seen that happen frequently, it did happen in a recent roachtest failure [1]. The consequences of the check hanging are pretty bad, since it means the failure won't have any artifacts, making debugging impossible.
A couple of things worth noting:
the comment above says "We avoid trying to do this when t.Failed()"; however, we are currently attempting to perform the validations regardless of test status.
The logs in [1] indicate that we didn't even attempt to run any validation checks, meaning ConnectToLiveNode was stuck for 1h. This might mean that the db.ExecContext is not respecting context cancelation. The fix might be similar to what was done in roachtest: added connection timeout to loq recovery ops #98851, and we need to set a connection timeout in these calls.
The text was updated successfully, but these errors were encountered:
renatolabs
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-testeng
TestEng Team
labels
Mar 27, 2023
When a test finishes,
roachtest
attempts to run some post validation checks to make sure the database is not in an inconsistent state. This involves running SQL statements through some live node, if any.There's a comment in that logic from a long time ago that alludes to the fact that such calls may sometimes hang:
cockroach/pkg/cmd/roachtest/test_runner.go
Lines 1090 to 1097 in b4f3ae3
While we haven't seen that happen frequently, it did happen in a recent roachtest failure [1]. The consequences of the check hanging are pretty bad, since it means the failure won't have any artifacts, making debugging impossible.
A couple of things worth noting:
the comment above says "We avoid trying to do this when t.Failed()"; however, we are currently attempting to perform the validations regardless of test status.
The logs in [1] indicate that we didn't even attempt to run any validation checks, meaning
ConnectToLiveNode
was stuck for 1h. This might mean that thedb.ExecContext
is not respecting context cancelation. The fix might be similar to what was done in roachtest: added connection timeout to loq recovery ops #98851, and we need to set a connection timeout in these calls.[1] #99586
Jira issue: CRDB-26023
The text was updated successfully, but these errors were encountered: