-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: collect failure artifacts when restore fails #104868
roachtest: collect failure artifacts when restore fails #104868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/cmd/roachtest/cluster.go
line 1476 at r1 (raw file):
// FetchDebugZip downloads the debug zip from the cluster using `roachprod ssh`. // The logs will be placed `dest`, relative to the test's artifacts dir.
Nit: ...placed at dest
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2030 at r1 (raw file):
// restore failure; if we can't, log the error encountered and // move on. restoreErr, collectionErr := mvb.collectFailureArtifacts(ctx, l, err, len(restoreErrors)+1)
Last param. is errID
, but it seems hardly unique–what if it's invoked twice with the same number of restoreErrors
?
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2050 at r1 (raw file):
msgs := make([]string, 0, len(restoreErrors)) for j, err := range restoreErrors { msgs = append(msgs, fmt.Sprintf("%d: %s", j+1, err.Error()))
Curious why the switch to 1-based accounting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
pkg/cmd/roachtest/cluster.go
line 1476 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit: ...placed at
dest
Fixed.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2030 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Last param. is
errID
, but it seems hardly unique–what if it's invoked twice with the same number ofrestoreErrors
?
It's unique "by construction" in this case? There's a 1:1 relationship between collectFailureArtifacts
and append(restoreErrors)
calls. The "ID" passed is such that it matches the index reported in the aggregated error message.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2050 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Curious why the switch to 1-based accounting?
Just personal preference, doesn't matter either way. Humans start counting at 1, and the error is for human consumption 🤷
This commit updates the `backup-restore/mixed-version` roachtest to collect artifacts (cockroach logs and a debug.zip) when a restore fails in the last step of the test (when all backups taken are restored). In that step, we do not immediately fail the test when a restore fails but instead attempt to restore every backup and return a list of errors found when the process is done. However, restoring cluster backups involves wiping the cluster which also deletes existing cockroach logs up to that point. This makes debugging a restore failure that happened prior to a cluster restore impossible. After this commit, a restore failure in that test will cause a `restore_failure_N` directory to be created in the artifacts directory, including the cockroach logs collected right after the failure, as well as a debug.zip created at the same time. This will make issues such as cockroachdb#104604 more actionable. Epic: none Release note: None
It was missing the `coordinator_id` for the node being waited on. Epic: none Release note: None
1b273c6
to
e736904
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2030 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
It's unique "by construction" in this case? There's a 1:1 relationship between
collectFailureArtifacts
andappend(restoreErrors)
calls. The "ID" passed is such that it matches the index reported in the aggregated error message.
Got it, thanks. I was initially thinking this might get invoked multiple times during the lifecycle of the test, but that's not the case.
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2050 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Just personal preference, doesn't matter either way. Humans start counting at 1, and the error is for human consumption 🤷
Fair enough, although the humans who will be reading these logs are already forced to switch between 0-based and 1-based thinking :)
This reduces the chance of a test timeout. Over time, we still get good coverage of the different backup scenarios as they are picked randomly. Epic: none Release note: None
@srosenberg pushed another commit after taking another look at #104604 -- that failure is actually a test timeout, and it's hard to diagnose. Let me know if you have comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r1, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/cmd/roachtest/tests/mixed_version_backup.go
line 2045 at r5 (raw file):
verify(clusterupgrade.MainVersion) // If the context was canceled (most likely due to a test timeout),
Timeouts are partial (unclean) runs, so it makes sense to bail out here.
TFTR! bors r=srosenberg |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 86c4583 to blathers/backport-release-23.1-104868: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit updates the
backup-restore/mixed-version
roachtest tocollect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.
After this commit, a restore failure in that test will cause a
restore_failure_N
directory to be created in the artifactsdirectory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.
This will make issues such as #104604 more actionable.
Epic: none
Release note: None