Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: collect failure artifacts when restore fails #104910

Merged
merged 3 commits into from
Jun 14, 2023

Conversation

renatolabs
Copy link
Contributor

Backport 3/3 commits from #104868.

/cc @cockroachdb/release


This commit updates the backup-restore/mixed-version roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
restore_failure_N directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as #104604 more actionable.

Epic: none

Release note: None

This commit updates the `backup-restore/mixed-version` roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
`restore_failure_N` directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as cockroachdb#104604 more actionable.

Epic: none

Release note: None
It was missing the `coordinator_id` for the node being waited on.

Epic: none

Release note: None
This reduces the chance of a test timeout. Over time, we still get
good coverage of the different backup scenarios as they are picked
randomly.

Epic: none

Release note: None
@renatolabs renatolabs requested a review from a team as a code owner June 14, 2023 19:34
@renatolabs renatolabs requested review from srosenberg and smg260 and removed request for a team June 14, 2023 19:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs merged commit d7157c2 into cockroachdb:release-23.1 Jun 14, 2023
@renatolabs renatolabs deleted the backport23.1-104868 branch June 14, 2023 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants