Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: collect failure artifacts when restore fails #104868

Merged

Conversation

renatolabs
Copy link
Contributor

This commit updates the backup-restore/mixed-version roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
restore_failure_N directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as #104604 more actionable.

Epic: none

Release note: None

@renatolabs renatolabs requested a review from a team as a code owner June 14, 2023 12:37
@renatolabs renatolabs requested review from herkolategan and srosenberg and removed request for a team June 14, 2023 12:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Jun 14, 2023
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/cmd/roachtest/cluster.go line 1476 at r1 (raw file):

// FetchDebugZip downloads the debug zip from the cluster using `roachprod ssh`.
// The logs will be placed `dest`, relative to the test's artifacts dir.

Nit: ...placed at dest


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2030 at r1 (raw file):

				// restore failure; if we can't, log the error encountered and
				// move on.
				restoreErr, collectionErr := mvb.collectFailureArtifacts(ctx, l, err, len(restoreErrors)+1)

Last param. is errID, but it seems hardly unique–what if it's invoked twice with the same number of restoreErrors?


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2050 at r1 (raw file):

		msgs := make([]string, 0, len(restoreErrors))
		for j, err := range restoreErrors {
			msgs = append(msgs, fmt.Sprintf("%d: %s", j+1, err.Error()))

Curious why the switch to 1-based accounting?

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/cmd/roachtest/cluster.go line 1476 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: ...placed at dest

Fixed.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2030 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Last param. is errID, but it seems hardly unique–what if it's invoked twice with the same number of restoreErrors?

It's unique "by construction" in this case? There's a 1:1 relationship between collectFailureArtifacts and append(restoreErrors) calls. The "ID" passed is such that it matches the index reported in the aggregated error message.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2050 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Curious why the switch to 1-based accounting?

Just personal preference, doesn't matter either way. Humans start counting at 1, and the error is for human consumption 🤷

This commit updates the `backup-restore/mixed-version` roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
`restore_failure_N` directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as cockroachdb#104604 more actionable.

Epic: none

Release note: None
It was missing the `coordinator_id` for the node being waited on.

Epic: none

Release note: None
@renatolabs renatolabs force-pushed the rc/better-logs-backup-mixed-version branch from 1b273c6 to e736904 Compare June 14, 2023 16:22
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2030 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

It's unique "by construction" in this case? There's a 1:1 relationship between collectFailureArtifacts and append(restoreErrors) calls. The "ID" passed is such that it matches the index reported in the aggregated error message.

Got it, thanks. I was initially thinking this might get invoked multiple times during the lifecycle of the test, but that's not the case.


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2050 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Just personal preference, doesn't matter either way. Humans start counting at 1, and the error is for human consumption 🤷

Fair enough, although the humans who will be reading these logs are already forced to switch between 0-based and 1-based thinking :)

This reduces the chance of a test timeout. Over time, we still get
good coverage of the different backup scenarios as they are picked
randomly.

Epic: none

Release note: None
@renatolabs
Copy link
Contributor Author

@srosenberg pushed another commit after taking another look at #104604 -- that failure is actually a test timeout, and it's hard to diagnose. Let me know if you have comments.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r1, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/cmd/roachtest/tests/mixed_version_backup.go line 2045 at r5 (raw file):

	verify(clusterupgrade.MainVersion)

	// If the context was canceled (most likely due to a test timeout),

Timeouts are partial (unclean) runs, so it makes sense to bail out here.

@renatolabs
Copy link
Contributor Author

TFTR!

bors r=srosenberg

@craig
Copy link
Contributor

craig bot commented Jun 14, 2023

Build succeeded:

@craig craig bot merged commit 1d3c777 into cockroachdb:master Jun 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 14, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 86c4583 to blathers/backport-release-23.1-104868: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants