Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup-restore/mixed-version failed #104604

Closed
cockroach-teamcity opened this issue Jun 8, 2023 · 2 comments
Closed

roachtest: backup-restore/mixed-version failed #104604

cockroach-teamcity opened this issue Jun 8, 2023 · 2 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 8, 2023

roachtest.backup-restore/mixed-version failed with artifacts on release-23.1 @ dcffb6a0a3f8ed7ab55b80d5a65d56be7a574f55:

test artifacts and logs in: /artifacts/backup-restore/mixed-version/run_1
(test_runner.go:1024).runTest: test timed out (8h0m0s)
(cluster.go:1394).FailOnInvalidDescriptors: invalid descriptors check failed: operation "invalid descriptors check" timed out after 59m56.807s (given timeout 5m0s): dial tcp 34.138.132.233:26257: connect: connection refused
(mixedversion.go:410).Run: 4 errors during restore:
0: <current>: waiting for job to finish: error reading (status, payload) for job 872128817642110980: dial tcp 34.138.132.233:26257: connect: connection refused
1: failed to wipe cluster: cluster.WipeE: context canceled
2: <current>: backup 8_22.2.10-to-current_table-bank.bank_all-planned-and-executed-on-current: error creating database restore_8_22_2_10_to_current_table_bank_bank_all_planned_and_executed_on_current_20: context canceled
3: failed to wipe cluster: cluster.WipeE: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-28624

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jun 8, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 8, 2023
@adityamaru adityamaru removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Jun 8, 2023
@adityamaru
Copy link
Contributor

looks like a dup of #104481 but we don't have dmesg.txts for some reason. Looking at heap profiles.

renatolabs added a commit to renatolabs/cockroach that referenced this issue Jun 14, 2023
This commit updates the `backup-restore/mixed-version` roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
`restore_failure_N` directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as cockroachdb#104604 more actionable.

Epic: none

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Jun 14, 2023
This commit updates the `backup-restore/mixed-version` roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
`restore_failure_N` directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as cockroachdb#104604 more actionable.

Epic: none

Release note: None
@renatolabs
Copy link
Contributor

(test_runner.go:1024).runTest: test timed out (8h0m0s)

This is actually the relevant line, but it's hard to see in the midst of all those other "error messages". Once #104868 is merged, timeouts should become less likely but if it does happen, the error messaging should improve.

Closing as there's nothing to do here.

craig bot pushed a commit that referenced this issue Jun 14, 2023
103967: build,bazel: upgrade to `rules_js` r=sjbarag a=rickystewart

The library which we were using, `rules_nodejs`, has known deficiencies:

1. The library has been "effectively deprecated" as of the [5.x branch](https://github.com/bazelbuild/rules_nodejs/tree/5.x);
2. the library is incompatible with things we need such as: cross-compilation, Bazel 6.0+, and remote execution;
3. and the library has bugs which we cannot fix, like a race condition which prevents builds from succeeding sporadically, requiring the dev to perform a `clean`.

Here we move to [rules_js](https://github.com/aspect-build/rules_js), the modern alternative.

Epic: none
Release note: None

104820: backupccl: adjust a test to run for secondary tenant codec too r=yuzefovich a=yuzefovich

Fixes: #82882.

Release note: None

104868: roachtest: collect failure artifacts when restore fails r=srosenberg a=renatolabs

This commit updates the `backup-restore/mixed-version` roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
`restore_failure_N` directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as #104604 more actionable.

Epic: none

Release note: None

104872: go.mod: bump Pebble to 32834aa62738 r=RaduBerinde a=RaduBerinde

32834aa6 objstorage: support heteorogeneous Storage backends
c75c4d65 db: wrap error when creating Reader with backing filenum
a8a7ebf5 db: Add Option to Filter SSTables

Release note: None
Epic: None

Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
renatolabs added a commit to renatolabs/cockroach that referenced this issue Jun 14, 2023
This commit updates the `backup-restore/mixed-version` roachtest to
collect artifacts (cockroach logs and a debug.zip) when a restore
fails in the last step of the test (when all backups taken are
restored). In that step, we do not immediately fail the test when a
restore fails but instead attempt to restore every backup and return a
list of errors found when the process is done. However, restoring
cluster backups involves wiping the cluster which also deletes
existing cockroach logs up to that point. This makes debugging a
restore failure that happened prior to a cluster restore impossible.

After this commit, a restore failure in that test will cause a
`restore_failure_N` directory to be created in the artifacts
directory, including the cockroach logs collected right after the
failure, as well as a debug.zip created at the same time.

This will make issues such as cockroachdb#104604 more actionable.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants