Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/backupccl: TestRestorePauseOnError failed #121342

Closed
cockroach-teamcity opened this issue Mar 29, 2024 · 5 comments · Fixed by #125158
Closed

ccl/backupccl: TestRestorePauseOnError failed #121342

cockroach-teamcity opened this issue Mar 29, 2024 · 5 comments · Fixed by #125158
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 29, 2024

ccl/backupccl.TestRestorePauseOnError failed on master @ a3ac7ebf958f25201c2696a17f996c0b9f86830f:

=== RUN   TestRestorePauseOnError
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRestorePauseOnError98142672
    backup_test.go:9235: expected jobID 955604274711265281 to have status paused got succeeded
    testutils.go:289: no Invalid Descriptors
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestRestorePauseOnError98142672
--- FAIL: TestRestorePauseOnError (46.33s)

Parameters:

  • attempt=1
  • run=1
  • shard=12
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-37200

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Mar 29, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 29, 2024
@cockroach-teamcity
Copy link
Member Author

ccl/backupccl.TestRestorePauseOnError failed on master @ 2a5e231716c436781f12452d800651f51c6383b7:

=== RUN   TestRestorePauseOnError
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRestorePauseOnError2732496999
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    backup_test.go:9235: expected jobID 955872203116740609 to have status paused got succeeded
    testutils.go:289: no Invalid Descriptors
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestRestorePauseOnError2732496999
--- FAIL: TestRestorePauseOnError (47.10s)

Parameters:

  • attempt=1
  • run=1
  • shard=12
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/backupccl.TestRestorePauseOnError failed on master @ 7fc4c7bcbbf0c75a62d056da0bf79a5a32714650:

=== RUN   TestRestorePauseOnError
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRestorePauseOnError3675220197
    backup_test.go:9235: expected jobID 956168747870191617 to have status paused got succeeded
    testutils.go:289: no Invalid Descriptors
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestRestorePauseOnError3675220197
--- FAIL: TestRestorePauseOnError (46.48s)

Parameters:

  • attempt=1
  • run=1
  • shard=12
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/backupccl.TestRestorePauseOnError failed on master @ 7fc4c7bcbbf0c75a62d056da0bf79a5a32714650:

=== RUN   TestRestorePauseOnError
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestRestorePauseOnError2125256115
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    backup_test.go:9235: expected jobID 956422934812819457 to have status paused got succeeded
    testutils.go:289: no Invalid Descriptors
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestRestorePauseOnError2125256115
--- FAIL: TestRestorePauseOnError (47.92s)

Parameters:

  • attempt=1
  • run=3
  • shard=12
Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

rickystewart added a commit to rickystewart/cockroach that referenced this issue Apr 1, 2024
Flaky. See cockroachdb#121342

Epic: None
Release note: None
craig bot pushed a commit that referenced this issue Apr 1, 2024
…121487

121207: rangefeed: add batch version of onValues r=dt a=dt

Release note: none.
Epic: none.

121257: kv/kvclient/kvtenant: allow downloading ones own spans r=stevendanna a=dt

Release note: none.
Epic: none.

121367: roachtest: expand WAL failover test assertions r=sumeerbhola a=jbowens

Update the WAL failover disk stall roachtest to assert that the stalled store does failover to the secondary and that SQL tail latencies remain bounded.

Epic: none
Release note: none

121369: ui: add goroutine scheduling latency graph to runtime dashboard r=mgartner a=mgartner

#### ui: add goroutine scheduling latency graph to runtime dashboard

Epic: None

Release note (ui change): The "Goroutine Scheduling Latency" graph has
been added to the Runtime page in Metrics.


121390: util/json,builtins: miscellaneous improvements around json objects r=yuzefovich a=yuzefovich

Fixes: #121326.

Release note: None

121452: dbconsole: use second granularity for jobs times r=dt a=dt

These milliseconds are not helpful to humans and just add noise that makes the actual times harder to spot.

Release note: none.
Epic: none.

121482: stress: reduce default `--runs_per_test` to 25 r=rail a=rickystewart

The nightlies are taking longer to run since the branch cut, so this should speed things up slightly.

Epic: CRDB-8308
Release note: None

121484: backupccl: skip `TestRestorePauseOnError` r=rail a=rickystewart

Flaky. See #121342

Epic: None
Release note: None

121487: go.mod: bump Pebble to eae0efc2a391 r=aadityasondhi a=itsbilal

Changes:

 * [`eae0efc2`](cockroachdb/pebble@eae0efc2) db: trigger flushable ingest in TestDeterminism
 * [`cc5b658d`](cockroachdb/pebble@cc5b658d) db: minor jobID cleanup
 * [`295c0099`](cockroachdb/pebble@295c0099) db: cancel compactions that overlap with flushable ingest+excise
 * [`a306d7e0`](cockroachdb/pebble@a306d7e0) db: add TestDeterminism
 * [`d3f89bff`](cockroachdb/pebble@d3f89bff) manifest: cleaner implementation of assertNotL0Cmp
 * [`1c7bcd1c`](cockroachdb/pebble@1c7bcd1c) manifest: add a helper for iterating through all levels
 * [`a9087af9`](cockroachdb/pebble@a9087af9) manifest: assert that we don't use Seek on L0 LevelMetadata
 * [`4f4be429`](cockroachdb/pebble@4f4be429) metamorphic: disable download op
 * [`e9624315`](cockroachdb/pebble@e9624315) base: add AssertionFailedf wrapper
 * [`016ef889`](cockroachdb/pebble@016ef889) metamorphic: run the level checker less frequently
 * [`1dc54dfd`](cockroachdb/pebble@1dc54dfd) metamorphic: add timestamps to logs
 * [`34502438`](cockroachdb/pebble@34502438) db: label download compactions logs

Fixes #121263.

Release note: none.
Epic: none.

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Bilal Akhtar <[email protected]>
@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 2, 2024
@msbutler msbutler added the P-2 Issues/test failures with a fix SLA of 3 months label Apr 16, 2024
@navsetlur navsetlur self-assigned this Jun 3, 2024
@navsetlur
Copy link
Contributor

navsetlur commented Jun 4, 2024

The cluster attempts to pause but SucceedsSoon sees the state as succeeded:

ccl/backupccl/restore_job.go:1722 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=974723906702147585›] 233  Pausing due to setting
W240604 20:22:09.754387 2764 ccl/backupccl/restore_job.go:1724 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=974723906702147585›] 234  job failed with error (testing injected failure) but is being paused due to the ‹debug_pause_on›=‹error› setting
I240604 20:22:09.759110 2764 jobs/update.go:352 ⋮ [T1,Vsystem,n1] 235  job 974723906702147585: pause requested recorded with reason ‹pausing job due to the debug_pause_on=error setting: testing injected failure›
E240604 20:22:09.760430 2764 jobs/adopt.go:456 ⋮ [T1,Vsystem,n1] 236  job 974723906702147585: adoption completed with error pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: pausing job due to the ‹debug_pause_on›=‹error› setting: testing injected failure
I240604 20:22:09.767191 915 jobs/adopt.go:108 ⋮ [T1,Vsystem,n1] 237  claimed 1 jobs
I240604 20:22:09.779409 881 jobs/adopt.go:547 ⋮ [T1,Vsystem,n1] 238  job 974723906702147585, session 01018086762f103c8442f29d0981035867fd8f: paused
I240604 20:22:11.531823 233 1@gossip/gossip.go:1374 ⋮ [T1,Vsystem,n1] 239  node has connected to cluster via gossip
I240604 20:22:11.531997 233 kv/kvserver/stores.go:283 ⋮ [T1,Vsystem,n1] 240  wrote 0 node addresses to persistent storage
I240604 20:22:12.858800 15 ccl/backupccl/backup_test.go:9122 ⋮ [-] 241  SucceedsSoon: expected jobID 974723906266660865 to have status paused got ‹succeeded›
I240604 20:22:13.861008 15 ccl/backupccl/backup_test.go:9122 ⋮ [-] 242  SucceedsSoon: expected jobID 974723906266660865 to have status paused got ‹succeeded›
I240604 20:22:14.863964 15 ccl/backupccl/backup_test.go:9122 ⋮ [-] 243  SucceedsSoon: expected jobID 974723906266660865 to have status paused got ‹succeeded›

navsetlur added a commit to navsetlur/cockroach that referenced this issue Jun 5, 2024
…en pausing a restore operation on error

The DEBUG_PAUSE_ON option was previously used to pause a restore operation when it failed before returning the error.
We currently have a standardized way of pausing operations via the pause point interface and it made sense to update
this feature to use this approach while fixing a test failure. This change removes the DEBUG_PAUSE_ON option and
replaces it with a 'restore.restore_after_failure' pause point.

Fixes: cockroachdb#121342

Release note (enterprise): The DEBUG_PAUSE_ON option has been removed entirely and replaced with the
'restore.restore_after_failure' pause point to match other pause points used throughout the codebase.
You can set this pause point by running  `SET CLUSTER SETTING jobs.debug.pausepoints = 'restore.after_restore_failure'`
craig bot pushed a commit that referenced this issue Jun 10, 2024
125158: backupccl: Replace DEBUG_PAUSE_ON option with standard pause point wh… r=dt a=navsetlur

…en pausing a restore operation on error

The DEBUG_PAUSE_ON option was previously used to pause a restore operation when it failed before returning the error. We currently have a standardized way of pausing operations via the pause point interface and it made sense to update this feature to use this approach while fixing a test failure. This change removes the DEBUG_PAUSE_ON option and replaces it with a 'restore.restore_after_failure' pause point.

Fixes: #121342

Release note (enterprise): The DEBUG_PAUSE_ON option has been removed entirely and replaced with the 'restore.restore_after_failure' pause point to match other pause points used throughout the codebase. You can set this pause point by running  `SET CLUSTER SETTING jobs.debug.pausepoints = 'restore.after_restore_failure'`

Co-authored-by: Naveen Setlur <[email protected]>
@craig craig bot closed this as completed in 49a1a4b Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-disaster-recovery
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants