Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/poller/rangefeed=false failed [cannot recover PENDING transaction in same epoch] #62064

Closed
cockroach-teamcity opened this issue Mar 16, 2021 · 1 comment · Fixed by #62744 or #62761
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).cdc/poller/rangefeed=false failed on master@e9387a6e5dfdad71c74ccd0a07c907632613fa3e:

		  |    92.0s        0           14.0           17.9   9663.7  16106.1  34359.7  34359.7 delivery
		  |    92.0s        0           70.1          159.0   5637.1  10200.5  11811.2  13958.6 newOrder
		  |    92.0s        0            9.0           19.8   1744.8   5100.3   5100.3   5100.3 orderStatus
		  |    92.0s        0           69.1          194.6   3221.2   7784.6   9126.8   9663.7 payment
		  |    92.0s        0            8.0           19.7   4831.8  20401.1  20401.1  20401.1 stockLevel
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |    93.0s        0           10.0           17.8  10200.5  36507.2  36507.2  36507.2 delivery
		  |    93.0s        0           83.0          158.2   6979.3  15032.4  16643.0  17179.9 newOrder
		  |    93.0s        0            8.0           19.7   2952.8  15032.4  15032.4  15032.4 orderStatus
		  |    93.0s        0           83.0          193.4   4563.4  11811.2  22548.6  27917.3 payment
		  |    93.0s        0           12.0           19.7   4563.4   6710.9   8589.9   8589.9 stockLevel
		  |    94.0s        0           18.0           17.8  10200.5  33286.0  38654.7  38654.7 delivery
		  |    94.0s        0          218.8          158.9   6174.0  10737.4  17179.9  18253.6 newOrder
		  |    94.0s        0           20.0           19.7   4563.4   7784.6  15569.3  15569.3 orderStatus
		  |    94.0s        0          165.8          193.1   4026.5   8589.9  16106.1  18253.6 payment
		  |    94.0s        0           11.0           19.6   3623.9   6442.5   6979.3   6979.3 stockLevel
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

	cluster.go:2688,cdc.go:204,cdc.go:608,test_runner.go:768: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.cdcBasicTest
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cdc.go:204
		  | main.registerCDC.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cdc.go:608
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:768
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2732
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2646
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5652
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:191
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

More

Artifacts: /cdc/poller/rangefeed=false

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 16, 2021
@stevendanna
Copy link
Collaborator

workload appears to have failed with:

Error: error in payment: ERROR: TransactionStatusError: programming error: cannot recover PENDING transaction in same epoch: meta={id=476e4aec key=/Table/53/1/679/4 pri=99.99999995 epo=0 ts=1615878898.732187069,1 min=1615878867.391387290,0 seq=31} lock=true stat=PENDING rts=0,0 wto=false gul=0,0 (REASON_UNKNOWN) (SQLSTATE XXUUU)
Error: COMMAND_PROBLEM: exit status 1
(1) COMMAND_PROBLEM
Wraps: (2) Node 4. Command with error:
  | ```
  | ./workload run tpcc --warehouses=1000 --duration=30m  {pgurl:1-3}
  | ```
Wraps: (3) exit status 1
Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
run_071323.531_n4_workload_run_tpcc: 07:15:02 cluster.go:2320: > result: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2780695-1615874329-24-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=30m  {pgurl:1-3}  returned: exit status 20
(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2780695-1615874329-24-n4cpu16:4 -- ./workload run tpcc --warehouses=1000 --duration=30m  {pgurl:1-3}  returned
  | stderr:
  | I210316 07:13:25.328503 1 workload/cli/run.go:361  [-] 1  creating load generator...
  | I210316 07:13:28.402072 1 workload/cli/run.go:392  [-] 2  creating load generator... done (took 3.073568195s)
  | Error: error in payment: ERROR: TransactionStatusError: programming error: cannot recover PENDING transaction in same epoch: meta={id=476e4aec key=/Table/53/1/679/4 pri=99.99999995 epo=0 ts=1615878898.732187069,1 min=1615878867.391387290,0 seq=31} lock=true stat=PENDING rts=0,0 wto=false gul=0,0 (REASON_UNKNOWN) (SQLSTATE XXUUU)
  | Error: COMMAND_PROBLEM: exit status 1
  | (1) COMMAND_PROBLEM
  | Wraps: (2) Node 4. Command with error:
  |   | ```
  |   | ./workload run tpcc --warehouses=1000 --duration=30m  {pgurl:1-3}
  |   | ```
  | Wraps: (3) exit status 1
  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
  |
  | stdout:
  | <... some data truncated by circular buffer; go to artifacts for details ...>

Similar to #61992

@tbg tbg changed the title roachtest: cdc/poller/rangefeed=false failed roachtest: cdc/poller/rangefeed=false failed [cannot recover PENDING transaction in same epoch] Mar 23, 2021
@nvanbenschoten nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2021
craig bot pushed a commit that referenced this issue Mar 29, 2021
62705: workload/schemachange: check enum's non-public status during ADD REGION r=ajwerner a=otan

Previously, having two ADD REGIONS in the same transaction with the same
value would error "unexpectedly" by the workload changer as we did not
inject the expected error in. This is because the region on the first
ADD REGION is not immediately made a public enum member, and does not
show up on SHOW REGIONS on the second generation of ADD REGION.

To fix this, we now detect whether the enum exists as a non-public member
and if so add the expected error code.

Also re-enables ADD REGION operation generation.

Release note: None


62744: roachtest: remove cdc/poller/rangefeed=false r=nvanbenschoten a=nvanbenschoten

Closes #62064. This isn't fixed, but consolidating to #61992.

This should have been removed in 8a81eac. The test is now effectively
identical to `cdc/tpcc-1000`, just less aggressive in its assertions.

Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 286ac2e Mar 29, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 29, 2021
Fixes cockroachdb#61992.
Fixes cockroachdb#62064.

This commit fixes a bug uncovered recently (for less than obvious
reasons) in cdc roachtests where a STAGING transaction could have its
transaction record moved back to a PENDING state without changing epochs
but after its timestamp was bumped. This could result in concurrent
transaction recovery attempts returning `programming error: cannot
recover PENDING transaction in same epoch` errors, because such a state
transition was not expected to be possible by transaction recovery.
However, as we found in cockroachdb#61992, this has actually been possible since
01bc20e.

This commit fixes the bug by detecting cases where a pusher knows of a
failed parallel commit and selectively upgrading PUSH_TIMESTAMP push
attempts to PUSH_ABORTs. This has no effect on pushes that fail with a
TransactionPushError. Such pushes will still wait on the pushee to retry
its commit and eventually commit or abort. It also has no effect on
expired pushees, as they would have been aborted anyway. This only
impacts pushes which would have succeeded due to priority mismatches. In
these cases, the push acts the same as a short-circuited transaction
recovery process, because the transaction recovery procedure always
finalizes target transactions, even if initiated by a PUSH_TIMESTAMP.

This seems very rare in practice, as it requires a few specific
interactions to line up just right, including:
- a STAGING transaction that has one of its in-flight intent writes bumped
- a rangefeed processor listening to that intent write
- a separate request that conflicts with a different intent
- a STAGING transaction which expires to allow transaction recovery
- a rangefeed processor push between the time of the request push and the request recovery

Still, this fix well contained, so I think we should backport it to all
of the release branches. However, since this issue does seem rare and
also can not cause corruption or atomicity violations, I wanted to be
conservative with the backport, so I'm going to let this bake on master
+ release-21.1 for a few weeks before merging the backport.

Release notes (bug fix): an improper interaction between conflicting
transactions which could result in spurious `cannot recover PENDING
transaction in same epoch` errors was fixed.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 30, 2021
Fixes cockroachdb#61992.
Fixes cockroachdb#62064.

This commit fixes a bug uncovered recently (for less than obvious
reasons) in cdc roachtests where a STAGING transaction could have its
transaction record moved back to a PENDING state without changing epochs
but after its timestamp was bumped. This could result in concurrent
transaction recovery attempts returning `programming error: cannot
recover PENDING transaction in same epoch` errors, because such a state
transition was not expected to be possible by transaction recovery.
However, as we found in cockroachdb#61992, this has actually been possible since
01bc20e.

This commit fixes the bug by detecting cases where a pusher knows of a
failed parallel commit and selectively upgrading PUSH_TIMESTAMP push
attempts to PUSH_ABORTs. This has no effect on pushes that fail with a
TransactionPushError. Such pushes will still wait on the pushee to retry
its commit and eventually commit or abort. It also has no effect on
expired pushees, as they would have been aborted anyway. This only
impacts pushes which would have succeeded due to priority mismatches. In
these cases, the push acts the same as a short-circuited transaction
recovery process, because the transaction recovery procedure always
finalizes target transactions, even if initiated by a PUSH_TIMESTAMP.

This seems very rare in practice, as it requires a few specific
interactions to line up just right, including:
- a STAGING transaction that has one of its in-flight intent writes bumped
- a rangefeed processor listening to that intent write
- a separate request that conflicts with a different intent
- a STAGING transaction which expires to allow transaction recovery
- a rangefeed processor push between the time of the request push and the request recovery

Still, this fix well contained, so I think we should backport it to all
of the release branches. However, since this issue does seem rare and
also can not cause corruption or atomicity violations, I wanted to be
conservative with the backport, so I'm going to let this bake on master
+ release-21.1 for a few weeks before merging the backport.

Release notes (bug fix): an improper interaction between conflicting
transactions which could result in spurious `cannot recover PENDING
transaction in same epoch` errors was fixed.
craig bot pushed a commit that referenced this issue Mar 30, 2021
62761: kv: prevent STAGING -> PENDING transition during high-priority push r=nvanbenschoten a=nvanbenschoten

Fixes #61992.
Fixes #62064.

This commit fixes a bug uncovered recently (for less than obvious reasons) in cdc roachtests where a STAGING transaction could have its transaction record moved back to a PENDING state without changing epochs but after its timestamp was bumped. This could result in concurrent transaction recovery attempts returning `programming error: cannot recover PENDING transaction in same epoch` errors, because such a state transition was not expected to be possible by transaction recovery. However, as we found in #61992, this has actually been possible since 01bc20e.

This commit fixes the bug by detecting cases where a pusher knows of a failed parallel commit and selectively upgrading PUSH_TIMESTAMP push attempts to PUSH_ABORTs. This has no effect on pushes that fail with a TransactionPushError. Such pushes will still wait on the pushee to retry its commit and eventually commit or abort. It also has no effect on expired pushees, as they would have been aborted anyway. This only impacts pushes which would have succeeded due to priority mismatches. In these cases, the push acts the same as a short-circuited transaction recovery process, because the transaction recovery procedure always finalizes target transactions, even if initiated by a PUSH_TIMESTAMP.

This seems very rare in practice, as it requires a few specific interactions to line up just right, including:
- a STAGING transaction that has one of its in-flight intent writes bumped
- a rangefeed processor listening to that intent write
- a separate request that conflicts with a different intent
- a STAGING transaction which expires to allow transaction recovery
- a rangefeed processor push between the time of the request push and the request recovery

Still, this fix well contained, so I think we should backport it to all of the release branches. However, since this issue does seem rare and also can not cause corruption or atomicity violations, I wanted to be conservative with the backport, so I'm going to let this bake on master + release-21.1 for a few weeks before merging the backport.

Release notes (bug fix): an improper interaction between conflicting transactions which could result in spurious `cannot recover PENDING transaction in same epoch` errors was fixed.

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 30, 2021
Fixes cockroachdb#61992.
Fixes cockroachdb#62064.

This commit fixes a bug uncovered recently (for less than obvious
reasons) in cdc roachtests where a STAGING transaction could have its
transaction record moved back to a PENDING state without changing epochs
but after its timestamp was bumped. This could result in concurrent
transaction recovery attempts returning `programming error: cannot
recover PENDING transaction in same epoch` errors, because such a state
transition was not expected to be possible by transaction recovery.
However, as we found in cockroachdb#61992, this has actually been possible since
01bc20e.

This commit fixes the bug by detecting cases where a pusher knows of a
failed parallel commit and selectively upgrading PUSH_TIMESTAMP push
attempts to PUSH_ABORTs. This has no effect on pushes that fail with a
TransactionPushError. Such pushes will still wait on the pushee to retry
its commit and eventually commit or abort. It also has no effect on
expired pushees, as they would have been aborted anyway. This only
impacts pushes which would have succeeded due to priority mismatches. In
these cases, the push acts the same as a short-circuited transaction
recovery process, because the transaction recovery procedure always
finalizes target transactions, even if initiated by a PUSH_TIMESTAMP.

This seems very rare in practice, as it requires a few specific
interactions to line up just right, including:
- a STAGING transaction that has one of its in-flight intent writes bumped
- a rangefeed processor listening to that intent write
- a separate request that conflicts with a different intent
- a STAGING transaction which expires to allow transaction recovery
- a rangefeed processor push between the time of the request push and the request recovery

Still, this fix well contained, so I think we should backport it to all
of the release branches. However, since this issue does seem rare and
also can not cause corruption or atomicity violations, I wanted to be
conservative with the backport, so I'm going to let this bake on master
+ release-21.1 for a few weeks before merging the backport.

Release notes (bug fix): an improper interaction between conflicting
transactions which could result in spurious `cannot recover PENDING
transaction in same epoch` errors was fixed.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 1, 2021
Closes cockroachdb#62064. This isn't fixed, but consolidating to cockroachdb#61992.

This should have been removed in 8a81eac. The test is now effectively
identical to `cdc/tpcc-1000`, just less aggressive in its assertions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot.
Projects
None yet
3 participants