Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #47471

Closed
cockroach-teamcity opened this issue Apr 14, 2020 · 4 comments · Fixed by #47492
Closed

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #47471

cockroach-teamcity opened this issue Apr 14, 2020 · 4 comments · Fixed by #47492
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=9/cpu=4/chaos/partition failed on master@29c0efdcc5edb5d100449a093b25df107f1df2d6:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200414-1874488/tpccbench/nodes=9/cpu=4/chaos/partition/run_1
	cluster.go:2418,tpcc.go:729,tpcc.go:570,test_runner.go:753: monitor failure: unexpected node event: 8: dead
		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2406
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2414
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:729
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:570
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 8: dead
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *errors.errorString

	cluster.go:1460,context.go:135,cluster.go:1449,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1874488-1586849395-28-n10cpu4 --oneshot --ignore-empty-nodes: exit status 1 10: skipped
		8: dead
		5: 866
		2: 32290
		4: 32662
		9: 31334
		6: 31590
		3: 1027
		1: 705
		7: 30586
		Error: UNCLASSIFIED_PROBLEM: 8: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) 8: dead
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Error types: (1) errors.Unclassified (2) *errors.fundamental

More

Artifacts: /tpccbench/nodes=9/cpu=4/chaos/partition
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 14, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Apr 14, 2020
@nvanbenschoten
Copy link
Member

F200414 10:57:28.377426 598722 kv/kvclient/kvcoord/dist_sender.go:1250  [n8] received 184 results, limit was 92
goroutine 598722 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x763d201, 0xed6278e18, 0x0, 0xc01237a820)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x7639ea0, 0xc000000004, 0x6b7c7ad, 0x22, 0x4e2, 0xc013af73e0, 0x27)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:212 +0xa0c
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4c69340, 0xc0279808d0, 0x4, 0x2, 0x43e8f85, 0x21, 0xc01237a948, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4c69340, 0xc0279808d0, 0x1, 0x4, 0x43e8f85, 0x21, 0xc01237a948, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000cc2120, 0x4c69340, 0xc0279808d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1250 +0xcfd
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000cc2120, 0x4c69340, 0xc0279808d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1460 +0xca2
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000cc2120, 0x4c69340, 0xc0279808d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1079 +0x164a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000cc2120, 0x4c69340, 0xc027980870, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:730 +0x785
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc00070c660, 0x4c69340, 0xc027980870, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:223 +0x95
github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.(*RequestBatcher).sendBatch.func1.1(0x4c69340, 0xc027980870, 0x4c69340, 0xc0246485d0)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher/batcher.go:275 +0xdc
github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.(*RequestBatcher).sendBatch.func1(0x4c69340, 0xc027980870)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher/batcher.go:306 +0x88d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc009849520, 0xc000c40000, 0xc00d49b0c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0x13e
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xa8

@nvanbenschoten
Copy link
Member

It's interesting that 184 is exactly twice 92. I wonder if we're doing something dumb here.

@nvanbenschoten
Copy link
Member

This looks similar to #46652, but it's not exactly the same. Specifically, this assertion is firing in DistSender, not on the Replica.

Also, I found that we also saw this issue in #40935 and in that case, we saw "received 215 results, limit was 200".

@nvanbenschoten
Copy link
Member

Oh, I think I might see what's going on here. MVCCResolveWriteIntentRangeUsingIter doesn't seem to handle the case where max < 0 correctly, unlike MVCCScan, which does. We're probably seeing a case where the limit is exhausted midway through a batch of ResolveIntentRange requests so then all remaining requests are evaluating as if they had no limit at all. That seems like an issue that could easily cause this.

cc. @tbg, as this class of bug might prove to be an interesting thread to pull on in #46652.

craig bot pushed a commit that referenced this issue Apr 14, 2020
47492: kv: respect exhausted key limit during ranged intent resolution r=nvanbenschoten a=nvanbenschoten

Fixes #47471.
Fixes #40935.

This commit fixes a long-standing bug where ranged intent resolution would not respect the MaxSpanRequestKeys set on a batch once the limit had already been exhausted by other requests in the same batch. Instead of treating the limit as exhausted, ranged intent resolution would consider the limit nonexistent (unbounded). This bug was triggering an assertion in DistSender. We became more likely to hit this issue in v20.1 because we started performing ranged intent resolution more often due to implicit SELECT FOR UPDATE.

This commit fixes the bug in two ways:
1. it addresses the root cause, updating MVCCResolveWriteIntentRangeUsingIter to properly respect the limit placed on the request when it is exhauted.
2. it disables the assertion in DistSender when it detects that we are hitting this bug. This ensures that we don't hit the assertion in mixed version clusters (see #40935). By the time we're in DistSender, the damage is already done and has already potentially resulted in a large Raft entry. Maintaining the assertion doesn't do us any good.

Release notes (bug fix): a bug that could could trigger an assertion with the text "received X results, limit was Y" has been fixed. The underlying bug was only performance related and could not cause user-visible correctness violations.

Release justification: fixes a medium-priority bug in existing functionality. The bug could result in an assertion failure and a node crashing. Even though this was an old bug (present in many releases before v20.1), it became a lot easier to hit in v20.1 because we started performing ranged intent resolution more often due to implicit SELECT FOR UPDATE.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 1ab9eca Apr 14, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 15, 2020
Fixes cockroachdb#47471.
Fixes cockroachdb#40935.

This commit fixes a long-standing bug where ranged intent resolution
would not respect the MaxSpanRequestKeys set on a batch once the limit
had already been exhausted by other requests in the same batch. Instead
of treating the limit as exhausted, ranged intent resolution would
consider the limit nonexistent (unbounded). This bug was triggering an
assertion in DistSender. We became more likely to hit this issue in
v20.1 because we started performing ranged intent resolution more often
due to implicit SELECT FOR UPDATE.

This commit fixes the bug in two ways:
1. it addresses the root cause, updating MVCCResolveWriteIntentRangeUsingIter
  to properly respect the limit placed on the request when it is exhauted.
2. it disables the assertion in DistSender when it detects that we are hitting
  this bug. This ensures that we don't hit the assertion in mixed version
  clusters (see cockroachdb#40935). By the time we're in DistSender, the damage is
  already done and has already potentially resulted in a large Raft entry.
  Maintaining the assertion doesn't do us any good.

Release notes (bug fix): a bug that could could trigger an assertion
with the text "received X results, limit was Y" has been fixed. The
underlying bug was only performance related and could not cause
user-visible correctness violations.

Release justification: fixes a medium-priority bug in existing
functionality. The bug could result in an assertion failure and a node
crashing. Even though this was an old bug (present in many releases
before v20.1), it became a lot easier to hit in v20.1 because we started
performing ranged intent resolution more often due to implicit SELECT
FOR UPDATE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants