Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [congmgr assertion failed] #53540

Closed
cockroach-teamcity opened this issue Aug 27, 2020 · 24 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on provisional_202008261913_v20.2.0-beta.1@eaa939ce6548a54a23970814ff00f30ad87680ac:

The test failed on branch=provisional_202008261913_v20.2.0-beta.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	cluster.go:1628,context.go:135,cluster.go:1617,test_runner.go:823: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2225322-1598474350-85-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: dead
		8: skipped
		1: 8011
		2: 8023
		12: skipped
		5: 6731
		6: 6446
		7: 6254
		10: 6741
		9: 6661
		11: 6377
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-provisional_202008261913_v20.2.0-beta.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 27, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Aug 27, 2020
@tbg
Copy link
Member

tbg commented Aug 27, 2020

teamcity-2225322-1598474350-85-n12cpu4-geo-0003> I200827 07:35:10.986842 391758 kv/kvserver/replica_command.go:1654 ⋮ [n3,replicate,s3,r4983/1:‹×›] change replicas (add ‹×› remove ‹×›): existing descriptor r4983:‹×› [(n3,s3):1, (n8,s8):4, (n4,s4):3, (n1,s1):5LEARNER, next=6, gen=162, sticky=1598498913.568048083,0]
teamcity-2225322-1598474350-85-n12cpu4-geo-0003> F200827 07:35:11.168393 402450 kv/kvserver/concurrency/concurrency_manager.go:263 ⋮ [n3,s3,r15632/3:‹×›] ‹×›
goroutine 402450 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x937f001, 0xc0098a3c90, 0x203009, 0x203009)

or, unredacted:

I200827 07:35:10.986842 391758 kv/kvserver/replica_command.go:1654 ⋮ [n3,replicate,s3,r4983/1:‹/Table/58/1/53{1/5/2…-6/9/1…}›] change replicas (add ‹[(n1,s1):5VOTER_INCOMING]› remove ‹[(n4,s4):3VOTER_DEMOTING]›): existing descriptor r4983:‹/Table/58/1/53{1/5/2170-6/9/1846}› [(n3,s3):1, (n8,s8):4, (n4,s4):3, (n1,s1):5LEARNER, next=6, gen=162, sticky=1598498913.568048083,0]
F200827 07:35:11.168393 402450 kv/kvserver/concurrency/concurrency_manager.go:263 ⋮ [n3,s3,r15632/3:‹/Table/61/1/35{87/19…-90/30…}›] ‹discovered lock by different transaction than existing lock›
goroutine 402450 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x937f001, 0xc0098a3c90, 0x203009, 0x203009)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x937e120, 0x4, 0x162f0fc042c88f34, 0x62412, 0x8607d93, 0x2e, 0x107, 0xc04672fc20, 0x41, 0xc003a572c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:243 +0x9e5
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x60c6440, 0xc023c35830, 0x4, 0x2, 0x502228b, 0x2, 0xc0098a4308, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:52 +0x1ac
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x60c6440, 0xc023c35830, 0x1, 0x4, 0x502228b, 0x2, 0xc0098a4308, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:153
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).HandleWriterIntentError(0xc00482e0c0, 0x60c6440, 0xc023c35830, 0xc01148fb90, 0x21, 0xc024185700, 0xc024185700, 0xc024185701)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:263 +0x1cb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleWriteIntentError(0xc004836800, 0x60c6440, 0xc023c35830, 0xc00bbab9e0, 0xc01148fb90, 0x162f0f76805799e9, 0x0, 0x0, 0x300000003, 0x3, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:455 +0xaa
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc004836800, 0x60c6440, 0xc023c35830, 0xc00bbab9e0, 0x56218e8, 0x0, 0xc023d168c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:376 +0x6ec

@tbg tbg changed the title roachtest: tpccbench/nodes=9/cpu=4/multi-region failed roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [congmgr assertion failed] Aug 27, 2020
@tbg
Copy link
Member

tbg commented Aug 28, 2020

@nvanbenschoten can you take a look at this on Monday? This is blocking 20.2.0-beta.1.

@knz knz removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 31, 2020
@nvanbenschoten
Copy link
Member

@knz while here, can we discuss how log redaction should work in this case? We have log.Fatalf(ctx, "%v", errors.AssertionFailedf("...")). Should withAssertionFailure implement the redact.SafeValue interface? Should we be passing it as log.Fatalf(ctx, "%v", redact.Safe(errors.AssertionFailedf("...")))?

@knz
Copy link
Contributor

knz commented Aug 31, 2020

the new error redaction code should have taken care of this already without any changes. I am surprised it didn't.

Or maybe this failure isn't from running off master? if this is a branch forked from before beginning of last week, it's expected the error is marked as redactable. Lastest 20.2 code should not have this problem.

@knz
Copy link
Contributor

knz commented Aug 31, 2020

oh wait I see an oddity specific to log.Fatal. I'm going to file as separate issue for that.

@knz
Copy link
Contributor

knz commented Aug 31, 2020

I've filed #53700 which I believe is restricted to crdb_internal.force_log_fatal().

I manually ran log.Fatalf(ctx, "%v", errors.AssertionFailedf(...)) and I get the expected output (on master).

Are you able to reproduce this error output at will? if so, can you alter the log.Fatal call with a log.Fatalf(ctx, "%+v", errors.Formattable(err)) and see what the object contains in detail?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 2, 2020
@nvanbenschoten
Copy link
Member

I'm able to hit this pretty easily (though in 4-6 hour) on eaa939c, but haven't been able to hit it yet on 95a13a8. That indicates that the issue was introduced somewhere in the ~200 commits between these two SHAs.

I haven't had much luck digging in with some of the improved logging in #53836 because it takes so long to hit this, so I'm planning on kicking off a long-running git bisect to hopefully narrow the search space down to something a little more manageable.

There were two bumps to Pebble in this period: d15daba and 8b9391d. The first one of these, in particular, is a little scary because it includes these two optimizations to iterator seeking: cockroachdb/pebble@76afcca and cockroachdb/pebble@a24bc6c.

@petermattis
Copy link
Collaborator

@sumeerbhola can speak to the safety of those Pebble iterator optimizations. I believe he thought they were fairly safe. Usually if we add a bug to iterators the Pebble metamorphic test catches it.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 4, 2020

I was able to hit this on d15daba last night, which, coincidentally, is the commit following 95a13a8. So that might have taken care of the bisect for me. I'm going to spin up another batch of runs on 95a13a8. If they still all come up negative, we have our answer.

@petermattis
Copy link
Collaborator

Eeek! @nvanbenschoten is this error more likely due to an iterator not returning a key when it should, or returning an old version of a key?

@petermattis
Copy link
Collaborator

FWIW, I've been stressing the Pebble metamorphic test for 30min now without failure. I'll leave it running as it sounds like this is a very rare problem.

@nvanbenschoten
Copy link
Member

I think it's more likely due to an iterator returning an old version of a key. For instance, this would be the symptom if an older version of an MVCCMetadata key was revived.

But let's wait to see what this last round of testing shows. I'd like to be very sure about the bisect before pointing the finger in Pebble's direction.

@nvanbenschoten
Copy link
Member

I'm going to spin up another batch of runs on 95a13a8

@petermattis @sumeerbhola This just finished, again with no instance of this error. So this failure does appear to be a result of that change.

@sumeerbhola
Copy link
Collaborator

I'm unclear on the commit history based on the output of git log -- I see the pebble bump in 5d1428a preceded by c50ccc9
I don't see 95a13a8 around there.
What am I missing?

Perhaps the cause is actually an earlier commit to pebble_iterator.go 338e623
I am reading https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine.go#L163-L171 as saying both upper and lower bound are optional, so there are cases where one may be provided and not the other. Also, the sentence "UpperBound must be provided unless Prefix is true, in which case the end of the prefix will be used as the upper bound." seems incorrect. If there are cases where pebbleIterator.setOptions is called with only one of the bounds, I believe we have a bug -- it should be doing what SetUpperBound is doing in copying the lower bound to the curBuf index.

@nvanbenschoten
Copy link
Member

I don't see 95a13a8 around there.
What am I missing?

They're back-to-back in https://github.com/cockroachdb/cockroach/commits/d15daba173ac5f7601c70ca6dad57a5a21c2c77b. Am I confusing myself with merge commits? I was comparing d15daba vs. 95a13a8 (its parent).

@sumeerbhola
Copy link
Collaborator

If there are cases where pebbleIterator.setOptions is called with only one of the bounds, I believe we have a bug -- it should be doing what SetUpperBound is doing in copying the lower bound to the curBuf index.

Misread the code -- there isn't a bug there. Unlike SetUpperBound which inherits the lower-bound, here we overwrite both bounds.

@tbg
Copy link
Member

tbg commented Sep 9, 2020

@sumeerbhola try running a bunch of them concurrently, see the --count and --parallelism flags. You're looking for failures that have the "dead node detection" line.
The test has a base failure rate of 25% (overheard from Nathan) so running singles won't be too effective.

@tbg
Copy link
Member

tbg commented Sep 9, 2020

Oh, looks like you got a "dead node" from your description? It should be in the log files, artifacts/..../<n>.logs/cockroach.*

@sumeerbhola
Copy link
Collaborator

There are logs in both artifacts/..../<n>.logs/ and artifacts/..../<n>.logs/logs/ and I was looking in the former. The latter has the failure. I am trying to find where this value is read from the store but got lost.

F200909 03:00:08.898769 475974 kv/kvserver/replica_write.go:160  [n2,s2,r13754/3:/Table/57/3/{22/6/<clipped by me because of copy-paste failing>
lease applied index bumped while the range was subsumed
goroutine 475974 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x932e501, 0x86a4a7, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x932d660, 0x4, 0x1632fe4967e0509a, 0x74346, 0x85b98db, 0x1c, 0xa0, 0xc0050b1880, 0x37, 0xc0063032c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:243 +0x9e5
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x608f980, 0xc004967c00, 0x4, 0x2, 0x50dc28d, 0x37, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:52 +0x1ac
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x608f980, 0xc004967c00, 0x1, 0xc000000004, 0x50dc28d, 0x37, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:153
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc001c7f800, 0x608f980, 0xc004967c00, 0xc0086abb90, 0x1632faf30f569bc6, 0x0, 0x0, 0x200000002, 0x3, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:160 +0x1573
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001c7f800, 0x608f980, 0xc004967c00, 0xc0086abb90, 0x55f2bb0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:357 +0x3f5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001c7f800, 0x608f980, 0xc004967c00, 0x35ba, 0xc0086abb90, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:98 +0x75e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc001c7f800, 0x608f980, 0xc004967c00, 0x1632fe4967dda72c, 0x0, 0x0, 0x0, 0x0, 0x35ba, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:37 +0x93
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync.func2(0x608f980, 0xc004967c00)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_range_lease.go:401 +0x5b7
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc00043e900, 0x608f980, 0xc004967c00, 0xc0050b1640, 0x35, 0x61e1460, 0xc0004c1520, 0xc00527f220)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:347 +0x152
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:342 +0x131

@nvanbenschoten
Copy link
Member

There were a collection of patches that I needed to run on top of d15daba to avoid other recently fixed failures. It looks like they were 3c09757 and 5430170. 0a9ee34 is probably also a good one to have.

If you're looking to reproduce this, I'd suggest running tpcc/mixed-headroom/n5cpu16 (#53535) instead. That test is faster and more stable.

@nvanbenschoten
Copy link
Member

@sumeerbhola do you remember what happened here? We tracked this down to d15daba and then what? I'm assuming the issue was resolved, so I'm hoping we can point to the fix and close this issue.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Feb 23, 2021

The Pebble optimization bugs were reproduced in Pebble tests, and the fixed optimizations were re-introduced in a series of Pebble PRs cockroachdb/pebble#947 cockroachdb/pebble#949 cockroachdb/pebble#951, and the corresponding Pebble bump in CockroachDB was merged on Oct 23 #55895

@nvanbenschoten
Copy link
Member

Thanks! So it sounds like we can close this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

7 participants