Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: TestDebugCheckStore failed #112219

Closed
cockroach-teamcity opened this issue Oct 12, 2023 · 7 comments
Closed

cli: TestDebugCheckStore failed #112219

cockroach-teamcity opened this issue Oct 12, 2023 · 7 comments
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team X-stale
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 12, 2023

cli.TestDebugCheckStore failed with artifacts on master @ 619aa6edad0269f75d12cb7d4b88b52f09825c7b:

=== RUN   TestDebugCheckStore
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/1855398a359288315eef18ccd1f2c1dd/logTestDebugCheckStore3562071732
    test_log_scope.go:81: use -show-logs to present logs inline
    debug_check_store_test.go:83: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/cli/debug_check_store_test.go:83
        	Error:      	Received unexpected error:
        	            	check-store found problems
        	            	(1) attached stack trace
        	            	  -- stack trace:
        	            	  | github.com/cockroachdb/cockroach/pkg/cli.init
        	            	  | 	github.com/cockroachdb/cockroach/pkg/cli/debug_check_store.go:54
        	            	  | runtime.doInit
        	            	  | 	GOROOT/src/runtime/proc.go:6542
        	            	  | runtime.doInit
        	            	  | 	GOROOT/src/runtime/proc.go:6519
        	            	  | runtime.main
        	            	  | 	GOROOT/src/runtime/proc.go:233
        	            	  | runtime.goexit
        	            	  | 	src/runtime/asm_amd64.s:1598
        	            	Wraps: (2) check-store found problems
        	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
        	Test:       	TestDebugCheckStore
        	Messages:   	dir=/artifacts/tmp/_tmp/1855398a359288315eef18ccd1f2c1dd/TestDebugCheckStore1683931705/s1
        	            	out=r52:/Table/5{0-3} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=6]: stats inconsistency:
        	            	- stored:
        	            	{ContainsEstimates:0 LastUpdateNanos:1697101719175390361 LockAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 LockBytes:0 LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0 RangeValBytes:0 SysBytes:1210 SysCount:9 AbortSpanBytes:0}
        	            	- recomputed:
        	            	{ContainsEstimates:0 LastUpdateNanos:1697101719175390361 LockAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 LockBytes:0 LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0 RangeValBytes:0 SysBytes:1204 SysCount:9 AbortSpanBytes:0}
        	            	- diff:
        	            	SysBytes: 1210 != 1204
        	            	scanned 58 ranges (1 with estimates), total stats contains_estimates:0 last_update_nanos:1697101957442661936 lock_age:0 gc_bytes_age:464602417 live_bytes:9990890 live_count:17302 key_bytes:1114908 key_count:19615 val_bytes:9951584 val_count:24560 intent_bytes:0 intent_count:0 lock_bytes:0 lock_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:34919 sys_count:369 abort_span_bytes:288 
        	            	
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/1855398a359288315eef18ccd1f2c1dd/logTestDebugCheckStore3562071732
--- FAIL: TestDebugCheckStore (1115.36s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-32303

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Oct 12, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Oct 12, 2023
@erikgrinaker
Copy link
Contributor

These SysBytes discrepancies are typically because of lease requests, which don't respect latches and can therefore race with other MVCC stats updates. Fixing it isn't trivial, and the consequences are minor, so it's not worth spending time on.

See e.g. #93896.

@erikgrinaker erikgrinaker removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 12, 2023
@andrewbaptist
Copy link
Collaborator

Should we update this test to allow for some difference in the stats? Allow the stored value to be slightly higher (16 bytes?) than the recomputed value?

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 19, 2023
@erikgrinaker
Copy link
Contributor

This test and command (debug check-store) is specifically about detecting stats mismatches though. It would be better to ensure the initial test fixture always has valid stats, e.g. by explicitly recomputing (and/or verifying) stats before shutting down the cluster and running check-store.

@kvoli
Copy link
Collaborator

kvoli commented Dec 1, 2023

These SysBytes discrepancies are typically because of lease requests, which don't respect latches and can therefore race with other MVCC stats updates. Fixing it isn't trivial, and the consequences are minor, so it's not worth spending time on.

Assigning P3.

@kvoli kvoli added the P-3 Issues/test failures with no fix SLA label Dec 1, 2023
@cockroach-teamcity
Copy link
Member Author

cli.TestDebugCheckStore failed on master @ 22d9482e646ae127981386d77b7ebf729346fc3b:

W231209 09:06:41.506553 2059560 kv/kvserver/replica_gossip.go:158  [T1,Vsystem,n3,s3,r2/3:/System/NodeLiveness{-Max}] 19  could not acquire lease for range gossip: replica unavailable: (n3,s3):3 unable to serve request to r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1702112536.902023277,0 (2023-12-09 09:02:16); raft status: {"id":"3","term":42,"vote":"1","commit":1385,"lead":"0","raftState":"StateFollower","applied":1385,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal HeartbeatTxn [/System/NodeLiveness/2,/Min), [txn: 9b793f55]
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20  failed node liveness heartbeat: replica unavailable: (n1,s1):1 unable to serve request to r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1702112536.902023277,0 (2023-12-09 09:02:16); raft status: {"id":"1","term":53,"vote":"1","commit":1385,"lead":"0","raftState":"StateFollower","applied":1385,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal RequestLease [/System/NodeLiveness,/Min)
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20 +
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20 +An inability to maintain liveness will prevent a node from participating in a
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20 +cluster. If this problem persists, it may be a sign of resource starvation or
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20 +of network connectivity problems. For help troubleshooting, visit:
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20 +
W231209 09:06:41.618957 57371 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n1,liveness-hb] 20 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
I231209 09:06:41.636362 2059645 kv/kvserver/replica_raft.go:1571  [T1,Vsystem,n3,s3,r23/2:/Table/2{0-1},raft] 21  slow non-blocking raft commit: commit-wait 565.125893ms sem 570ns
I231209 09:06:41.636820 2059624 kv/kvserver/store_raft.go:699  [T1,Vsystem,n3,s3,r45/2:/Table/4{2-3},raft] 22  raft ready handling: 0.63s [append=0.00s, apply=0.00s, sync=0.63s, other=0.00s], wrote [append-batch=58 B, append-ent=0 B (1), ] pebble stats: [commit-wait 634.718085ms sem 210ns]; node might be overloaded
I231209 09:06:41.649718 1803165 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r1/3:/{Min-System/NodeL…},raft] 23  raft ready handling: 0.61s [append=0.00s, apply=0.00s, sync=0.61s, other=0.00s], wrote [append-batch=58 B, append-ent=0 B (1), ] pebble stats: [commit-wait 608.128618ms sem 720ns]; node might be overloaded
I231209 09:06:41.659433 262294308 gossip/client.go:124  [T1,Vsystem,n2] 24  started gossip client to n0 (127.0.0.1:43731)
W231209 09:06:41.958815 56640 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n1,s1,r1/1:/{Min-System/NodeL…}] 25  1 stepped down to follower since quorum is not active
I231209 09:06:42.001564 1803170 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r18/2:/Table/1{5-6},raft] 26  raft ready handling: 0.85s [append=0.00s, apply=0.00s, sync=0.85s, other=0.00s], wrote [append-batch=58 B, append-ent=0 B (1), ] pebble stats: [commit-wait 851.95415ms sem 750ns]; node might be overloaded
I231209 09:06:42.002171 1803205 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r6/3:/Table/{0-4},raft] 27  raft ready handling: 0.84s [append=0.00s, apply=0.00s, sync=0.84s, other=0.00s], wrote [append-batch=57 B, append-ent=0 B (1), ] pebble stats: [commit-wait 835.434212ms sem 549ns]; node might be overloaded
I231209 09:06:42.098698 262372626 gossip/client.go:124  [T1,Vsystem,n3] 28  started gossip client to n0 (127.0.0.1:43411)
W231209 09:06:42.286578 57263 server/node.go:985  [T1,Vsystem,n1] 29  problem getting store descriptor for store {ClusterID:e608c456-31a6-44a9-9a25-3368f31422a8 NodeID:1 StoreID:1}: lstat /var/lib/engflow/worker/work/1/exec/_tmp/dd773376e09f16cb3a6403834f045a44/TestDebugCheckStore988297895: no such file or directory
I231209 09:06:42.323780 1803195 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r23/3:/Table/2{0-1},raft] 30  raft ready handling: 0.89s [append=0.00s, apply=0.00s, sync=0.88s, other=0.00s], wrote [append-batch=49 B, ] pebble stats: [commit-wait 884.873326ms sem 130ns]; node might be overloaded
W231209 09:06:42.328562 1803141 kv/kvserver/replica_gossip.go:158  [T1,Vsystem,n2,s2,r2/2:/System/NodeLiveness{-Max}] 31  could not acquire lease for range gossip: replica unavailable: (n2,s2):2 unable to serve request to r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1702112536.902023277,0 (2023-12-09 09:02:16); raft status: {"id":"2","term":63,"vote":"2","commit":1385,"lead":"0","raftState":"StateCandidate","applied":1385,"progress":{},"leadtransferee":"0"}: have been waiting 62.00s for slow proposal RequestLease [/System/NodeLiveness,/Min)
W231209 09:06:42.477814 2059560 kv/kvserver/replica_gossip.go:158  [T1,Vsystem,n3,s3,r2/3:/System/NodeLiveness{-Max}] 32  could not acquire lease for range gossip: replica unavailable: (n3,s3):3 unable to serve request to r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1702112536.902023277,0 (2023-12-09 09:02:16); raft status: {"id":"3","term":42,"vote":"1","commit":1385,"lead":"0","raftState":"StateFollower","applied":1385,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal HeartbeatTxn [/System/NodeLiveness/2,/Min), [txn: 9b793f55]
W231209 09:06:42.518646 56616 kv/kvserver/replica_gossip.go:158  [T1,Vsystem,n1,s1,r2/1:/System/NodeLiveness{-Max}] 33  could not acquire lease for range gossip: replica unavailable: (n1,s1):1 unable to serve request to r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1702112536.902023277,0 (2023-12-09 09:02:16); raft status: {"id":"1","term":53,"vote":"1","commit":1385,"lead":"0","raftState":"StateFollower","applied":1385,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal RequestLease [/System/NodeLiveness,/Min)
I231209 09:06:42.520549 56655 kv/kvserver/store_raft.go:699  [T1,Vsystem,n1,s1,r52/1:/Table/5{0-1},raft] 34  raft ready handling: 0.68s [append=0.00s, apply=0.00s, sync=0.68s, other=0.00s], wrote [append-batch=94 B, append-ent=0 B (1), apply=0 B (4)] pebble stats: [commit-wait 681.988369ms sem 200ns]; node might be overloaded
I231209 09:06:42.521256 56680 kv/kvserver/store_raft.go:699  [T1,Vsystem,n1,s1,r38/1:/Table/3{5-6},raft] 35  raft ready handling: 0.68s [append=0.00s, apply=0.00s, sync=0.68s, other=0.00s], wrote [append-batch=57 B, append-ent=0 B (1), ] pebble stats: [commit-wait 680.000289ms sem 210ns]; node might be overloaded
I231209 09:06:42.589589 1803185 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r43/3:/Table/4{0-1},raft] 36  raft ready handling: 0.59s [append=0.00s, apply=0.00s, sync=0.59s, other=0.00s], wrote [append-batch=58 B, append-ent=0 B (1), ] pebble stats: [commit-wait 591.7946ms sem 350ns]; node might be overloaded
I231209 09:06:42.591470 1803200 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r38/2:/Table/3{5-6},raft] 37  raft ready handling: 0.71s [append=0.00s, apply=0.00s, sync=0.71s, other=0.00s], wrote [append-batch=57 B, append-ent=0 B (1), ] pebble stats: [commit-wait 709.779536ms sem 110ns]; node might be overloaded
I231209 09:06:42.591809 1803175 kv/kvserver/store_raft.go:699  [T1,Vsystem,n2,s2,r59/2:/Table/5{7-8},raft] 38  raft ready handling: 0.59s [append=0.00s, apply=0.00s, sync=0.59s, other=0.00s], wrote [append-batch=57 B, append-ent=0 B (1), ] pebble stats: [commit-wait 594.26863ms sem 740ns]; node might be overloaded
W231209 09:06:42.641350 1803592 kv/kvserver/liveness/liveness.go:752  [T1,Vsystem,n2,liveness-hb] 39  slow heartbeat took 3.029469002s; err=context deadline exceeded
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3.03s (given timeout 3s): context deadline exceeded
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40 +
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40 +An inability to maintain liveness will prevent a node from participating in a
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40 +cluster. If this problem persists, it may be a sign of resource starvation or
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40 +of network connectivity problems. For help troubleshooting, visit:
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40 +
W231209 09:06:42.641540 1803592 kv/kvserver/liveness/liveness.go:668  [T1,Vsystem,n2,liveness-hb] 40 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
W231209 09:06:42.649327 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r22/3:/Table/{19-20}] 41  3 stepped down to follower since quorum is not active
W231209 09:06:42.649608 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r9/2:/Table/{5-6}] 42  2 stepped down to follower since quorum is not active
W231209 09:06:42.649960 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r3/2:/System/{NodeLive…-tsd}] 43  2 stepped down to follower since quorum is not active
W231209 09:06:42.650262 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r63/3:/Table/6{1-2}] 44  3 stepped down to follower since quorum is not active
W231209 09:06:42.650509 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r43/4:/Table/4{0-1}] 45  4 stepped down to follower since quorum is not active
W231209 09:06:42.650831 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r38/3:/Table/3{5-6}] 46  3 stepped down to follower since quorum is not active
W231209 09:06:42.651285 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r10/2:/Table/{6-7}] 47  2 stepped down to follower since quorum is not active
W231209 09:06:42.651646 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r16/2:/Table/1{3-4}] 48  2 stepped down to follower since quorum is not active
W231209 09:06:42.651894 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r59/3:/Table/5{7-8}] 49  3 stepped down to follower since quorum is not active
W231209 09:06:42.652181 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r52/3:/Table/5{0-1}] 50  3 stepped down to follower since quorum is not active
W231209 09:06:42.652407 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r45/2:/Table/4{2-3}] 51  2 stepped down to follower since quorum is not active
W231209 09:06:42.652672 2059594 go.etcd.io/raft/v3/raft.go:1233  [T1,Vsystem,n3,s3,r50/2:/Table/4{7-8}] 52  2 stepped down to follower since quorum is not active
W231209 09:06:42.669668 249209222 kv/kvserver/liveness/liveness.go:752  [T1,Vsystem,n2,s2,r39/2:/Table/3{6-7}] 53  slow heartbeat took 1m19.467970856s; err=replica unavailable: (n1,s1):1 unable to serve request to r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]: closed timestamp: 1702112536.902023277,0 (2023-12-09 09:02:16); raft status: {"id":"1","term":53,"vote":"1","commit":1385,"lead":"0","raftState":"StateFollower","applied":1385,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal RequestLease [/System/NodeLiveness,/Min)
W231209 09:06:42.670124 57257 server/node.go:1019  [T1,Vsystem,n1] 54  [n1,s1]: unable to compute metrics: lstat /var/lib/engflow/worker/work/1/exec/_tmp/dd773376e09f16cb3a6403834f045a44/TestDebugCheckStore988297895: no such file or directory
I231209 09:06:42.742970 56665 kv/kvserver/store_raft.go:699  [T1,Vsystem,n1,s1,r9/1:/Table/{5-6},raft] 55  raft ready handling: 0.67s [append=0.00s, apply=0.00s, sync=0.67s, other=0.00s], wrote [append-batch=57 B, append-ent=0 B (1), ] pebble stats: [commit-wait 671.553752ms sem 100ns]; node might be overloaded
--- FAIL: TestDebugCheckStore (1484.90s)

Parameters:

  • attempt=1
  • deadlock=true
  • run=1
  • shard=8
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

cli.TestDebugCheckStore failed on master @ e747f6e6857a19d6048cb184b0f55c52cb8a6390:

=== RUN   TestDebugCheckStore
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestDebugCheckStore1292772866
    test_log_scope.go:81: use -show-logs to present logs inline

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=8
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

Copy link

github-actions bot commented Feb 5, 2024

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team X-stale
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

5 participants