kvserver: `TestCheckConsistencyInconsistent` failed, data race in `TestingSetRedactable` #81819

erikgrinaker · 2022-05-25T12:49:23Z

=== RUN   TestCheckConsistencyInconsistent
    test_log_scope.go:79: test logs captured to: /home/erik/.cache/bazel/_bazel_erik/58b8816bb117734f0a0eaaef7fcef95b/sandbox/linux-sandbox/11283/execroot/com_github_cockroachdb_cockroach/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestCheckConsistencyInconsistent3070084177
    test_log_scope.go:80: use -show-logs to present logs inline
==================
WARNING: DATA RACE
Write at 0x00c0007d9b68 by goroutine 47:
  github.com/cockroachdb/cockroach/pkg/util/log.TestingSetRedactable.func1()
      github.com/cockroachdb/cockroach/pkg/util/log/redact.go:199 +0xe4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestCheckConsistencyInconsistent()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:477 +0x242b
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1244 +0x4f7
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:370 +0xa24
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:498 +0x109
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:501 +0x10a
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:327 +0x4b5
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:498 +0x109
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:501 +0x10a
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:327 +0x4b5
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:196 +0xa8
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:292 +0xcec
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:292 +0xcec
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:195 +0x8b
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestCheckConsistencyInconsistent()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:336 +0x9e4
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      GOROOT/src/testing/testing.go:1306 +0x47

Previous read at 0x00c0007d9b68 by goroutine 42:
  github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry()
      github.com/cockroachdb/cockroach/pkg/util/log/clog.go:312 +0x7f5
  github.com/cockroachdb/cockroach/pkg/util/log.logfDepth()
      github.com/cockroachdb/cockroach/pkg/util/log/channels.go:60 +0x385
  github.com/cockroachdb/cockroach/pkg/util/log.Infof()
      github.com/cockroachdb/cockroach/bazel-out/k8-fastbuild/bin/pkg/util/log/log_channels_generated.go:263 +0x1f0
  github.com/cockroachdb/cockroach/pkg/util/admission.(*GrantCoordinator).CPULoad()
      github.com/cockroachdb/cockroach/pkg/util/admission/granter.go:956 +0x3e
  github.com/cockroachdb/cockroach/pkg/util/admission.(*GrantCoordinator).CPULoad-fm()
      github.com/cockroachdb/cockroach/pkg/util/admission/granter.go:951 +0x57
  github.com/cockroachdb/cockroach/pkg/util/goschedstats.(*schedStatsTicker).getStatsOnTick()
      github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:236 +0x3fa
  github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1()
      github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:169 +0x1c4

Goroutine 47 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:1306 +0x726
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:1598 +0x99
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.runTests()
      GOROOT/src/testing/testing.go:1596 +0x7ca
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:1504 +0x9d1
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMain()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/main_test.go:49 +0x105
  main.main()
      main/bazel-out/k8-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/testmain.go:1435 +0x5c6

Goroutine 42 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0()
      github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:157 +0x2b
==================
I220525 12:44:18.543257 1 (gostd) testmain.go:1414  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 66

Jira issue: CRDB-16183

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2022-05-25T13:06:08Z

This test will be disabled under race in #82042. However, the underlying race should still be addressed.

82042: kvserver: skip `TestCheckConsistencyInconsistent` under race r=tbg a=erikgrinaker The test is rather slow and also timing-dependent, which can result in false positives. Furthermore, the call to `TestingSetRedactable` triggers the race detector. This patch therefore disables the test under the race detector. Touches #81819. Release note: None Co-authored-by: Erik Grinaker <[email protected]>

github-actions · 2022-06-29T10:06:20Z

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

blathers-crl · 2023-03-03T22:57:54Z

Hi @shralex, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

nvanbenschoten · 2023-03-17T17:17:32Z

I unskipped the test under race and stressed it:

# unskip test
dev test pkg/kv/kvserver -f=TestCheckConsistencyInconsistent --stress --race

This did not hit the data race mentioned above. However, after about 10m of stress, it failed with the following, more concerning error:

--- FAIL: TestCheckConsistencyInconsistent (518.47s)
    test_log_scope.go:161: test logs captured to: /tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestCheckConsistencyInconsistent646488916
    test_log_scope.go:79: use -show-logs to present logs inline
    consistency_queue_test.go:407:
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:407
        	Error:      	Not equal:
        	            	expected: []byte{0xf1, 0xab, 0xd1, 0x41, 0x75, 0xd1, 0x79, 0x30, 0xd5, 0x36, 0x8d, 0xaf, 0x52, 0xc0, 0xd5, 0x88, 0x73, 0xc5, 0xb3, 0x6f, 0x81, 0x21, 0x67, 0x4f, 0xca, 0xce, 0x1, 0x68, 0xc5, 0x84, 0x6f, 0x7d, 0x54, 0xe5, 0x7e, 0xaa, 0x2e, 0xf1, 0x33, 0xdf, 0x8f, 0x45, 0xa2, 0x4a, 0x1e, 0x77, 0x80, 0xfd, 0xf3, 0x25, 0x10, 0x16, 0xd5, 0x83, 0x57, 0x15, 0x3c, 0x1f, 0x7f, 0x53, 0x94, 0xf0, 0x25, 0x10}
        	            	actual  : []byte{0x1d, 0x77, 0xab, 0xd8, 0x5d, 0x3c, 0xe9, 0x36, 0xc2, 0xaa, 0xef, 0x40, 0x1c, 0xf2, 0xfb, 0x18, 0xcd, 0x55, 0x16, 0x71, 0xf2, 0xf4, 0xfe, 0x39, 0x66, 0xa1, 0xa2, 0xe6, 0xc1, 0x84, 0x77, 0xc0, 0xf8, 0x25, 0x11, 0x1e, 0xc0, 0xc5, 0x79, 0x53, 0x9d, 0xe2, 0x32, 0x6b, 0x73, 0xb5, 0x3, 0xe9, 0x29, 0x5, 0x25, 0x75, 0xe, 0xa4, 0x1d, 0xbd, 0x88, 0xf4, 0x6c, 0x53, 0xa9, 0x22, 0xef, 0xdc}

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,6 +1,6 @@
        	            	 ([]uint8) (len=64) {
        	            	- 00000000  f1 ab d1 41 75 d1 79 30  d5 36 8d af 52 c0 d5 88  |...Au.y0.6..R...|
        	            	- 00000010  73 c5 b3 6f 81 21 67 4f  ca ce 01 68 c5 84 6f 7d  |s..o.!gO...h..o}|
        	            	- 00000020  54 e5 7e aa 2e f1 33 df  8f 45 a2 4a 1e 77 80 fd  |T.~...3..E.J.w..|
        	            	- 00000030  f3 25 10 16 d5 83 57 15  3c 1f 7f 53 94 f0 25 10  |.%....W.<..S..%.|
        	            	+ 00000000  1d 77 ab d8 5d 3c e9 36  c2 aa ef 40 1c f2 fb 18  |.w..]<.6...@....|
        	            	+ 00000010  cd 55 16 71 f2 f4 fe 39  66 a1 a2 e6 c1 84 77 c0  |.U.q...9f.....w.|
        	            	+ 00000020  f8 25 11 1e c0 c5 79 53  9d e2 32 6b 73 b5 03 e9  |.%....yS..2ks...|
        	            	+ 00000030  29 05 25 75 0e a4 1d bd  88 f4 6c 53 a9 22 ef dc  |).%u......lS."..|
        	            	 }
        	Test:       	TestCheckConsistencyInconsistent

I've since been stressing without race and have not hit the same failure.

EDIT: I quickly hit this again with --race.

nvanbenschoten · 2023-03-17T19:15:08Z

I've been able to hit this failure as far back as 4f904eb. @nicktrav could we have someone from Replication take a look at this test failure and determine what to do about it? Maybe @pavelkalinnikov, who worked with this test last October. There's an argument that what we're seeing here should be promoted to a release-blocker, but let's get another pair of eyes on it before making that decision.

blathers-crl · 2023-03-17T19:15:20Z

cc @cockroachdb/replication

pav-kv · 2023-03-19T20:58:16Z

Reproduces fairly quickly on my end too. Investigating.

pav-kv · 2023-03-20T09:56:13Z

I was bisecting, and hit the original race after a while (got it at 4eb5451, but it's probably a long existing one):

I230320 00:45:32.525877 1 (gostd) rand.go:199  [T1] 1  random seed: 6654103616478701771
==================
WARNING: DATA RACE
Write at 0x00c000e5ba18 by goroutine 35:
  github.com/cockroachdb/cockroach/pkg/util/log.TestingSetRedactable.func1()
      github.com/cockroachdb/cockroach/pkg/util/log/redact.go:199 +0xd8
  runtime.deferreturn()
      GOROOT/src/runtime/panic.go:476 +0x30
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.forceScanAndProcess()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue_helpers_testutil.go:42 +0x120
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ForceReplicationScanAndProcess()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue_helpers_testutil.go:55 +0xf0
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication.func2()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1367 +0xf4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149 +0x54
  github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range()
      github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:352 +0x29c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:148 +0x60
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1359 +0x36c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:438 +0xa50
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:578 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:392 +0x5e4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:578 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:392 +0x5e4
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:601 +0x54
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:578 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:392 +0x5e4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:236 +0x6c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:332 +0xb14
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:332 +0xb14
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:235 +0x5c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestCheckConsistencyInconsistent()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:290 +0x680
  testing.tRunner()
      GOROOT/src/testing/testing.go:1446 +0x188
  testing.(*T).Run.func1()
      GOROOT/src/testing/testing.go:1493 +0x40

Previous read at 0x00c000e5ba18 by goroutine 15538:
  github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry()
      github.com/cockroachdb/cockroach/pkg/util/log/clog.go:335 +0x5f4
  github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal()
      github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x550
  github.com/cockroachdb/cockroach/pkg/util/log.logfDepth()
      github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39 +0x68
  github.com/cockroachdb/cockroach/pkg/util/log.Infof()
      github.com/cockroachdb/cockroach/bazel-out/darwin_arm64-fastbuild/bin/pkg/util/log/log_channels_generated.go:277 +0x34
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce.func1()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:560 +0x30

Goroutine 35 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:1493 +0x55c
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:1846 +0x90
  testing.tRunner()
      GOROOT/src/testing/testing.go:1446 +0x188
  testing.runTests()
      GOROOT/src/testing/testing.go:1844 +0x6c0
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:1726 +0x870
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMain()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/main_test.go:35 +0x108
  main.main()
      main/bazel-out/darwin_arm64-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/testmain.go:1521 +0x604

Goroutine 15538 (finished) created at:
  time.goFunc()
      GOROOT/src/time/sleep.go:176 +0x44
==================
I230320 00:46:49.303526 1 (gostd) testmain.go:1492  [T1] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 66


ERROR: exit status 66

pav-kv · 2023-03-21T09:22:42Z

Another error:

--- FAIL: TestCheckConsistencyInconsistent (696.45s)
    test_log_scope.go:161: test logs captured to: /Users/pavel/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/577bfc8ed46850d1aaafa923fa37ce78/logTestCheckConsistencyInconsistent994443092
    test_log_scope.go:79: use -show-logs to present logs inline
    consistency_queue_test.go:313: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:313
                                                        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:358
                Error:          Received unexpected error:
                                could not collect checksum from any replica
                                (1)
                                  | (opaque error wrapper)
                                  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
                                  | reportable 0:
                                  |
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).runConsistencyCheck
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_consistency.go:376
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).checkConsistencyImpl
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_consistency.go:96
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).CheckConsistency
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_consistency.go:84
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:967
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:188
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:206
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:203
                                  | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
                                  |     github.com/cockroachdb/cockroach/pkg/server/node.go:1169
                                  | github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
                                  |     github.com/cockroachdb/cockroach/pkg/server/node.go:1243
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler.func1
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/darwin_arm64-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:9975
                                  | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1
                                  |     github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:96
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1162
                                  | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:268
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1165
                                  | github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.unaryInterceptor
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:92
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1165
                                  | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:235
                                  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
                                  |     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322
                                  | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:233
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1165
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1167
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/darwin_arm64-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:9977
                                  | google.golang.org/grpc.(*Server).processUnaryRPC
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1340
                                  | google.golang.org/grpc.(*Server).handleStream
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1713
                                  | google.golang.org/grpc.(*Server).serveStreams.func1.2
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:965
                                  | runtime.goexit
                                  |     src/runtime/asm_arm64.s:1172
                                Wraps: (2) could not collect checksum from any replica
                                Error types: (1) *errbase.opaqueWrapper (2) *errutil.leafError
                Test:           TestCheckConsistencyInconsistent
    panic.go:540: -- test log scope end --
FAIL
I230321 09:18:54.842662 1 (gostd) testmain.go:1494  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 1


ERROR: exit status 1

1 runs completed, 1 failures, over 11m43s
context canceled

pav-kv · 2023-03-21T11:59:28Z

Here is what happened during one of the test failures:

--- FAIL: TestCheckConsistencyInconsistent (132.90s)
...
    consistency_queue_test.go:410: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:410
                Error:          Not equal: 
                                expected: []byte{0x55, 0x7a, 0x1, 0x36, 0x95, 0xb1, 0x3, 0x8a, 0xa6, 0x4, 0xd4, 0x50, 0x63, 0x78, 0xf0, 0x57, 0x43, 0x20, 0x5, 0xb2, 0x6d, 0xf0, 0x47, 0x91, 0x36, 0x9c, 0x87, 0xd8, 0xa6, 0xff, 0x72, 0xf6, 0xaa, 0xf1, 0xf2, 0xd1, 0xc8, 0x24, 0x9e, 0xc1, 0x26, 0xa6, 0xe4, 0xae, 0x47, 0xb9, 0x9b, 0xb5, 0xd1, 0x15, 0xa5, 0x42, 0x9e, 0xca, 0xfa, 0x4c, 0x36, 0xc6, 0xf0, 0xce, 0x1f, 0xd5, 0x2e, 0x62}
                                actual  : []byte{0x5e, 0x10, 0x5b, 0x87, 0xb6, 0x42, 0x3, 0x1f, 0xdf, 0x88, 0x73, 0x55, 0x2f, 0x4b, 0xfd, 0x7a, 0x2c, 0xcb, 0x49, 0x55, 0xa6, 0x74, 0x95, 0x1c, 0xe5, 0x90, 0x79, 0xc7, 0x54, 0x8f, 0x51, 0xa1, 0xbe, 0x83, 0xf2, 0x3a, 0xcd, 0xae, 0x1e, 0x4, 0x68, 0x2c, 0x9f, 0x34, 0x8c, 0xda, 0xad, 0xd7, 0xd4, 0x84, 0x22, 0x9f, 0xe, 0x7e, 0xc8, 0x9, 0x53, 0x7, 0xff, 0xaa, 0xda, 0xbd, 0xd, 0x42}
...

Node n2 got intentionally corrupted, and the first phase of the consistency check failed (nodes 1 and 3 agree, 2 is corrupted):

E230321 09:30:19.389914 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 560
... 560 +(n1,s1):1: checksum 0346578bad85b436b1d0f519907ffb46613982c18e527b425ae8786e034628c800aa089c55346b3b4dc2d029437c6a7140051ba1bd58368ea59c049682f4f524
... 560 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
... 560 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 
... 560 +(n3,s3):3: checksum 0346578bad85b436b1d0f519907ffb46613982c18e527b425ae8786e034628c800aa089c55346b3b4dc2d029437c6a7140051ba1bd58368ea59c049682f4f524
... 560 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
... 560 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 
... 560 +(n2,s2):2: checksum fa276506de3cdeb9e2518c2272579e9f03101eb5627a70a5c8fc81c3caf7b76cb7d3d49e085a8c481015af8a8804e2831ce0ed9c060677a619d26bc010d0ad6b [minority]
... 560 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
... 560 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 live_bytes:-20 live_count:-1 key_bytes:-14 key_count:-1 val_bytes:-6 val_count:-1 
... 561  consistency check failed; fetching details and shutting down minority (n2,s2):2

Shortly, all 3 nodes got a second phase CheckConsistency command which instructs to grab storage checkpoints at the time of application, and to kill node 2:

W230321 09:30:19.659257 2275 kv/kvserver/replica_consistency.go:670 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,raft] 563  creating checkpoint ‹r5_at_41› with spans [‹/Local/RangeID/{4""-7""}› ‹/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/{System/tsd-Table/3}› ‹/{System/tsd-Table/3}›]
W230321 09:30:19.738066 489 kv/kvserver/replica_consistency.go:670 ⋮ [T1,n1,s1,r5/1:‹/{Systemtse-Table/0}›,raft] 564  creating checkpoint ‹r5_at_41› with spans [‹/Local/RangeID/{4""-7""}› ‹/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/{System/tsd-Table/3}› ‹/{System/tsd-Table/3}›]
W230321 09:30:19.883564 3298 kv/kvserver/replica_consistency.go:670 ⋮ [T1,n3,s3,r5/3:‹/{Systemtse-Table/0}›,raft] 565  creating checkpoint ‹r5_at_41› with spans [‹/Local/RangeID/{4""-7""}› ‹/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/{System/tsd-Table/3}› ‹/{System/tsd-Table/3}›]

Nodes 2 and 3 quickly succeeded in creating the checkpoint @ log index 41:

W230321 09:30:20.706666 2275 kv/kvserver/replica_consistency.go:674 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,raft] 572  created checkpoint ‹auxiliary/checkpoints/r5_at_41›
W230321 09:30:20.811606 3298 kv/kvserver/replica_consistency.go:674 ⋮ [T1,n3,s3,r5/3:‹/{Systemtse-Table/0}›,raft] 574  created checkpoint ‹auxiliary/checkpoints/r5_at_41›

5+eps seconds later, the leaseholder ended the second phase of the consistency check:

E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596  
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +(n2,s2):2: checksum 3931acdb478bf3a29e62b7c7aaa0ad983475b58c1e74be639d681dbade43ed405e30e97ab8f6678c3859df71bc613562e45eeea43b8cf3ed1e534dc30a481cdc
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 live_bytes:-20 live_count:-1 key_bytes:-14 key_count:-1 val_bytes:-6 val_count:-1 
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +(n3,s3):3: checksum 5e105b87b642031fdf8873552f4bfd7a2ccb4955a67
4951ce59079c7548f51a1be83f23acdae1e04682c9f348cdaadd7d484229f0e7ec8095307ffaadabd0d42 [minority]
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 
E230321 09:30:26.086872 19038 kv/kvserver/replica_consistency.go:247 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 597  consistency check failed

Notably, only nodes 2 and 3 returned a checksum. Node 1 timed out at 5 seconds and returned an error.

Only 10 more seconds later, node 1 finally reported that it created a checksum:

W230321 09:30:36.510150 489 kv/kvserver/replica_consistency.go:674 ⋮ [T1,n1,s1,r5/1:‹/{Systemtse-Table/0}›,raft] 647  created checkpoint ‹auxiliary/checkpoints/r5_at_41›

And exactly 5 seconds later it gave up on keeping the result of the checksum computation because the result collection request never showed up (we know that it timed out above):

E230321 09:30:41.521566 20700 kv/kvserver/replica_consistency.go:718 ⋮ [T1,n1,s1,r5/1:‹/{Systemtse-Table/0}›] 662  checksum collection did not join: ‹operation "kvserver.Replica: computing checksum" timed out after 5.01s (given timeout 5s)›: context deadline exceeded

So what's probably happening is:

runConsistency check returns early when only nodes 2 and 3 have created the storage checkpoint.
Node 1 haven't created the checkpoint, but has started doing so.
Node 2 "fatals" (mocked out in the test) shortly after the check is completed.
Node 1 is still creating its checkpoint, but has probably created the directory by now.
Hence, the test assumes that the checkpoint has been created.
And proceeds to open it and compute the checksum of the range.

The problem is that on the node 1 the checkpoint directory isn't complete yet. Even the RdbCheckpoints metric check doesn't save it, because it's updated asynchronously and on the basis of the number of directories in the checkpoints folder.

This commit makes it so that the consistency checker checkpoints are first created in a "_pending" folder, and only after completion are atomically renamed to the intended directory name. This is to help distinguish valid checkpoints from the ones that weren't finalized (for example, when the node crashed in the meantime). Part of cockroachdb#81819 Epic: none Release note (ops change): checkpoint directories (that can be created in the rare event of range inconsistency) are now clearly indicated as pending until they are fully populated. This is to help operators to distinguish valid checkpoints from corrupted ones.

98527: sql: deprecate gossip-based physical planning r=dt a=dt Previously system tenants and secondary tenants used differnt physical planning implementations, with the system tenant and only the system tenant using nodeIDs while other tenants used the instance table. This unifies those implementations such that all tenants use NodeIDs if running in mixed mode and use the instance table if not. Release note: none. Epic: CRDB-16910 98879: sql: support array-flatten subqueries within UDFs r=mgartner a=mgartner Array-flatten subqueries (e.g., `ARRAY(SELECT a FROM t)`) are now supported within UDFs. They are now converted to a normal subquery with a ScalarGroupBy if they exist within a UDF, even if they are uncorrelated. This allows them to be executed without any changes to the execbuilder or the evaluation logic of `tree.Routine`. Fixes #98738 Release note: None 98988: ui: add badges for filter elements r=maryliag a=maryliag Adds badges for each of the selected filters on SQL Activity and Insights pages. Part Of #98891 https://www.loom.com/share/e7417209fc704d71b2733f58609fb4de <img width="1160" alt="Screenshot 2023-03-19 at 1 30 49 PM" src="https://user-images.githubusercontent.com/1017486/226195412-03d3ef58-5d6d-4c24-84f1-6a55b952f5c0.png"> Release note (ui change): Adds badges for each selected filter on SQL Activity and Insights pages. 99042: sql: added indexes to system.statement_statistics, system.transaction_statistics r=ericharmeling a=ericharmeling Fixes #98624. This commit adds indexes on new computed columns to the system.statement_statistics and system.transaction_statistics tables. Epic: none Release note: None 99054: pkg/ccl: Unskip TestTenantStatusAPI/tenant_ranges/pagination r=dhartunian a=abarganier Fixes: #92979 Previously, in #97386, we skipped test_tenant_ranges_pagination because it was marked as flaky. The test makes a request for a single range and expects an offset of `1` back. It then uses this offset to request a second range, and expects an offset of `2`. This means that the test requires at least 3 ranges to exist on the tenant. The test was flaking on the assertion that the offset returned by the second request came back as `2`. Instead, it was flaking when the offset came back as `0`, which signifies that there are no more ranges to process. We learned that the tenant create process has an asycnhronous splitting of ranges that occurs, which is what would lead to this sporadic scenario where not enough ranges existed (yet) for the test to succeed. This patch updates the test with a `testutils.SucceedsSoon` clause that checks first that `crdb_internal.ranges` contains at least 3 ranges, prior to making the second request. This should provide sufficient time for the range split queue to be processed and eliminate the vast majority of these test flakes. Release note: none 99119: kvserver: mark in-flight storage checkpoints r=tbg a=pavelkalinnikov This commit makes it so that the consistency checker checkpoints are first created in a "_pending" folder, and only after completion are atomically renamed to the intended directory name. This is to help distinguish valid checkpoints from the ones that weren't finalized (for example, when the node crashed in the meantime). Part of #81819 Epic: none Release note (ops change): checkpoint directories (that can be created in the rare event of range inconsistency) are now clearly indicated as pending until they are fully populated. This is to help operators to distinguish valid checkpoints from corrupted ones. Co-authored-by: David Taylor <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: maryliag <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

This commit fixes the race possible in TestCheckConsistencyInconsistent: - Node 2 is corrupted. - The second phase of runConsistency check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. This commit makes the test wait for the moment when all the checkpoint are known to be fully populated. Part of cockroachdb#81819 Epic: none Release note: none

This commit fixes the race possible in TestCheckConsistencyInconsistent: - Node 2 is corrupted. - The second phase of runConsistency check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. This commit makes the test wait for the moment when all the checkpoint are known to be fully populated. Part of #81819 Epic: none Release note: none

erikgrinaker added C-test-failure Broken test (automatically or manually discovered). T-kv-replication labels May 25, 2022

erikgrinaker self-assigned this May 25, 2022

erikgrinaker removed their assignment May 25, 2022

erikgrinaker changed the title ~~kvserver: data race in TestCheckConsistencyInconsistent~~ kvserver: data race in TestingSetRedactable May 25, 2022

erikgrinaker added T-kv KV Team and removed T-kv-replication labels May 29, 2022

erikgrinaker mentioned this issue May 29, 2022

kvserver: skip TestCheckConsistencyInconsistent under race #82042

Merged

github-actions bot added the no-test-failure-activity label Jun 29, 2022

erikgrinaker added skipped-test and removed no-test-failure-activity labels Jun 29, 2022

shralex added the GA-blocker label Mar 3, 2023

shralex assigned nvanbenschoten and unassigned nvanbenschoten Mar 3, 2023

nvanbenschoten changed the title ~~kvserver: data race in TestingSetRedactable~~ kvserver: TestCheckConsistencyInconsistent failed, data race in TestingSetRedactable Mar 17, 2023

nvanbenschoten added T-kv-replication and removed T-kv KV Team labels Mar 17, 2023

nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 17, 2023

erikgrinaker assigned pav-kv Mar 18, 2023

erikgrinaker removed the GA-blocker label Mar 21, 2023

pav-kv mentioned this issue Mar 21, 2023

kvserver: fix and unskip TestCheckConsistencyInconsistent #99114

Merged

pav-kv mentioned this issue Mar 21, 2023

kvserver: mark in-flight storage checkpoints #99119

Merged

craig bot closed this as completed in f592975 Mar 22, 2023

blathers-crl bot mentioned this issue Mar 22, 2023

release-23.1: kvserver: fix and unskip TestCheckConsistencyInconsistent #99225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: `TestCheckConsistencyInconsistent` failed, data race in `TestingSetRedactable` #81819

kvserver: `TestCheckConsistencyInconsistent` failed, data race in `TestingSetRedactable` #81819

erikgrinaker commented May 25, 2022 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented May 25, 2022 •

edited

Loading

github-actions bot commented Jun 29, 2022

blathers-crl bot commented Mar 3, 2023

nvanbenschoten commented Mar 17, 2023 •

edited

Loading

nvanbenschoten commented Mar 17, 2023

blathers-crl bot commented Mar 17, 2023

pav-kv commented Mar 19, 2023

pav-kv commented Mar 20, 2023

pav-kv commented Mar 21, 2023

pav-kv commented Mar 21, 2023 •

edited

Loading

kvserver: TestCheckConsistencyInconsistent failed, data race in TestingSetRedactable #81819

kvserver: TestCheckConsistencyInconsistent failed, data race in TestingSetRedactable #81819

Comments

erikgrinaker commented May 25, 2022 • edited by cockroach-jira-scripts Loading

erikgrinaker commented May 25, 2022 • edited Loading

github-actions bot commented Jun 29, 2022

blathers-crl bot commented Mar 3, 2023

nvanbenschoten commented Mar 17, 2023 • edited Loading

nvanbenschoten commented Mar 17, 2023

blathers-crl bot commented Mar 17, 2023

pav-kv commented Mar 19, 2023

pav-kv commented Mar 20, 2023

pav-kv commented Mar 21, 2023

pav-kv commented Mar 21, 2023 • edited Loading

kvserver: `TestCheckConsistencyInconsistent` failed, data race in `TestingSetRedactable` #81819

kvserver: `TestCheckConsistencyInconsistent` failed, data race in `TestingSetRedactable` #81819

erikgrinaker commented May 25, 2022 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented May 25, 2022 •

edited

Loading

nvanbenschoten commented Mar 17, 2023 •

edited

Loading

pav-kv commented Mar 21, 2023 •

edited

Loading