Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: TestCheckConsistencyInconsistent failed, data race in TestingSetRedactable #81819

Closed
erikgrinaker opened this issue May 25, 2022 · 10 comments · Fixed by #99114
Closed
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). skipped-test

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 25, 2022

=== RUN   TestCheckConsistencyInconsistent
    test_log_scope.go:79: test logs captured to: /home/erik/.cache/bazel/_bazel_erik/58b8816bb117734f0a0eaaef7fcef95b/sandbox/linux-sandbox/11283/execroot/com_github_cockroachdb_cockroach/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestCheckConsistencyInconsistent3070084177
    test_log_scope.go:80: use -show-logs to present logs inline
==================
WARNING: DATA RACE
Write at 0x00c0007d9b68 by goroutine 47:
  github.com/cockroachdb/cockroach/pkg/util/log.TestingSetRedactable.func1()
      github.com/cockroachdb/cockroach/pkg/util/log/redact.go:199 +0xe4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestCheckConsistencyInconsistent()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:477 +0x242b
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1244 +0x4f7
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:370 +0xa24
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:498 +0x109
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:501 +0x10a
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:327 +0x4b5
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:498 +0x109
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:501 +0x10a
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:327 +0x4b5
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:196 +0xa8
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:292 +0xcec
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:292 +0xcec
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:195 +0x8b
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestCheckConsistencyInconsistent()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:336 +0x9e4
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      GOROOT/src/testing/testing.go:1306 +0x47

Previous read at 0x00c0007d9b68 by goroutine 42:
  github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry()
      github.com/cockroachdb/cockroach/pkg/util/log/clog.go:312 +0x7f5
  github.com/cockroachdb/cockroach/pkg/util/log.logfDepth()
      github.com/cockroachdb/cockroach/pkg/util/log/channels.go:60 +0x385
  github.com/cockroachdb/cockroach/pkg/util/log.Infof()
      github.com/cockroachdb/cockroach/bazel-out/k8-fastbuild/bin/pkg/util/log/log_channels_generated.go:263 +0x1f0
  github.com/cockroachdb/cockroach/pkg/util/admission.(*GrantCoordinator).CPULoad()
      github.com/cockroachdb/cockroach/pkg/util/admission/granter.go:956 +0x3e
  github.com/cockroachdb/cockroach/pkg/util/admission.(*GrantCoordinator).CPULoad-fm()
      github.com/cockroachdb/cockroach/pkg/util/admission/granter.go:951 +0x57
  github.com/cockroachdb/cockroach/pkg/util/goschedstats.(*schedStatsTicker).getStatsOnTick()
      github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:236 +0x3fa
  github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1()
      github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:169 +0x1c4

Goroutine 47 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:1306 +0x726
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:1598 +0x99
  testing.tRunner()
      GOROOT/src/testing/testing.go:1259 +0x22f
  testing.runTests()
      GOROOT/src/testing/testing.go:1596 +0x7ca
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:1504 +0x9d1
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMain()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/main_test.go:49 +0x105
  main.main()
      main/bazel-out/k8-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/testmain.go:1435 +0x5c6

Goroutine 42 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0()
      github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:157 +0x2b
==================
I220525 12:44:18.543257 1 (gostd) testmain.go:1414  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 66

Jira issue: CRDB-16183

@erikgrinaker erikgrinaker added C-test-failure Broken test (automatically or manually discovered). T-kv-replication labels May 25, 2022
@erikgrinaker erikgrinaker self-assigned this May 25, 2022
@erikgrinaker erikgrinaker removed their assignment May 25, 2022
@erikgrinaker erikgrinaker changed the title kvserver: data race in TestCheckConsistencyInconsistent kvserver: data race in TestingSetRedactable May 25, 2022
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented May 25, 2022

This test will be disabled under race in #82042. However, the underlying race should still be addressed.

@erikgrinaker erikgrinaker added T-kv KV Team and removed T-kv-replication labels May 29, 2022
craig bot pushed a commit that referenced this issue May 31, 2022
82042: kvserver: skip `TestCheckConsistencyInconsistent` under race r=tbg a=erikgrinaker

The test is rather slow and also timing-dependent, which can result in
false positives. Furthermore, the call to `TestingSetRedactable`
triggers the race detector. This patch therefore disables the test under
the race detector.

Touches #81819.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@blathers-crl
Copy link

blathers-crl bot commented Mar 3, 2023

Hi @shralex, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 17, 2023

I unskipped the test under race and stressed it:

# unskip test
dev test pkg/kv/kvserver -f=TestCheckConsistencyInconsistent --stress --race

This did not hit the data race mentioned above. However, after about 10m of stress, it failed with the following, more concerning error:

--- FAIL: TestCheckConsistencyInconsistent (518.47s)
    test_log_scope.go:161: test logs captured to: /tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestCheckConsistencyInconsistent646488916
    test_log_scope.go:79: use -show-logs to present logs inline
    consistency_queue_test.go:407:
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:407
        	Error:      	Not equal:
        	            	expected: []byte{0xf1, 0xab, 0xd1, 0x41, 0x75, 0xd1, 0x79, 0x30, 0xd5, 0x36, 0x8d, 0xaf, 0x52, 0xc0, 0xd5, 0x88, 0x73, 0xc5, 0xb3, 0x6f, 0x81, 0x21, 0x67, 0x4f, 0xca, 0xce, 0x1, 0x68, 0xc5, 0x84, 0x6f, 0x7d, 0x54, 0xe5, 0x7e, 0xaa, 0x2e, 0xf1, 0x33, 0xdf, 0x8f, 0x45, 0xa2, 0x4a, 0x1e, 0x77, 0x80, 0xfd, 0xf3, 0x25, 0x10, 0x16, 0xd5, 0x83, 0x57, 0x15, 0x3c, 0x1f, 0x7f, 0x53, 0x94, 0xf0, 0x25, 0x10}
        	            	actual  : []byte{0x1d, 0x77, 0xab, 0xd8, 0x5d, 0x3c, 0xe9, 0x36, 0xc2, 0xaa, 0xef, 0x40, 0x1c, 0xf2, 0xfb, 0x18, 0xcd, 0x55, 0x16, 0x71, 0xf2, 0xf4, 0xfe, 0x39, 0x66, 0xa1, 0xa2, 0xe6, 0xc1, 0x84, 0x77, 0xc0, 0xf8, 0x25, 0x11, 0x1e, 0xc0, 0xc5, 0x79, 0x53, 0x9d, 0xe2, 0x32, 0x6b, 0x73, 0xb5, 0x3, 0xe9, 0x29, 0x5, 0x25, 0x75, 0xe, 0xa4, 0x1d, 0xbd, 0x88, 0xf4, 0x6c, 0x53, 0xa9, 0x22, 0xef, 0xdc}

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,6 +1,6 @@
        	            	 ([]uint8) (len=64) {
        	            	- 00000000  f1 ab d1 41 75 d1 79 30  d5 36 8d af 52 c0 d5 88  |...Au.y0.6..R...|
        	            	- 00000010  73 c5 b3 6f 81 21 67 4f  ca ce 01 68 c5 84 6f 7d  |s..o.!gO...h..o}|
        	            	- 00000020  54 e5 7e aa 2e f1 33 df  8f 45 a2 4a 1e 77 80 fd  |T.~...3..E.J.w..|
        	            	- 00000030  f3 25 10 16 d5 83 57 15  3c 1f 7f 53 94 f0 25 10  |.%....W.<..S..%.|
        	            	+ 00000000  1d 77 ab d8 5d 3c e9 36  c2 aa ef 40 1c f2 fb 18  |.w..]<.6...@....|
        	            	+ 00000010  cd 55 16 71 f2 f4 fe 39  66 a1 a2 e6 c1 84 77 c0  |.U.q...9f.....w.|
        	            	+ 00000020  f8 25 11 1e c0 c5 79 53  9d e2 32 6b 73 b5 03 e9  |.%....yS..2ks...|
        	            	+ 00000030  29 05 25 75 0e a4 1d bd  88 f4 6c 53 a9 22 ef dc  |).%u......lS."..|
        	            	 }
        	Test:       	TestCheckConsistencyInconsistent

I've since been stressing without race and have not hit the same failure.

EDIT: I quickly hit this again with --race.

@nvanbenschoten nvanbenschoten changed the title kvserver: data race in TestingSetRedactable kvserver: TestCheckConsistencyInconsistent failed, data race in TestingSetRedactable Mar 17, 2023
@nvanbenschoten
Copy link
Member

I've been able to hit this failure as far back as 4f904eb. @nicktrav could we have someone from Replication take a look at this test failure and determine what to do about it? Maybe @pavelkalinnikov, who worked with this test last October. There's an argument that what we're seeing here should be promoted to a release-blocker, but let's get another pair of eyes on it before making that decision.

@nvanbenschoten nvanbenschoten added T-kv-replication and removed T-kv KV Team labels Mar 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 17, 2023

cc @cockroachdb/replication

@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 17, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Mar 19, 2023

Reproduces fairly quickly on my end too. Investigating.

@pav-kv
Copy link
Collaborator

pav-kv commented Mar 20, 2023

I was bisecting, and hit the original race after a while (got it at 4eb5451, but it's probably a long existing one):

I230320 00:45:32.525877 1 (gostd) rand.go:199  [T1] 1  random seed: 6654103616478701771
==================
WARNING: DATA RACE
Write at 0x00c000e5ba18 by goroutine 35:
  github.com/cockroachdb/cockroach/pkg/util/log.TestingSetRedactable.func1()
      github.com/cockroachdb/cockroach/pkg/util/log/redact.go:199 +0xd8
  runtime.deferreturn()
      GOROOT/src/runtime/panic.go:476 +0x30
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.forceScanAndProcess()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue_helpers_testutil.go:42 +0x120
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ForceReplicationScanAndProcess()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue_helpers_testutil.go:55 +0xf0
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication.func2()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1367 +0xf4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149 +0x54
  github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range()
      github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:352 +0x29c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:148 +0x60
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1359 +0x36c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:438 +0xa50
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:578 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:392 +0x5e4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:578 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:392 +0x5e4
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:601 +0x54
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:578 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:392 +0x5e4
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:236 +0x6c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:332 +0xb14
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.NewTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:332 +0xb14
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:235 +0x5c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestCheckConsistencyInconsistent()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:290 +0x680
  testing.tRunner()
      GOROOT/src/testing/testing.go:1446 +0x188
  testing.(*T).Run.func1()
      GOROOT/src/testing/testing.go:1493 +0x40

Previous read at 0x00c000e5ba18 by goroutine 15538:
  github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry()
      github.com/cockroachdb/cockroach/pkg/util/log/clog.go:335 +0x5f4
  github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal()
      github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x550
  github.com/cockroachdb/cockroach/pkg/util/log.logfDepth()
      github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39 +0x68
  github.com/cockroachdb/cockroach/pkg/util/log.Infof()
      github.com/cockroachdb/cockroach/bazel-out/darwin_arm64-fastbuild/bin/pkg/util/log/log_channels_generated.go:277 +0x34
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce.func1()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:560 +0x30

Goroutine 35 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:1493 +0x55c
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:1846 +0x90
  testing.tRunner()
      GOROOT/src/testing/testing.go:1446 +0x188
  testing.runTests()
      GOROOT/src/testing/testing.go:1844 +0x6c0
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:1726 +0x870
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMain()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/main_test.go:35 +0x108
  main.main()
      main/bazel-out/darwin_arm64-fastbuild/bin/pkg/kv/kvserver/kvserver_test_/testmain.go:1521 +0x604

Goroutine 15538 (finished) created at:
  time.goFunc()
      GOROOT/src/time/sleep.go:176 +0x44
==================
I230320 00:46:49.303526 1 (gostd) testmain.go:1492  [T1] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 66


ERROR: exit status 66

@pav-kv
Copy link
Collaborator

pav-kv commented Mar 21, 2023

Another error:

--- FAIL: TestCheckConsistencyInconsistent (696.45s)
    test_log_scope.go:161: test logs captured to: /Users/pavel/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/577bfc8ed46850d1aaafa923fa37ce78/logTestCheckConsistencyInconsistent994443092
    test_log_scope.go:79: use -show-logs to present logs inline
    consistency_queue_test.go:313: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:313
                                                        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:358
                Error:          Received unexpected error:
                                could not collect checksum from any replica
                                (1)
                                  | (opaque error wrapper)
                                  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
                                  | reportable 0:
                                  |
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).runConsistencyCheck
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_consistency.go:376
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).checkConsistencyImpl
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_consistency.go:96
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).CheckConsistency
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_consistency.go:84
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:967
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:188
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:206
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:203
                                  | github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
                                  |     github.com/cockroachdb/cockroach/pkg/server/node.go:1169
                                  | github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
                                  |     github.com/cockroachdb/cockroach/pkg/server/node.go:1243
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler.func1
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/darwin_arm64-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:9975
                                  | github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1
                                  |     github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:96
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1162
                                  | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:268
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1165
                                  | github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.unaryInterceptor
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:92
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1165
                                  | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:235
                                  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
                                  |     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322
                                  | github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1
                                  |     github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:233
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1.1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1165
                                  | google.golang.org/grpc.chainUnaryInterceptors.func1
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1167
                                  | github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler
                                  |     github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/darwin_arm64-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:9977
                                  | google.golang.org/grpc.(*Server).processUnaryRPC
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1340
                                  | google.golang.org/grpc.(*Server).handleStream
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:1713
                                  | google.golang.org/grpc.(*Server).serveStreams.func1.2
                                  |     google.golang.org/grpc/external/org_golang_google_grpc/server.go:965
                                  | runtime.goexit
                                  |     src/runtime/asm_arm64.s:1172
                                Wraps: (2) could not collect checksum from any replica
                                Error types: (1) *errbase.opaqueWrapper (2) *errutil.leafError
                Test:           TestCheckConsistencyInconsistent
    panic.go:540: -- test log scope end --
FAIL
I230321 09:18:54.842662 1 (gostd) testmain.go:1494  [-] 1  Test //pkg/kv/kvserver:kvserver_test exited with error code 1


ERROR: exit status 1

1 runs completed, 1 failures, over 11m43s
context canceled

@pav-kv
Copy link
Collaborator

pav-kv commented Mar 21, 2023

Here is what happened during one of the test failures:

--- FAIL: TestCheckConsistencyInconsistent (132.90s)
...
    consistency_queue_test.go:410: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/consistency_queue_test.go:410
                Error:          Not equal: 
                                expected: []byte{0x55, 0x7a, 0x1, 0x36, 0x95, 0xb1, 0x3, 0x8a, 0xa6, 0x4, 0xd4, 0x50, 0x63, 0x78, 0xf0, 0x57, 0x43, 0x20, 0x5, 0xb2, 0x6d, 0xf0, 0x47, 0x91, 0x36, 0x9c, 0x87, 0xd8, 0xa6, 0xff, 0x72, 0xf6, 0xaa, 0xf1, 0xf2, 0xd1, 0xc8, 0x24, 0x9e, 0xc1, 0x26, 0xa6, 0xe4, 0xae, 0x47, 0xb9, 0x9b, 0xb5, 0xd1, 0x15, 0xa5, 0x42, 0x9e, 0xca, 0xfa, 0x4c, 0x36, 0xc6, 0xf0, 0xce, 0x1f, 0xd5, 0x2e, 0x62}
                                actual  : []byte{0x5e, 0x10, 0x5b, 0x87, 0xb6, 0x42, 0x3, 0x1f, 0xdf, 0x88, 0x73, 0x55, 0x2f, 0x4b, 0xfd, 0x7a, 0x2c, 0xcb, 0x49, 0x55, 0xa6, 0x74, 0x95, 0x1c, 0xe5, 0x90, 0x79, 0xc7, 0x54, 0x8f, 0x51, 0xa1, 0xbe, 0x83, 0xf2, 0x3a, 0xcd, 0xae, 0x1e, 0x4, 0x68, 0x2c, 0x9f, 0x34, 0x8c, 0xda, 0xad, 0xd7, 0xd4, 0x84, 0x22, 0x9f, 0xe, 0x7e, 0xc8, 0x9, 0x53, 0x7, 0xff, 0xaa, 0xda, 0xbd, 0xd, 0x42}
...

Node n2 got intentionally corrupted, and the first phase of the consistency check failed (nodes 1 and 3 agree, 2 is corrupted):

E230321 09:30:19.389914 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 560
... 560 +(n1,s1):1: checksum 0346578bad85b436b1d0f519907ffb46613982c18e527b425ae8786e034628c800aa089c55346b3b4dc2d029437c6a7140051ba1bd58368ea59c049682f4f524
... 560 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
... 560 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 
... 560 +(n3,s3):3: checksum 0346578bad85b436b1d0f519907ffb46613982c18e527b425ae8786e034628c800aa089c55346b3b4dc2d029437c6a7140051ba1bd58368ea59c049682f4f524
... 560 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
... 560 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 
... 560 +(n2,s2):2: checksum fa276506de3cdeb9e2518c2272579e9f03101eb5627a70a5c8fc81c3caf7b76cb7d3d49e085a8c481015af8a8804e2831ce0ed9c060677a619d26bc010d0ad6b [minority]
... 560 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
... 560 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 live_bytes:-20 live_count:-1 key_bytes:-14 key_count:-1 val_bytes:-6 val_count:-1 
... 561  consistency check failed; fetching details and shutting down minority (n2,s2):2

Shortly, all 3 nodes got a second phase CheckConsistency command which instructs to grab storage checkpoints at the time of application, and to kill node 2:

W230321 09:30:19.659257 2275 kv/kvserver/replica_consistency.go:670 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,raft] 563  creating checkpoint ‹r5_at_41› with spans [‹/Local/RangeID/{4""-7""}› ‹/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/{System/tsd-Table/3}› ‹/{System/tsd-Table/3}›]
W230321 09:30:19.738066 489 kv/kvserver/replica_consistency.go:670 ⋮ [T1,n1,s1,r5/1:‹/{Systemtse-Table/0}›,raft] 564  creating checkpoint ‹r5_at_41› with spans [‹/Local/RangeID/{4""-7""}› ‹/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/{System/tsd-Table/3}› ‹/{System/tsd-Table/3}›]
W230321 09:30:19.883564 3298 kv/kvserver/replica_consistency.go:670 ⋮ [T1,n3,s3,r5/3:‹/{Systemtse-Table/0}›,raft] 565  creating checkpoint ‹r5_at_41› with spans [‹/Local/RangeID/{4""-7""}› ‹/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/Local/Range/{System/tsd-Table/3}› ‹/Local/Lock/Intent/{System/tsd-Table/3}› ‹/{System/tsd-Table/3}›]

Nodes 2 and 3 quickly succeeded in creating the checkpoint @ log index 41:

W230321 09:30:20.706666 2275 kv/kvserver/replica_consistency.go:674 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,raft] 572  created checkpoint ‹auxiliary/checkpoints/r5_at_41›
W230321 09:30:20.811606 3298 kv/kvserver/replica_consistency.go:674 ⋮ [T1,n3,s3,r5/3:‹/{Systemtse-Table/0}›,raft] 574  created checkpoint ‹auxiliary/checkpoints/r5_at_41›

5+eps seconds later, the leaseholder ended the second phase of the consistency check:

E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596  
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +(n2,s2):2: checksum 3931acdb478bf3a29e62b7c7aaa0ad983475b58c1e74be639d681dbade43ed405e30e97ab8f6678c3859df71bc613562e45eeea43b8cf3ed1e534dc30a481cdc
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 live_bytes:-20 live_count:-1 key_bytes:-14 key_count:-1 val_bytes:-6 val_count:-1 
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +(n3,s3):3: checksum 5e105b87b642031fdf8873552f4bfd7a2ccb4955a67
4951ce59079c7548f51a1be83f23acdae1e04682c9f348cdaadd7d484229f0e7ec8095307ffaadabd0d42 [minority]
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats: contains_estimates:0 last_update_nanos:1679391017206907000 intent_age:0 gc_bytes_age:0 live_bytes:40 live_count:2 key_bytes:28 key_count:2 val_bytes:12 val_count:2 intent_bytes:0 intent_count:0 separated_intent_count:0 range_key_count:0 range_key_bytes:0 range_val_count:0 range_val_bytes:0 sys_bytes:578 sys_count:7 abort_span_bytes:0 
E230321 09:30:26.045762 19038 kv/kvserver/replica_consistency.go:151 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 596 +- stats.Sub(recomputation): last_update_nanos:1679391017206907000 
E230321 09:30:26.086872 19038 kv/kvserver/replica_consistency.go:247 ⋮ [T1,n2,s2,r5/2:‹/{Systemtse-Table/0}›,*kvpb.CheckConsistencyRequest] 597  consistency check failed

Notably, only nodes 2 and 3 returned a checksum. Node 1 timed out at 5 seconds and returned an error.

Only 10 more seconds later, node 1 finally reported that it created a checksum:

W230321 09:30:36.510150 489 kv/kvserver/replica_consistency.go:674 ⋮ [T1,n1,s1,r5/1:‹/{Systemtse-Table/0}›,raft] 647  created checkpoint ‹auxiliary/checkpoints/r5_at_41›

And exactly 5 seconds later it gave up on keeping the result of the checksum computation because the result collection request never showed up (we know that it timed out above):

E230321 09:30:41.521566 20700 kv/kvserver/replica_consistency.go:718 ⋮ [T1,n1,s1,r5/1:‹/{Systemtse-Table/0}›] 662  checksum collection did not join: ‹operation "kvserver.Replica: computing checksum" timed out after 5.01s (given timeout 5s)›: context deadline exceeded

So what's probably happening is:

  • runConsistency check returns early when only nodes 2 and 3 have created the storage checkpoint.
  • Node 1 haven't created the checkpoint, but has started doing so.
  • Node 2 "fatals" (mocked out in the test) shortly after the check is completed.
  • Node 1 is still creating its checkpoint, but has probably created the directory by now.
  • Hence, the test assumes that the checkpoint has been created.
  • And proceeds to open it and compute the checksum of the range.

The problem is that on the node 1 the checkpoint directory isn't complete yet. Even the RdbCheckpoints metric check doesn't save it, because it's updated asynchronously and on the basis of the number of directories in the checkpoints folder.

pav-kv added a commit to pav-kv/cockroach that referenced this issue Mar 21, 2023
This commit makes it so that the consistency checker checkpoints are first
created in a "_pending" folder, and only after completion are atomically
renamed to the intended directory name. This is to help distinguish valid
checkpoints from the ones that weren't finalized (for example, when the node
crashed in the meantime).

Part of cockroachdb#81819
Epic: none
Release note (ops change): checkpoint directories (that can be created in the
rare event of range inconsistency) are now clearly indicated as pending until
they are fully populated. This is to help operators to distinguish valid
checkpoints from corrupted ones.
craig bot pushed a commit that referenced this issue Mar 21, 2023
98527: sql: deprecate gossip-based physical planning r=dt a=dt

Previously system tenants and secondary tenants used differnt physical planning implementations, with the system tenant and only the system tenant using nodeIDs while other tenants used the instance table. This unifies those implementations such that all tenants use NodeIDs if running in mixed mode and use the instance table if not.

Release note: none.
Epic: CRDB-16910

98879: sql: support array-flatten subqueries within UDFs r=mgartner a=mgartner

Array-flatten subqueries (e.g., `ARRAY(SELECT a FROM t)`) are now
supported within UDFs. They are now converted to a normal subquery with
a ScalarGroupBy if they exist within a UDF, even if they are
uncorrelated. This allows them to be executed without any changes to the
execbuilder or the evaluation logic of `tree.Routine`.

Fixes #98738

Release note: None


98988: ui: add badges for filter elements r=maryliag a=maryliag

Adds badges for each of the selected filters on SQL Activity and
Insights pages.

Part Of #98891

https://www.loom.com/share/e7417209fc704d71b2733f58609fb4de

<img width="1160" alt="Screenshot 2023-03-19 at 1 30 49 PM" src="https://user-images.githubusercontent.com/1017486/226195412-03d3ef58-5d6d-4c24-84f1-6a55b952f5c0.png">

Release note (ui change): Adds badges for each selected
filter on SQL Activity and Insights pages.

99042: sql: added indexes to system.statement_statistics, system.transaction_statistics r=ericharmeling a=ericharmeling

Fixes #98624.

This commit adds indexes on new computed columns to the system.statement_statistics and system.transaction_statistics tables.

Epic: none

Release note: None

99054: pkg/ccl: Unskip TestTenantStatusAPI/tenant_ranges/pagination r=dhartunian a=abarganier

Fixes: #92979

Previously, in #97386, we skipped test_tenant_ranges_pagination because it was marked as flaky.

The test makes a request for a single range and expects an offset of `1` back. It then uses this offset to request a second range, and expects an offset of `2`. This means that the test requires at least 3 ranges to exist on the tenant.

The test was flaking on the assertion that the offset returned by the second request came back as `2`. Instead, it was flaking when the offset came back as `0`, which signifies that there are no more ranges to process.

We learned that the tenant create process has an asycnhronous splitting of ranges that occurs, which is what would lead to this sporadic scenario where not enough ranges existed (yet) for the test to succeed.

This patch updates the test with a `testutils.SucceedsSoon` clause that checks first that `crdb_internal.ranges` contains at least 3 ranges, prior to making the second request. This should provide sufficient time for the range split queue to be processed and eliminate the vast majority of these test flakes.

Release note: none

99119: kvserver: mark in-flight storage checkpoints r=tbg a=pavelkalinnikov

This commit makes it so that the consistency checker checkpoints are first created in a "_pending" folder, and only after completion are atomically renamed to the intended directory name. This is to help distinguish valid checkpoints from the ones that weren't finalized (for example, when the node crashed in the meantime).

Part of #81819
Epic: none
Release note (ops change): checkpoint directories (that can be created in the rare event of range inconsistency) are now clearly indicated as pending until they are fully populated. This is to help operators to distinguish valid checkpoints from corrupted ones.

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: maryliag <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
pav-kv added a commit to pav-kv/cockroach that referenced this issue Mar 22, 2023
This commit fixes the race possible in TestCheckConsistencyInconsistent:
- Node 2 is corrupted.
- The second phase of runConsistency check times out on node 1, and returns
  early when only nodes 2 and 3 have created the storage checkpoint.
- Node 1 haven't created the checkpoint, but has started doing so.
- Node 2 "fatals" (mocked out in the test) shortly after the check is complete.
- Node 1 is still creating its checkpoint, but has probably created the
  directory by now.
- Hence, the test assumes that the checkpoint has been created, and proceeds to
  open it and compute the checksum of the range.

This commit makes the test wait for the moment when all the checkpoint are
known to be fully populated.

Part of cockroachdb#81819
Epic: none
Release note: none
@craig craig bot closed this as completed in f592975 Mar 22, 2023
blathers-crl bot pushed a commit that referenced this issue Mar 22, 2023
This commit fixes the race possible in TestCheckConsistencyInconsistent:
- Node 2 is corrupted.
- The second phase of runConsistency check times out on node 1, and returns
  early when only nodes 2 and 3 have created the storage checkpoint.
- Node 1 haven't created the checkpoint, but has started doing so.
- Node 2 "fatals" (mocked out in the test) shortly after the check is complete.
- Node 1 is still creating its checkpoint, but has probably created the
  directory by now.
- Hence, the test assumes that the checkpoint has been created, and proceeds to
  open it and compute the checksum of the range.

This commit makes the test wait for the moment when all the checkpoint are
known to be fully populated.

Part of #81819
Epic: none
Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants