Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

Closed
cockroach-teamcity opened this issue Mar 15, 2021 · 82 comments · Fixed by #63302
Closed

roachtest: hotspotsplits/nodes=4 failed [inconsistency] #61990

cockroach-teamcity opened this issue Mar 15, 2021 · 82 comments · Fixed by #63302
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).hotspotsplits/nodes=4 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	cluster.go:1754,test_runner.go:862: r38 (/Table/53/1/-182447479829764077) is inconsistent: RANGE_INCONSISTENT 
		(1) attached stack trace
		  -- stack trace:
		  | main.(*cluster).CheckReplicaDivergenceOnDB
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1705
		  | main.(*cluster).FailOnReplicaDivergence.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1751
		  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140
		  | main.(*cluster).FailOnReplicaDivergence
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1748
		  | main.(*testRunner).runTest
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:862
		  | main.(*testRunner).runWorker
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:472
		  | main.(*testRunner).Run.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:275
		  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) r38 (/Table/53/1/-182447479829764077) is inconsistent: RANGE_INCONSISTENT 
		  | (n3,s3):3: checksum 099280635e0ad488ea26cb5f253acebde95ceaa12eb3a8088e41ce92866507e5a664046b208ad370517d6af5c839c135dd1c27e612bbb0d59f75357a06084b61 [minority]
		  | - stats: contains_estimates:0 last_update_nanos:1615788916721166939 intent_age:0 gc_bytes_age:0 live_bytes:301241874 live_count:1149 key_bytes:28077 key_count:1149 val_bytes:301213797 val_count:1149 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:702 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1615788916721166939 live_bytes:301241874 live_count:1149 key_bytes:28077 key_count:1149 val_bytes:301213797 val_count:1149 sys_bytes:702 sys_count:6 
		  | (n2,s2):2: checksum ff3cbc427622cc2ebf8ae40b42f37870f5662226a33bd636662a3423b268d19c15ae26c4f10181e9a80e531d00bb5127cbbcd44c21088fe8e5745d10653687fa
		  | - stats: contains_estimates:0 last_update_nanos:1615788916721166939 intent_age:0 gc_bytes_age:0 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:702 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1615788916721166939 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 sys_bytes:702 sys_count:6 
		  | (n1,s1):1: checksum ff3cbc427622cc2ebf8ae40b42f37870f5662226a33bd636662a3423b268d19c15ae26c4f10181e9a80e531d00bb5127cbbcd44c21088fe8e5745d10653687fa
		  | - stats: contains_estimates:0 last_update_nanos:1615788916721166939 intent_age:0 gc_bytes_age:0 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:702 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1615788916721166939 live_bytes:300979696 live_count:1148 key_bytes:28052 key_count:1148 val_bytes:300951644 val_count:1148 sys_bytes:702 sys_count:6
		Error types: (1) *withstack.withStack (2) *errutil.leafError

More

Artifacts: /hotspotsplits/nodes=4

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 15, 2021
@knz
Copy link
Contributor

knz commented Mar 16, 2021

cc @aayushshah15 @nvanbenschoten for triage

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@8b137b4f068a0d590a3e86ae91fd60eb84f2750a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	cluster.go:1754,test_runner.go:863: r40 (/Table/53/1/4396575136294584631) is inconsistent: RANGE_INCONSISTENT 
		(1) attached stack trace
		  -- stack trace:
		  | main.(*cluster).CheckReplicaDivergenceOnDB
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1705
		  | main.(*cluster).FailOnReplicaDivergence.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1751
		  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140
		  | main.(*cluster).FailOnReplicaDivergence
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1748
		  | main.(*testRunner).runTest
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:863
		  | main.(*testRunner).runWorker
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:473
		  | main.(*testRunner).Run.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:276
		  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) r40 (/Table/53/1/4396575136294584631) is inconsistent: RANGE_INCONSISTENT 
		  | (n3,s3):2: checksum f03984771a11e0ff99ba698c20d3e2606680201bad62ed4d08da7c9ae4ec774b3f8323e7b15c5f8ee443abde68a2ab57904f017ff835e8edafe4ce1114b6840f
		  | - stats: contains_estimates:0 last_update_nanos:1616739521894074624 intent_age:0 gc_bytes_age:0 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:629 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1616739521894074624 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 sys_bytes:629 sys_count:6 
		  | (n2,s2):3: checksum f03984771a11e0ff99ba698c20d3e2606680201bad62ed4d08da7c9ae4ec774b3f8323e7b15c5f8ee443abde68a2ab57904f017ff835e8edafe4ce1114b6840f
		  | - stats: contains_estimates:0 last_update_nanos:1616739521894074624 intent_age:0 gc_bytes_age:0 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:629 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1616739521894074624 live_bytes:315924490 live_count:1205 key_bytes:30125 key_count:1205 val_bytes:315894365 val_count:1205 sys_bytes:629 sys_count:6 
		  | (n1,s1):1: checksum 7ad7a7eb605a7f4a0c52e76e4ffb7f075324a842e8f385c10a8e59d3a5da223d17fc992dad8c5838d578206d0e93663d3154e40d9392e17a1610678dd3d4daf9 [minority]
		  | - stats: contains_estimates:0 last_update_nanos:1616739521894074624 intent_age:0 gc_bytes_age:0 live_bytes:315400134 live_count:1203 key_bytes:30075 key_count:1203 val_bytes:315370059 val_count:1203 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:629 sys_count:6 abort_span_bytes:0 
		  | - stats.Sub(recomputation): last_update_nanos:1616739521894074624 live_bytes:315400134 live_count:1203 key_bytes:30075 key_count:1203 val_bytes:315370059 val_count:1203 sys_bytes:629 sys_count:6
		Error types: (1) *withstack.withStack (2) *errutil.leafError

More

Artifacts: /hotspotsplits/nodes=4

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg tbg changed the title roachtest: hotspotsplits/nodes=4 failed roachtest: hotspotsplits/nodes=4 failed [inconsistency] Mar 29, 2021
tbg added a commit that referenced this issue Mar 29, 2021
In #61990 we had this method catch a stats divergence on teardown in an
otherwise successful test. The call to `t.Fatal` in that method
unfortunately prevented the logs from being collected, which is not
helpful.

Release note: None
@tbg tbg self-assigned this Mar 29, 2021
@tbg
Copy link
Member

tbg commented Mar 29, 2021

Kicking off a repro stress attempt with --debug on top of #62728

@tbg
Copy link
Member

tbg commented Mar 29, 2021

20 runs passed (in <900s each)

@tbg
Copy link
Member

tbg commented Mar 29, 2021

out of 200 an additional runs, I got one reproduction (i.e. 199 PASS). That's a 0.45% repro rate. Ouch!

       	Wraps: (2) r59 (/Table/53/1/5664974682252175621) is inconsistent: RANGE_INCONSISTENT 
		 | (n2,s2):2: checksum 5907ea1a6182683e1601abded94351e69219280a4ab1d6d271679e82262e4c49349c683c16b07d99763d8eeb3de5e123158b6584366cdae4b2f847dbb4fba601
		 | - stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:368 sys_count:4 abort_span_bytes:0 
		 | - stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:368 sys_count:4 
		 | (n3,s3):3: checksum 5907ea1a6182683e1601abded94351e69219280a4ab1d6d271679e82262e4c49349c683c16b07d99763d8eeb3de5e123158b6584366cdae4b2f847dbb4fba601
		 | - stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:368 sys_count:4 abort_span_bytes:0 
		 | - stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:368 sys_count:4 
		 | (n1,s1):1: checksum 9246dccddf4316dea2eebaf4c89a6b3a87826fa8670b00c9c96266359bbf203845c86b348914dc749d5e8b6f0a746dda8f851e3bfcad7374968fdcbfb499cf50 [minority]
		 | - stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:368 sys_count:4 abort_span_bytes:0 
		 | - stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:368 sys_count:4
		Error types: (1) *withstack.withStack (2) *errutil.leafError

@irfansharif
Copy link
Contributor

irfansharif commented Mar 29, 2021

Do you have the logs/debug zips collected somewhere? I was just going to say that I'd be really happy if we found a range tombstone bug. Out of obvious ideas for #62267.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

I have something even better, namely the cluster! Two of the nodes failed on the real consistency checker and got checkpointed and killed. Go check them out (I'm heading to bed, haven't looked yet):

$ roachprod extend tobias-1617031837-01-n4cpu4 --lifetime 100h
tobias-1617031837-01-n4cpu4: [gce] 122h37m11s remaining
  tobias-1617031837-01-n4cpu4-0001	tobias-1617031837-01-n4cpu4-0001.us-east1-b.cockroach-ephemeral	10.142.0.56	35.243.139.251
  tobias-1617031837-01-n4cpu4-0002	tobias-1617031837-01-n4cpu4-0002.us-east1-b.cockroach-ephemeral	10.142.0.101	35.185.42.134
  tobias-1617031837-01-n4cpu4-0003	tobias-1617031837-01-n4cpu4-0003.us-east1-b.cockroach-ephemeral	10.142.0.41	34.74.192.79
  tobias-1617031837-01-n4cpu4-0004	tobias-1617031837-01-n4cpu4-0004.us-east1-b.cockroach-ephemeral	10.142.0.52	35.185.111.113

Unfortunately the checkpoints are ~20GiB each.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

Kicking off 800 runs overnight on release-21.1 to see if I can get a repro there.

@irfansharif
Copy link
Contributor

irfansharif commented Mar 29, 2021

@sumeerbhola, did we break cockroach debug keys <store>?

$ roachprod ssh tobias-1617031837-01-n4cpu4:2
$ ~/cockroach debug keys /mnt/data1/cockroach/auxiliary/checkpoints/r59_at_2866/
panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
        panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
        panic: iterator with constraint=2 is being used with key /Min that has constraint=1

@irfansharif
Copy link
Contributor

irfansharif commented Mar 29, 2021

Ditto for range-keys, but that doesn't seem pebble related. #51151, works on a 20.1 binary (broken in 20.2 as well).

$ ~/cockroach debug range-data /mnt/data1/cockroach/auxiliary/checkpoints/r59_at_2866 59
panic: field raftpb.HardState.term has invalid type: got uint64, want pointer [recovered]
        panic: field raftpb.HardState.term has invalid type: got uint64, want pointer [recovered]
        panic: field raftpb.HardState.term has invalid type: got uint64, want pointer

@sumeerbhola
Copy link
Collaborator

panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
panic: iterator with constraint=2 is being used with key /Min that has constraint=1 [recovered]
panic: iterator with constraint=2 is being used with key /Min that has constraint=1

This must have been overlooked in all the changes to users of MVCCIterator such that they don't span from local to global keys. Where is the code that is constructing this MVCCIterator? The fix will typically be one of (a) set more accurate bounds if the iterator is only really trying to read one of the local or global keys, (b) use 2 iterators if both need to be read, (c) if one is trying to also read non-MVCC keys like lock table keys, use EngineIterator instead of MVCCIterator.

@irfansharif
Copy link
Contributor

irfansharif commented Mar 29, 2021

The local range ID keys look fine, except the node in the majority (n2) is ahead by a commit. Is that fine?

3c3
< 0,0 /Local/RangeID/59/r/RangeAppliedState (0x0169c3727261736b00): raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:436 sys_count:5 >
---
> 0,0 /Local/RangeID/59/r/RangeAppliedState (0x0169c3727261736b00): raft_applied_index:2865 lease_applied_index:2703 range_stats:<last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:436 sys_count:5 >
6c6
< 0,0 /Local/RangeID/59/u/RaftHardState (0x0169c3757266746800): term:28 vote:2 commit:2866
---
> 0,0 /Local/RangeID/59/u/RaftHardState (0x0169c3757266746800): term:28 vote:2 commit:2865

@sumeerbhola, I think here:

cockroach/pkg/cli/debug.go

Lines 223 to 236 in d29a7be

return db.MVCCIterate(debugCtx.startKey.Key, debugCtx.endKey.Key, storage.MVCCKeyAndIntentsIterKind, func(kv storage.MVCCKeyValue) error {
done, err := printer(kv)
if err != nil {
return err
}
if done {
return iterutil.StopIteration()
}
results++
if results == debugCtx.maxResults {
return iterutil.StopIteration()
}
return nil
})
. This is readily reproducible, start a single node (cockroach start-single-node --insecure --logtostderr=INFO) and use cockroach debug keys in against the store directory.

@irfansharif
Copy link
Contributor

# leaseholder
$ ~/cockroach20.1 debug keys --from='/Table/53/1' --from='/Table/54'  . | wc -l
79150

# minority errant node
$ ~/cockroach20.1 debug keys --from='/Table/53/1' --from='/Table/54'  . | wc -l
79146

@irfansharif
Copy link
Contributor

irfansharif commented Mar 30, 2021

cockroach.tobias-1617031837-01-n4cpu4-0002.ubuntu.2021-03-29T19_27_11Z.019618.log

(n2,s2):2: checksum 335d894333f35cb2cbcf2844defb44a77b8f42b6f286dff07970e79b43e7e2bef3c46de8b4ec1fb06d28629d87e63a92bfb0578e84be5ef2ba9441a0a04fcc9f
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
(n3,s3):3: checksum 335d894333f35cb2cbcf2844defb44a77b8f42b6f286dff07970e79b43e7e2bef3c46de8b4ec1fb06d28629d87e63a92bfb0578e84be5ef2ba9441a0a04fcc9f
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
(n1,s1):1: checksum 4174178f69296d98d5d821d3998f0e58ba23e4665bda7e8464a6ef20098623918f3eb27f2293812128fbf4a96cea060fd3c9c8fbecdfec001ee79b9a91d8f9e1 [minority]
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:-524356 live_count:-2 key_bytes:-50 key_count:-2 val_bytes:-524306 val_count:-2 
 consistency check failed; fetching details and shutting down minority (n1,s1):1
 
(n2,s2):2: checksum 2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
(n3,s3):3: checksum 2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3
  stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
  stats.Sub(recomputation): last_update_nanos:1617046662819189590 
+====== diff(2a0745549ed2f4a9b4f7d26f857f85447b3701975a51e74b15a51c1c3f5cb9dc6d8394fc7023ee9a4544e42d92136de092469488dd04c9a883649ef1336fbcc3, [minority]) ======
--- leaseholder
+++ follower
-0,0 �/Local/RangeID/59/r/RangeAppliedState�
-    ts:1970-01-01 00:00:00 +0000 UTC
-    value:�raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:296261140 live_count:1130 key_bytes:28250 key_count:1130 val_bytes:296232890 val_count:1130 sys_bytes:436 sys_count:5 > raft_closed_timestamp:<wall_time:1617047719942013857 >�
-    raw mvcc_key/value: �0169c3727261736b00� �1204080010001800200028003240785941710308b2161090151a2709569391525fe770162094aca28d0128ea0830dadc0138ea0840bacfa08d0148ea0860b4036805220a08a187ff9fd78abab816�
+0,0 �/Local/RangeID/59/r/RangeAppliedState�
+    ts:1970-01-01 00:00:00 +0000 UTC
+    value:�raft_applied_index:2866 lease_applied_index:2704 range_stats:<last_update_nanos:1617046662819189590 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 sys_bytes:436 sys_count:5 > raft_closed_timestamp:<wall_time:1617047719942013857 >�
+    raw mvcc_key/value: �0169c3727261736b00� �1204080010001800200028003240ea302a2f0308b2161090151a2709569391525fe77016208cabe28c0128e60830f6db0138e6084096cfe08c0148e60860b4036805220a08a187ff9fd78abab816�
-1617046279.883407392,0 �/Table/53/1/5898728633597431726/0�
-    ts:2021-03-29 19:31:19.883407392 +0000 UTC
-    value:�"\x01\xaa\x98\x1f\n&\x90\x80\x00\u007fw\xf24\xdc\xd0m\x97\xd5~X\xe5\x98\x1f!\x9f\xf6\xb0\x89\xd3\r\xe5.\x03\x8f\xa2\x1d8q\x1ch=\x9e\xcb\x04\xc6\xe34ce\x85cV\x83\xfa\xf4\x12\xbb\xb4\x95\x1e\x97k}\x04\xe0\x9a\b\xb0\rO\xb3\xf3\xec+\xdf\n|gr-\xe2b\xa8\x1a\u007f#mx\xbf\xb5\xc7m\xf8K8\xcd\xf7\x11\xdb&\xa2ww\x98\xc7K\xa5O\x17\x86$s\x0e\xd4\xdf9\x17\x87I\x9c\xb3:\xad~\xec\xddg\x8cF\x85-\x83\xe7\xb6\xc1\x85u\xde\x1d3.f Ү\xf7\xa24\x12\xf38\xce\x1d\xb5\xe4$\x9f\xf8\v\x141\x94\xb0\x88\x1b\u007fg\xde\xe7�\x00\xb2\xfek\xc1\xfb\xb0\x83\x97\x95\x19\x88\xcf@HlO\x1b\xd2\xd6\xe1\xf0B\x87w\xc0\xe7w`V\x98\xda0&\xa9\xf6\xf6\xc0-\xff\xf2\xd9*\x9f\xcd\ty\x9e\xc9\x06l\xb7{\t\x19J\x85I\x85R\xb5,2\xa7\x93\xe7\x9eT\x8b\xf4r\xd3m\xa6s\xbf\x1a\x1aU\xbe\xcd\xe0)�mJ[[\x90\x88\xff<\xbf\x0fj\x14\x93\x89\xb1-\x92\xe6jb\xe3\xf2\"\xa0\xb5\xf7\x15\xdf\xf4>\x15t.\x18\xa9s\x90\x8d?\x8f\xc6\xect\x8b�\x9c\xeecN}\x9e\xb2ɺ\x8f\x8e\xed\xf5\xbfZ\b]\x18d\x04`^\xe4\x1dZ7�@\xd6o\r4\x8c\"\xab\xac\xa3�\xd3V\xcc$\x97\xa1\x8bwS\xf4\xc9\xec
-    raw mvcc_key/value: �bd89fd51dc7e7080000fae88001670e70629d1402009� �01aa981f0a269080007f77f234dcd06d97d57e58e5981f219ff6b089d30de52e038fa21d38711c683d9ecb04c6e334636585635683faf412bbb4951e976b7d04e09a08b00d4fb3f3ec2bdf0a7c67722de262a81a7f236d78bfb5c76df84b38cdf711db26a2777798c74ba54f178624730ed4df391787499cb33aad7eecdd678c46852d83e7b6c18575de1d332e6620d2aef7a23412f338ce1db5e4249ff80b143194b0881b7f67dee7ca8300b2fe6bc1fbb08397951988cf40486c4f1bd2d6e1f0428777c0e777605698da3026a9f6f6c02dfff2d92a9fcd09799ec9066cb77b09194a85498552b52c32a793e79e548bf472d36da673bf1a1a55becde029e394a66d4a5b5b9088ff3cbf0f6a149389b12d92e66a62e3f222a0b5f715dff43e15742e18a973908d3f8fc6ec748bdf929cee634e7d9eb2c9ba8f8eedf5bf5a085d186404605ee41d5a37dc9c40d66f0d348c22abaca3c985d356cc2497a18b7753f4c9ec2b4404bf05c3882ce0a4046d456061160e679eccf09ae40b00d1fd316eff9c537c05a88c637f4f7331658429146b7b5fe22e339f626a203b2bac99fe60b62d4297f6369e4e9838cdd2de45afcbf3ab6431cec7bc70987108897a22487153c9f4a44db16cca7031e17a58daa099c8f83cdf358de6ba1
-1617046280.292398950,0 �/Table/53/1/5938528238689125289/0�
-    ts:2021-03-29 19:31:20.29239895 +0000 UTC
-    value:�"\xa0\xa3A1\n&\x90\x80\x00\x9c\x10\xc4\x04\x801\xb8&\x8f\xcaf\"\xe0.\u008b<\"q~]\xcc\xe3(K\x19\xf9\x81OC5\xfa\x89%,\xac\x10\x99\\b\xf0�\xd9\xf9Ծ\x83�\xbfw\x1a\x9ez\xae5\n]m\u007f\xf9{\xe2\x05\x03\x1c\xb68\x0e\x91\v\xfev\xb9~\xa5\xa8\xc4\x1b\n\xe3\xc7m\xd78\xd9x_\xe6]\x9aCqsj\xc7Q\xa4ը\xc1\x80\xf5&H\xb6l\xc2G\xc0\xb1\xcf\x1eB\v\x8d\xc0Ty\xaaL\x11\x90rL\xbe\x19\xfa\x17\xd7b\x8c]E\xaa\x04\xd5\x02\xdf#$\x9b<\x1d\xda]\xfcNB\xb0O\xbf�\x86\xb1UkWo^\xe5\xf9%\xeaGAdz\xad\xd1\xc5\x10\x93\xaeE\xfb\xab\xac\xc1\x13\xff\xba\xb7\xe4_\xbd\xfc4FF\xe9Hd\xb6\xca�\xec]p\xe1Hl\xad5\xc7\u007f\xf1\x8d\x1a\xbe\xe7z\rkJ\x00\x94-=p@\xd8UecXDl\xa5c\xac\xb0\xa7\xb8Iĺ\xa9\xa8\u007f,\xceh\xbf\vl�N \xf5z!\x1f\xd9#C\xa9\xf0!o\x12�SVLi\x8em\xa5\xab-Y\xccT@#\xc0\x8e\xf8a\x0e&q\x8d\x99\x94Z$,(\xc1*\x84m\xeaN\x90>\x94s9\x1b\xec\x14\xde\xca{\xf6\x94kH\xa4\x1d\x19T\x0f\xfa\x9c}\x16[\x03\x9e@\xdaM=P\xd9;:\xdb_*E0L\xa3\a]\x92\x9f\xb6\x8e}ӯP\fp<VT\xb3\\\xf9W\x80j\xd9{\xb2�:ͬ\x03\x13\xebM\xf7\xea\xba,\x8d\xa9\x16\xdb\xfd!\
-    raw mvcc_key/value: �bd89fd5269e3f82bb103a988001670e7064231f76609� �a0a341310a269080009c10c4048031b8268fca6622e02ec28b3c22717e5dcce3284b19f9814f4335fa89252cac10995c62f0dd9bd9f9d4be83e98ebebf771a9e7aae350a5d6d7ff97be205031cb6380e910bfe76b97ea5a8c41b0ae3c76dd738d9785fe65d9a4371736ac751a4d5a8c180f52648b66cc247c0b1cf1e420b8dc05479aa4c1190724cbe19fa17d7628c5d45aa04d502df23249b3c1dda5dfc4e42b04fbfefa79086b1556b576f5ee5f925ea4741c7b3add1c51093ae45fbabacc113ffbab7e45fbdfc344646e94864b6cada9fec5d70e1486cad35c77ff18d1abee77a0d6b4a00942d3d7040d855656358446ca563acb0a7b849c4baa9a87f2cce68bf0b6ccf844e20f57a211fd92343a9f0216f12de8b53564c698e6da5ab2d59cc544023c08ef8610e26718d99945a242c28c12a846dea4e903e9473391bec14deca7bf6946b48a41d19540ffa9c7d165b039e40da4d3d50d93b3adb5f2a45304ca3075d929fb68e7dd3af500c703c5654b35cf957806ad97bb2d9883acdac0313eb4df7eaba2c8da916dbfd211b3c436e3de3f1ea5b3f88950fd9e0ff394538dc16ff9d247dbe34136111f23161b40152a776d1efa4418782d9a2916b0d4021565e146a856f5712fbd7cef4a74d8990a7c961ecea6b5acf5
- stats: contains_estimates:0 last_update_nanos:1617046662819189590 intent_age:0 gc_bytes_age:0 live_bytes:295212428 live_count:1126 key_bytes:28150 key_count:1126 val_bytes:295184278 val_count:1126 intent_bytes:0 intent_count:0 separated_intent_count:0 sys_bytes:436 sys_count:5 abort_span_bytes:0 
- stats.Sub(recomputation): last_update_nanos:1617046662819189590 live_bytes:-524356 live_count:-2 key_bytes:-50 key_count:-2 val_bytes:-524306 val_count:-2 

Looks like we've dropped two keys.

@tbg
Copy link
Member

tbg commented Mar 30, 2021

on n1 (minority) and n2 (majority):

$ ./cockroach20.1 debug range-data --replicated /mnt/data1/cockroach/auxiliary/checkpoints/r59_at_2866/ 59 | grep -Eo '^[0-9.,]+ +[^ ]+' > r59.txt
$ roachprod get tobias-1617031837-01-n4cpu4:1-2 r59.txt
$ diff 1.r59.txt 2.r59.txt 
966a967
> 1617046279.883407392,0 /Table/53/1/5898728633597431726/0
1129a1131
> 1617046280.292398950,0 /Table/53/1/5938528238689125289/0

These two keys are not adjacent. The first key is on line 967 of 2.r59.txt and the second one is line 1131 (and is one of the last keys in the range). Also, their timestamps are ~500ms apart, so they're most likely just unrelated and we possibly just happened to lose a key, twice. The repro one higher up also loses two keys, but the first one loses only one. Interestingly, it's always just on one range, which is interesting. I'm also assuming here that the stats reflect the truth (as we've confirmed they are doing in this hands-on repro).

The workload here is kv with --read-percent=0 and a large block size. But in particular, the batch size is one. Really, this is close to the most boring workload you can conceive of. Each insert should be just to a single kv pair. And yet, we haven't seen this error on any of our nightly benchmarks (which may just be luck), so possibly it has to do with the large block size (~256kb), or with the splits, though these inserts never straddle the splits (since they're point inserts, in fact should take the 1PC optimization most of the time though I haven't checked). Additionally, it's on a three-node cluster, so there won't be any rebalancing.

At first glance a plausible explanation for the bug could be that we're simply dropping a whole pebble batch on the floor, but that isn't it. We are definitely losing the kv write in tandem with the stats write, but we seem to be updating the raft applied index just fine, or we would see fatal errors - as we do assert that it goes up in increments of exactly one. A more likely explanation is that we are erroneously applying a command with a forced error when we shouldn't (only on the minority replica), but I don't see any particular reason why this would happen in this test and not others (and a cursory review of the code shows no source of nondeterminism).

The timestamp of the first lost key translates to Monday, March 29, 2021 7:31:19 PM. Looking through the combined logs at that time via

cockroach debug merge-logs logs/.unredacted/cockroach.tobias.log | grep -Ev 'handle raft ready|runtime stats' | grep r59 | xsel -b

there are a few things going on. r59 is created via a split ~2 minutes before the lost writes (so that split is likely not to blame). Around 70s before the problematic write there's a slow heartbeat. 60s before the write the raft log queue truncates (due to size), cutting off n3 from the raft log. A snapshot is sent and applied by n3 at 19:31:10, i.e. ~8s before the write. 2s after the write, the raft log queue truncates again, causing another snapshot (but none is sent for a while, it only gets applied at 19:34:11, minutes after the write. We're also seeing a split in the meantime, at 19:32:10, ~50s after the write. Finally, starting at 19:34:27 (minutes after the write) we see a flurry of failed reproposals due to closed timestamps on n2, and also a bunch of "have been waiting" messages from DistSender trying to talk to 459. That's pretty much it - at 19:37:49 the inconsistency is detected.

Nothing really stands out here. Since we have a repro cycle - as bad as it may be - I am going to instrument the binary to print at each raft log position the command and whether we applied it or not. That should provide valuable insight.

@tbg
Copy link
Member

tbg commented Mar 30, 2021

#62791 now running with up to 800 iterations. 💸 💸 💸

It's also looking increasingly likely that release-21.1 isn't going to reproduce - I'm ~700 out of 800 runs in and the only failure that has cropped up was a clock offset fatal.

@tbg
Copy link
Member

tbg commented Mar 30, 2021

Doing some light math here on the repro changes... I got 1 in 200 on master, and the failures are Binomially distributed, so let's say the p is really 0.45 as I said above:

image

Then with 0/800 basically rules out that it'll ever show up (i.e. under the same distribution, I would've been ~guaranteed to see a failure)

image

That tool also has facilities for giving you 95% confidence interval estimates for p (given the observation n=100, x=1). The most optimistic one gives 0.3%, the pessimistic one a meager 0.14%. Let's assume the worst (0.14%) - a >= 1/200 observation would be expected in ~25% of cases, and a >= 1/800 observation in 75% of cases. If I give 21.1 another 800 runs overnight, that will go up to 93%, at which point I'll feel pretty good about saying that the problem does not exist on release-21.

@tbg
Copy link
Member

tbg commented Apr 7, 2021

run 114, tobias-1617787468-46-n4cpu4:2

panic: TBG CRASH DETECTED GOID=158: r40: truncated cache to [143, 207] but got an entry for 208..263
add=false bytesGuessed=0bytesAdded=0 entriesAdded=0 bytesRemoved=0 entriesRemoved=0

@tbg
Copy link
Member

tbg commented Apr 7, 2021

run 126, tobias-1617787468-24-n4cpu4:2

panic: TBG CRASH DETECTED GOID=159: TBG term regression: idx 312 at term=8 appended with logterm=10

I'll leave this running but you get the idea. We catch the raft entry cache assertion quite a bit, and the term regression sometimes.

@tbg
Copy link
Member

tbg commented Apr 7, 2021

These repros all have add=false, which means that (*Cache).Add becomes really trivial - it basically just does

	if truncate {
		truncIdx := ents[0].Index
		if add { // not taken
			// Some entries were already overwritten.
			truncIdx = ents[len(ents)-1].Index + 1
		}
		bytesRemoved, entriesRemoved = p.truncateFrom(truncIdx)
	}

I have the Cache here - should be doable to figure out why this didn't do what was advertised.

One last failure that occurred before I aborted the repro attempts:

run 144, tobias-1617787468-18-n4cpu4:1

panic: TBG CRASH DETECTED GOID=221: r41: truncated cache to [2429, 2551] but got an entry for 2552..2555
add=false bytesGuessed=0bytesAdded=0 entriesAdded=0 bytesRemoved=0 entriesRemoved=0

@tbg
Copy link
Member

tbg commented Apr 7, 2021

I think I have something. The problem is indeed in the code from the last comment. When !add (as it is in every repro I've looked at), we are starting to truncate at the first index we thought we would add (but did not in fact add). Now, when that index is actually lower than the first index we have cached, I think truncateFrom turns into a noop - but it should have removed every entry!

For example, in the dump I'm looking at right now (run 71), we were inserting entries 1154...1253 when the assertion hit. The ringbuf (post assertion, but I think it wasn't mutated at all) has entries 1213..1268. Note 1154 < 1213.

Ok, time to fix that and to see if that indeed does explain everything... but I feel it must. Will have some hopefully not preliminary victory lunch before.

@tbg
Copy link
Member

tbg commented Apr 7, 2021

Spinning up the repro machine again with provisional fix 97bf81b

tbg added a commit to tbg/cockroach that referenced this issue Apr 8, 2021
Likely due to golang/gogo proto incompatibilities around our use of the
`(gogoproto.nullable)` option, calling `.String()` on a `raftpb.Entry`
panics.

However, upstream at our currently vendored commit
578ffd578362a8962194c4c0c68c1a9824cf9ea3,

```
func TestFoo(t *testing.T) {
        ent := pb.Entry{Term: 1, Index: 2}
        _ = ent.String()
}
```

does not panic.

This commit adds a test documenting the undesired behavior and working
around it in `./cockroach debug raft-log` (and the other debug commands
that stringify log entries).

Found while investigating cockroachdb#61990.

Release note: None
craig bot pushed a commit that referenced this issue Apr 8, 2021
62728: roachtest: don't call t.Fatal in FailOnReplicaDivergence r=erikgrinaker a=tbg

In #61990 we had this method catch a stats divergence on teardown in an
otherwise successful test. The call to `t.Fatal` in that method
unfortunately prevented the logs from being collected, which is not
helpful.

Release note: None


63073: roachtest: retire `bank` tests r=nvanbenschoten,andreimatei a=tbg

The `bank` roachtests are a basic form of Jepsen test. We were hoping
that they could provide additional benefit due to being less complex
than a full Jepsen test. In my opinion, this calculus has not worked
out. We've spent many hours staring at this test deadlocked when it
wasn't CockroachDB's fault at all. Besides, it's not adding anything
meaningful on top of our Jepsen tests plus KVNemesis plus the TPCC
invariant checks, so we are removing the `bank` family of tests here.

Should we want to reintroduce these tests, we should do it at the level
of TestCluster, or at least augment the `bank` workload to incorporate
these invariant checks instead.

Closes #62754.
Closes #62749
Closes #53871.

Release note: None


63214: kvserver: reduce ReplicaGCQueueInactivityThreshold to 12 hours r=ajwerner a=erikgrinaker

`ReplicaGCQueueInactivityThreshold` specifies the interval at which the
GC queue checks whether a replica has been removed from the canonical
range descriptor, and was set to 10 days. This is a fallback for when we
fail to detect the removal and GC the replica immediately. However, this
could occasionally cause stale replicas to linger for 10 days, which
surprises users (e.g. by causing alerts if the stale replica thinks the
range has become underreplicated).

This patch reduces the threshold to 12 hours, which is a more reasonable
timeframe for users to expect things to "sort themselves out". The
operation to read the range descriptor is fairly cheap, so this is not
likely to cause any problems, and the interval is therefore not jittered
either.

Resolves #63212, touches #60259.

Release note (ops change): Replica garbage collection now checks
replicas against the range descriptor every 12 hours (down from 10 days)
to see if they should be removed. Replicas that fail to notice they have
been removed from a range will therefore linger for at most 12 hours
rather than 10 days.

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Apr 8, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this issue Apr 8, 2021
This productionizes a number of assertions that were added in cockroachdb#62791,
while investigating cockroachdb#61990.
We don't have a good story about whether assertions always need to
be behind a build tag. I added the assertions that are dirt cheap
without a build tag, and where I augmented an existing assertion
I kept it behind the existing build tag gating.

If any of these assertions fires, it's a good chance that we are facing
potential log corruption (the one in the entry cache is often benign,
but once it leads to a term regression, it's a very likely bad time), so
if anything, in the (relatively distant) future these assertions should
fail "gracefully" at the replica level as opposed to taking down the
node; we wouldn't want to disable them. Since we can't fail assertions
at the replica level now, per-node is fine and they will have plenty
of time to bake until they make it into a release.

Release note: None
nanduu04 pushed a commit to nanduu04/cockroach that referenced this issue Apr 9, 2021
In cockroachdb#61990 we had this method catch a stats divergence on teardown in an
otherwise successful test. The call to `t.Fatal` in that method
unfortunately prevented the logs from being collected, which is not
helpful.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791,
while investigating cockroachdb#61990.
We don't have a good story about whether assertions always need to
be behind a build tag. I added the assertions that are dirt cheap
without a build tag, and where I augmented an existing assertion
I kept it behind the existing build tag gating.

If any of these assertions fires, it's a good chance that we are facing
potential log corruption (the one in the entry cache is often benign,
but once it leads to a term regression, it's a very likely bad time), so
if anything, in the (relatively distant) future these assertions should
fail "gracefully" at the replica level as opposed to taking down the
node; we wouldn't want to disable them. Since we can't fail assertions
at the replica level now, per-node is fine and they will have plenty
of time to bake until they make it into a release.

Release note: None
craig bot pushed a commit that referenced this issue Apr 12, 2021
63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg

When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 #36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in #36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes #61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.


Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 54c27b3 Apr 12, 2021
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791,
while investigating cockroachdb#61990.
We don't have a good story about whether assertions always need to
be behind a build tag. I added the assertions that are dirt cheap
without a build tag, and where I augmented an existing assertion
I kept it behind the existing build tag gating.

If any of these assertions fires, it's a good chance that we are facing
potential log corruption (the one in the entry cache is often benign,
but once it leads to a term regression, it's a very likely bad time), so
if anything, in the (relatively distant) future these assertions should
fail "gracefully" at the replica level as opposed to taking down the
node; we wouldn't want to disable them. Since we can't fail assertions
at the replica level now, per-node is fine and they will have plenty
of time to bake until they make it into a release.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791,
while investigating cockroachdb#61990.
We don't have a good story about whether assertions always need to
be behind a build tag. I added the assertions that are dirt cheap
without a build tag, and where I augmented an existing assertion
I kept it behind the existing build tag gating.

If any of these assertions fires, it's a good chance that we are facing
potential log corruption (the one in the entry cache is often benign,
but once it leads to a term regression, it's a very likely bad time), so
if anything, in the (relatively distant) future these assertions should
fail "gracefully" at the replica level as opposed to taking down the
node; we wouldn't want to disable them. Since we can't fail assertions
at the replica level now, per-node is fine and they will have plenty
of time to bake until they make it into a release.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this issue Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791,
while investigating cockroachdb#61990.
We don't have a good story about whether assertions always need to
be behind a build tag. I added the assertions that are dirt cheap
without a build tag, and where I augmented an existing assertion
I kept it behind the existing build tag gating.

If any of these assertions fires, it's a good chance that we are facing
potential log corruption (the one in the entry cache is often benign,
but once it leads to a term regression, it's a very likely bad time), so
if anything, in the (relatively distant) future these assertions should
fail "gracefully" at the replica level as opposed to taking down the
node; we wouldn't want to disable them. Since we can't fail assertions
at the replica level now, per-node is fine and they will have plenty
of time to bake until they make it into a release.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jun 24, 2021
In cockroachdb#61990 we had this method catch a stats divergence on teardown in an
otherwise successful test. The call to `t.Fatal` in that method
unfortunately prevented the logs from being collected, which is not
helpful.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants