Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal consistency check failed with 1 inconsistent replicas #35424

Closed
awoods187 opened this issue Mar 5, 2019 · 26 comments
Closed

Fatal consistency check failed with 1 inconsistent replicas #35424

awoods187 opened this issue Mar 5, 2019 · 26 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@awoods187
Copy link
Contributor

awoods187 commented Mar 5, 2019

Describe the problem
Hit a fatal that killed a node while importing TPC-C 10k on a 7 node 72 cpu machine.

F190305 14:52:51.825973 686646 storage/replica_consistency.go:165  [n3,consistencyChecker,s3,r336/3:/Table/56/1/"\x04{\x02…-W\"e…}] consistency check failed with 1 inconsistent replicas
goroutine 686646 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000521100, 0xc0005211a0, 0x5374600, 0x1e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b160a0, 0xc000000004, 0x5374685, 0x1e, 0xa5, 0xc0184c80a0, 0x7a)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a20040, 0xc009930d20, 0x4, 0x2, 0x32ecb3f, 0x36, 0xc0087407d0, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a20040, 0xc009930d20, 0x1, 0xc000000004, 0x32ecb3f, 0x36, 0xc0087407d0, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a20040, 0xc009930d20, 0x32ecb3f, 0x36, 0xc0087407d0, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc04529f680, 0x3a20040, 0xc009930d20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:165 +0x863
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc04529f680, 0x3a20040, 0xc009930d20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:174 +0x9ce
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc001124480, 0x3a20040, 0xc009930d20, 0xc04529f680, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:117 +0x1d8
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica.func1(0x3a20040, 0xc009930d20, 0xdf8475800, 0x3a20040)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:753 +0x24e
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout(0x3a20040, 0xc009930d20, 0xc0348426f0, 0x2c, 0xdf8475800, 0xc00117ee90, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:90 +0xb5
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica(0xc000756d00, 0x3a20080, 0xc024f38540, 0xc04529f680, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:719 +0x215
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processLoop.func1.2(0x3a20080, 0xc01d9bdef0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:646 +0xb8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc00055a7e0, 0x3a20080, 0xc01d9bdef0, 0xc00b4524c0, 0x36, 0x0, 0x0, 0xc01b5d09e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x134

To Reproduce
roachprod create $CLUSTER -n 8 --clouds=aws --aws-machine-type-ssd=c5d.18xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-7 cockroach
roachprod stage $CLUSTER:8 workload
roachprod start $CLUSTER:1-7 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

Environment:
v19.1.0-beta.20190304-135-gd8f7e85
cockroach.log

@awoods187 awoods187 added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Mar 5, 2019
@awoods187
Copy link
Contributor Author

@tbg
Copy link
Member

tbg commented Mar 5, 2019

E190305 14:52:51.825947 686646 storage/replica_consistency.go:110  [n3,consistencyChecker,s3,r336/3:/Table/56/1/"\x04{\x02…-W\"e…}] replica (n2,s2):2 is inconsistent: expected checksum 16305128fce4247f9a055ce89d7c1e483c1dae68e63a6e5a366593cf70008ee37e644337eeb35e004236fcdcd645444534c2297530c48ca0ad46460f06e866af, got 15f200e0ed74ba584101e09a0e74f1af7bcd0d226ed18e16bbbf86e9dee77446ce372661c9c89b2377cdaa4d480c47017fefe189c4fcc85dd10baacf215f9c50
--- leaseholder
+++ follower
-0.000000000,0 /Local/RangeID/336/r/RangeAppliedState
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0024\x84O\x9c\xac\x03\b2\x10\x16\x1a)\t-\v\xd73\x92\x13\x89\x15 \x83\x80\x80\x10(\xd5\xda\x170\xae\xf3\xbe\x068\xd5\xda\x17@Ռ\xc1\tH\xd5\xda\x17`\xf3\ah\x06"
-    raw mvcc_key/value: 0169f70150727261736b00 1204080010001800200028003234844f9cac03083210161a29092d0bd73392138915208380801028d5da1730aef3be0638d5da1740d58cc10948d5da1760f3076806
+0.000000000,0 /Local/RangeID/336/r/RangeAppliedState
+    ts:<zero>
+    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002[\a\xfa\xe4Q\x03\b2\x10\x16\x1aP\t-\v\xd73\x92\x13\x89\x15 \xa3\x98\x95\xfd\xff\xff\xff\xff\xff\x01(\xb2\xe6\xfb\xff\xff\xff\xff\xff\xff\x010\x86\xe6\xec\xfe\xff\xff\xff\xff\xff\x018\xb2\xe6\xfb\xff\xff\xff\xff\xff\xff\x01@\x9d\xb2\xa8\xfe\xff\xff\xff\xff\xff\x01H\xb2\xe6\xfb\xff\xff\xff\xff\xff\xff\x01`\xf3\x04h\x05"
+    raw mvcc_key/value: 0169f70150727261736b00 120408001000180020002800325b07fae45103083210161a50092d0bd7339213891520a39895fdffffffffff0128b2e6fbffffffffffff013086e6ecfeffffffffff0138b2e6fbffffffffffff01409db2a8feffffffffff0148b2e6fbffffffffffff0160f3046805
-0.000000000,0 /Local/Range/Table/56/1/"\x04\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F"/PrefixEnd/Transaction/"c9196ccd-c175-4df5-9469-762432eddbe6"
-    ts:<zero>
-    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002\xc1\x02T\xf6\xaf\xce\x03\nE\n\x10\xc9\x19l\xcd\xc1uM\xf5\x94iv$2\xed\xdb\xe6\x1a\x1f\x01k\x12\xc0\x89\x12\x04\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F\x00\xff\x02\x00\x01rdsc*\n\b\xf6\x9d\xd1\xf6\x86\xd5\xc4\xc4\x150\x92\x97&8\v \x01*\n\b\x8c\xcc\xc9ӊ\xd5\xc4\xc4\x152\n\b\xf6\x9d\xd1\xf6\x86\xd5\xc4\xc4\x15Z\x18\x1a\x16\x03\xc0\x89\x12\x04f\x15h\v\xc7O\x9b\xb9F\xe1`Rc\xef(\x00\x02Z\x18\x1a\x16\x03\xc0\x89\x12\x05\xb1\x13\u007f\x1fnN\xa3\x9d\xd7f\x9c\xf8\xed\x98[\x00\x02Z\x19\x1a\x17\x95\x89\x14\xf9\\~w\xbd\xf9\r\xa7-\xa0\xfd\x05\xfdq$0V\x80\x02\x88Z\x1a\x1a\x18\x95\x89\x14\xf9\\~w\xbd\xf9\r\xa7-\xa0\xfd\x05\xfdq$0V\x80\x02\x8a\x89Z\x1a\x1a\x18\x95\x89\x14\xf9\\~w\xbd\xf9\r\xa7-\xa0\xfd\x05\xfdq$0V\x80\x02\x8b\x89Z\x1a\x1a\x18\x95\x89\x14\xf9\\~w\xbd\xf9\r\xa7-\xa0\xfd\x05\xfdq$0V\x80\x02\x8c\x89Z\x1a\x1a\x18\x95\x89\x14\xf9\\~w\xbd\xf9\r\xa7-\xa0\xfd\x05\xfdq$0V\x80\x02\x8d\x89Z\x1a\x1a\x18\x95\x89\x14\xf9\\~w\xbd\xf9\r\xa7-\xa0\xfd\x05\xfdq$0V\x80\x02\x8e\x89"
-    raw mvcc_key/value: 016b12c0891204027aadbe1444c9a4c5f8f17715334600ff02000174786e2dc9196ccdc1754df59469762432eddbe600 12040800100018002000280032c10254f6afce030a450a10c9196ccdc1754df59469762432eddbe61a1f016b12c0891204027aadbe1444c9a4c5f8f17715334600ff020001726473632a0a08f69dd1f686d5c4c41530929726380b20012a0a088cccc9d38ad5c4c415320a08f69dd1f686d5c4c4155a181a1603c08912046615680bc74f9bb946e1605263ef2800025a181a1603c0891205b1137f1f6e4ea39dd7669cf8ed985b00025a191a17958914f95c7e77bdf90da72da0fd05fd712430568002885a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028a895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028b895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028c895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028d895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028e89
F190305 14:52:51.825973 686646 storage/replica_consistency.go:165  [n3,consistencyChecker,s3,r336/3:/Table/56/1/"\x04{\x02…-W\"e…}] consistency check failed with 1 inconsistent replicas
goroutine 686646 [running]:

@tbg
Copy link
Member

tbg commented Mar 5, 2019

cluster: andy-72

@tbg
Copy link
Member

tbg commented Mar 5, 2019

decoded kv pairs:

# leaseholder range applied state
[tschottdorf@toblerone cockroach (fix/marshal-panic)]$ ./cockroach debug decode-value 0169f70150727261736b00 1204080010001800200028003234844f9cac03083210161a29092d0bd73392138915208380801028d5da1730aef3be0638d5da1740d58cc10948d5da1760f3076806
0.000000000,0 /Local/RangeID/336/r/RangeAppliedState: raft_applied_index:50 lease_applied_index:22 range_stats:<last_update_nanos:1551793065261730605 live_bytes:33554435 live_count:388437 key_bytes:13613486 key_count:388437 val_bytes:19940949 val_count:388437 sys_bytes:1011 sys_count:6 >

# follower range applied state
[tschottdorf@toblerone cockroach (fix/marshal-panic)]$ ./cockroach debug decode-value 0169f70150727261736b00 120408001000180020002800325b07fae45103083210161a50092d0bd7339213891520a39895fdffffffffff0128b2e6fbffffffffffff013086e6ecfeffffffffff0138b2e6fbffffffffffff01409db2a8feffffffffff0148b2e6fbffffffffffff0160f3046805
0.000000000,0 /Local/RangeID/336/r/RangeAppliedState: raft_applied_index:50 lease_applied_index:22 range_stats:<last_update_nanos:1551793065261730605 live_bytes:-5944285 live_count:-68814 key_bytes:-2411770 key_count:-68814 val_bytes:-3532515 val_count:-68814 sys_bytes:627 sys_count:5 >

# leaseholder txn (missing on follower)
[tschottdorf@toblerone cockroach (fix/marshal-panic)]$ ./cockroach debug decode-value 016b12c0891204027aadbe1444c9a4c5f8f17715334600ff02000174786e2dc9196ccdc1754df59469762432eddbe600 12040800100018002000280032c10254f6afce030a450a10c9196ccdc1754df59469762432eddbe61a1f016b12c0891204027aadbe1444c9a4c5f8f17715334600ff020001726473632a0a08f69dd1f686d5c4c41530929726380b20012a0a088cccc9d38ad5c4c415320a08f69dd1f686d5c4c4155a181a1603c08912046615680bc74f9bb946e1605263ef2800025a181a1603c0891205b1137f1f6e4ea39dd7669cf8ed985b00025a191a17958914f95c7e77bdf90da72da0fd05fd712430568002885a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028a895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028b895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028c895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028d895a1a1a18958914f95c7e77bdf90da72da0fd05fd7124305680028e89
0.000000000,0 /Local/Range/Table/56/1/"\x04\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F"/PrefixEnd/Transaction/"c9196ccd-c175-4df5-9469-762432eddbe6": id=c9196ccd key=/Local/Range/Table/56/1/"\x04\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F"/PrefixEnd/RangeDescriptor rw=true pri=0.02912963 stat=COMMITTED epo=0 ts=1551792061.229059830,0 orig=1551792061.229059830,0 max=0.000000000,0 wto=false seq=11 int=8

@awoods187
Copy link
Contributor Author

cockroach.log

@awoods187
Copy link
Contributor Author

Repro steps:
export CLUSTER=andy-72
roachprod create $CLUSTER -n 8 --clouds=aws --aws-machine-type-ssd=c5d.18xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-7 cockroach
roachprod stage $CLUSTER:8 workload
roachprod start $CLUSTER:1-7 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

Resulting in:

I190305 13:02:11.581205 1 ccl/workloadccl/cliccl/fixtures.go:321  starting import of 9 tables
I190305 13:02:12.284823 14 ccl/workloadccl/fixture.go:498  imported warehouse (1s, 10000 rows, 0 index entries, 532 KiB)
I190305 13:02:12.941137 15 ccl/workloadccl/fixture.go:498  imported district (1s, 100000 rows, 0 index entries, 9.9 MiB)
I190305 13:02:13.060512 165 ccl/workloadccl/fixture.go:498  imported item (1s, 100000 rows, 0 index entries, 7.8 MiB)
I190305 13:04:31.716139 164 ccl/workloadccl/fixture.go:498  imported new_order (2m20s, 90000000 rows, 0 index entries, 1.3 GiB)
I190305 13:27:50.024077 163 ccl/workloadccl/fixture.go:498  imported order (25m38s, 300000000 rows, 600000000 index entries, 18 GiB)
I190305 13:34:21.681521 162 ccl/workloadccl/fixture.go:498  imported history (32m10s, 300000000 rows, 600000000 index entries, 39 GiB)
I190305 13:50:46.564577 16 ccl/workloadccl/fixture.go:498  imported customer (48m35s, 300000000 rows, 300000000 index entries, 173 GiB)
I190305 14:35:50.869149 166 ccl/workloadccl/fixture.go:498  imported stock (1h33m39s, 1000000000 rows, 1000000000 index entries, 314 GiB)
I190305 14:53:19.206437 167 ccl/workloadccl/fixture.go:498  imported order_line (1h51m8s, 0 rows, 0 index entries, 0 B)
Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
Error:  exit status 1

@andreimatei andreimatei self-assigned this Mar 5, 2019
@andreimatei
Copy link
Contributor

andreimatei commented Mar 5, 2019

Here's what we know:

r336 has 3 replicas (n2, n4, n7), out of which two have corrupt stats. The inconsistency has been previously discovered by n3 which killed itself when it discovered that it diverged from n2. n3 stats were probably fine. After it killed itself, the villain n2 became the leaseholder and upreplicated to n7, which is how n7 entered the dark side.
Besides the stats difference, n3 also had an extra committed transaction record vs n2. The record is from a split txn (Tobi, we thought that it's a more regular transaction, but we've misinterpreted the txn key. It is a split.).

r336 has the following history:

13:20: n2 initiating a split (manual) of r331 which creates r336
13:21: n2 initiating a the first split of r336 (manual split) - this is where the phantom split txn record comes from; creating r634
13:22: n2 remove replica n7 (descriptor is n7, n2, n3, n4)
13:37: n3 becomes the leaseholder through a lease transfer (I think)
13:37: (a second later) n3 initiates a second split of r336 (split because of large size), creating r3289
14:15: n3 tried to upreplicate to n5 and got rejected
14:52: n3 consistency crash !!!
14:53: n2 becomes leaseholder

16:59: n2 trying to send preemptive snapshot to n7
16:59: n2 adding replica on n7. descriptor: n4, n2, n3
17:37: n2 removing n3. descriptor: n4, n2, n7

A couple of interesting things:

  • the 2nd split happens very soon after a lease transfer
  • the corrupted stats show byte counts of -5MB and so. The first split was a manual one, so it's unlikely that its stats delta were too large. The 2nd split however has large deltas. And yet, we know that something went wrong around the first split, since we have that phantom txn record.

We have the Raft logs for the range from some index onwards; we have the second split record but not the first.

So what now? Unclear.
Two things that I believe are worth trying:

  • try to recoup the truncated Raft log to get the records around the first split. We could look at what stats delta that one had, and we could also hopefully look inside the write batches for the deletion of the txn record. In particular, we could check if n3 has that deletion command (which it apparently did not apply).
  • we could try to get historical versions of the RangeAppliedState key and try to see around when the stats got screwed up.

These are long shots, but what is there to do here but go really guerilla? @ajkr would you kindly assist me in squeezing RocksDB?
The other avenue, I guess, is trying a million restores on these 72-core nodes and hope for repros. @awoods187 please give us whatever commands you have.

Also, I've run cockroach debug check-store on two nodes, and it gives interesting stuff. More ranges with stat discrepancies, although none with negative sizes:
n2: https://gist.github.com/andreimatei/328344a7e6ad58db6aa2c07fc9b56710
n4: https://gist.github.com/andreimatei/6963b5d10228fa28d3a88c95c64be8c2

@tbg, what detection do we have for bad stats? Does any queue recompute them? Perhaps we should introduce something that crashes.

Attaching the range report:
r336 report.pdf
And all the logs:
r336-logs.tar.zip
And the Raft log on n2 (identical to n4):
r336-raft-log-n2.txt

Let's make it a party @bdarnell @petermattis @nvanbenschoten @ajwerner

@tbg
Copy link
Member

tbg commented Mar 5, 2019

What's the check-store output on n3? Specifically, is it OK with the stats for r336?

The stats, when "consistently inconsistent" (i.e. agreeing across the replicas, but not agreeing with recomputing from scratch) are actually fixed by the consistency checker. Along with the hash, it also computes the stats and when there's no inconsistency but a stats delta, it triggers a recomputation. In that sense, incorrect stats itself are not as huge a problem, but the code we wrote should have them consistent though it's been known (#28946) that we seem to not manage. Note that ranges that hold timeseries data always have incorrect stats but that's ok because they have ContainsEstimates set.

What's in the Raft logs? I think that the debug command needs to be brushed up a bit to be useful.

func tryRaftLogEntry(kv engine.MVCCKeyValue) (string, error) {

This code should also pretty-print the WriteBatch. You can just make an iterator from it and iterate through, maybe even calling the existing pretty printer code (so that the values get unmarshaled as well) -- the MVCCStats deltas for the individual log entries will be in there. And some indexes (like 55) aren't even decoded as a Raft entry, so something weird happens for them in SprintKeyValue (tryRaftLogEntry must be returning something unexpected). Maybe those are conf changes, but still, there's something wrong there.

@tbg
Copy link
Member

tbg commented Mar 5, 2019

What's the check-store output on n3? Specifically, is it OK with the stats for r336?

Nevermind, n4 is likely consistent with n3, and is ok with r336.

@tbg
Copy link
Member

tbg commented Mar 5, 2019

In the n2 dump, check r3289 -- it's adjacent to r336, isn't it? And its stats are exactly the negative sign of those of the corrupt r336 -- that can't be coincidence:

r336:/Table/56/1/"\x04{\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F"/PrefixEnd-W\"e\xf2\xecIZ\xa0\x88\x0e!9~K`"} [(n4,s4):4, (n2,s2):2, (n7,s7):5, next=6, gen=2]: diff(actual, claimed): LiveBytes: 33554435 != -5944285
LiveCount: 388437 != -68814
KeyBytes: 13613486 != -2411770
KeyCount: 388437 != -68814
ValBytes: 19940949 != -3532515
ValCount: 388437 != -68814

r3289:/Table/56/1/"\x04{W\"e\xf2\xecIZ\xa0\x88\x0e!9~K`"-f\x15h\v\xc7O\x9b\xb9F\xe1`Rc\xef("/PrefixEnd} [(n4,s4):1, (n2,s2):2, (n1,s1):4, next=5, gen=0]: diff(actual, claimed): LiveBytes: 5944285 != 45443005
LiveCount: 68814 != 526065
KeyBytes: 2411770 != 18437026
KeyCount: 68814 != 526065
ValBytes: 3532515 != 27005979
ValCount: 68814 != 526065

@tbg
Copy link
Member

tbg commented Mar 5, 2019

r3289 was created through the second split

13:37: (a second later) n3 initiates a second split of r336 (split because of large size)

Now I would feel a lot better if someone told me that the stray txn record also belonged to the second split, because then I could stop worrying about the first split.

Anyway, these stats above tell a story - r336 after the second split claims to have exactly the negative actual stats of r336. This would happen if r336 at the time of the split trigger had completely zero stats. (I have to double check that in the code, but intuitively it makes sense that if the LHS before the split thinks that it has 0, and it gives away 68814, then after it should think it has -68814). Of course we know that it didn't have zero stats.

@tbg
Copy link
Member

tbg commented Mar 5, 2019

The raft log entry for the second split is has these stats (in the split trigger):

rhs_delta:
    last_update_nanos:1551793064152233192
    live_bytes:45443005
    live_count:526065
    key_bytes:18437026
    key_count:526065
    val_bytes:27005979
    val_count:526065
    sys_bytes:250
    sys_count:4
delta:
    last_update_nanos:1551793064152233192
    live_bytes:-45443005
    live_count:-526065
    key_bytes:-18437026
    key_count:-526065
    val_bytes:-27005979
    val_count:-526065
    sys_bytes:80

I think RhsDelta is essentially what the initial stats of the RHS (r3289) will be. The delta is almost exactly -RhsDelta with the exception of sys_bytes, which makes sense since the txn record (which is being committed) lives on the LHS.

I tried to look up the range status for r3289, but... it says the range was not found! Nothing was in the logs attached here, so I pulled the new logs and found this:

ip-172-31-40-36> I190305 19:34:28.242575 1177054 storage/replica_consistency.go:140 [n4,consistencyChecker,s4,r3289/1:/Table/56/1/"\x04{W"e…-f\x1…}] triggering stats recomputation to resolve delta of {ContainsEstimates:false LastUpdateNanos:1551806265947324605 IntentAge:0 GCBytesAge:0 LiveBytes:39498720 LiveCount:457251 KeyBytes:16025256 KeyCount:457251 ValBytes:23473464 ValCount:457251 IntentBytes:0 IntentCount:0 SysBytes:0 SysCount:0 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

(this is the stats recomputation mechanism I described above; in particular, the consistency checker found no divergence, though this is after a bunch of rebalancing that may well have masked any problem).

Somewhat later, the range gets merged away, which explains why I wasn't seeing it.

ip-172-31-34-81> I190305 19:40:44.376563 815244 storage/replica_command.go:385 [n2,merge,s2,r336/2:/Table/56/1/"\x04{\x02…-W"e…}] initiating a merge of r3289:/Table/56/1/"\x04{W"e\xf2\xecIZ\xa0\x88\x0e!9~K"-f\x15h\v\xc7O\x9b\xb9F\xe1Rc\xef("/PrefixEnd} [(n4,s4):1, (n2,s2):2, (n7,s7):5, next=6, gen=0] into this range (lhs+rhs has (size=-5.7 MiB+5.7 MiB qps=0.00+0.00 --> 0.00qps) below threshold (size=0 B, qps=0.00))
ip-172-31-32-186> I190305 19:40:44.382955 780982 storage/store.go:2453 [n1,replicaGC,s1,r3289/4:/Table/56/1/"\x04{W"e…-f\x1…}] removing replica r3289/4
ip-172-31-32-186> I190305 19:40:44.383549 780982 storage/replica_destroy.go:150 [n1,replicaGC,s1,r3289/4:/Table/56/1/"\x04{W"e…-f\x1…}] removed 68821 (68814+7) keys in 0ms [clear=0ms commit=0ms]
ip-172-31-40-36> I190305 19:40:44.384324 714 storage/store.go:2453 [n4,s4,r336/4:/Table/56/1/"\x04{\x02…-W"e…}] removing replica r3289/1
ip-172-31-35-192> I190305 19:40:44.386441 711 storage/store.go:2453 [n7,s7,r336/5:/Table/56/1/"\x04{\x02…-W"e…}] removing replica r3289/5
ip-172-31-34-81> I190305 19:40:44.386514 588 storage/store.go:2453 [n2,s2,r336/2:/Table/56/1/"\x04{\x02…-W"e…}] removing replica r3289/2

Notice something? r336 merges that range back in. This wasn't reflected in the info posted above so far (because it hadn't happened yet), but look what it did to the range status at r336:

image

@andreimatei
Copy link
Contributor

Now I would feel a lot better if someone told me that the stray txn record also belonged to the second split, because then I could stop worrying about the first split.

It seems to belong to the first split, tho. The txn record is:

0.000000000,0 /Local/Range/Table/56/1/"\x04\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F"/PrefixEnd/Transaction/"c9196ccd-c175-4df5-9469-762432eddbe6": id=c9196ccd key=/Local/Range/Table/56/1/"\x04\x02z\xad\xbe\x14Dɤ\xc5\xf8\xf1w\x153F"/PrefixEnd/RangeDescriptor rw=true pri=0.02912963 stat=COMMITTED epo=0 ts=1551792061.229059830,0 orig=1551792061.229059830,0 max=0.000000000,0 wto=false seq=11 int=8

That timestamp is 13:21:01 GMT.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 5, 2019

The stats, when "consistently inconsistent" (i.e. agreeing across the replicas, but not agreeing with recomputing from scratch) are actually fixed by the consistency checker.

But when "inconsistently inconsistent", they stay that way, because the consistency checker works by computing a delta and applying it equally to all replicas. The only time we write absolute values to the stats is when creating new stats records for the RHS of a split. At all other times we adjust them with deltas, so if the first split led to a discrepancy it would persist across future splits and the actions of the consistency queue.

@awoods187
Copy link
Contributor Author

A second node died
cockroach.log

I extended the cluster so it's still available.

@nvanbenschoten
Copy link
Member

It's possible that #34989 is the same issue.

@tbg
Copy link
Member

tbg commented Mar 6, 2019

A second node died

Looks like the same inconsistency. This time, the leaseholder had the incorrect stats and compared itself with n4 (the one with the correct stats), so it killed itself. The new leaseholder is also incorrect and so the newly added replica is also incorrect. Effectively the same situation as before.

@tbg
Copy link
Member

tbg commented Mar 6, 2019

It's possible that #34989 is the same issue.

That's a shame, wasn't able to find the logs for it (was hoping it was a roachtest, but apparently not).

@tbg
Copy link
Member

tbg commented Mar 6, 2019

I'm poking at this with ldb (cockroach debug rocksdb) now, https://gist.github.com/tbg/afabae1bea75f8f427990f339f1111a3 has my periodically updated notes.

So far I'm able to read the stats key, and am waiting for the command that dumps all the historical versions to return (I didn't figure out how to do that for a single key, it seems to dump the whole db).

@tbg
Copy link
Member

tbg commented Mar 6, 2019

The RocksDB sleuthing worked out, though I can't say I got very good information out of it (next time this happens, we need to immediately shut down the nodes that are inconsistent to preserve more history at the Rocks level)

Below is the earliest applied state from the incorrect node (n2), at lease applied index 18. The Raft command before that one was the second split (which doesn't matter, just pointing it out). Note the sys count of four

0.000000000,0 /Local/RangeID/336/r/RangeAppliedState: raft_applied_index:41 lease_applied_index:18 range_stats:<last_update_nanos:1551793065261730605 live_bytes:-5944285 live_count:-68814 key_bytes:-2411770 key_count:-68814 val_bytes:-3532515 val_count:-68814 sys_bytes:548 sys_count:4 >

Now here's :3 (the consistent node that crashed) at the same lease applied index (not the same Raft log position, but any command that can change sys_count, I think, must assign a lease index)

0.000000000,0 /Local/RangeID/336/r/RangeAppliedState: raft_applied_index:44 lease_applied_index:18 range_stats:<last_update_nanos:1551793065261730605 live_bytes:33554435 live_count:388437 key_bytes:13613486 key_count:388437 val_bytes:19940949 val_count:388437 sys_bytes:932 sys_count:5 >

At the same time :3 has one transaction record extra (the one for the first split), which is a sys_count delta of one.

I verified the range-descriptors on :3 and :6 and they agree (up to the point at which :3 crashed). So both replicas applied all of the splits similarly.

  • n4 initiated the second split right away based on size, n2 didn't bother -- because n2 probably had zero stats on r336
  • when n2 applies the(second) split trigger, it gets the negative of the RHS stats because that's what happens when you have zero stats. Additionally, the RHS that is created here (r3289) has fairly incorrect stats; however, both replicas agree on what they are
  • r634 (created via the first split) doesn't show up with inconsistent stats anywhere

This doesn't seem to add up. Let's assume the first split entirely messed things up and left n2 (and the other replica) with mostly zero stats but somehow initializes the RHS successfully so that it never shows up here (and I just manually ran a consistency check, which didn't kill any additional nodes). n4 presumably has both sides of the split 100% correct at this point, so the split trigger computed by n2 must've reflected the reality on the ground stats-wise. Now how would the second split mess up the range stats on the RHS of the second split (r3289)? The consistent node (crashed) shows that as the only range for which the persisted stats don't match, but it itself evaluated that split.

On the other hand, if the first split was fine, and the second somehow did damage which is more plausible looking at the stats, then why is there a transaction record on the consistent replica and not the others? The txn record is supposed to be there, and it also wouldn't be deleted synchronously with the split (due to the external intents). So what may have deleted that transaction proto? Committing the range descriptor would take place in the same WriteBatch as committing the txn proto. And the range descriptor seems to have committed throughout. The stats on the consistent node suggest that the txn record ought to be there. Then how did it disappear elsewhere?

@tbg
Copy link
Member

tbg commented Mar 6, 2019

@awoods187 I was trying to run this kind of cluster 5x but ran into the resource limit for all of them. Do you know how to bump the limit?

@tbg
Copy link
Member

tbg commented Mar 6, 2019

(I'll try with some less beefy machines for now)

@ajwerner
Copy link
Contributor

ajwerner commented Mar 6, 2019

It’s easy enough to bump the resource limit. You have to go to the support center in the AWS console and file a ticket. It has a form for the region, type, and quota size. For smaller bumps it should be automated. For larger ones they usually get to it within a couple of hours.

@awoods187 awoods187 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 6, 2019
@andreimatei
Copy link
Contributor

Let's recap. We seem to have three unexplainable things happening here:

  1. A transaction record was left over on one replica but not on the other two.
  2. n3 computed the wrong stats for the RHS of the second split or r336
  3. n2 apparently had empty stats for r336 when processing the split and ended up with negative values r336 after applying the split's delta

What's the common thread between facts? Beats me.
Working on trying to repro and on trying to recover more Raft logs.

@andreimatei
Copy link
Contributor

Unfortunately I was not able to get any more Raft logs or info.
I've looked on all nodes for Raft logs of r336 and there's nothing since before the second split, even with

The Raft log key range for r336 is

0X0169F70150757266746C00 - 0X0169F70150757266746D00

And so I've tried stuff like:

~/cockroach debug rocksdb --db=cockroach idump  --from=0x0169F7015075726674 --max_keys=10 --hex --input_key_hex

And I got nothing before log index 34, which we already had.

I've also taught cockroach debug raft-log to decode the write batches inside the applied commands, and I saw nothing interesting.

I've been also trying to reproduce by running the fixtures import with more aggressive consistency checks, but no dice. I've been doing it at low scale though; I'm going to try to get more industrial about it.

@tbg
Copy link
Member

tbg commented Mar 18, 2019

#35861 should help tickle these things as they happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants