stability: raft log panic on fresh cluster #11591

mberhault · 2016-11-24T13:25:50Z

cluster: cobalt (new cluster for continuous-deployment from empty clusters)
sha: b5a19ba

Cluster started from scratch at 161124-08:26:19 (well, this is the third node, but all started within 30s.
The third node (40.117.230.120) crashed after 5 minutes with:

E161124 08:31:27.294717 154 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r22/42:{-}] tocommit(192123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(192123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(192123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 154 [running]:
panic(0x1789040, 0xc42671c9c0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc4203ec5a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0x6e
panic(0x1789040, 0xc42671c9c0)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc42671c820, 0x19c3716, 0x5d, 0xc4224441a0, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:111 +0x107
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc42525afc0, 0x2ee7b)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x166
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc42542ac30, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1095 +0x54
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc42542ac30, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1041 +0x31f
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc42542ac30, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:779 +0x11ef
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc420e64240, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest.func4(0xc420e64240, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2965 +0x89
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc422e07000, 0x1b42800, 0xc425777848, 0xadb54b, 0xc422e070e8)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:535 +0x139
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc422e07000, 0xc425777848, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:552 +0x92
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420904e00, 0x7f07f4ae9c78, 0xc424f982d0, 0xc426c2f500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2966 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420904e00, 0x16)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3203 +0x180
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc4200f7ad0, 0xc4203ec5a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:241 +0x267
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2()
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:181 +0x33
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4203ec5a0, 0xc4207a22a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

This occured about 100s after starting the block-writer on that node:

$ tail logs/supervisor.log
2016-11-24 08:06:09,595 INFO supervisord started with pid 3527
2016-11-24 08:26:19,644 INFO spawned: 'cockroach' with pid 4670
2016-11-24 08:26:22,230 INFO success: cockroach entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:26:22,233 INFO spawned: 'goroutine_profile' with pid 4699
2016-11-24 08:26:24,652 INFO success: goroutine_profile entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:26:24,655 INFO spawned: 'node_exporter' with pid 4721
2016-11-24 08:26:26,886 INFO success: node_exporter entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:29:38,161 INFO spawned: 'block_writer' with pid 4921
2016-11-24 08:29:40,451 INFO success: block_writer entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:31:27,364 INFO exited: cockroach (exit status 2; not expected)

Full stderr from this node:
cockroach.stderr.txt

All other nodes are still up and running and processing sql requests.

I'll leave this node down for now and silence the alert. Continuous deployment has not yet been enabled.

The text was updated successfully, but these errors were encountered:

petermattis · 2016-11-28T17:35:36Z

This is similar to #9037. It's possible that it could have been caused by #11657, though that might be wishful thinking. I'm going to try running both TestRaftRemoveRace and TestRemovePlaceholderRace under stress for a while.

petermattis · 2016-11-28T18:10:18Z

Both those tests exhibited no problems under 10min of stress (each).

mberhault · 2016-11-29T13:37:48Z

I just wiped cyan (testing continuous-deployment with wipe).
Process was:

stop all jobs
wipe data/logs
upgrade binaries to latest master sha (7b73805 in this case)
start all nodes
start block_writer on all nodes (about 1 minute later)

Node 104.46.105.52 crashed fairly quickly. Full timeline:

2016-11-29 13:08:10,088 INFO spawned: 'cockroach' with pid 14241
2016-11-29 13:08:12,352 INFO success: cockroach entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:08:12,357 INFO spawned: 'node_exporter' with pid 14306
2016-11-29 13:08:14,960 INFO success: node_exporter entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:08:14,965 INFO spawned: 'goroutine_profile' with pid 14312
2016-11-29 13:08:17,392 INFO success: goroutine_profile entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:09:40,198 INFO spawned: 'block_writer' with pid 14571
2016-11-29 13:09:42,379 INFO success: block_writer entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:14:20,037 INFO exited: cockroach (exit status 2; not expected)

Crash was:

E161129 13:14:19.896082 154 vendor/github.com/coreos/etcd/raft/log.go:191  [n5,s5,r22/67:{-}] tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

cockroach.stderr.txt

mberhault · 2016-11-29T13:42:21Z

node is still down, and all continuous deployment is paused.

petermattis · 2016-11-29T13:45:08Z

Can you stop the cluster?

mberhault · 2016-11-29T13:47:15Z

Cluster is stopped and all alerts for cyan silenced for 3 days

petermattis · 2016-11-29T13:53:46Z

Here are some other log messages that reference r22:

I161129 13:14:17.977240 98389 storage/replica_raftstorage.go:621  [n5,s5,r22/?:{-}] applying preemptive snapshot at index 493960 (id=bee82043, encoded size=38677184, 35 rocksdb batches, 4201 log entries)
I161129 13:14:18.356718 98389 storage/replica_raftstorage.go:629  [n5,s5,r22/?:/Table/51{-/1/46111…}] applied preemptive snapshot in 379ms [clear=0ms batch=167ms entries=12ms commit=192ms]
...
I161129 13:14:19.286899 597 storage/store.go:3013  [n5,s5,r22/?:/Table/51{-/1/46111…}] added to replica GC queue (peer suggestion)
I161129 13:14:19.287306 597 storage/store.go:3013  [n5,s5,r22/?:/Table/51{-/1/46111…}] added to replica GC queue (peer suggestion)
...
W161129 13:14:19.879616 187 storage/replica.go:580  [n5,s5,r22/?:/Table/51{-/1/46111…}] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).RemoveReplica for 590.824619ms (>500ms):
E161129 13:14:19.896082 154 vendor/github.com/coreos/etcd/raft/log.go:191  [n5,s5,r22/67:{-}] tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]

The stack trace for the crash is:

goroutine 154 [running]:
panic(0x1793ee0, 0xc427c8ba90)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc42035de60)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0x6e
panic(0x1793ee0, 0xc427c8ba90)
	/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc42f7debf0, 0x19d0aad, 0x5d, 0xc427d7e380, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:111 +0x107
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc428825030, 0x78d97)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x166
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc424397590, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1095 +0x54
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc424397590, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1041 +0x31f
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc424397590, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:779 +0x11ef
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc42a37b000, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest.func4(0xc42a37b000, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2969 +0x89
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc4256d5000, 0x1b56b00, 0xc424e81848, 0xae6aeb, 0xc4256d50e8)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:535 +0x10d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc4256d5000, 0xc424e81848, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:552 +0x92
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc42086ae00, 0x25301a0, 0xc42869c120, 0xc42f614480, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2970 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc42086ae00, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3207 +0x180
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc42000eb00, 0xc42035de60)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:241 +0x267
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2()
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:181 +0x33
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42035de60, 0xc4209bf120)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

So we added a replica to a node and then very quickly decided to remove it. Shortly after the removal the replica received a heartbeat that we sent on through to Raft. This is surprising because we're not supposed to process Raft messages on destroyed replicas.

petermattis · 2016-11-29T13:56:27Z

In addition to the Raft panic, we're seeing some sort of rebalancing thrashing going on:

~ grep -w r22 cockroach.stderr.txt | grep 'applied preemptive' | wc -l
      17
~ grep -w r22 cockroach.stderr.txt | grep 'added to replica GC queue' | wc -l
      22

So we've added a different replica for r22 to the node 17 times. The highest replica ID I saw was 64 which indicates that in addition to this node, other nodes are experiencing the same replica thrashiness for this range.

petermattis · 2016-11-29T14:37:51Z

I'm not sure if I'm interpreting the goroutine output correctly, but I'm suspicious that the replica GC queue is using a different *Replica than the one which panics. The replica GC queue shows:

github.com/cockroachdb/cockroach/pkg/storage.(*Store).RemoveReplica(0xc42086ae00, 0x2530160, 0xc428a1bb60, 0xc4287ce000, 0x16, 0xc42f729930, 0x1, 0x8, 0xc42a374ac0, 0x34, ...)

The panicking goroutine shows:

github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc4256d5000, 0x1b56b00, 0xc424e81848, 0xae6aeb, 0xc4256d50e8)

I'm pretty sure 0xc4256d5000 is the address of the *Replica. But that address is not the 3rd argument to Store.RemoveReplica. If the replica GC queue is holding on to the wrong *Replica then locking and all sorts of other fun stuff would be messed up. I'm not sure how this would happen though and it is still speculation.

petermattis · 2016-11-29T14:56:29Z

Ok, looked at how to interpret goroutine stack output a bit more. The *Replica pointer passed to Store.RemoveReplica is 0xc4287ce000 (the first argument is the *Store and the second and third are the Context type and value), but moments later, for the same range ID, we call Replica.withRaftGroupLocked on a *Replica pointer at 0xc4256d5000. That is definitely no bueno.

mberhault · 2016-11-29T14:58:29Z

All data/logs backed up on all instances to:

Backed up logs/ to /home/cockroach/logs.11591
Backed up /mnt/data to /mnt/data/backup.11591

Debugging aid for cockroachdb#11591.

petermattis · 2016-11-29T15:11:52Z

Re: rebalancer thrashing. I can see rebalancer badness using: rm -fr cockroach-data && allocsim -n 6 -w 6. The number of replicas per node thrashes up and down before eventually settling down. It is possible this is different than what cyan experienced due to the smaller replicas used by allocsim.

mberhault · 2016-11-29T15:34:46Z

cyan restarted with sha 459d9c1

mberhault · 2016-11-29T16:15:06Z

started stop/wipe/start on cyan, letting it run for ~20-30 minutes. no crashes so far.

mberhault · 2016-11-29T16:30:32Z

New crash on cyan after 3 minutes:
node 104.209.249.37

panic message:

E161129 16:26:06.990367 164 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r7/56:{-},@5b6e332c73332c72372f3f3a7b2d7d5d] tocommit(250531) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(250531) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(250531) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

full log:
cockroach.stderr.txt

petermattis · 2016-11-29T16:52:38Z

Crap. My testing of %x was messed up somehow so that pointer info is not actually a pointer. Sending a fix.

petermattis · 2016-11-29T17:03:16Z

The goroutine stacks once again indicate that the *Replica passed to Store.RemoveReplica was different than the one passed to Replica.withRaftGroupLocked for the same range ID. I'm presuming the extra debug info will show this definitively. For now I'm proceeding as if that has been determined.

mberhault · 2016-11-29T18:05:08Z

that was quick. 104.46.103.35 died in 2 minutes.

E161129 18:02:09.640159 155 vendor/github.com/coreos/etcd/raft/log.go:191  [n6,s6,r22/46:{-},@c42302d800] tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

Full log:
cockroach.stderr.txt

petermattis · 2016-11-29T19:22:20Z

I161129 18:02:07.064133 29975 storage/replica_raftstorage.go:621  [n6,s6,r22/?:{-},@c42302dc00] applying preemptive snapshot at index 172791 (id=8dea7adb, encoded size=27078372, 24 rocksdb batches, 4188 log entries)
I161129 18:02:07.315158 29975 storage/replica_raftstorage.go:629  [n6,s6,r22/?:/{Table/51-Max},@c42302dc00] applied preemptive snapshot in 251ms [clear=0ms batch=82ms entries=33ms commit=135ms]
...
I161129 18:02:09.309849 359 storage/store.go:3015  [n6,s6,r22/?:/{Table/51-Max},@c42302dc00] added to replica GC queue (peer suggestion)
I161129 18:02:09.309948 359 storage/store.go:3015  [n6,s6,r22/?:/{Table/51-Max},@c42302dc00] added to replica GC queue (peer suggestion)
I161129 18:02:09.321922 176 storage/store.go:1991  [n6,replicaGC,s6,r22/?:/{Table/51-Max},@c42302dc00] removing replica
...
E161129 18:02:09.640159 155 vendor/github.com/coreos/etcd/raft/log.go:191  [n6,s6,r22/46:{-},@c42302d800] tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

This was the first reference to c42302d800 in the log file. I think my earlier theory/concern that we had 2 simultaneous *Replicas for the same range ID was incorrect. My theory now is that the MsgHeartbeat that is being processed is causing the *Replica to be recreated. But why is that MsgHeartbeat causing a problem?

petermattis · 2016-11-29T20:22:49Z

Rebalancer thrashing:

I161129 18:02:07.602474 182290 storage/replica.go:2192  [n2,s2,r22/3:/{Table/51-Max},@c4204c8c00] proposing ADD_REPLICA {NodeID:6 StoreID:6 ReplicaID:46}: [{NodeID:1 StoreID:1 ReplicaID:45} {NodeID:5 StoreID:5 ReplicaID:29} {NodeID:2 StoreID:2 ReplicaID:3} {NodeID:6 StoreID:6 ReplicaID:46}]
...
I161129 18:02:09.152941 183266 storage/replica.go:2192  [n2,s2,r22/3:/{Table/51-Max},@c4204c8c00] proposing REMOVE_REPLICA {NodeID:6 StoreID:6 ReplicaID:46}: [{NodeID:1 StoreID:1 ReplicaID:45} {NodeID:5 StoreID:5 ReplicaID:29} {NodeID:2 StoreID:2 ReplicaID:3}]

The above logs are from the leader of r22. We added the replica and then 2 seconds later we removed it.

petermattis · 2016-11-29T20:36:39Z

Here's the timeline I'm seeing:

n2: Add new replica (ID 46) to r22.
n2: Send preemptive snapshot.
n6: Preemptive snapshot applied for replica ID 46. RangeDescriptor.NextReplicaID will still be 46 at this point because NextReplicaID is not incremented until the ChangeReplicas call.
n2: Add replica ID 46.
n2: Remove replica ID 46.
n6: Add r22 to replica GC queue due to peer notification.
n6: GC replica 46 for r22. Note that the tombstone will be for replica ID 46, not 47, because this node hasn't received the add replica notification yet.
n6: Receive a MsgHeartbeat for r22 for replica ID 46. Store.tryGetOrCreateReplica will recreate the *Replica because the message replica ID (46) is not less than the tombstone replica ID (46).

An indication that this is happening is the removing replica log message which indicates that the replica being removed doesn't have a replica ID yet (r22/?).

Cc @bdarnell

tamird · 2016-11-29T20:55:54Z

On Tue, Nov 29, 2016 at 3:36 PM, Peter Mattis ***@***.***> wrote: n6: GC replica 46 for r22. Note that the tombstone will be for replica ID 46, not 47, because this node hasn't received the add replica notification yet.

Don't we do a consistent lookup of the range descriptor in the GC queue?

petermattis · 2016-11-29T21:05:47Z

Don't we do a consistent lookup of the range descriptor in the GC queue?

We do, but we don't pass that down when we destroy the replica.

mberhault · 2016-11-30T14:03:21Z

I'll do the stop/wipe/start cycle on cyan once this is in.

tamird · 2016-11-30T16:40:59Z

Writing a test for this is proving more difficult than expected.

In order for this panic to trigger, the heartbeat has to address a non-zero Commit on the "revived" replica, which means that replica needs to have processed some raft traffic.

On the other hand, in order for the replica to be "revivable", its range descriptor must be stale at the time of its removal.

These goals seem at odds with each other. @petermattis @bdarnell ?

petermattis · 2016-11-30T16:45:58Z

Can the test manually construct a problematic heartbeat message and get a local store into a bad state where it is removed but with the wrong tombstone key set?

tamird · 2016-11-30T16:49:24Z

Manually constructing the heartbeat is not a problem.

How do I create a non-empty raft log while keeping the descriptor stale? I'd rather not write a broken tombstone manually because then we're not testing anything (and the test will still fail after the fix).

petermattis · 2016-11-30T16:51:39Z

How do I create a non-empty raft log while keeping the descriptor stale?

I'm not sure.

I'd rather not write a broken tombstone manually because then we're not testing anything (and the test will still fail after the fix).

Yes, definitely don't do that.

tamird · 2016-11-30T17:48:56Z

Tried blocking the application of the additive change replicas in hopes of producing this, but it seems that raftMu protects against removal during log replay.

I'm at the end of my rope =/

tamird · 2016-11-30T17:50:46Z

My attempts are here: https://github.com/tamird/cockroach/commit/e94d0d437fdc802e84ec0f5d86c3c2e73f85472a

bdarnell · 2016-12-02T05:10:18Z

How do I create a non-empty raft log while keeping the descriptor stale?

This happens when something else gets committed to the range's raft log concurrently with the ChangeReplicas, so it appears in the log after the preemptive snapshot is started but before the ChangeReplicas transaction commits. The block writer puts enough data in the raft log that (with the new smaller message sizes since #10929) the post-add-replica log replay takes multiple messages, and that's what gives you a window in which the descriptor is stale but we have processed some of the log.

In a test, you could tune the constants in #10929 so low that you don't need anything else happening concurrently; the first part of the ChangeReplicas(add) transaction would go into its own message. Then you could mess with the raft transport to allow that message through but drop any MsgApp containing an EndTransaction. Then you can remove the replica (while still blocking EndTransactions at the raft level), trigger the replica GC, and send the raft message that would revive it.

tamird · 2016-12-03T02:53:13Z

Thanks for the advice! I got this working, will send a PR shortly.

…

On Fri, Dec 2, 2016 at 12:10 AM, Ben Darnell ***@***.***> wrote: How do I create a non-empty raft log while keeping the descriptor stale? This happens when something else gets committed to the range's raft log concurrently with the ChangeReplicas, so it appears in the log after the preemptive snapshot is started but before the ChangeReplicas transaction commits. The block writer puts enough data in the raft log that (with the new smaller message sizes since #10929 <#10929>) the post-add-replica log replay takes multiple messages, and that's what gives you a window in which the descriptor is stale but we have processed some of the log. In a test, you could tune the constants in #10929 <#10929> so low that you don't need anything else happening concurrently; the first part of the ChangeReplicas(add) transaction would go into its own message. Then you could mess with the raft transport to allow that message through but drop any MsgApp containing an EndTransaction. Then you can remove the replica (while still blocking EndTransactions at the raft level), trigger the replica GC, and send the raft message that would revive it. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#11591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPE_Voex1l4TGlrWjLjTdNZtWbSwtks5rD6g9gaJpZM4K7piK> .

a-robinson · 2017-03-19T05:09:20Z

For reference, this appears to have been more than just a tombstone issue, as tracked down in #14193 / #14231.

The replica pointer tag was used for debugging cockroachdb#11591, which has been long since fixed.

benesch · 2018-07-28T04:09:31Z

Thanks for the advice! I got this working, will send a PR shortly.

For posterity, this test was added in #11798.

mberhault assigned bdarnell and petermattis Nov 24, 2016

petermattis mentioned this issue Nov 29, 2016

storage: add replica pointer tag #11683

Merged

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 29, 2016

storage: add replica pointer tag

a0d3054

Debugging aid for cockroachdb#11591.

petermattis mentioned this issue Nov 29, 2016

storage: include range and lease counts in snapshot responses #11690

Closed

tamird mentioned this issue Nov 29, 2016

storage: use consistent range descriptor for raft tombstones #11699

Merged

petermattis mentioned this issue Nov 30, 2016

storage: reactive gossiping of store capacity based on rate of change #11662

Merged

tamird closed this as completed in #11699 Nov 30, 2016

tamird mentioned this issue Dec 1, 2016

stability: raft panic on delta #9946

Closed

petermattis mentioned this issue Mar 17, 2017

storage: panic in raftGroup.Step while running under stressrace #14231

Closed

petermattis mentioned this issue Apr 20, 2017

storage: remove replica pointer tag #15212

Merged

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 20, 2017

storage: remove replica pointer tag

c8d6697

The replica pointer tag was used for debugging cockroachdb#11591, which has been long since fixed.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 20, 2017

storage: disable replica pointer tag

a5bbfcf

The replica pointer tag was used for debugging cockroachdb#11591, which has been long since fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: raft log panic on fresh cluster #11591

stability: raft log panic on fresh cluster #11591

mberhault commented Nov 24, 2016

petermattis commented Nov 28, 2016

petermattis commented Nov 28, 2016

mberhault commented Nov 29, 2016 •

edited

Loading

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

mberhault commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

tamird commented Nov 29, 2016 via email

petermattis commented Nov 29, 2016

mberhault commented Nov 30, 2016

tamird commented Nov 30, 2016 •

edited

Loading

petermattis commented Nov 30, 2016

tamird commented Nov 30, 2016

petermattis commented Nov 30, 2016

tamird commented Nov 30, 2016

tamird commented Nov 30, 2016

bdarnell commented Dec 2, 2016

tamird commented Dec 3, 2016 via email

a-robinson commented Mar 19, 2017

benesch commented Jul 28, 2018

stability: raft log panic on fresh cluster #11591

stability: raft log panic on fresh cluster #11591

Comments

mberhault commented Nov 24, 2016

petermattis commented Nov 28, 2016

petermattis commented Nov 28, 2016

mberhault commented Nov 29, 2016 • edited Loading

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

mberhault commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

mberhault commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

petermattis commented Nov 29, 2016

tamird commented Nov 29, 2016 via email

petermattis commented Nov 29, 2016

mberhault commented Nov 30, 2016

tamird commented Nov 30, 2016 • edited Loading

petermattis commented Nov 30, 2016

tamird commented Nov 30, 2016

petermattis commented Nov 30, 2016

tamird commented Nov 30, 2016

tamird commented Nov 30, 2016

bdarnell commented Dec 2, 2016

tamird commented Dec 3, 2016 via email

a-robinson commented Mar 19, 2017

benesch commented Jul 28, 2018

mberhault commented Nov 29, 2016 •

edited

Loading

tamird commented Nov 30, 2016 •

edited

Loading