Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: raft log panic on fresh cluster #11591

Closed
mberhault opened this issue Nov 24, 2016 · 34 comments
Closed

stability: raft log panic on fresh cluster #11591

mberhault opened this issue Nov 24, 2016 · 34 comments
Assignees

Comments

@mberhault
Copy link
Contributor

cluster: cobalt (new cluster for continuous-deployment from empty clusters)
sha: b5a19ba

Cluster started from scratch at 161124-08:26:19 (well, this is the third node, but all started within 30s.
The third node (40.117.230.120) crashed after 5 minutes with:

E161124 08:31:27.294717 154 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r22/42:{-}] tocommit(192123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(192123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(192123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 154 [running]:
panic(0x1789040, 0xc42671c9c0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc4203ec5a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0x6e
panic(0x1789040, 0xc42671c9c0)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc42671c820, 0x19c3716, 0x5d, 0xc4224441a0, 0x2, 0x2)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:111 +0x107
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc42525afc0, 0x2ee7b)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x166
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc42542ac30, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1095 +0x54
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc42542ac30, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1041 +0x31f
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc42542ac30, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:779 +0x11ef
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc420e64240, 0x8, 0x2a, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest.func4(0xc420e64240, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2965 +0x89
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc422e07000, 0x1b42800, 0xc425777848, 0xadb54b, 0xc422e070e8)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:535 +0x139
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc422e07000, 0xc425777848, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:552 +0x92
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420904e00, 0x7f07f4ae9c78, 0xc424f982d0, 0xc426c2f500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2966 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420904e00, 0x16)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3203 +0x180
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc4200f7ad0, 0xc4203ec5a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:241 +0x267
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2()
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:181 +0x33
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4203ec5a0, 0xc4207a22a0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

This occured about 100s after starting the block-writer on that node:

$ tail logs/supervisor.log
2016-11-24 08:06:09,595 INFO supervisord started with pid 3527
2016-11-24 08:26:19,644 INFO spawned: 'cockroach' with pid 4670
2016-11-24 08:26:22,230 INFO success: cockroach entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:26:22,233 INFO spawned: 'goroutine_profile' with pid 4699
2016-11-24 08:26:24,652 INFO success: goroutine_profile entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:26:24,655 INFO spawned: 'node_exporter' with pid 4721
2016-11-24 08:26:26,886 INFO success: node_exporter entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:29:38,161 INFO spawned: 'block_writer' with pid 4921
2016-11-24 08:29:40,451 INFO success: block_writer entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-24 08:31:27,364 INFO exited: cockroach (exit status 2; not expected)

Full stderr from this node:
cockroach.stderr.txt

All other nodes are still up and running and processing sql requests.

I'll leave this node down for now and silence the alert. Continuous deployment has not yet been enabled.

@petermattis
Copy link
Collaborator

This is similar to #9037. It's possible that it could have been caused by #11657, though that might be wishful thinking. I'm going to try running both TestRaftRemoveRace and TestRemovePlaceholderRace under stress for a while.

@petermattis
Copy link
Collaborator

Both those tests exhibited no problems under 10min of stress (each).

@mberhault
Copy link
Contributor Author

mberhault commented Nov 29, 2016

I just wiped cyan (testing continuous-deployment with wipe).
Process was:

  • stop all jobs
  • wipe data/logs
  • upgrade binaries to latest master sha (7b73805 in this case)
  • start all nodes
  • start block_writer on all nodes (about 1 minute later)

Node 104.46.105.52 crashed fairly quickly. Full timeline:

2016-11-29 13:08:10,088 INFO spawned: 'cockroach' with pid 14241
2016-11-29 13:08:12,352 INFO success: cockroach entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:08:12,357 INFO spawned: 'node_exporter' with pid 14306
2016-11-29 13:08:14,960 INFO success: node_exporter entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:08:14,965 INFO spawned: 'goroutine_profile' with pid 14312
2016-11-29 13:08:17,392 INFO success: goroutine_profile entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:09:40,198 INFO spawned: 'block_writer' with pid 14571
2016-11-29 13:09:42,379 INFO success: block_writer entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2016-11-29 13:14:20,037 INFO exited: cockroach (exit status 2; not expected)

Crash was:

E161129 13:14:19.896082 154 vendor/github.com/coreos/etcd/raft/log.go:191  [n5,s5,r22/67:{-}] tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

cockroach.stderr.txt

@mberhault
Copy link
Contributor Author

node is still down, and all continuous deployment is paused.

@petermattis
Copy link
Collaborator

Can you stop the cluster?

@mberhault
Copy link
Contributor Author

Cluster is stopped and all alerts for cyan silenced for 3 days

@petermattis
Copy link
Collaborator

Here are some other log messages that reference r22:

I161129 13:14:17.977240 98389 storage/replica_raftstorage.go:621  [n5,s5,r22/?:{-}] applying preemptive snapshot at index 493960 (id=bee82043, encoded size=38677184, 35 rocksdb batches, 4201 log entries)
I161129 13:14:18.356718 98389 storage/replica_raftstorage.go:629  [n5,s5,r22/?:/Table/51{-/1/46111…}] applied preemptive snapshot in 379ms [clear=0ms batch=167ms entries=12ms commit=192ms]
...
I161129 13:14:19.286899 597 storage/store.go:3013  [n5,s5,r22/?:/Table/51{-/1/46111…}] added to replica GC queue (peer suggestion)
I161129 13:14:19.287306 597 storage/store.go:3013  [n5,s5,r22/?:/Table/51{-/1/46111…}] added to replica GC queue (peer suggestion)
...
W161129 13:14:19.879616 187 storage/replica.go:580  [n5,s5,r22/?:/Table/51{-/1/46111…}] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).RemoveReplica for 590.824619ms (>500ms):
E161129 13:14:19.896082 154 vendor/github.com/coreos/etcd/raft/log.go:191  [n5,s5,r22/67:{-}] tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(494999) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]

The stack trace for the crash is:

goroutine 154 [running]:
panic(0x1793ee0, 0xc427c8ba90)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc42035de60)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0x6e
panic(0x1793ee0, 0xc427c8ba90)
	/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc42f7debf0, 0x19d0aad, 0x5d, 0xc427d7e380, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:111 +0x107
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc428825030, 0x78d97)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x166
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc424397590, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1095 +0x54
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc424397590, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1041 +0x31f
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc424397590, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:779 +0x11ef
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc42a37b000, 0x8, 0x43, 0x2, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0xc9
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest.func4(0xc42a37b000, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2969 +0x89
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc4256d5000, 0x1b56b00, 0xc424e81848, 0xae6aeb, 0xc4256d50e8)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:535 +0x10d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc4256d5000, 0xc424e81848, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:552 +0x92
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc42086ae00, 0x25301a0, 0xc42869c120, 0xc42f614480, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2970 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc42086ae00, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3207 +0x180
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc42000eb00, 0xc42035de60)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:241 +0x267
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2()
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:181 +0x33
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42035de60, 0xc4209bf120)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

So we added a replica to a node and then very quickly decided to remove it. Shortly after the removal the replica received a heartbeat that we sent on through to Raft. This is surprising because we're not supposed to process Raft messages on destroyed replicas.

@petermattis
Copy link
Collaborator

In addition to the Raft panic, we're seeing some sort of rebalancing thrashing going on:

~ grep -w r22 cockroach.stderr.txt | grep 'applied preemptive' | wc -l
      17
~ grep -w r22 cockroach.stderr.txt | grep 'added to replica GC queue' | wc -l
      22

So we've added a different replica for r22 to the node 17 times. The highest replica ID I saw was 64 which indicates that in addition to this node, other nodes are experiencing the same replica thrashiness for this range.

@petermattis
Copy link
Collaborator

I'm not sure if I'm interpreting the goroutine output correctly, but I'm suspicious that the replica GC queue is using a different *Replica than the one which panics. The replica GC queue shows:

github.com/cockroachdb/cockroach/pkg/storage.(*Store).RemoveReplica(0xc42086ae00, 0x2530160, 0xc428a1bb60, 0xc4287ce000, 0x16, 0xc42f729930, 0x1, 0x8, 0xc42a374ac0, 0x34, ...)

The panicking goroutine shows:

github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc4256d5000, 0x1b56b00, 0xc424e81848, 0xae6aeb, 0xc4256d50e8)

I'm pretty sure 0xc4256d5000 is the address of the *Replica. But that address is not the 3rd argument to Store.RemoveReplica. If the replica GC queue is holding on to the wrong *Replica then locking and all sorts of other fun stuff would be messed up. I'm not sure how this would happen though and it is still speculation.

@petermattis
Copy link
Collaborator

Ok, looked at how to interpret goroutine stack output a bit more. The *Replica pointer passed to Store.RemoveReplica is 0xc4287ce000 (the first argument is the *Store and the second and third are the Context type and value), but moments later, for the same range ID, we call Replica.withRaftGroupLocked on a *Replica pointer at 0xc4256d5000. That is definitely no bueno.

@mberhault
Copy link
Contributor Author

All data/logs backed up on all instances to:

Backed up logs/ to /home/cockroach/logs.11591
Backed up /mnt/data to /mnt/data/backup.11591

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 29, 2016
@petermattis
Copy link
Collaborator

Re: rebalancer thrashing. I can see rebalancer badness using: rm -fr cockroach-data && allocsim -n 6 -w 6. The number of replicas per node thrashes up and down before eventually settling down. It is possible this is different than what cyan experienced due to the smaller replicas used by allocsim.

@mberhault
Copy link
Contributor Author

cyan restarted with sha 459d9c1

@mberhault
Copy link
Contributor Author

started stop/wipe/start on cyan, letting it run for ~20-30 minutes. no crashes so far.

@mberhault
Copy link
Contributor Author

New crash on cyan after 3 minutes:
node 104.209.249.37

panic message:

E161129 16:26:06.990367 164 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r7/56:{-},@5b6e332c73332c72372f3f3a7b2d7d5d] tocommit(250531) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(250531) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(250531) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

full log:
cockroach.stderr.txt

@petermattis
Copy link
Collaborator

Crap. My testing of %x was messed up somehow so that pointer info is not actually a pointer. Sending a fix.

@petermattis
Copy link
Collaborator

The goroutine stacks once again indicate that the *Replica passed to Store.RemoveReplica was different than the one passed to Replica.withRaftGroupLocked for the same range ID. I'm presuming the extra debug info will show this definitively. For now I'm proceeding as if that has been determined.

@mberhault
Copy link
Contributor Author

that was quick. 104.46.103.35 died in 2 minutes.

E161129 18:02:09.640159 155 vendor/github.com/coreos/etcd/raft/log.go:191  [n6,s6,r22/46:{-},@c42302d800] tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

Full log:
cockroach.stderr.txt

@petermattis
Copy link
Collaborator

I161129 18:02:07.064133 29975 storage/replica_raftstorage.go:621  [n6,s6,r22/?:{-},@c42302dc00] applying preemptive snapshot at index 172791 (id=8dea7adb, encoded size=27078372, 24 rocksdb batches, 4188 log entries)
I161129 18:02:07.315158 29975 storage/replica_raftstorage.go:629  [n6,s6,r22/?:/{Table/51-Max},@c42302dc00] applied preemptive snapshot in 251ms [clear=0ms batch=82ms entries=33ms commit=135ms]
...
I161129 18:02:09.309849 359 storage/store.go:3015  [n6,s6,r22/?:/{Table/51-Max},@c42302dc00] added to replica GC queue (peer suggestion)
I161129 18:02:09.309948 359 storage/store.go:3015  [n6,s6,r22/?:/{Table/51-Max},@c42302dc00] added to replica GC queue (peer suggestion)
I161129 18:02:09.321922 176 storage/store.go:1991  [n6,replicaGC,s6,r22/?:/{Table/51-Max},@c42302dc00] removing replica
...
E161129 18:02:09.640159 155 vendor/github.com/coreos/etcd/raft/log.go:191  [n6,s6,r22/46:{-},@c42302d800] tocommit(173626) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

This was the first reference to c42302d800 in the log file. I think my earlier theory/concern that we had 2 simultaneous *Replicas for the same range ID was incorrect. My theory now is that the MsgHeartbeat that is being processed is causing the *Replica to be recreated. But why is that MsgHeartbeat causing a problem?

@petermattis
Copy link
Collaborator

Rebalancer thrashing:

I161129 18:02:07.602474 182290 storage/replica.go:2192  [n2,s2,r22/3:/{Table/51-Max},@c4204c8c00] proposing ADD_REPLICA {NodeID:6 StoreID:6 ReplicaID:46}: [{NodeID:1 StoreID:1 ReplicaID:45} {NodeID:5 StoreID:5 ReplicaID:29} {NodeID:2 StoreID:2 ReplicaID:3} {NodeID:6 StoreID:6 ReplicaID:46}]
...
I161129 18:02:09.152941 183266 storage/replica.go:2192  [n2,s2,r22/3:/{Table/51-Max},@c4204c8c00] proposing REMOVE_REPLICA {NodeID:6 StoreID:6 ReplicaID:46}: [{NodeID:1 StoreID:1 ReplicaID:45} {NodeID:5 StoreID:5 ReplicaID:29} {NodeID:2 StoreID:2 ReplicaID:3}]

The above logs are from the leader of r22. We added the replica and then 2 seconds later we removed it.

@petermattis
Copy link
Collaborator

Here's the timeline I'm seeing:

  1. n2: Add new replica (ID 46) to r22.
  2. n2: Send preemptive snapshot.
  3. n6: Preemptive snapshot applied for replica ID 46. RangeDescriptor.NextReplicaID will still be 46 at this point because NextReplicaID is not incremented until the ChangeReplicas call.
  4. n2: Add replica ID 46.
  5. n2: Remove replica ID 46.
  6. n6: Add r22 to replica GC queue due to peer notification.
  7. n6: GC replica 46 for r22. Note that the tombstone will be for replica ID 46, not 47, because this node hasn't received the add replica notification yet.
  8. n6: Receive a MsgHeartbeat for r22 for replica ID 46. Store.tryGetOrCreateReplica will recreate the *Replica because the message replica ID (46) is not less than the tombstone replica ID (46).

An indication that this is happening is the removing replica log message which indicates that the replica being removed doesn't have a replica ID yet (r22/?).

Cc @bdarnell

@tamird
Copy link
Contributor

tamird commented Nov 29, 2016 via email

@petermattis
Copy link
Collaborator

Don't we do a consistent lookup of the range descriptor in the GC queue?

We do, but we don't pass that down when we destroy the replica.

@mberhault
Copy link
Contributor Author

I'll do the stop/wipe/start cycle on cyan once this is in.

@tamird
Copy link
Contributor

tamird commented Nov 30, 2016

Writing a test for this is proving more difficult than expected.

In order for this panic to trigger, the heartbeat has to address a non-zero Commit on the "revived" replica, which means that replica needs to have processed some raft traffic.

On the other hand, in order for the replica to be "revivable", its range descriptor must be stale at the time of its removal.

These goals seem at odds with each other. @petermattis @bdarnell ?

@petermattis
Copy link
Collaborator

Can the test manually construct a problematic heartbeat message and get a local store into a bad state where it is removed but with the wrong tombstone key set?

@tamird
Copy link
Contributor

tamird commented Nov 30, 2016

Manually constructing the heartbeat is not a problem.

How do I create a non-empty raft log while keeping the descriptor stale? I'd rather not write a broken tombstone manually because then we're not testing anything (and the test will still fail after the fix).

@petermattis
Copy link
Collaborator

How do I create a non-empty raft log while keeping the descriptor stale?

I'm not sure.

I'd rather not write a broken tombstone manually because then we're not testing anything (and the test will still fail after the fix).

Yes, definitely don't do that.

@tamird
Copy link
Contributor

tamird commented Nov 30, 2016

Tried blocking the application of the additive change replicas in hopes of producing this, but it seems that raftMu protects against removal during log replay.

I'm at the end of my rope =/

@tamird
Copy link
Contributor

tamird commented Nov 30, 2016

@bdarnell
Copy link
Contributor

bdarnell commented Dec 2, 2016

How do I create a non-empty raft log while keeping the descriptor stale?

This happens when something else gets committed to the range's raft log concurrently with the ChangeReplicas, so it appears in the log after the preemptive snapshot is started but before the ChangeReplicas transaction commits. The block writer puts enough data in the raft log that (with the new smaller message sizes since #10929) the post-add-replica log replay takes multiple messages, and that's what gives you a window in which the descriptor is stale but we have processed some of the log.

In a test, you could tune the constants in #10929 so low that you don't need anything else happening concurrently; the first part of the ChangeReplicas(add) transaction would go into its own message. Then you could mess with the raft transport to allow that message through but drop any MsgApp containing an EndTransaction. Then you can remove the replica (while still blocking EndTransactions at the raft level), trigger the replica GC, and send the raft message that would revive it.

@tamird
Copy link
Contributor

tamird commented Dec 3, 2016 via email

@a-robinson
Copy link
Contributor

For reference, this appears to have been more than just a tombstone issue, as tracked down in #14193 / #14231.

petermattis added a commit to petermattis/cockroach that referenced this issue Apr 20, 2017
The replica pointer tag was used for debugging cockroachdb#11591, which has been
long since fixed.
petermattis added a commit to petermattis/cockroach that referenced this issue Apr 20, 2017
The replica pointer tag was used for debugging cockroachdb#11591, which has been
long since fixed.
@benesch
Copy link
Contributor

benesch commented Jul 28, 2018

Thanks for the advice! I got this working, will send a PR shortly.

For posterity, this test was added in #11798.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants