stability: resurrecting registration cluster #6991

mberhault · 2016-06-01T17:21:43Z

@bdarnell: I'll be keeping track of actions and results here.

Quick summary:
the registration cluster is falling over repeatedly due to large snapshot sizes. Specifically, recipients of range 1 snapshots OOM during applySnapshot.

eg, on node 2 ec2-52-91-3-164.compute-1.amazonaws.com:

I160601 16:24:14.957105 storage/replica_raftstorage.go:610  received snapshot for range 1 at index 6818966. encoded size=1204695315, 14475 KV pairs, 384994 log entries
...
I160601 16:24:25.479455 /go/src/google.golang.org/grpc/clientconn.go:499  grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.31.14.204:26257: getsockopt: connection refused"; Reconnecting to "ip-172-31-14-204:26257"
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7f78fae4acc9 m=7
signal arrived during cgo execution

goroutine 86 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc822780308, 0x7f7800000000)
        /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc8227802b0 sp=0xc822780280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x0, 0x0)
        ??:0 +0x53 fp=0xc822780308 sp=0xc8227802b0
github.com/cockroachdb/cockroach/storage/engine.dbApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:990 +0x138 fp=0xc8227803a0 sp=0xc822780308
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).ApplyBatchRepr(0xc822348000, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:578 +0x4f fp=0xc8227803d8 sp=0xc8227803a0
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).flushMutations(0xc822348000)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:672 +0x146 fp=0xc822780468 sp=0xc8227803d8
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatchIterator).Seek(0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:498 +0x25 fp=0xc8227804a0 sp=0xc822780468
github.com/cockroachdb/cockroach/storage/engine.mvccGetMetadata(0x7f78fbaca080, 0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, 0xc938fc6000, 0xc938fc6000, 0x11, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:684 +0xcf fp=0xc8227805d8 sp=0xc8227804a0
github.com/cockroachdb/cockroach/storage/engine.mvccPutInternal(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:1020 +0x1f9 fp=0xc822780b20 sp=0xc8227805d8
github.com/cockroachdb/cockroach/storage/engine.mvccPutUsingIter(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:988 +0x1bb fp=0xc822780bf8 sp=0xc822780b20
github.com/cockroachdb/cockroach/storage/engine.MVCCPut(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0xc800000000, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:926 +0x1bb fp=0xc822780cc8 sp=0xc822780bf8
github.com/cockroachdb/cockroach/storage/engine.MVCCPutProto(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:549 +0x1c5 fp=0xc822780d98 sp=0xc822780cc8
github.com/cockroachdb/cockroach/storage.(*Replica).append(0xc8237fd9e0, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc8f810e000, 0x5dfe2, 0x5dfe2, 0x1626, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:547 +0x20f fp=0xc822780ed8 sp=0xc822780d98
github.com/cockroachdb/cockroach/storage.(*Replica).applySnapshot(0xc8237fd9e0, 0x7f78fbac9f08, 0xc822348000, 0xc88f6b6000, 0x47ce3113, 0x47ce4000, 0xc821b06cc0, 0x4, 0x4, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:670 +0xc06 fp=0xc8227812f8 sp=0xc822780ed8
github.com/cockroachdb/cockroach/storage.(*Replica).handleRaftReady(0xc8237fd9e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1425 +0x29e fp=0xc822781bd0 sp=0xc8227812f8
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2055 +0x35d fp=0xc822781f60 sp=0xc822781bd0
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8202efce0, 0xc82206b240)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52 fp=0xc822781f80 sp=0xc822781f60
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc822781f88 sp=0xc822781f80
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

There is no corresponding "applied snapshot for range 1" message, and the stack trace does list an applySnapshot entry. Can't confirm from the trace that it is for that range (the range ID is not one of the simple arguments), but it most likely is. Similar pattern appeared multiple times.

I will perform the following to try to resurrect the cluster:

stop all nodes
backup all rocksdb data
change the default zone config to have only two replicas

The text was updated successfully, but these errors were encountered:

mberhault · 2016-06-01T17:26:59Z

data backed up on each node to: /mnt/data/backup.6991

mberhault · 2016-06-01T17:30:51Z

the output of cockroach debug raft-log /mnt/data 1 is available at:
[email protected]:raftlog.1

mberhault · 2016-06-01T17:44:25Z

blast. we don't seem to be getting into a stable enough state to actually apply the zone config change.

mberhault · 2016-06-01T18:44:43Z

ok, I added swap on each machine and the snapshot for range 1 went through. sql is usable again (including zone commands)

bdarnell · 2016-06-03T19:53:50Z

Looks like you picked the wrong node to run debug raft-log on. It's tiny on that node, but huge on two of the others. The node with the tiny log runs from position 5706034 to 5706784; the others have logs starting at 6433973. So this is a case of a range being removed from a node, not being GC'd, then being re-added to that node later.

Did that node have an extended period of downtime prior to this? I think this is just a case of the raft logs growing without bound while a node is down and there is no healthy node to repair onto.

tbg · 2016-07-01T03:37:22Z

One node in the registration cluster died:

W160625 05:51:47.105941 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's olde
st index (0) is less than the first index (25269) for range 803
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7f0df6810cc9 m=7
signal arrived during cgo execution

goroutine 57 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc84c00c308, 0x7f0d00000000)
        /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc84c00c2b0 sp=0xc84c00c280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f0de30dc570, 0xc8884f8000, 0x43f, 0x0, 0x0)
        ??:0 +0x53 fp=0xc84c00c308 sp=0xc84c00c2b0

tbg · 2016-07-01T03:47:27Z

Last runtime stats were

I160625 05:51:40.005810 server/status/runtime.go:160  runtime stats: 6.8 GiB RSS, 344 goroutines, 3.3
 GiB active, 66315.43cgo/sec, 0.75/0.24 %(u/s)time, 0.00 %gc (0x)

I think the machines have ~7gig of ram, so that points to an issue here. Some of the other nodes are similarly high:

ubuntu@ip-172-31-8-73:~$ free -m -h
             total       used       free     shared    buffers     cached
Mem:          7.3G       6.9G       450M        32K        57M       535M

with cockroach reporting just shy of 7gb RSS.

The cluster should still be working with a node down. It clearly doesn't. The UI isn't accessible from the outside, so poking that way is a bit awkward. In any case, some raft groups are pretty long (I tried 1 which didn't exist and then 2 gave the following):

ubuntu@ip-172-31-3-145:~$ sudo ./cockroach debug raft-log /mnt/data/ 2 | grep Index: | wc -l
11024

tbg · 2016-07-01T03:59:06Z

The version running is Date: Mon May 30 14:59:10 2016 -0400. I think it's missed out on a lot of recent goodness.

tbg · 2016-07-01T04:27:42Z

Some more random tidbits from one of the nodes:

ubuntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/_status/ranges/local | grep raft_state | sort | uniq -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3844k    0 3844k    0     0  6541k      0 --:--:-- --:--:-- --:--:-- 6538k
    756       "raft_state": "StateCandidate"
    520       "raft_state": "StateFollower"
   1443       "raft_state": "StateLeader"

buntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/debug/stopper
0xc8203472d0: 380 tasks
6      storage/replica.go:602
367    server/node.go:803 // <- oscillates +-200
3      storage/queue.go:383
2      storage/intent_resolver.go:302
1      ts/db.go:100

Rudimentary elinks-based poking on the debug/requests endpoint shows... well, just a lot of NotLeaderErrors (without a new lease holder hint). We need to a way for us to access the admin port to make this debugging less painful.

In light of all of these bugs that we fixed since May 30, I also think we should update that version ASAP. I'm not sure what our protocol is wrt this cluster - can I simply do that?

petermattis · 2016-07-01T13:33:56Z

I don't know what our protocol is either, but I would lean toward yes.

tbg · 2016-07-01T14:18:22Z

Ok. I'll pull a backup off the dataset and run last night's beta.

tbg · 2016-07-01T14:45:54Z

One node died a few minutes in with OOM, presumably due to snapshotting.

I160701 14:45:05.143248 storage/replica_raftstorage.go:524  generated snapshot for range 403 at index
 3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

petermattis · 2016-07-01T14:49:06Z

That raft log is ginormous. Why do we send the full raft log on snapshots?

tbg · 2016-07-01T14:51:27Z

I think this is probably a huge Raft log that was created prior to our truncation improvements, but which was picked up by the replication queue before the truncation queue. Maybe we should put a failsafe into snapshot creation (so that any snapshot which exceeds a certain size isn't even fully created)?

petermattis · 2016-07-01T14:58:38Z

Seems easier to only snapshot the necessary tail of the raft log. For a snapshot, I think we only have to send anything past the applied index of the raft log which should be very small. Ah, strike this. Now I recall that raft log truncation is itself a raft operation.

Ok, I think putting a failsafe to avoid creating excessively large snapshots is reasonable. I'll file an issue.

tbg · 2016-07-01T14:58:57Z

it died again at the same range. I think that failsafe is worthwhile - it would give the truncation queue a chance to pick it up first. The failsafe could even aggressively queue the truncation.

tbg · 2016-07-01T15:07:09Z

I'm a bit out of ideas as to how to proceed right now. In an ideal world, I could restart the cluster with upreplication turned off, and wait for the truncation queue to do its job.
Instead, I'm periodically running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done in the hope that the truncation queue will at some point manage to get there first.

tamird · 2016-07-01T15:13:47Z

Is this happening on just one node? Are all ranges fully replicated to other nodes? Can you simply nuke this one node?

tbg · 2016-07-01T15:15:18Z

There's very little visibility since I can't access the admin ui from outside. Anyone have experience setting up an ssh-tunnel-proxy?

If one node tries to send that snapshot, chances are it's the same on the other nodes or underreplicated. In both cases, nuking the first node won't help. I also think I saw two nodes die already.

petermattis · 2016-07-01T15:16:23Z

If you're running insecure you can do: ssh -L 8080:localhost:8080 <some-machine>. I did this earlier today without difficulty.

tbg · 2016-07-01T15:16:39Z

It's a secure cluster. I'll give it a try though.

petermattis · 2016-07-01T15:18:44Z

Should still work with a secure cluster.

tbg · 2016-07-01T15:18:56Z

It simply works, great. Thanks @petermattis. Would you mind running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done in a busy loop? Can't hurt and I'm about to go on the train.

petermattis · 2016-07-01T15:23:29Z

Where is this ~/production/hosts/register file?

tbg · 2016-07-01T15:26:03Z

It's my local clone of our non-public production repo.

petermattis · 2016-07-01T15:28:03Z

Got it.

tbg · 2016-07-01T17:42:20Z

I realized that I hadn't actually managed to run the updated version because supervisord needed to reload the config. I did that now, but the cluster is even more unhappy than before - the first range isn't being gossiped.

tbg · 2016-07-01T17:44:18Z

Restarted one of the nodes. Magically that seems to have brought the first range back in the game. Snapshot sending time.

tbg · 2016-07-01T18:00:01Z

Still in critical state, though. Had to restart one of the nodes again (to resuscitate first range gossip).

Sometimes things are relatively quiet, then large swaths of

W160701 17:58:28.362804 raft/raft.go:593  [group 1967] 4 stepped down to follower since quorum is not active
W160701 17:58:29.131940 raft/raft.go:593  [group 1178] 4 stepped down to follower since quorum is not active
W160701 17:58:29.134338 raft/raft.go:593  [group 533] 4 stepped down to follower since quorum is not active
W160701 17:58:29.272449 raft/raft.go:593  [group 3162] 4 stepped down to follower since quorum is not active
W160701 17:58:29.274708 raft/raft.go:593  [group 711] 4 stepped down to follower since quorum is not active
W160701 17:58:29.276705 raft/raft.go:593  [group 1510] 7 stepped down to follower since quorum is not active
W160701 17:58:29.281201 raft/raft.go:593  [group 4096] 4 stepped down to follower since quorum is not active
W160701 17:58:29.281295 raft/raft.go:593  [group 1169] 4 stepped down to follower since quorum is not active
W160701 17:58:29.286367 raft/raft.go:593  [group 593] 4 stepped down to follower since quorum is not active
W160701 17:58:29.377702 raft/raft.go:593  [group 773] 4 stepped down to follower since quorum is not active
W160701 17:58:29.380577 raft/raft.go:593  [group 4637] 4 stepped down to follower since quorum is not active
W160701 17:58:29.471898 raft/raft.go:593  [group 634] 4 stepped down to follower since quorum is not active
W160701 17:58:29.567202 raft/raft.go:593  [group 1375] 7 stepped down to follower since quorum is not active
W160701 17:58:30.137139 raft/raft.go:593  [group 3516] 4 stepped down to follower since quorum is not active

I think those might have to do with us reporting "unreachable" to Raft every time the outgoing message queue is full (cc @tamird). Too bad I'm not running with the per-replica outboxes yet.

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState).

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that a new HardState does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619.

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that a new HardState does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619.

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that a new HardState does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619.

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that a new HardState does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619.

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that the new HardState and Raft log does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619. storage: prevent loss of uncommitted log entries

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that the new HardState and Raft log does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619. storage: prevent loss of uncommitted log entries

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

As discovered in cockroachdb#6991 (comment), it's possible that we apply a Raft snapshot without writing a corresponding HardState since we write the snapshot in its own batch first and only then write a HardState. If that happens, the server is going to panic on restart: It will have a nontrivial first index, but a committed index of zero (from the empty HardState). This change prevents us from applying a snapshot when there is no HardState supplied along with it, except when applying a preemptive snapshot (in which case we synthesize a HardState). Ensure that the new HardState and Raft log does not break promises made by an existing one during preemptive snapshot application. Fixes cockroachdb#7619. storage: prevent loss of uncommitted log entries

See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was applied (so there is a TruncatedState). In this case, synthesize a HardState (simply setting everything that was in the snapshot to committed). Having lost the original HardState can theoretically mean that the replica was further ahead or had voted, and so there's no guarantee that this will be correct. But it will be correct in the majority of cases, and some state *has* to be recovered. To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have applied an empty snapshot (no real data, but a Raft log which starts and ends at index ten as designated by its TruncatedState). We don't have a HardState, so Raft will crash because its Commit index zero isn't in line with the fact that our Raft log starts only at index ten. The migration sees that there is a TruncatedState, but no HardState. It will synthesize a HardState with Commit:10 (and the corresponding Term from the TruncatedState, which is five).

tamird · 2016-09-09T14:57:32Z

via @tschottdorf: raw data is preserved in s3, but has been dumped and imported into the new cluster.

petermattis mentioned this issue Jul 1, 2016

storage: avoid creating excessively large snapshots #7581

Closed

petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 11, 2016

petermattis modified the milestone: Q3 Jul 11, 2016

cuongdo removed their assignment Aug 22, 2016

tamird closed this as completed Sep 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: resurrecting registration cluster #6991

stability: resurrecting registration cluster #6991

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

bdarnell commented Jun 3, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

tamird commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016 •

edited

Loading

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016 •

edited

Loading

tbg commented Jul 1, 2016

tamird commented Sep 9, 2016

stability: resurrecting registration cluster #6991

stability: resurrecting registration cluster #6991

Comments

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

mberhault commented Jun 1, 2016

bdarnell commented Jun 3, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016

tamird commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016 • edited Loading

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

tbg commented Jul 1, 2016 • edited Loading

tbg commented Jul 1, 2016

tamird commented Sep 9, 2016

tbg commented Jul 1, 2016 •

edited

Loading

tbg commented Jul 1, 2016 •

edited

Loading