Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: resurrecting registration cluster #6991

Closed
mberhault opened this issue Jun 1, 2016 · 48 comments
Closed

stability: resurrecting registration cluster #6991

mberhault opened this issue Jun 1, 2016 · 48 comments
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@mberhault
Copy link
Contributor

@bdarnell: I'll be keeping track of actions and results here.

Quick summary:
the registration cluster is falling over repeatedly due to large snapshot sizes. Specifically, recipients of range 1 snapshots OOM during applySnapshot.

eg, on node 2 ec2-52-91-3-164.compute-1.amazonaws.com:

I160601 16:24:14.957105 storage/replica_raftstorage.go:610  received snapshot for range 1 at index 6818966. encoded size=1204695315, 14475 KV pairs, 384994 log entries
...
I160601 16:24:25.479455 /go/src/google.golang.org/grpc/clientconn.go:499  grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.31.14.204:26257: getsockopt: connection refused"; Reconnecting to "ip-172-31-14-204:26257"
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7f78fae4acc9 m=7
signal arrived during cgo execution

goroutine 86 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc822780308, 0x7f7800000000)
        /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc8227802b0 sp=0xc822780280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x0, 0x0)
        ??:0 +0x53 fp=0xc822780308 sp=0xc8227802b0
github.com/cockroachdb/cockroach/storage/engine.dbApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:990 +0x138 fp=0xc8227803a0 sp=0xc822780308
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).ApplyBatchRepr(0xc822348000, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:578 +0x4f fp=0xc8227803d8 sp=0xc8227803a0
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).flushMutations(0xc822348000)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:672 +0x146 fp=0xc822780468 sp=0xc8227803d8
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatchIterator).Seek(0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:498 +0x25 fp=0xc8227804a0 sp=0xc822780468
github.com/cockroachdb/cockroach/storage/engine.mvccGetMetadata(0x7f78fbaca080, 0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, 0xc938fc6000, 0xc938fc6000, 0x11, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:684 +0xcf fp=0xc8227805d8 sp=0xc8227804a0
github.com/cockroachdb/cockroach/storage/engine.mvccPutInternal(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:1020 +0x1f9 fp=0xc822780b20 sp=0xc8227805d8
github.com/cockroachdb/cockroach/storage/engine.mvccPutUsingIter(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:988 +0x1bb fp=0xc822780bf8 sp=0xc822780b20
github.com/cockroachdb/cockroach/storage/engine.MVCCPut(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0xc800000000, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:926 +0x1bb fp=0xc822780cc8 sp=0xc822780bf8
github.com/cockroachdb/cockroach/storage/engine.MVCCPutProto(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:549 +0x1c5 fp=0xc822780d98 sp=0xc822780cc8
github.com/cockroachdb/cockroach/storage.(*Replica).append(0xc8237fd9e0, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc8f810e000, 0x5dfe2, 0x5dfe2, 0x1626, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:547 +0x20f fp=0xc822780ed8 sp=0xc822780d98
github.com/cockroachdb/cockroach/storage.(*Replica).applySnapshot(0xc8237fd9e0, 0x7f78fbac9f08, 0xc822348000, 0xc88f6b6000, 0x47ce3113, 0x47ce4000, 0xc821b06cc0, 0x4, 0x4, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:670 +0xc06 fp=0xc8227812f8 sp=0xc822780ed8
github.com/cockroachdb/cockroach/storage.(*Replica).handleRaftReady(0xc8237fd9e0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1425 +0x29e fp=0xc822781bd0 sp=0xc8227812f8
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
        /go/src/github.com/cockroachdb/cockroach/storage/store.go:2055 +0x35d fp=0xc822781f60 sp=0xc822781bd0
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8202efce0, 0xc82206b240)
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52 fp=0xc822781f80 sp=0xc822781f60
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc822781f88 sp=0xc822781f80
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62

There is no corresponding "applied snapshot for range 1" message, and the stack trace does list an applySnapshot entry. Can't confirm from the trace that it is for that range (the range ID is not one of the simple arguments), but it most likely is. Similar pattern appeared multiple times.

I will perform the following to try to resurrect the cluster:

  • stop all nodes
  • backup all rocksdb data
  • change the default zone config to have only two replicas
@mberhault
Copy link
Contributor Author

data backed up on each node to: /mnt/data/backup.6991

@mberhault
Copy link
Contributor Author

the output of cockroach debug raft-log /mnt/data 1 is available at:
[email protected]:raftlog.1

@mberhault
Copy link
Contributor Author

blast. we don't seem to be getting into a stable enough state to actually apply the zone config change.

@mberhault
Copy link
Contributor Author

ok, I added swap on each machine and the snapshot for range 1 went through. sql is usable again (including zone commands)

@bdarnell
Copy link
Contributor

bdarnell commented Jun 3, 2016

Looks like you picked the wrong node to run debug raft-log on. It's tiny on that node, but huge on two of the others. The node with the tiny log runs from position 5706034 to 5706784; the others have logs starting at 6433973. So this is a case of a range being removed from a node, not being GC'd, then being re-added to that node later.

Did that node have an extended period of downtime prior to this? I think this is just a case of the raft logs growing without bound while a node is down and there is no healthy node to repair onto.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

One node in the registration cluster died:

W160625 05:51:47.105941 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's olde
st index (0) is less than the first index (25269) for range 803
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7f0df6810cc9 m=7
signal arrived during cgo execution

goroutine 57 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc84c00c308, 0x7f0d00000000)
        /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc84c00c2b0 sp=0xc84c00c280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f0de30dc570, 0xc8884f8000, 0x43f, 0x0, 0x0)
        ??:0 +0x53 fp=0xc84c00c308 sp=0xc84c00c2b0

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Last runtime stats were

I160625 05:51:40.005810 server/status/runtime.go:160  runtime stats: 6.8 GiB RSS, 344 goroutines, 3.3
 GiB active, 66315.43cgo/sec, 0.75/0.24 %(u/s)time, 0.00 %gc (0x)

I think the machines have ~7gig of ram, so that points to an issue here. Some of the other nodes are similarly high:

ubuntu@ip-172-31-8-73:~$ free -m -h
             total       used       free     shared    buffers     cached
Mem:          7.3G       6.9G       450M        32K        57M       535M

with cockroach reporting just shy of 7gb RSS.

The cluster should still be working with a node down. It clearly doesn't. The UI isn't accessible from the outside, so poking that way is a bit awkward. In any case, some raft groups are pretty long (I tried 1 which didn't exist and then 2 gave the following):

ubuntu@ip-172-31-3-145:~$ sudo ./cockroach debug raft-log /mnt/data/ 2 | grep Index: | wc -l
11024

@tbg
Copy link
Member

tbg commented Jul 1, 2016

The version running is Date: Mon May 30 14:59:10 2016 -0400. I think it's missed out on a lot of recent goodness.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Some more random tidbits from one of the nodes:

ubuntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/_status/ranges/local | grep raft_state | sort | uniq -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3844k    0 3844k    0     0  6541k      0 --:--:-- --:--:-- --:--:-- 6538k
    756       "raft_state": "StateCandidate"
    520       "raft_state": "StateFollower"
   1443       "raft_state": "StateLeader"
buntu@ip-172-31-8-73:~$ curl -k https://localhost:8080/debug/stopper
0xc8203472d0: 380 tasks
6      storage/replica.go:602
367    server/node.go:803 // <- oscillates +-200
3      storage/queue.go:383
2      storage/intent_resolver.go:302
1      ts/db.go:100

Rudimentary elinks-based poking on the debug/requests endpoint shows... well, just a lot of NotLeaderErrors (without a new lease holder hint). We need to a way for us to access the admin port to make this debugging less painful.

In light of all of these bugs that we fixed since May 30, I also think we should update that version ASAP. I'm not sure what our protocol is wrt this cluster - can I simply do that?

@petermattis
Copy link
Collaborator

I don't know what our protocol is either, but I would lean toward yes.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Ok. I'll pull a backup off the dataset and run last night's beta.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

One node died a few minutes in with OOM, presumably due to snapshotting.

I160701 14:45:05.143248 storage/replica_raftstorage.go:524  generated snapshot for range 403 at index
 3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

@petermattis
Copy link
Collaborator

That raft log is ginormous. Why do we send the full raft log on snapshots?

@tbg
Copy link
Member

tbg commented Jul 1, 2016

I think this is probably a huge Raft log that was created prior to our truncation improvements, but which was picked up by the replication queue before the truncation queue. Maybe we should put a failsafe into snapshot creation (so that any snapshot which exceeds a certain size isn't even fully created)?

@petermattis
Copy link
Collaborator

Seems easier to only snapshot the necessary tail of the raft log. For a snapshot, I think we only have to send anything past the applied index of the raft log which should be very small. Ah, strike this. Now I recall that raft log truncation is itself a raft operation.

Ok, I think putting a failsafe to avoid creating excessively large snapshots is reasonable. I'll file an issue.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

it died again at the same range. I think that failsafe is worthwhile - it would give the truncation queue a chance to pick it up first. The failsafe could even aggressively queue the truncation.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

I'm a bit out of ideas as to how to proceed right now. In an ideal world, I could restart the cluster with upreplication turned off, and wait for the truncation queue to do its job.
Instead, I'm periodically running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done in the hope that the truncation queue will at some point manage to get there first.

@tamird
Copy link
Contributor

tamird commented Jul 1, 2016

Is this happening on just one node? Are all ranges fully replicated to other nodes? Can you simply nuke this one node?

@tbg
Copy link
Member

tbg commented Jul 1, 2016

There's very little visibility since I can't access the admin ui from outside. Anyone have experience setting up an ssh-tunnel-proxy?

If one node tries to send that snapshot, chances are it's the same on the other nodes or underreplicated. In both cases, nuking the first node won't help. I also think I saw two nodes die already.

@petermattis
Copy link
Collaborator

If you're running insecure you can do: ssh -L 8080:localhost:8080 <some-machine>. I did this earlier today without difficulty.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

It's a secure cluster. I'll give it a try though.

@petermattis
Copy link
Collaborator

Should still work with a secure cluster.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

It simply works, great. Thanks @petermattis. Would you mind running for host in $(cat ~/production/hosts/register); do ssh $host supervisorctl -c supervisord.conf start cockroach; done in a busy loop? Can't hurt and I'm about to go on the train.

@petermattis
Copy link
Collaborator

Where is this ~/production/hosts/register file?

@tbg
Copy link
Member

tbg commented Jul 1, 2016

It's my local clone of our non-public production repo.

@petermattis
Copy link
Collaborator

Got it.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

I realized that I hadn't actually managed to run the updated version because supervisord needed to reload the config. I did that now, but the cluster is even more unhappy than before - the first range isn't being gossiped.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Restarted one of the nodes. Magically that seems to have brought the first range back in the game. Snapshot sending time.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Still in critical state, though. Had to restart one of the nodes again (to resuscitate first range gossip).

Sometimes things are relatively quiet, then large swaths of

W160701 17:58:28.362804 raft/raft.go:593  [group 1967] 4 stepped down to follower since quorum is not active
W160701 17:58:29.131940 raft/raft.go:593  [group 1178] 4 stepped down to follower since quorum is not active
W160701 17:58:29.134338 raft/raft.go:593  [group 533] 4 stepped down to follower since quorum is not active
W160701 17:58:29.272449 raft/raft.go:593  [group 3162] 4 stepped down to follower since quorum is not active
W160701 17:58:29.274708 raft/raft.go:593  [group 711] 4 stepped down to follower since quorum is not active
W160701 17:58:29.276705 raft/raft.go:593  [group 1510] 7 stepped down to follower since quorum is not active
W160701 17:58:29.281201 raft/raft.go:593  [group 4096] 4 stepped down to follower since quorum is not active
W160701 17:58:29.281295 raft/raft.go:593  [group 1169] 4 stepped down to follower since quorum is not active
W160701 17:58:29.286367 raft/raft.go:593  [group 593] 4 stepped down to follower since quorum is not active
W160701 17:58:29.377702 raft/raft.go:593  [group 773] 4 stepped down to follower since quorum is not active
W160701 17:58:29.380577 raft/raft.go:593  [group 4637] 4 stepped down to follower since quorum is not active
W160701 17:58:29.471898 raft/raft.go:593  [group 634] 4 stepped down to follower since quorum is not active
W160701 17:58:29.567202 raft/raft.go:593  [group 1375] 7 stepped down to follower since quorum is not active
W160701 17:58:30.137139 raft/raft.go:593  [group 3516] 4 stepped down to follower since quorum is not active

I think those might have to do with us reporting "unreachable" to Raft every time the outgoing message queue is full (cc @tamird). Too bad I'm not running with the per-replica outboxes yet.

tbg added a commit to tbg/cockroach that referenced this issue Jul 6, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).
tbg added a commit to tbg/cockroach that referenced this issue Jul 6, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that the new HardState and Raft log does not break promises made by an
existing one during preemptive snapshot application.

Fixes cockroachdb#7619.

storage: prevent loss of uncommitted log entries
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
tbg added a commit to tbg/cockroach that referenced this issue Jul 11, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that the new HardState and Raft log does not break promises made by an
existing one during preemptive snapshot application.

Fixes cockroachdb#7619.

storage: prevent loss of uncommitted log entries
tbg added a commit to tbg/cockroach that referenced this issue Jul 11, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 11, 2016
@petermattis petermattis modified the milestone: Q3 Jul 11, 2016
tbg added a commit to tbg/cockroach that referenced this issue Jul 12, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that the new HardState and Raft log does not break promises made by an
existing one during preemptive snapshot application.

Fixes cockroachdb#7619.

storage: prevent loss of uncommitted log entries
tbg added a commit to tbg/cockroach that referenced this issue Jul 12, 2016
See cockroachdb#6991. It's possible that the HardState is missing after a snapshot was
applied (so there is a TruncatedState). In this case, synthesize a HardState
(simply setting everything that was in the snapshot to committed). Having lost
the original HardState can theoretically mean that the replica was further
ahead or had voted, and so there's no guarantee that this will be correct. But
it will be correct in the majority of cases, and some state *has* to be
recovered.

To illustrate this in the scenario in cockroachdb#6991: There, we (presumably) have
applied an empty snapshot (no real data, but a Raft log which starts and
ends at index ten as designated by its TruncatedState). We don't have a
HardState, so Raft will crash because its Commit index zero isn't in line
with the fact that our Raft log starts only at index ten.

The migration sees that there is a TruncatedState, but no HardState. It will
synthesize a HardState with Commit:10 (and the corresponding Term from the
TruncatedState, which is five).
@cuongdo cuongdo removed their assignment Aug 22, 2016
@tamird
Copy link
Contributor

tamird commented Sep 9, 2016

via @tschottdorf: raw data is preserved in s3, but has been dumped and imported into the new cluster.

@tamird tamird closed this as completed Sep 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants