stability: OOM in applySnapshot #5467

mberhault · 2016-03-22T13:47:43Z

build sha: 6388648

~12-13 hours after start (4 nodes, 4 photos apps)

node 3 crashes in cgo, failing to malloc:

W160322 02:41:05.506698 storage/intent_resolver.go:270  [node=4,store=4,range=1] failed to push during intent resolution: failed to push "storage/replica_command.go:2075 (*Replica).ChangeReplicas" id=e5cefcb1 key=/Local/Range/"\x04tsd\x12cr.store.valcount\x00\x01\x89\xf8\x06.\xab4"/RangeDescriptor rw=false pri=0.02576397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1458614452.109743706,0 orig=1458614452.109743706,0 max=1458614452.109743706,0
I160322 02:41:07.423443 storage/replica_command.go:1306  range 29: new leader lease replica {4 4 7} 2016-03-22 02:41:06.515680542 +0000 UTC 1.102035073s
I160322 02:41:11.491077 storage/replica_command.go:1306  range 1: new leader lease replica {4 4 4} 2016-03-22 02:41:09.950262787 +0000 UTC 2.035941663s
I160322 02:41:13.619220 server/status/runtime.go:140  runtime stats: 144 goroutines, 373.55MiB active, 100247.14cgo/sec, 0.59/0.10 %(u/s)time, 0.00 %gc (9x)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7fd372a94cc9 m=7
signal arrived during cgo execution

rax    0x0
rbx    0x1c317a9
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x6ceb
rsi    0x6cf1
rbp    0x7fd372e1d868
rsp    0x7fd36f8bb8e8
r8     0xa
r9     0x7fd36f8bc700
r10    0x8
r11    0x206
r12    0x7fd2963e6120
r13    0x7fd36f8bbbe0
r14    0x7fd36f8bbbd0
r15    0x3ffff09
rip    0x7fd372a94cc9
rflags 0x206
cs     0x33
fs     0x0
gs     0x0
1: syscall [locked] [Created by stop.(*Stopper).RunWorker @ stopper.go:140]
    runtime   cgocall.go:123              cgocall(0x107c980, 0xc84069ef38, 0)
    engine    ??:0                        _Cfunc_DBPut(#6, #3, 0x1a, #7, 0, #5, 0x2f5, 0, 0)
    engine    rocksdb.go:845              dbPut(#6, #3, 0x1a, 0x20, #7, 0, #5, 0x2f5, 0x300, 0, ...)
    engine    rocksdb.go:512              (*rocksDBBatch).Put(#4, #3, 0x1a, 0x20, #7, 0, #5, 0x2f5, 0x300, 0, ...)
    storage   replica_raftstorage.go:537  (*Replica).applySnapshot(#2, 0x7fd373631640, #4, 0xc837506000, 0x4467977, 0x4468000, 0xc82ab80ac0, 0x3, 0x4, 0, ...)
    storage   replica.go:1159             (*Replica).handleRaftReady(#2, 0, 0)
    storage   store.go:1807               (*Store).processRaft.func1()
    stop      stopper.go:139              (*Stopper).RunWorker.func1(#1, 0xc8201d7ec0)
    runtime   asm_amd64.s:1998            goexit()

Full log:
node3.log.parse.txt

node 2 crashes ~1h later (also OOM): suspiciously high number of goroutines:

37374: semacquire [180~251 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:324]
    sync           sema.go:47                runtime_Semacquire(*)
    sync           waitgroup.go:127          (*WaitGroup).Wait(*)
    storage        replica.go:782            (*Replica).beginCmds(#4438, *, #43)
    storage        replica.go:947            (*Replica).addWriteCmd(#4438, #395820, *, *, *, #90, 0x5, 0x1b, 0, 0, ...)
    storage        replica.go:708            (*Replica).Send(#4438, #395820, *, *, *, #90, 0x5, 0x1b, 0, 0, ...)
    storage        store.go:1617             (*Store).Send(#867, #395820, *, *, *, #90, 0x5, 0x1b, 0, 0, ...)
    storage        stores.go:174             (*Stores).Send(#1146, #395820, *, 0, 0, #90, 0x5, 0x1b, 0, 0, ...)
    server         node.go:689               (*Node).Batch.func2()
    stop           stopper.go:166            (*Stopper).RunTask(#1043, *, 0)
    server         node.go:701               (*Node).Batch(#924, #395820, *, *, *, 0, 0)
    roachpb        api.pb.go:1273            _Internal_Batch_Handler(#35, #924, #395820, *, *, 0, 0, 0, 0)
    grpc           server.go:497             (*Server).processUnaryRPC(#6, #395829, #81604, *, #1576, #65, 0, 0, 0)
    grpc           server.go:646             (*Server).handleStream(#6, #395829, #81604, *, 0)
    grpc           server.go:323             (*Server).serveStreams.func1.1(#25664, #6, #395829, #81604, *)

Full log:
node2.log.parse.txt

Followed eventually by nodes 1 and 0:
node0.log.parse.txt
node1.log.parse.txt

The text was updated successfully, but these errors were encountered:

mberhault · 2016-03-22T14:04:27Z

data/logs archived to /mnt/data.<bug#> and ~/logs.<bugnumber>

bdarnell · 2016-03-22T18:35:39Z

How much memory was available to the process? The panic is coming from an allocation in cgo, for which we don't have good instrumentation, so I'm going to look at the stats on the go side first.

The last memory report was I160322 02:41:13.619220 server/status/runtime.go:140 runtime stats: 144 goroutines, 373.55MiB active, 100247.14cgo/sec, 0.59/0.10 %(u/s)time, 0.00 %gc (9x). Steady-state memory usage is generally under 20 MiB, but there are occasional spikes into the 400s. And then there's this:

I160322 02:26:43.619192 server/status/runtime.go:140  runtime stats: 219 goroutines, 363.26MiB active, 37226.94cgo/sec, 0.38/0.07 %(u/s)time, 0.00 %gc (9x)
I160322 02:26:53.660041 server/status/runtime.go:140  runtime stats: 253 goroutines, 1089.33MiB active, 339514.54cgo/sec, 1.08/0.15 %(u/s)time, 0.02 %gc (17x)
I160322 02:27:03.619394 server/status/runtime.go:140  runtime stats: 245 goroutines, 1417.35MiB active, 87497.51cgo/sec, 0.35/0.07 %(u/s)time, 0.00 %gc (1x)
I160322 02:27:13.619352 server/status/runtime.go:140  runtime stats: 250 goroutines, 1434.23MiB active, 3423.01cgo/sec, 0.06/0.01 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:23.620176 server/status/runtime.go:140  runtime stats: 226 goroutines, 1455.04MiB active, 3459.52cgo/sec, 0.12/0.03 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:33.619577 server/status/runtime.go:140  runtime stats: 238 goroutines, 1466.45MiB active, 872.35cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:43.619138 server/status/runtime.go:140  runtime stats: 246 goroutines, 1476.87MiB active, 714.83cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:53.619174 server/status/runtime.go:140  runtime stats: 250 goroutines, 1487.16MiB active, 704.20cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:03.619212 server/status/runtime.go:140  runtime stats: 254 goroutines, 1497.56MiB active, 790.80cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:13.619147 server/status/runtime.go:140  runtime stats: 254 goroutines, 1507.63MiB active, 664.10cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:23.619195 server/status/runtime.go:140  runtime stats: 263 goroutines, 1518.07MiB active, 702.20cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:33.619076 server/status/runtime.go:140  runtime stats: 263 goroutines, 1530.07MiB active, 1273.71cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:43.619171 server/status/runtime.go:140  runtime stats: 264 goroutines, 1540.93MiB active, 658.29cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:53.619072 server/status/runtime.go:140  runtime stats: 267 goroutines, 1551.70MiB active, 651.51cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:29:03.619179 server/status/runtime.go:140  runtime stats: 269 goroutines, 18.03MiB active, 654.89cgo/sec, 0.04/0.00 %(u/s)time, 0.00 %gc (1x)

Memory usage spiked by 700MiB in a single 10-second reporting interval and then kept growing for a while before dropping quickly back to the baseline. During that time no GCs were occuring. Either we had one incredibly long GC and the pause time stats we report are inaccurate, or there really was no GC activity (because time was spent in cgo?)

Of course, the process recovered from that incident, but it might give us some clues about what could have happened. We now know that memory usage is subject to sudden spikes, and that it takes more than 1.5GB allocated on the go side to kill the process (unless the cgo side had grown between this incident and the one that killed it).

A lot of leases changed hands at 02:26, and a lot of ranges seem to be getting rebalanced away from NodeID 2 (note that node IDs are 1-based but these log filenames are 0-based. @mberhault can you change that?).

There are also a lot of these in the log, which makes me wonder if raft log GC might be failing in a way that is causing the snapshots to grow without bound: W160322 02:40:30.749408 storage/raft_log_queue.go:116 storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (12047) for range 65. I'll look at the data directories later to see if we can figure out which range was getting sent over and how big the snapshot would have been.

mberhault · 2016-03-22T19:35:28Z

Each machine has approx 7.3GB (the official number on AWS is 8GB. liars!)

For info, a currently-running node is showing:

I160322 19:34:49.750220 server/status/runtime.go:140  runtime stats: 115 goroutines, 10.91MiB active, 46580.60cgo/sec, 1.08/0.11 %(u/s)time, 0.00 %gc (33x)

free:

$ free 
             total       used       free     shared    buffers     cached
Mem:       7659508    7528128     131380        372      89508    4964312
-/+ buffers/cache:    2474308    5185200
Swap:            0          0          0

with ps reporting:

root     25357  124 27.6 2963668 2114556 ?     Sl   17:25 160:14 /home/ubuntu/cockroach

nvanbenschoten · 2016-03-23T07:10:22Z

The gRPC complaints roughly 7 minutes before the crash like

I160322 02:34:55.612588 /go/src/google.golang.org/grpc/server.go:273  grpc: Server.Serve failed to complete security handshake from "172.31.58.174:58481": EOF

along with the node's inability to establish a quorum

E160322 02:40:49.682709 storage/queue.go:444  [replicate] range=3 [/Table/12-/Table/13): error: storage/replicate_queue.go:122: range requires a replication change, but lacks a quorum of live nodes.

would lead me to believe that the cluster experienced some kind of network partition which isolated node3. After the node was able to reestablish a connection with the rest of the cluster, it seems like the Raft snapshot was too large for cgo to handle.

The only semi-relevant issue I could find was blevesearch/blevex#13, where the author explains that a bad_alloc was a deterministic consequence of doing a batch insert above a certain size threshold (in a cgo call to LevelDB). Even if this crash does end up being an issue with letting the Raft log grow indefinitely, resulting in a huge snapshot as @bdarnell suggested, I'd still like to see if we can independently find some notion of a RocksDB batch size limit and how that would relate to total system memory.

In terms of the node2 crash, it looks like that was related. The continual logging of

E160322 03:53:47.419284 storage/store.go:1819  error handling raft message: storage/store.go:1758: refusing incoming Raft message MsgHeartbeat for group 32 from {2 2 2} to {3 3 3}: raft group deleted

makes me wonder if a Raft command somehow got stuck, blocking all dependent commands behind it indefinitely.

bdarnell · 2016-03-23T16:28:20Z

The "raft group deleted" logging is probably harmless; it can result when a range is rebalanced away and persist until the GC queue runs. I downgraded this logging to V(1) yesterday.

mberhault · 2016-03-24T13:05:18Z

Happened again today, same error:

I160324 06:58:23.635550 server/status/runtime.go:144  runtime stats: 161 goroutines, 222.53MiB active, 3061.99cgo/sec, 0.27/0.04 %(u/s)time, 0.00 %gc (8x)
I160324 06:58:23.668877 server/status/linuxstats.go:56  mallinfo stats: ordblks=289 KiB, smblks=904 B, hblks=0 B, hblkhd=0 B, usmblks=0 B, fsmblks=76 KiB, uordblks=16 EiB, fordblks=16 EiB, keepcost=76 KiB
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7fa975651cc9 m=6
signal arrived during cgo execution

Unfortunately, the mallinfo struct uses int32 for uordblks and fordblks (total allocated and total free respectively).

Still, the fact that total free overflowed 2GB seems to indicate that we spiked very high in rocksdb.

nvanbenschoten · 2016-03-25T08:02:05Z

After looking back at the crash of node 2, I'm fairly optimistic that the issue is the exact same as #5368 and was caught in #5551. It seems like a read-only command got stuck in redirectOnOrAcquireLeaderLease and was never timed out because of the incorrect context withTimeout linkage.

That said, I havent been able to determine what would actually cause requesting the leader lease to get stuck in the first place, so I'm curious if anyone has any ideas. Is it expected for commands proposed to Raft to need to timeout without hearing a positive or negative answer, given some certain set of circumstances? I assume not, so another possibility is that we go stuck repeatedly requesting the leader lease in this loop, finding each time that the lease returned did not cover the current timestamp. I'd be curious to look at traces from these nodes to see how many attempts we were making to acquire leader leases. A third alternative is that requesting the leader lease somehow caused a deadlock, but at this point that is just speculation.

bdarnell · 2016-03-25T15:39:46Z

Is it expected for commands proposed to Raft to need to timeout without hearing a positive or negative answer, given some certain set of circumstances?

Raft never gives a negative answer; either the command commits or you timeout (and in general we use periodic retries of raft commands to ensure that it eventually commits). However, if you're on a replica which has been removed from the range, then none of the other replicas will talk to you.

So when a stale range descriptor cache directs a request to a removed replica, we don't handle it well at all. I think we're currently relying on SendNextTimeout to handle this at the rpc layer, but it leaves an orphaned goroutine behind (prior to #5551). Eventually the replicaGCQueue will see that the replica has been removed and cancel all of its outstanding operations (or at least it should), but that takes a long time.

I think we need to add some sort of error messaging to the raft transport, so when a message is rejected for being from an out of date replica, we can send a message to that replica that will trigger a replica GC.

bdarnell · 2016-03-31T23:20:50Z

I'm going to close this since the immediate cause was fixed in #5551. I've filed #5789 to track something that we could be doing better with removed replicas.

mberhault assigned tbg Mar 22, 2016

bdarnell mentioned this issue Mar 24, 2016

stability: split constantly failing with "range is already split" #5501

Closed

bdarnell mentioned this issue Mar 31, 2016

storage: add replicas to replica GC queue when remote reports an error #5789

Closed

bdarnell closed this as completed Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: OOM in applySnapshot #5467

stability: OOM in applySnapshot #5467

mberhault commented Mar 22, 2016

mberhault commented Mar 22, 2016

bdarnell commented Mar 22, 2016

mberhault commented Mar 22, 2016

nvanbenschoten commented Mar 23, 2016

bdarnell commented Mar 23, 2016

mberhault commented Mar 24, 2016

nvanbenschoten commented Mar 25, 2016

bdarnell commented Mar 25, 2016

bdarnell commented Mar 31, 2016

stability: OOM in applySnapshot #5467

stability: OOM in applySnapshot #5467

Comments

mberhault commented Mar 22, 2016

mberhault commented Mar 22, 2016

bdarnell commented Mar 22, 2016

mberhault commented Mar 22, 2016

nvanbenschoten commented Mar 23, 2016

bdarnell commented Mar 23, 2016

mberhault commented Mar 24, 2016

nvanbenschoten commented Mar 25, 2016

bdarnell commented Mar 25, 2016

bdarnell commented Mar 31, 2016