Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: OOM in applySnapshot #5467

Closed
mberhault opened this issue Mar 22, 2016 · 9 comments
Closed

stability: OOM in applySnapshot #5467

mberhault opened this issue Mar 22, 2016 · 9 comments
Assignees

Comments

@mberhault
Copy link
Contributor

build sha: 6388648

~12-13 hours after start (4 nodes, 4 photos apps)

node 3 crashes in cgo, failing to malloc:

W160322 02:41:05.506698 storage/intent_resolver.go:270  [node=4,store=4,range=1] failed to push during intent resolution: failed to push "storage/replica_command.go:2075 (*Replica).ChangeReplicas" id=e5cefcb1 key=/Local/Range/"\x04tsd\x12cr.store.valcount\x00\x01\x89\xf8\x06.\xab4"/RangeDescriptor rw=false pri=0.02576397 iso=SERIALIZABLE stat=PENDING epo=0 ts=1458614452.109743706,0 orig=1458614452.109743706,0 max=1458614452.109743706,0
I160322 02:41:07.423443 storage/replica_command.go:1306  range 29: new leader lease replica {4 4 7} 2016-03-22 02:41:06.515680542 +0000 UTC 1.102035073s
I160322 02:41:11.491077 storage/replica_command.go:1306  range 1: new leader lease replica {4 4 4} 2016-03-22 02:41:09.950262787 +0000 UTC 2.035941663s
I160322 02:41:13.619220 server/status/runtime.go:140  runtime stats: 144 goroutines, 373.55MiB active, 100247.14cgo/sec, 0.59/0.10 %(u/s)time, 0.00 %gc (9x)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7fd372a94cc9 m=7
signal arrived during cgo execution

rax    0x0
rbx    0x1c317a9
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x6ceb
rsi    0x6cf1
rbp    0x7fd372e1d868
rsp    0x7fd36f8bb8e8
r8     0xa
r9     0x7fd36f8bc700
r10    0x8
r11    0x206
r12    0x7fd2963e6120
r13    0x7fd36f8bbbe0
r14    0x7fd36f8bbbd0
r15    0x3ffff09
rip    0x7fd372a94cc9
rflags 0x206
cs     0x33
fs     0x0
gs     0x0
1: syscall [locked] [Created by stop.(*Stopper).RunWorker @ stopper.go:140]
    runtime   cgocall.go:123              cgocall(0x107c980, 0xc84069ef38, 0)
    engine    ??:0                        _Cfunc_DBPut(#6, #3, 0x1a, #7, 0, #5, 0x2f5, 0, 0)
    engine    rocksdb.go:845              dbPut(#6, #3, 0x1a, 0x20, #7, 0, #5, 0x2f5, 0x300, 0, ...)
    engine    rocksdb.go:512              (*rocksDBBatch).Put(#4, #3, 0x1a, 0x20, #7, 0, #5, 0x2f5, 0x300, 0, ...)
    storage   replica_raftstorage.go:537  (*Replica).applySnapshot(#2, 0x7fd373631640, #4, 0xc837506000, 0x4467977, 0x4468000, 0xc82ab80ac0, 0x3, 0x4, 0, ...)
    storage   replica.go:1159             (*Replica).handleRaftReady(#2, 0, 0)
    storage   store.go:1807               (*Store).processRaft.func1()
    stop      stopper.go:139              (*Stopper).RunWorker.func1(#1, 0xc8201d7ec0)
    runtime   asm_amd64.s:1998            goexit()

Full log:
node3.log.parse.txt

node 2 crashes ~1h later (also OOM): suspiciously high number of goroutines:

37374: semacquire [180~251 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:324]
    sync           sema.go:47                runtime_Semacquire(*)
    sync           waitgroup.go:127          (*WaitGroup).Wait(*)
    storage        replica.go:782            (*Replica).beginCmds(#4438, *, #43)
    storage        replica.go:947            (*Replica).addWriteCmd(#4438, #395820, *, *, *, #90, 0x5, 0x1b, 0, 0, ...)
    storage        replica.go:708            (*Replica).Send(#4438, #395820, *, *, *, #90, 0x5, 0x1b, 0, 0, ...)
    storage        store.go:1617             (*Store).Send(#867, #395820, *, *, *, #90, 0x5, 0x1b, 0, 0, ...)
    storage        stores.go:174             (*Stores).Send(#1146, #395820, *, 0, 0, #90, 0x5, 0x1b, 0, 0, ...)
    server         node.go:689               (*Node).Batch.func2()
    stop           stopper.go:166            (*Stopper).RunTask(#1043, *, 0)
    server         node.go:701               (*Node).Batch(#924, #395820, *, *, *, 0, 0)
    roachpb        api.pb.go:1273            _Internal_Batch_Handler(#35, #924, #395820, *, *, 0, 0, 0, 0)
    grpc           server.go:497             (*Server).processUnaryRPC(#6, #395829, #81604, *, #1576, #65, 0, 0, 0)
    grpc           server.go:646             (*Server).handleStream(#6, #395829, #81604, *, 0)
    grpc           server.go:323             (*Server).serveStreams.func1.1(#25664, #6, #395829, #81604, *)

Full log:
node2.log.parse.txt

Followed eventually by nodes 1 and 0:
node0.log.parse.txt
node1.log.parse.txt

@mberhault
Copy link
Contributor Author

data/logs archived to /mnt/data.<bug#> and ~/logs.<bugnumber>

@bdarnell
Copy link
Contributor

How much memory was available to the process? The panic is coming from an allocation in cgo, for which we don't have good instrumentation, so I'm going to look at the stats on the go side first.

The last memory report was I160322 02:41:13.619220 server/status/runtime.go:140 runtime stats: 144 goroutines, 373.55MiB active, 100247.14cgo/sec, 0.59/0.10 %(u/s)time, 0.00 %gc (9x). Steady-state memory usage is generally under 20 MiB, but there are occasional spikes into the 400s. And then there's this:

I160322 02:26:43.619192 server/status/runtime.go:140  runtime stats: 219 goroutines, 363.26MiB active, 37226.94cgo/sec, 0.38/0.07 %(u/s)time, 0.00 %gc (9x)
I160322 02:26:53.660041 server/status/runtime.go:140  runtime stats: 253 goroutines, 1089.33MiB active, 339514.54cgo/sec, 1.08/0.15 %(u/s)time, 0.02 %gc (17x)
I160322 02:27:03.619394 server/status/runtime.go:140  runtime stats: 245 goroutines, 1417.35MiB active, 87497.51cgo/sec, 0.35/0.07 %(u/s)time, 0.00 %gc (1x)
I160322 02:27:13.619352 server/status/runtime.go:140  runtime stats: 250 goroutines, 1434.23MiB active, 3423.01cgo/sec, 0.06/0.01 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:23.620176 server/status/runtime.go:140  runtime stats: 226 goroutines, 1455.04MiB active, 3459.52cgo/sec, 0.12/0.03 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:33.619577 server/status/runtime.go:140  runtime stats: 238 goroutines, 1466.45MiB active, 872.35cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:43.619138 server/status/runtime.go:140  runtime stats: 246 goroutines, 1476.87MiB active, 714.83cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:27:53.619174 server/status/runtime.go:140  runtime stats: 250 goroutines, 1487.16MiB active, 704.20cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:03.619212 server/status/runtime.go:140  runtime stats: 254 goroutines, 1497.56MiB active, 790.80cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:13.619147 server/status/runtime.go:140  runtime stats: 254 goroutines, 1507.63MiB active, 664.10cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:23.619195 server/status/runtime.go:140  runtime stats: 263 goroutines, 1518.07MiB active, 702.20cgo/sec, 0.02/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:33.619076 server/status/runtime.go:140  runtime stats: 263 goroutines, 1530.07MiB active, 1273.71cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:43.619171 server/status/runtime.go:140  runtime stats: 264 goroutines, 1540.93MiB active, 658.29cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:28:53.619072 server/status/runtime.go:140  runtime stats: 267 goroutines, 1551.70MiB active, 651.51cgo/sec, 0.03/0.00 %(u/s)time, 0.00 %gc (0x)
I160322 02:29:03.619179 server/status/runtime.go:140  runtime stats: 269 goroutines, 18.03MiB active, 654.89cgo/sec, 0.04/0.00 %(u/s)time, 0.00 %gc (1x)

Memory usage spiked by 700MiB in a single 10-second reporting interval and then kept growing for a while before dropping quickly back to the baseline. During that time no GCs were occuring. Either we had one incredibly long GC and the pause time stats we report are inaccurate, or there really was no GC activity (because time was spent in cgo?)

Of course, the process recovered from that incident, but it might give us some clues about what could have happened. We now know that memory usage is subject to sudden spikes, and that it takes more than 1.5GB allocated on the go side to kill the process (unless the cgo side had grown between this incident and the one that killed it).

A lot of leases changed hands at 02:26, and a lot of ranges seem to be getting rebalanced away from NodeID 2 (note that node IDs are 1-based but these log filenames are 0-based. @mberhault can you change that?).

There are also a lot of these in the log, which makes me wonder if raft log GC might be failing in a way that is causing the snapshots to grow without bound: W160322 02:40:30.749408 storage/raft_log_queue.go:116 storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (12047) for range 65. I'll look at the data directories later to see if we can figure out which range was getting sent over and how big the snapshot would have been.

@mberhault
Copy link
Contributor Author

Each machine has approx 7.3GB (the official number on AWS is 8GB. liars!)

For info, a currently-running node is showing:

I160322 19:34:49.750220 server/status/runtime.go:140  runtime stats: 115 goroutines, 10.91MiB active, 46580.60cgo/sec, 1.08/0.11 %(u/s)time, 0.00 %gc (33x)

free:

$ free 
             total       used       free     shared    buffers     cached
Mem:       7659508    7528128     131380        372      89508    4964312
-/+ buffers/cache:    2474308    5185200
Swap:            0          0          0

with ps reporting:

root     25357  124 27.6 2963668 2114556 ?     Sl   17:25 160:14 /home/ubuntu/cockroach

@nvanbenschoten
Copy link
Member

The gRPC complaints roughly 7 minutes before the crash like

I160322 02:34:55.612588 /go/src/google.golang.org/grpc/server.go:273  grpc: Server.Serve failed to complete security handshake from "172.31.58.174:58481": EOF

along with the node's inability to establish a quorum

E160322 02:40:49.682709 storage/queue.go:444  [replicate] range=3 [/Table/12-/Table/13): error: storage/replicate_queue.go:122: range requires a replication change, but lacks a quorum of live nodes.

would lead me to believe that the cluster experienced some kind of network partition which isolated node3. After the node was able to reestablish a connection with the rest of the cluster, it seems like the Raft snapshot was too large for cgo to handle.

The only semi-relevant issue I could find was blevesearch/blevex#13, where the author explains that a bad_alloc was a deterministic consequence of doing a batch insert above a certain size threshold (in a cgo call to LevelDB). Even if this crash does end up being an issue with letting the Raft log grow indefinitely, resulting in a huge snapshot as @bdarnell suggested, I'd still like to see if we can independently find some notion of a RocksDB batch size limit and how that would relate to total system memory.

In terms of the node2 crash, it looks like that was related. The continual logging of

E160322 03:53:47.419284 storage/store.go:1819  error handling raft message: storage/store.go:1758: refusing incoming Raft message MsgHeartbeat for group 32 from {2 2 2} to {3 3 3}: raft group deleted

makes me wonder if a Raft command somehow got stuck, blocking all dependent commands behind it indefinitely.

@bdarnell
Copy link
Contributor

The "raft group deleted" logging is probably harmless; it can result when a range is rebalanced away and persist until the GC queue runs. I downgraded this logging to V(1) yesterday.

@mberhault
Copy link
Contributor Author

Happened again today, same error:

I160324 06:58:23.635550 server/status/runtime.go:144  runtime stats: 161 goroutines, 222.53MiB active, 3061.99cgo/sec, 0.27/0.04 %(u/s)time, 0.00 %gc (8x)
I160324 06:58:23.668877 server/status/linuxstats.go:56  mallinfo stats: ordblks=289 KiB, smblks=904 B, hblks=0 B, hblkhd=0 B, usmblks=0 B, fsmblks=76 KiB, uordblks=16 EiB, fordblks=16 EiB, keepcost=76 KiB
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
SIGABRT: abort
PC=0x7fa975651cc9 m=6
signal arrived during cgo execution

Unfortunately, the mallinfo struct uses int32 for uordblks and fordblks (total allocated and total free respectively).

Still, the fact that total free overflowed 2GB seems to indicate that we spiked very high in rocksdb.

@nvanbenschoten
Copy link
Member

After looking back at the crash of node 2, I'm fairly optimistic that the issue is the exact same as #5368 and was caught in #5551. It seems like a read-only command got stuck in redirectOnOrAcquireLeaderLease and was never timed out because of the incorrect context withTimeout linkage.

That said, I havent been able to determine what would actually cause requesting the leader lease to get stuck in the first place, so I'm curious if anyone has any ideas. Is it expected for commands proposed to Raft to need to timeout without hearing a positive or negative answer, given some certain set of circumstances? I assume not, so another possibility is that we go stuck repeatedly requesting the leader lease in this loop, finding each time that the lease returned did not cover the current timestamp. I'd be curious to look at traces from these nodes to see how many attempts we were making to acquire leader leases. A third alternative is that requesting the leader lease somehow caused a deadlock, but at this point that is just speculation.

@bdarnell
Copy link
Contributor

Is it expected for commands proposed to Raft to need to timeout without hearing a positive or negative answer, given some certain set of circumstances?

Raft never gives a negative answer; either the command commits or you timeout (and in general we use periodic retries of raft commands to ensure that it eventually commits). However, if you're on a replica which has been removed from the range, then none of the other replicas will talk to you.

So when a stale range descriptor cache directs a request to a removed replica, we don't handle it well at all. I think we're currently relying on SendNextTimeout to handle this at the rpc layer, but it leaves an orphaned goroutine behind (prior to #5551). Eventually the replicaGCQueue will see that the replica has been removed and cancel all of its outstanding operations (or at least it should), but that takes a long time.

I think we need to add some sort of error messaging to the raft transport, so when a message is rejected for being from an out of date replica, we can send a message to that replica that will trigger a replica GC.

@bdarnell
Copy link
Contributor

I'm going to close this since the immediate cause was fixed in #5551. I've filed #5789 to track something that we could be doing better with removed replicas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants