-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: OOM in applySnapshot #5467
Comments
data/logs archived to |
How much memory was available to the process? The panic is coming from an allocation in cgo, for which we don't have good instrumentation, so I'm going to look at the stats on the go side first. The last memory report was
Memory usage spiked by 700MiB in a single 10-second reporting interval and then kept growing for a while before dropping quickly back to the baseline. During that time no GCs were occuring. Either we had one incredibly long GC and the pause time stats we report are inaccurate, or there really was no GC activity (because time was spent in cgo?) Of course, the process recovered from that incident, but it might give us some clues about what could have happened. We now know that memory usage is subject to sudden spikes, and that it takes more than 1.5GB allocated on the go side to kill the process (unless the cgo side had grown between this incident and the one that killed it). A lot of leases changed hands at 02:26, and a lot of ranges seem to be getting rebalanced away from NodeID 2 (note that node IDs are 1-based but these log filenames are 0-based. @mberhault can you change that?). There are also a lot of these in the log, which makes me wonder if raft log GC might be failing in a way that is causing the snapshots to grow without bound: |
Each machine has approx 7.3GB (the official number on AWS is 8GB. liars!) For info, a currently-running node is showing:
free:
with ps reporting:
|
The gRPC complaints roughly 7 minutes before the crash like
along with the node's inability to establish a quorum
would lead me to believe that the cluster experienced some kind of network partition which isolated node3. After the node was able to reestablish a connection with the rest of the cluster, it seems like the Raft snapshot was too large for cgo to handle. The only semi-relevant issue I could find was blevesearch/blevex#13, where the author explains that a In terms of the node2 crash, it looks like that was related. The continual logging of
makes me wonder if a Raft command somehow got stuck, blocking all dependent commands behind it indefinitely. |
The "raft group deleted" logging is probably harmless; it can result when a range is rebalanced away and persist until the GC queue runs. I downgraded this logging to V(1) yesterday. |
Happened again today, same error:
Unfortunately, the mallinfo struct uses int32 for uordblks and fordblks (total allocated and total free respectively). Still, the fact that total free overflowed 2GB seems to indicate that we spiked very high in rocksdb. |
After looking back at the crash of node 2, I'm fairly optimistic that the issue is the exact same as #5368 and was caught in #5551. It seems like a read-only command got stuck in That said, I havent been able to determine what would actually cause requesting the leader lease to get stuck in the first place, so I'm curious if anyone has any ideas. Is it expected for commands proposed to Raft to need to timeout without hearing a positive or negative answer, given some certain set of circumstances? I assume not, so another possibility is that we go stuck repeatedly requesting the leader lease in this loop, finding each time that the lease returned did not cover the current timestamp. I'd be curious to look at traces from these nodes to see how many attempts we were making to acquire leader leases. A third alternative is that requesting the leader lease somehow caused a deadlock, but at this point that is just speculation. |
Raft never gives a negative answer; either the command commits or you timeout (and in general we use periodic retries of raft commands to ensure that it eventually commits). However, if you're on a replica which has been removed from the range, then none of the other replicas will talk to you. So when a stale range descriptor cache directs a request to a removed replica, we don't handle it well at all. I think we're currently relying on SendNextTimeout to handle this at the rpc layer, but it leaves an orphaned goroutine behind (prior to #5551). Eventually the replicaGCQueue will see that the replica has been removed and cancel all of its outstanding operations (or at least it should), but that takes a long time. I think we need to add some sort of error messaging to the raft transport, so when a message is rejected for being from an out of date replica, we can send a message to that replica that will trigger a replica GC. |
build sha: 6388648
~12-13 hours after start (4 nodes, 4 photos apps)
node 3 crashes in cgo, failing to malloc:
Full log:
node3.log.parse.txt
node 2 crashes ~1h later (also OOM): suspiciously high number of goroutines:
Full log:
node2.log.parse.txt
Followed eventually by nodes 1 and 0:
node0.log.parse.txt
node1.log.parse.txt
The text was updated successfully, but these errors were encountered: