-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high memory usage in cluster #15702
Comments
5 nodes at ohq |
All nodes originally had 8GB of RAM, but nodes crdb1d and crdb1e were upgrade to 16GB of ram due to continually using more memory than 8GB and not resolved by restarting. |
Thanks for the heap profiles, @jlauro! You said that these heap profiles are from after the cluster was idle for some time? If not, could you expand on what workload you were running? Did you run a restore from a backup? crdb1d and crdb1e are both showing the same memory bloat from |
What's odd to me is that you say the memory usage wasn't resolved by restarting. That would imply that it isn't just a memory leak staying around forever, but something happening every time the node restarts, which is much stranger. |
Yes, the cluster is idle... at least in terms of clients sending any sort of traffic to them. However, most of the nodes are actively exercising cpu/network. As to what they are working on moving around, I have no idea... but I should be able to copy the entire dataset to all the nodes from all the nodes (n x n) in far lass time that it's been idle... Load isn't bad on some nodes, and others are using 150+% CPU (dual cpu vm). How much background load should be expected when "idle"? |
Would you mind sharing some of the graphs from the admin UI over the same time period? |
I think restarting with only 8gb does resolve it short term, but did seem to be the same nodes being prone to using more memory despite being idle. I can wipe all data recreate with fresh import if you think that might help clean out any bad data from beta versions (and will also clear out the 2 dead nodes) or crashes / ungraceful kills. |
Which graphs would you like? and from cluster, or specific nodes? |
There could be raft snapshots getting sent to the two nodes. It's also possible that very uneven latencies between the datacenters (some being much closer together than others) could be making for a lot of lease transfers, which would explain seeing a bunch of raft leadership transfers. I'm not sure why the memory wouldn't be getting GC'ed faster, but long enough delays between GC cycles might explain things. |
I'd be curious to see the replication queue and raftlog queue from the Queues page. Leaseholders per store on the Replication page would also be interesting. The memory usage graph from the runtime page and the SQL byte traffic and SQL queries graphs from the SQL page would also be helpful. |
I setup a cluster locally with the same locality settings and put some load on it, then let it sit. So without the latencies between nodes, I didn't see any lease or replica rebalancing thrashing. To start, how about the cluster-wide overview, runtime and replication views, and the same for the nodes with high memory usage. |
If I could connect to the admin UI myself, that would speed things up a bit, but I really appreciate your quick turnaround times! |
To be a little more clear, I'm worried you might be running into #15369 if the latencies between your datacenters differ by a lot, and am hoping to determine how many lease transfers and snapshots are happening in the cluster. If you want to partially test that hypothesis out, you could try running |
Knowing the latencies between your datacenters would help me try to repro. I'm not really sure where "ohq" or "ods" might be. I'm sorry we don't have better tooling yet for extracting all the relevant info (cc #13984). If you do just want to do a single data dump to avoid more interruptions to your day, running |
Ok, had to setup a temporary tunnel to reach from the internet... https://208.184.77.179:8080 |
I think one of the nodes die from decided to start taking too much memory... Want me restart cockroach on crdb4a and crdb5a, and/or increase the memory on them or all nodes? |
Thanks! It doesn't look like what's going on is related to lease transfers. I don't care too much what you do with crdb4a and crdb5a, but could you possibly enable remote debugging by running |
crdb4a crashed hard... had to reboot node, even console wasn't responding... Increased it's ram to 16gb and it's back up. I ran the cockroach debug zip, but it created a 23mb file and this caps it at 10mb upload... |
Here is full-mesh latency of 100 packets per check: Processing crdb2a Processing crdb3a Processing crdb4a Processing crdb5a Processing crdb1c Processing crdb1d Processing crdb1e Processing crdb1f Processing crdb1h Processing crdb2h Processing crdb3h |
Thanks for opening up the debug pages, @jlauro! I haven't fully figured things out, but I just wanted to let you know that I've been actively looking into it this morning, and am starting to get a handle on what's going on. |
Ok, so here's a summary of what's going on.
I don't know if something recently changed in raft or in our usage of it or if this has just always been around, but this is pretty absurd. I'll see if I can find what might have triggered this, but cc @petermattis @bdarnell in the meantime. Debug page for the problematic range:
|
Yowzer! That's a lot of unmarshaling! I'm not sure what can be done about this. The Raft leader needs to check for ConfChange entries. We could be mildly more intelligent by providing a specific interface for counting the number of ConfChange entries so we only have to keep 1 unmarshalled entry in memory at a time. Not sure how much that would help. I'd love to know how the Raft group got into this state. |
@jlauro - to help us figure out how the range got into this state, could you go to node 12 (crdb1e) and run |
Just to confirm, I have to shut cockroach down on that node first? [root@crdb1e data]# cockroach debug raft-log /data/crdb/ 137 |
@jlauro Yes, that's correct. |
The file is 36GB. Compressed it is 218mb. Not sure if I can e-mail that large of a file. I'll try... |
Sending via gmail... looks like it's going to covert it to google drive and send you a link. |
Even if we can't do it on a per-replica basis or dynamically, putting in a knob to increase the raft election timeout will at least make it possible for a cluster to recover from situations like this, which is better than what we have now. And adding the knob wouldn't add any real risk to 1.0/1.0.1 |
Not easily, or at least not as you've described here - we're holding the raftMu too long on the leader, but the election timeout is used on the followers. If we have a way of letting the followers know that they should slow down their election timeouts, the best way to implement this is probably to skip calls to
Maybe. We're already taking heartbeat generation mostly out of raft's hands with coalesced heartbeats, so we might be able to keep heartbeats flowing even when raft processing is blocked. I think there are probably dragons here, though.
The unapplied portion of the log may include both committed and uncommitted entries. All the committed entries will become applied at the next @danhhz Do you have a backup of the data from #15681 that you think might be the same root cause as this issue? I'd like to look at the raft logs to see what they have in common with the one from @jlauro. |
@bdarnell they're in |
My description was terse. I was imagining that if we detect the Adjusting the number of calls to
Seems tricky to do this. We don't even know that the replica is the leader when this is occurring and we can't determine that it is until the long running call finishes.
Ah, that's what I was missing. |
Should we revert the removal of |
An env var might be better since it'll only really be recommended for people that have hit this problem, but I do think we should expose either it or |
Just so that we have some type of escape hatch if needed. |
|
I think #13869 is going to be the long-term solution here to keep the reproposals from spiraling out of control. |
Commands are reproposed if they are not committed within 1-2 election timeouts, which defaults to 3-6 seconds. To commit a 40MB command in 3s requires network throughput of at least 13MB/sec to one of the followers, or twice that if requests to two followers get routed over the same link. (enabling RPC compression would really help for these DeleteRange commands). Our current GRPC window configuration allows 2MB to be transferred per stream per RTT, so for the 70ms RTT shown above we have a GRPC-imposed limit of ~30MB/sec/stream, if the underlying network can give us that much bandwidth. That's making significant demands of the network, so in addition to the generic flow control of #13869 we may want to start considering the size of pending commands before reproposing them. |
The proposal quota in #13869 is 1000 Raft log entries. I think we'd need to figure out how to make the proposal quota based on the size of the Raft log if in order to address the issue here. Cc @irfansharif |
will have a PR out for review for that shortly, can discuss/incorporate ideas for this there. |
Yeah, it definitely needs to be based on size instead of the number of entries. Also, pulling on the thread a little further, even if they are not reproposed, commands this large are sufficient to cause heartbeats to be delayed, perhaps to the point of causing elections. Maybe MsgApp messages should use separate GRPC streams (similar to the way that MsgSnap used to). This would need a lot of testing (we want to minimize the chance that MsgApps are reordered with respect to each other, but I think raft will tolerate reorderings of different message types reasonably well) |
Documented this known limitation here: https://github.com/cockroachdb/docs/pull/1381/files#diff-b959712af7fa5ed1c822d0310313666eR11 @petermattis, @a-robinson , if I missed or misunderstood anything, please let me know. |
I managed to break it again and crash crdb1a with rc2 (was rebuilt fresh). crdb1a.ces.cvnt.net:/data/crdb/logs# cockroach node status I am not sure if this is the same or different issue, but appears to be a different way to possibly trigger it. I think there might be some higher than normal packet loss between crdb1d and the *a and *c nodes. |
When you say that you "managed to break it again", what symptom(s) are you referring to? Also, it looks like the end of that table got truncated. Did it print properly in your terminal? |
Main/oirignal symptom for this issue (#15702) of high memory usage, way beyond what it should use. 8gb ram with 4gb for swap for vm, and the process grew to 24gb address space. (I think 11gb actual active+swap for the go process) Table printed correctly in the terminal. Looks like spaces were eaten on the cut-n-paste and don't align, but all the cols appear to have made it in the post. |
Pasting with current status.
|
I also go a large Unmarshal memory cost.
this is heap profiling |
@a-robinson How did you figure out range 137 is problemtic? |
@ggaaooppeenngg did you run To determine the problematic range, I ran a couple little python scripts to process the output of |
Originally reported in the forum:
Version: binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
The cluster should be idle, but a lot of the nodes are showing load... It's not busy from clients, but never seems to go idle, even after leaving it inactive for a long days (but it's only been hours on rc.1).
The text was updated successfully, but these errors were encountered: