-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rapid memory leak in OTP 25 #8044
Comments
Just a wild guess, but since we see this in OTP 25 but not in OTP 24 wonder if it might be related to #5195 (Further optimize off-heap traversal during minor GC)? |
Just a bump to highlight that Nick updated the first comment with a trace showing a corrupted linked list, an entry whose |
Not sure if's helpful or not but another core dump has a similar cycle just a bit smaller, with only two items:
Dumping the offheap terms in the list I found these:
|
Another interesting thing I noticed is that all the processes which attempt to clean the off-heap list just one particular process in Apache CouchDB: https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_db_update_listener.erl#L39.
All the other process backtraces also point to this process:
On the surface there isn't anything special about this process. It maybe a very short lived process or it may run for long time. It does send messages across the dist channel and there is possibly a race condition where it's sent a message to exit while the main request process it's linked as well so may also crash. So far I haven't been able to reproduce it on my laptop (intel, mac). |
Some debugging info about the schdulers
|
First of all could you show what loaded NIFs or linked in drivers you have. For example in your core dump do
and then print the
|
Thanks for taking a look, @sverker. Here are the nifs/drivers we have:
|
I am also now able to reproduce the issue on a smaller, isolated test cluster. I compiled a debug emulator and ran it with the load that usually reproduces the issue. The rapid memory leak started however then it didn't blow up to 60GB and haven't gotten to the off_heap cycle there. Instead memory (rss) rose to about 40GB and then the emulator stopped accepting connections. I managed to remsh in and inspect the running processes. Most were stuck in
I took a core dump and the threads stack look like this (skipping the waiting or yielding ones, and the initial
After about an hour I checked and most the bif_return_trap/2, erts_internal,dsend_continue_trap/1 were gone and the memory went down to 20GB and the nodes with the debug smp emulator are still staying up. |
I don't know if the circular offheap lists are part of the root cause or just some secondary symptom. However, I made a branch at https://github.com/sverker/otp/tree/sverker/check-circular-offheap/25.3.2.8 that checks for circular offheap lists at strategic places. It's based on OTP-25.3.2.8 and can be compiled (with check) both as opt and debug VM. |
Thank you, I'll compile it and give it a try. |
I tried it and the issue happened (memory blew up to 43GB) but the assert didn't trigger. Wonder if there are more places to insert the check? |
@nickva Did you run optimized or debug VM? |
I ran the optimized version of the VM. |
A pushed a commit to https://github.com/sverker/otp/tree/sverker/check-circular-offheap/25.3.2.8 that also checks for circular offheap lists when processes are exiting. |
Thanks, Sverker. I had started on the same path here master...nickva:otp:sverker/check-circular-offheap/25.3.2.8 I noticed it would be nice to have line numbers, so I altered the assert macro a bit. And probably added check in way too many places... I added a few places you had added to it and re-ran the tests. So far I got two assert crashes (with that commit above) in the same
That' s coming from |
So, it looks like the circular list happens between the last known successful check at I've stared at the code and tried to reproduce myself without success so far. |
I added a few more checks: And now it fails at After
|
Trying to dig a bit deeper I added a sleep call to attach the debugger right after when the condition happens: nickva@f9000c0 watcher = copy_struct(watcher, watcher_sz, &hp, factory.off_heap);
if(!erts_check_circular_offheap(c_p)) {
fflush(stdout);
fflush(stderr);
fprintf(stderr, "XXXXX %s:%s:%d \n", __FILE__, __func__, __LINE__);
fflush(stderr);
fflush(stdout);
wait_debug();
}
Moved up to the exit dist handle frame:
Printing a few values:
(This one has the names replaced)
The process
ctx struct
This one is circular to repeats up to some depth
Sorry for all the mess, that's a lot of stuff. I didn't know what would be useful so opted to dump all the stuff I had. |
@nickva First of all, thanks for your quick and competent responses. It looks like someone has already built something on heap and linked it in offheap list but not increased the heap top (c_p->htop). "Our" copy_struct() then builds a term (external pid) over the existing term(s) in offheap list and that it was causing the circular list. I've pushed yet another (one liner) commit to https://github.com/sverker/otp/tree/sverker/check-circular-offheap/25.3.2.8 that also checks that no terms in offheap list is above the heap top. You should probably combine that commit with your existing branch with all the additional checks. |
Thank you, @sverker I gave it a try in a new commit nickva@c898276 But I noticed an assert trigger in
It seems |
You switched the condition around. Should be |
Oops, sorry. That makes sense. I'll fix the check and retest |
From When |
Thank you for explaining @sverker I updated the commit nickva@e0cd2e7 Now the assert fails in erts_continue_exit_process
|
I managed to get rr-project debugger working and captured a few traces with assertions but haven't yet figure out how navigate properly.
|
Run the replay again until assert fail with
Don't forget |
You navigate rr replay as a normal interactive gdb session but you can also do |
Thank you @sverker. The I got a bit further, I see the watchpoint trigger but the backtrace looks unreadable:
Wonder if it's because of the jit or does it looks like something scribbled over the memory? Doing a bunch of
Threads info at that time:
After a few more steps backwards
|
I tried running with Just in case it's not here are some more details: In this case it seems to trigger when establishing a TLS connection. It's reproducible with just a plain erl prompt, doesn't need to run in our production / test environment
The surrounding code looks like: /*
* Now allocate the ProcBin on the heap.
*/
pb = (ProcBin *) HTOP;
HTOP += PROC_BIN_SIZE;
pb->thing_word = HEADER_PROC_BIN;
pb->size = num_bytes;
pb->next = MSO(c_p).first;
-> MSO(c_p).first = (struct erl_off_heap_header*) pb;
pb->val = bptr;
pb->bytes = (byte*) bptr->orig_bytes;
pb->flags = 0;
OH_OVERHEAD(&(MSO(c_p)), pb->size / sizeof(Eterm));
new_binary = make_binary(pb); |
Yes that's a false positive in beam_hot.h. It uses HTOP, which is a cached value of c_p->htop, when it executes beam code. The macros SWAPIN and SWAPOUT switch modes between using c_p->htop and using HTOP. I pushed a commit to my branch where I removed calls to ERTS_CHK_MBUF_SZ when we are "switched in" and HTOP is used instead of c_p->htop. It looks like your first try hit in jit code (which complicates things a bit) that stepped p->htop backwards. But I think it can be worth to try without jit again and my commit and see if we can hit the same logic in C-code. The jitted instructions are usually the same as the interpreted instructions written in C, just faster. |
We might have found a bug that's causing your problem. I pushed a fix commit to my branch. |
That's great news. Thank you, @sverker. I'll give it a try, immediately. |
So far running with https://github.com/nickva/otp/commits/sverker/check-circular-offheap/25.3.2.8/ (both the false positive assertion fixes and the swap out fix with the jit build), no assertions have triggered. That seems to be good news!
UPDATE: Running with your fix on top of a pristine OTP-25.3.2.8 for a few hours looks great as well. Removing all the cycle checks seems to have helped with memory usage mentioned earlier in the comment. I think what might have been happening is some NIFS were called often, and there is a cycle test check before and after NIF calls that lead to a singleton process in our system using the NIF to block and slow down. |
@nickva Thanks, for your excellent trouble shooting. It's always a bit frustrating to not be able to reproduce but instead having to revert to "remote control" someone else. But this went really well, at least from my side. I assume you got an initial taste of rr debugging, which I really can recommend for low level language debugging. The combination of watchpoints and reverse execution can be really powerful and in practice save hours of frustration. I will put this fix in the pipe for next OTP 25 patch release. |
@sverker thank you for your help. I really appreciate your time, effort, and patience debugging this issue. Yeah, it was frustrating not being able to reproduce it at will. I had to resort to developing on a test db node, with emacs tramp mode and various custom building and deployment scripts. Indeed, rr is quite an amazing tool. This was my first time trying it out and I am definitely a fan! I first heard about from Lukas when debugging another OTP memory issue. Thanks for the advice and pointers on how to use it properly! |
@sverker I had noticed the PR open to fix it #8104 has |
Both are correct and mean the same thing. |
That explains it! Thank you for answering. |
We have been running the updated version in production for months without issues. Thank you, @sverker and the rest of the OTP team, for taking the time to help debug and fix the issue! |
Describe the bug
beam.smp rapidly consumes memory, up to 58GB in 2 minutes, and then gets killed by oom-killer.
To Reproduce
So far it seems to happen in production, especially when fetching large (50MB) json documents.
Expected behavior
OOM doesn't happen
Affected versions
OTP
25.3.2.8
Doesn't happen in OTP
24.3.4.13
Additional context
I can make the issue happen more frequently on a cluster by repeatedly fetching a large 50MB json document.
Once the node memory starts increasing there are only a few minutes available to dump the core. I captured
twofive core files that way, and a thread backtrace from attaching gdb to the beam.smp:Process 1
(gdb) thread apply all bt
from a core fileProcess 2
(gdb) thread apply all bt
from a core fileProcess 3
thread apply all bt
from attached process, no core file availableThe common theme seems to be something related to
erts_cleanup_offheap
when called fromerts_continue_exit_process
?Digging into one of the
erts_cleanup_offheap
call a bit with gdb noticed there is a cycle (!) there instead of a linked list.$12
next pointer0x7fcd8c835d98
starts repeating at$20
The text was updated successfully, but these errors were encountered: