Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary memory leak in drv_binary associated with the dist port #5876

Closed
nickva opened this issue Apr 11, 2022 · 20 comments · Fixed by #5892
Closed

Binary memory leak in drv_binary associated with the dist port #5876

nickva opened this issue Apr 11, 2022 · 20 comments · Fixed by #5892
Assignees
Labels
bug Issue is reported as a bug in progress team:VM Assigned to OTP team VM

Comments

@nickva
Copy link
Contributor

nickva commented Apr 11, 2022

Describe the bug

Upgrading from OTP 20 to OTP 23 noticed increasing binary memory usage associated with tcp_inet drv_binary allocation type

It is noticeable in the instrument:allocations() output:

> instrument:allocations()
       ...
       tcp_inet =>
           #{driver_tid => {78,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {75,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary =>
                 {0,4831,944,64,18,192,140,8,8,15830,0,0,0,0,0,0,0,0},
             drv_internal => {62,80,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}

> erlang:memory(binary).
1110936656

Another server:

> instrument:allocations()
       tcp_inet =>
           #{driver_tid => {79,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {76,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary =>
                 {7,15020,3852,493,19,114,208,82,32,66378,0,0,0,0,0,0,0,0},
             drv_internal => {62,82,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}
        ...
> erlang:memory(binary). 
4555266976

A well-behaved node in the same cluster:

> instrument:allocations()
       tcp_inet =>
           #{driver_tid => {79,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {76,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary =>
                 {14,10682,514,65,17,72,212,65,46,2037,0,0,0,0,0,0,0,0},
             drv_internal => {58,87,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}

> erlang:memory(binary).
313718152

bin_mem_leak_90_days

(The increase on the graph corresponds with upgrading to OTP 23, the flat section previously is OTP 20)

Suspected dist protocol and possibly the 64kb fragmentation logic interacting with refc binaries as when I had tried to dump the core with gdb, the process was suspended long enough to miss the dist tick timeout and it disconnected. When that happened the binary memory was released immediately.

binary_dealloc_on_dist_disconnection

(The core was dumped around 21:43 to 21:45)

To Reproduce

Out of 6 or so clusters that were upgraded (each with 3 to 12 nodes or so) only observed this on one cluster and on that cluster only on 3 out of 6 nodes.

I couldn't reliable reproduce the issue on my laptop. I had tried simply sending large binaries between two local nodes. Sometimes I see spikes in the drv_binary => 64kb histogram slot but then it clears out back to 0.

Expected behavior

The binary memory wouldn't keep increasing and would either level off at a low enough level and stay there or get de-allocated.

Affected versions

Erlang 23 (erts-11.2.2.10)

$ remsh
Erlang/OTP 23 [erts-11.2.2.10] [source] [64-bit] [smp:80:80] [ds:80:80:10] [async-threads:1]
$ uname -a
Linux db8 4.9.0-17-amd64 #1 SMP Debian 4.9.290-1 (2021-12-12) x86_64 GNU/Linux

Additional context

vm.args used:

-shutdown_time 5000
-kernel error_logger silent
-sasl sasl_error_logger false
+K true
+A32
+zdbbl32768
-ssl session_lifetime 300
+MBacul 0
+MBas aobf

I had suspected it was a simple refc binary memory leak and ran recon:bin_leak(5) which runs GC for every process but that didn't affect the binary memory.

Other suspicion was that it may be memory fragmentation and had tried running with smaller limit for binary MBC blocks with

+MBlmbcs 512

And removing +MBacul 0 and +MBas aobf. But that didn't seem to make a difference.

A few more details:

  • Most of the messages are sent between nodes with erlang:send(Dest, Msg, [noconnect, nosuspend]) command. In case the messages cannot be sent (response =/= ok), there is some buffering with a single head of line sender attempting to re-connect: https://github.com/apache/couchdb/blob/3.x/src/rexi/src/rexi_buffer.erl#L73-L84. Perhaps after Erlang 21 improving erlang:send/2,3 and making it more asynchronous that buffering strategy is now counter-productive and may be involved in this binary leak issue

  • A lot of traffic may be binary data being sent through a coordinator node on the cluster to be written to other nodes. So a common path might be http socket <-> node1 <-> [dist] <-> node2 <-> disk, where node1 would be a coordinator and write or read from multiple cluster nodes then respond to the http request

  • I had tried to determine the exact size in bytes of the 64kb block by varying the start size of the histogram and the size seems to be 65608:

>f(Hist), {ok, {_, _, #{tcp_inet:=#{drv_binary:=Hist}}}} = instrument:allocations(#{allocator_types=>[binary_alloc], histogram_start=>65609, histogram_width=>4}), Hist.
{78646,2774,0,0}

> f(Hist), {ok, {_, _, #{tcp_inet:=#{drv_binary:=Hist}}}} = instrument:allocations(#{allocator_types=>[binary_alloc], histogram_start=>65608, histogram_width=>4}), Hist.
{58078,23313,0,0}
  • Suspecting it may have something to do with dist protocol fragmentation and wanting to mitigate the issue somehow, I had tried to use the release compatibility flag +R 21 on one of the nodes, but that didn't seem to have any effect.

  • I have a gdb core dump and had been trying to inspect it but so far have not gotten very far:

(gdb) etp-alloc-instances 10

(gdb) print *((Allctr_t *) 0x7f37742df980)
$2473 = {dd = {q = {tail = {data ...
@nickva nickva added the bug Issue is reported as a bug label Apr 11, 2022
@garazdawi
Copy link
Contributor

Erlang/OTP 23 also introduced erlang:term_to_iovec/1 and started using in the the distribution. So that could be another source of this leak.

I don't suppose it is possible for you to see if the issue can be recreated using the latest Erlang/OTP 22?

Normally when trying to find leaks with refc binaries the core file does not help, as all it will show is a bunch of binaries that have their refc at 1 without any idea about what code missed to do the decrement it needed to...

@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Apr 11, 2022
@garazdawi
Copy link
Contributor

I had suspected it was a simple refc binary memory leak and ran recon:bin_leak(5) which runs GC for every process but that didn't affect the binary memory.

Have you run recon:proc_count(binary_memory, 10). or something similar?

@nickva
Copy link
Contributor Author

nickva commented Apr 11, 2022

Thanks for taking a look @garazdawi

I don't suppose it is possible for you to see if the issue can be recreated using the latest Erlang/OTP 22?

Thanks for the idea. I'll try running with OTP 22.

Have you run recon:proc_count(binary_memory, 10). or something similar?

We ran that across the cluster a few times to see what we could find but it didn't seem to reveal large differences between the nodes which leaked memory and the ones which didn't.

I tried again a few times:

> erlang:memory(binary).
4644564408

> recon:proc_count(binary_memory, 4).
[{<0.17633.3366>,3433878391096,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.7305.3556>,1511651870539,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.6597.5061>,110354480,
  [{current_function,{gen,do_call,4}},
   {initial_call,{erlang,apply,2}}]},
 {<0.26249.5057>,15542734,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]}]

And one which doesn't leak binary memory

 erlang:memory(binary).
310430664

recon:proc_count(binary_memory, 4). [{<0.23201.3229>,
7629199574678,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.12872.3198>,44829366,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.31859.0>,3827967,
  [couch_replicator_scheduler,
   {current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.6720.4911>,1922939,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]}]

Then I thought of naively summing the binary sizes referenced in the erlang:process_info(Pid, binary) (it's silly in general, I understand, as those are full binary sizes and with lots of refcs would produce completely unreliable results)

f(SumAllBins), SumAllBins = fun() ->
   lists:foldl(fun(Pid, SzAcc) ->
       Bins = case erlang:process_info(Pid, binary) of
           undefined -> [];
           {binary, BList} -> BList
       end,
       lists:foldl(fun({Addr, Size, _Refs}, Acc) ->
           Acc + Size
       end, SzAcc, Bins)
   end, 0, processes())
end.

And noticed that even with the naive summing it seems the total sum of erlang:process_info(Pid, binary) was less than the memory reported from erlang:memory(binary)

> SumAllBins().
30065317

> erlang:memory(binary).
3276123488

@garazdawi
Copy link
Contributor

Suspected dist protocol and possibly the 64kb fragmentation logic interacting with refc binaries as when I had tried to dump the core with gdb, the process was suspended long enough to miss the dist tick timeout and it disconnected. When that happened the binary memory was released immediately.

I was thinking a bit more about what you wrote here. This would suggest that the refc binaries are kept alive by something that is destroyed when the connection is dropped. I've written a small extension to etp that can dump information about all current distribution entries in a core dump. What do you get if you run that on your dump?

@nickva
Copy link
Contributor Author

nickva commented Apr 12, 2022

Thanks for the script @garazdawi. Here is the output from the node which had the largest binary memory leak at the time:

https://gist.github.com/nickva/b428fbb52f93c8248c7be67361cdc03c

The node where the core was dumped is db11. db8-db12 are the peer nodes most of the dist data goes through there. clouseau is a search component in written in Java/Scala. The disconnected 3-4 leter/number randomly named ones are periodic erl_call to gather stats.

It is odd to see db3 in the disconnected nodes that used to be an older (replaced) node so something had tried to connect to it. And also db11 has an entry for itself, but that may be expected?

EDIT: We suspected the search component (clouseau) initially as and restarted it a a few times but it didn't seem to clear the binary memory.

@garazdawi
Copy link
Contributor

hmm, that didn't tell us much...

And also db11 has an entry for itself, but that may be expected?

yes, that is expected.

@nickva
Copy link
Contributor Author

nickva commented Apr 12, 2022

We managed to start the db11 node with with OTP 22 and are monitoring it currently.

Looking at other metrics we have also noticed that whenever binary memory is bumped up it corresponds with a spike in traffic and those spikes also coincide with the an increase in our rexi.buffered metric bumps. That's a metric we increment when erlang:send(Dest, Msg, [noconnect, nosuspend]) doesn't return ok and we tried to buffer it, then spawn a single process to try to connect and send the data then drop message when the buffer fills up as well

For example in this graph the green is the binary memory on a "good" node (db12) where it stays constant over a few weeks. The blue one is the "bad" (db11) one which keeps increasing.

db11_db12_vs_buffered_rexi

The second set of metrics (the right Y axis) is the rate of buffered messages (erlang:send(Dest, Msg, [noconnect, nosuspend]) =/= ok). Those are > 0 for some nodes (including db12) but not for db11. So other nodes are trying to push data to db11 and it overwhelms the dist port/buffer on db11 it seems.

Also I had tried a higher zdbbl value 128MB vs the default 32MB and that's the db11 run from Mon 08:00 -> Mon 22:00. That seems to have smoothed out the rise a bit.

@nickva
Copy link
Contributor Author

nickva commented Apr 13, 2022

After running with OTP22 for some time, unfortunately it seems to exhibit the same memory growth as OTP 23.

Screen Shot 2022-04-13 at 2 57 23 PM

When discussion with a co-worker the issue we were wondering if fragmentation and re-assembly of fragments could leak memory if some fragments never arrive and if erlang:send(...., [nosuspend]) could contribute to that.

Another idea we had, since we cannot "find" this memory in erlang:process_info(Pid, binary) if perhaps it could ets table entries holding those binaries alive. In that case wouldn't the allocation instrumentation account it under system binaries instead of tcp_inet in instrument:allocation() results.

@nickva
Copy link
Contributor Author

nickva commented Apr 13, 2022

@garazdawi I think we can reproduce it locally with https://gist.github.com/nickva/0805d9e5865438220ae52fbf2e1a102d which, I believe, is one of your own scripts I found in a mailing list and I altered it a bit :-)

With that script the memory leak can be seen on OTP 22, 23 and 24, but not OTP 21. That might point to dist fragmentation?

OTP 23

       tcp_inet =>
           #{driver_tid => {3,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {3,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,1,0,0,0,0,0,5423,0},
             drv_internal => {0,3,1,0,0,0,0,0,0,0,0},
             port => {0,0,1,0,0,0,0,0,0,0,0}}}}}
([email protected])67> erlang:memory(binary).
355819424
...
Sent:1198MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,5349,0}
Sent:1377MB/s, Hist:{1,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,5205,0}
([email protected])49> dist_perf:go('[email protected]').
started
Sent:1259MB/s, Hist:{1,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,5481,0}
([email protected])50> dist_perf:stop().

Running dist_perf:go('[email protected]'). (which kills and respawns everything) makes the leak happen much quicker. So I was running that every 5-10 seconds). After binary memory accumulated to 350Mb (on n1, the destination node) I stopped the traffic and ran dist_perf:gc() there to show that the memory isn't deallocated by GC.

OTP 22 and 24 look about the same. Maybe on 24 it was a tiny bit slower.

I had waited for 10-20 minutes then, and it still was allocated. Finally, to confirm it tied to dist connection I explicitly disconnected from the node and it immediately got deallocated:

([email protected])70> net_kernel:disconnect('[email protected]').
true

([email protected])71> erlang:memory(binary).
4184

([email protected])72> instrument:allocations(#{histogram_width => 11}).
       tcp_inet =>
           #{driver_tid => {3,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {3,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,1,0,0,0,0,0,0,0},
             drv_internal => {0,3,0,0,0,0,0,0,0,0,0}}}}}

OTP 21

       tcp_inet =>
           #{driver_tid => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0},
             drv_internal => {0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}
([email protected])5> erlang:memory(binary).
1271456
...
Sent:1415MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
Sent:1458MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
([email protected])7> dist_perf:go('[email protected]').
Sent:1476MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
started
Sent:1486MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
([email protected])8> dist_perf:st
stats/1  stop/0
([email protected])8> dist_perf:stop().

@garazdawi
Copy link
Contributor

Thanks for the great reproducer! I was able to capture the fault in an rr session and I think I have figured out what is going on... though I don't have any solution to fix it yet.

Each fragment sent has a sequence ID and a fragment ID. The sequence ID is basically the PID of the sending process and it is used as the key when re-assembling fragments on the receiving side. When a process receives an exit signal while suspended during a fragmented send, the send operation is aborted and the remaining fragments are not sent. This caused the receiving side to keep the incomplete message in its buffer until the connection was aborted.

I could have sworn I had handled this scenario when adding the fragmented send functionality, but apparently not...

As I see it we have a couple of options:

  1. Force all the fragments into the distribution buffer if a process receives an exit signal while doing a fragmented send.
  2. Allow the sending process to continue sending the fragments after it has received the exit signal. Processes already send fragmented exit/down messages while exiting, so this would just make it send another message while exiting.
  3. Introduce an "abort sequence" message that is sent if a sequence is aborted before all fragments have been sent.

I'm experimenting a bit to see which approach is better, though I'm currently leaning towards #2.

garazdawi added a commit to garazdawi/otp that referenced this issue Apr 14, 2022
If a process is suspended doing a fragmented send and then
receives an exit signal it was terminated before it could
finish sending the message leading to a memory leak on the
receiving side.

This change fixes that so that the message is allowed to
finish being sent before the process exits.

Closes erlang#5876
@garazdawi
Copy link
Contributor

Please try #5892 which is an implementation of suggestion 2.

@nickva
Copy link
Contributor Author

nickva commented Apr 14, 2022

That's great news, thanks @garazdawi! Will give it a try

@nickva
Copy link
Contributor Author

nickva commented Apr 14, 2022

@garazdawi we were curious why the release compat flag +R 21 didn't seem to work. We had tried that earlier on as a mitigation strategy force the nodes to use older OTP 21 dist protocol. Is there any configuration parameter we could try as a quick/temporary fix for already deployed production clusters before the full release rolls out?

nickva pushed a commit to cloudant/otp that referenced this issue Apr 14, 2022
If a process is suspended doing a fragmented send and then
receives an exit signal it was terminated before it could
finish sending the message leading to a memory leak on the
receiving side.

This change fixes that so that the message is allowed to
finish being sent before the process exits.

Closes erlang#5876
@garazdawi
Copy link
Contributor

I don’t know, and I won’t have time to check until earliest Tuesday.

@nickva
Copy link
Contributor Author

nickva commented Apr 14, 2022

@garazdawi thanks for the quick fix!

Confirmed with the reproducer locally on maint-23 + your patch that the memory doesn't increase and when dist_perf:stop() is called it gets released.

Sent:2098MB/s, Hist:{1,0,0,8,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,78,0}
Sent:2101MB/s, Hist:{2,0,0,7,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,37,0}
([email protected])11> dist_perf:go('[email protected]').
started
...
([email protected])20> dist_perf:go('[email protected]').
started
Sent:2088MB/s, Hist:{1,0,0,9,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,80,0}
Sent:2088MB/s, Hist:{1,0,0,8,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,32,0}
([email protected])21> dist_perf:stop().
([email protected])41> erlang:memory(binary).
104096560

After dist_perf:stop(). on n2

([email protected])42> erlang:memory(binary).
112176

@nickva
Copy link
Contributor Author

nickva commented Apr 15, 2022

Deployed the update to production as a patch on top of maint-23 branch and so far looks good:

bin_leak_fixed

@garazdawi
Copy link
Contributor

Great! Then I just need to make sure that I did not break anything else. Handling of process state is quite complex and it is easy to break something while fixing something else...

Regarding the +R option, it is only used when the feature flags cannot be used to negotiate which distribution features you want to use. So in the case of fragments it has no effect. In fact, the last time the +R flag was used was when migrating from Erlang R7 to R9, so it may be time to remove it completely since it hasn't provided any functionality in 20:ish years...

@nickva
Copy link
Contributor Author

nickva commented Apr 20, 2022

Thanks for explanation about the +R flag. I think it would be nice to remove it, as you said, since it hasn't provided any functionality in years.

As for fix, thanks for the heads up. For now we haven't deployed it further but the cluster were it was deployed seems stable.

@nickva
Copy link
Contributor Author

nickva commented May 3, 2022

@garazdawi thank you for investigating and fixing the binary leak. I see it's already part of the new Erlang 23 patch release https://github.com/erlang/otp/releases/tag/OTP-23.3.4.14!

@garazdawi
Copy link
Contributor

Happy to help! Thanks for providing such an easy to reproduce test case! Having a way to reproduce the problem makes things so much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug in progress team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants