Binary memory leak in drv_binary associated with the dist port #5876

nickva · 2022-04-11T06:19:41Z

Describe the bug

Upgrading from OTP 20 to OTP 23 noticed increasing binary memory usage associated with tcp_inet drv_binary allocation type

It is noticeable in the instrument:allocations() output:

> instrument:allocations()
       ...
       tcp_inet =>
           #{driver_tid => {78,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {75,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary =>
                 {0,4831,944,64,18,192,140,8,8,15830,0,0,0,0,0,0,0,0},
             drv_internal => {62,80,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}

> erlang:memory(binary).
1110936656

Another server:

> instrument:allocations()
       tcp_inet =>
           #{driver_tid => {79,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {76,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary =>
                 {7,15020,3852,493,19,114,208,82,32,66378,0,0,0,0,0,0,0,0},
             drv_internal => {62,82,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}
        ...
> erlang:memory(binary). 
4555266976

A well-behaved node in the same cluster:

> instrument:allocations()
       tcp_inet =>
           #{driver_tid => {79,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {76,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary =>
                 {14,10682,514,65,17,72,212,65,46,2037,0,0,0,0,0,0,0,0},
             drv_internal => {58,87,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}

> erlang:memory(binary).
313718152

(The increase on the graph corresponds with upgrading to OTP 23, the flat section previously is OTP 20)

Suspected dist protocol and possibly the 64kb fragmentation logic interacting with refc binaries as when I had tried to dump the core with gdb, the process was suspended long enough to miss the dist tick timeout and it disconnected. When that happened the binary memory was released immediately.

(The core was dumped around 21:43 to 21:45)

To Reproduce

Out of 6 or so clusters that were upgraded (each with 3 to 12 nodes or so) only observed this on one cluster and on that cluster only on 3 out of 6 nodes.

I couldn't reliable reproduce the issue on my laptop. I had tried simply sending large binaries between two local nodes. Sometimes I see spikes in the drv_binary => 64kb histogram slot but then it clears out back to 0.

Expected behavior

The binary memory wouldn't keep increasing and would either level off at a low enough level and stay there or get de-allocated.

Affected versions

Erlang 23 (erts-11.2.2.10)

$ remsh
Erlang/OTP 23 [erts-11.2.2.10] [source] [64-bit] [smp:80:80] [ds:80:80:10] [async-threads:1]

$ uname -a
Linux db8 4.9.0-17-amd64 #1 SMP Debian 4.9.290-1 (2021-12-12) x86_64 GNU/Linux

Additional context

vm.args used:

-shutdown_time 5000
-kernel error_logger silent
-sasl sasl_error_logger false
+K true
+A32
+zdbbl32768
-ssl session_lifetime 300
+MBacul 0
+MBas aobf

I had suspected it was a simple refc binary memory leak and ran recon:bin_leak(5) which runs GC for every process but that didn't affect the binary memory.

Other suspicion was that it may be memory fragmentation and had tried running with smaller limit for binary MBC blocks with

+MBlmbcs 512

And removing +MBacul 0 and +MBas aobf. But that didn't seem to make a difference.

A few more details:

Most of the messages are sent between nodes with erlang:send(Dest, Msg, [noconnect, nosuspend]) command. In case the messages cannot be sent (response =/= ok), there is some buffering with a single head of line sender attempting to re-connect: https://github.com/apache/couchdb/blob/3.x/src/rexi/src/rexi_buffer.erl#L73-L84. Perhaps after Erlang 21 improving erlang:send/2,3 and making it more asynchronous that buffering strategy is now counter-productive and may be involved in this binary leak issue
A lot of traffic may be binary data being sent through a coordinator node on the cluster to be written to other nodes. So a common path might be http socket <-> node1 <-> [dist] <-> node2 <-> disk, where node1 would be a coordinator and write or read from multiple cluster nodes then respond to the http request
I had tried to determine the exact size in bytes of the 64kb block by varying the start size of the histogram and the size seems to be 65608:

>f(Hist), {ok, {_, _, #{tcp_inet:=#{drv_binary:=Hist}}}} = instrument:allocations(#{allocator_types=>[binary_alloc], histogram_start=>65609, histogram_width=>4}), Hist.
{78646,2774,0,0}

> f(Hist), {ok, {_, _, #{tcp_inet:=#{drv_binary:=Hist}}}} = instrument:allocations(#{allocator_types=>[binary_alloc], histogram_start=>65608, histogram_width=>4}), Hist.
{58078,23313,0,0}

Suspecting it may have something to do with dist protocol fragmentation and wanting to mitigate the issue somehow, I had tried to use the release compatibility flag +R 21 on one of the nodes, but that didn't seem to have any effect.
I have a gdb core dump and had been trying to inspect it but so far have not gotten very far:

(gdb) etp-alloc-instances 10

(gdb) print *((Allctr_t *) 0x7f37742df980)
$2473 = {dd = {q = {tail = {data ...

The text was updated successfully, but these errors were encountered:

garazdawi · 2022-04-11T07:49:29Z

Erlang/OTP 23 also introduced erlang:term_to_iovec/1 and started using in the the distribution. So that could be another source of this leak.

I don't suppose it is possible for you to see if the issue can be recreated using the latest Erlang/OTP 22?

Normally when trying to find leaks with refc binaries the core file does not help, as all it will show is a bunch of binaries that have their refc at 1 without any idea about what code missed to do the decrement it needed to...

garazdawi · 2022-04-11T15:08:17Z

I had suspected it was a simple refc binary memory leak and ran recon:bin_leak(5) which runs GC for every process but that didn't affect the binary memory.

Have you run recon:proc_count(binary_memory, 10). or something similar?

nickva · 2022-04-11T15:31:39Z

Thanks for taking a look @garazdawi

I don't suppose it is possible for you to see if the issue can be recreated using the latest Erlang/OTP 22?

Thanks for the idea. I'll try running with OTP 22.

Have you run recon:proc_count(binary_memory, 10). or something similar?

We ran that across the cluster a few times to see what we could find but it didn't seem to reveal large differences between the nodes which leaked memory and the ones which didn't.

I tried again a few times:

> erlang:memory(binary).
4644564408

> recon:proc_count(binary_memory, 4).
[{<0.17633.3366>,3433878391096,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.7305.3556>,1511651870539,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.6597.5061>,110354480,
  [{current_function,{gen,do_call,4}},
   {initial_call,{erlang,apply,2}}]},
 {<0.26249.5057>,15542734,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]}]

And one which doesn't leak binary memory

 erlang:memory(binary).
310430664

recon:proc_count(binary_memory, 4). [{<0.23201.3229>,
7629199574678,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.12872.3198>,44829366,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.31859.0>,3827967,
  [couch_replicator_scheduler,
   {current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.6720.4911>,1922939,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]}]

Then I thought of naively summing the binary sizes referenced in the erlang:process_info(Pid, binary) (it's silly in general, I understand, as those are full binary sizes and with lots of refcs would produce completely unreliable results)

f(SumAllBins), SumAllBins = fun() ->
   lists:foldl(fun(Pid, SzAcc) ->
       Bins = case erlang:process_info(Pid, binary) of
           undefined -> [];
           {binary, BList} -> BList
       end,
       lists:foldl(fun({Addr, Size, _Refs}, Acc) ->
           Acc + Size
       end, SzAcc, Bins)
   end, 0, processes())
end.

And noticed that even with the naive summing it seems the total sum of erlang:process_info(Pid, binary) was less than the memory reported from erlang:memory(binary)

> SumAllBins().
30065317

> erlang:memory(binary).
3276123488

garazdawi · 2022-04-12T07:52:14Z

Suspected dist protocol and possibly the 64kb fragmentation logic interacting with refc binaries as when I had tried to dump the core with gdb, the process was suspended long enough to miss the dist tick timeout and it disconnected. When that happened the binary memory was released immediately.

I was thinking a bit more about what you wrote here. This would suggest that the refc binaries are kept alive by something that is destroyed when the connection is dropped. I've written a small extension to etp that can dump information about all current distribution entries in a core dump. What do you get if you run that on your dump?

nickva · 2022-04-12T14:13:18Z

Thanks for the script @garazdawi. Here is the output from the node which had the largest binary memory leak at the time:

https://gist.github.com/nickva/b428fbb52f93c8248c7be67361cdc03c

The node where the core was dumped is db11. db8-db12 are the peer nodes most of the dist data goes through there. clouseau is a search component in written in Java/Scala. The disconnected 3-4 leter/number randomly named ones are periodic erl_call to gather stats.

It is odd to see db3 in the disconnected nodes that used to be an older (replaced) node so something had tried to connect to it. And also db11 has an entry for itself, but that may be expected?

EDIT: We suspected the search component (clouseau) initially as and restarted it a a few times but it didn't seem to clear the binary memory.

garazdawi · 2022-04-12T14:25:54Z

hmm, that didn't tell us much...

And also db11 has an entry for itself, but that may be expected?

yes, that is expected.

nickva · 2022-04-12T15:54:24Z

We managed to start the db11 node with with OTP 22 and are monitoring it currently.

Looking at other metrics we have also noticed that whenever binary memory is bumped up it corresponds with a spike in traffic and those spikes also coincide with the an increase in our rexi.buffered metric bumps. That's a metric we increment when erlang:send(Dest, Msg, [noconnect, nosuspend]) doesn't return ok and we tried to buffer it, then spawn a single process to try to connect and send the data then drop message when the buffer fills up as well

For example in this graph the green is the binary memory on a "good" node (db12) where it stays constant over a few weeks. The blue one is the "bad" (db11) one which keeps increasing.

The second set of metrics (the right Y axis) is the rate of buffered messages (erlang:send(Dest, Msg, [noconnect, nosuspend]) =/= ok). Those are > 0 for some nodes (including db12) but not for db11. So other nodes are trying to push data to db11 and it overwhelms the dist port/buffer on db11 it seems.

Also I had tried a higher zdbbl value 128MB vs the default 32MB and that's the db11 run from Mon 08:00 -> Mon 22:00. That seems to have smoothed out the rise a bit.

nickva · 2022-04-13T19:55:54Z

After running with OTP22 for some time, unfortunately it seems to exhibit the same memory growth as OTP 23.

When discussion with a co-worker the issue we were wondering if fragmentation and re-assembly of fragments could leak memory if some fragments never arrive and if erlang:send(...., [nosuspend]) could contribute to that.

Another idea we had, since we cannot "find" this memory in erlang:process_info(Pid, binary) if perhaps it could ets table entries holding those binaries alive. In that case wouldn't the allocation instrumentation account it under system binaries instead of tcp_inet in instrument:allocation() results.

nickva · 2022-04-13T21:58:54Z

@garazdawi I think we can reproduce it locally with https://gist.github.com/nickva/0805d9e5865438220ae52fbf2e1a102d which, I believe, is one of your own scripts I found in a mailing list and I altered it a bit :-)

With that script the memory leak can be seen on OTP 22, 23 and 24, but not OTP 21. That might point to dist fragmentation?

OTP 23

       tcp_inet =>
           #{driver_tid => {3,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {3,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,1,0,0,0,0,0,5423,0},
             drv_internal => {0,3,1,0,0,0,0,0,0,0,0},
             port => {0,0,1,0,0,0,0,0,0,0,0}}}}}
([email protected])67> erlang:memory(binary).
355819424

...
Sent:1198MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,5349,0}
Sent:1377MB/s, Hist:{1,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,5205,0}
([email protected])49> dist_perf:go('[email protected]').
started
Sent:1259MB/s, Hist:{1,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,5481,0}
([email protected])50> dist_perf:stop().

Running dist_perf:go('[email protected]'). (which kills and respawns everything) makes the leak happen much quicker. So I was running that every 5-10 seconds). After binary memory accumulated to 350Mb (on n1, the destination node) I stopped the traffic and ran dist_perf:gc() there to show that the memory isn't deallocated by GC.

OTP 22 and 24 look about the same. Maybe on 24 it was a tiny bit slower.

I had waited for 10-20 minutes then, and it still was allocated. Finally, to confirm it tied to dist connection I explicitly disconnected from the node and it immediately got deallocated:

([email protected])70> net_kernel:disconnect('[email protected]').
true

([email protected])71> erlang:memory(binary).
4184

([email protected])72> instrument:allocations(#{histogram_width => 11}).
       tcp_inet =>
           #{driver_tid => {3,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {3,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,1,0,0,0,0,0,0,0},
             drv_internal => {0,3,0,0,0,0,0,0,0,0,0}}}}}

OTP 21

       tcp_inet =>
           #{driver_tid => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             driver_tsd => {3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             drv_binary => {0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0},
             drv_internal => {0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
             port => {0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}}}}}
([email protected])5> erlang:memory(binary).
1271456

...
Sent:1415MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
Sent:1458MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
([email protected])7> dist_perf:go('[email protected]').
Sent:1476MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
started
Sent:1486MB/s, Hist:{0,0,0,12,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,0,1}
([email protected])8> dist_perf:st
stats/1  stop/0
([email protected])8> dist_perf:stop().

garazdawi · 2022-04-14T09:48:43Z

Thanks for the great reproducer! I was able to capture the fault in an rr session and I think I have figured out what is going on... though I don't have any solution to fix it yet.

Each fragment sent has a sequence ID and a fragment ID. The sequence ID is basically the PID of the sending process and it is used as the key when re-assembling fragments on the receiving side. When a process receives an exit signal while suspended during a fragmented send, the send operation is aborted and the remaining fragments are not sent. This caused the receiving side to keep the incomplete message in its buffer until the connection was aborted.

I could have sworn I had handled this scenario when adding the fragmented send functionality, but apparently not...

As I see it we have a couple of options:

Force all the fragments into the distribution buffer if a process receives an exit signal while doing a fragmented send.
Allow the sending process to continue sending the fragments after it has received the exit signal. Processes already send fragmented exit/down messages while exiting, so this would just make it send another message while exiting.
Introduce an "abort sequence" message that is sent if a sequence is aborted before all fragments have been sent.

I'm experimenting a bit to see which approach is better, though I'm currently leaning towards #2.

If a process is suspended doing a fragmented send and then receives an exit signal it was terminated before it could finish sending the message leading to a memory leak on the receiving side. This change fixes that so that the message is allowed to finish being sent before the process exits. Closes erlang#5876

garazdawi · 2022-04-14T15:11:48Z

Please try #5892 which is an implementation of suggestion 2.

nickva · 2022-04-14T15:30:49Z

That's great news, thanks @garazdawi! Will give it a try

nickva · 2022-04-14T16:36:02Z

@garazdawi we were curious why the release compat flag +R 21 didn't seem to work. We had tried that earlier on as a mitigation strategy force the nodes to use older OTP 21 dist protocol. Is there any configuration parameter we could try as a quick/temporary fix for already deployed production clusters before the full release rolls out?

If a process is suspended doing a fragmented send and then receives an exit signal it was terminated before it could finish sending the message leading to a memory leak on the receiving side. This change fixes that so that the message is allowed to finish being sent before the process exits. Closes erlang#5876

garazdawi · 2022-04-14T17:41:18Z

I don’t know, and I won’t have time to check until earliest Tuesday.

nickva · 2022-04-14T18:44:32Z

@garazdawi thanks for the quick fix!

Confirmed with the reproducer locally on maint-23 + your patch that the memory doesn't increase and when dist_perf:stop() is called it gets released.

Sent:2098MB/s, Hist:{1,0,0,8,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,78,0}
Sent:2101MB/s, Hist:{2,0,0,7,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,37,0}
([email protected])11> dist_perf:go('[email protected]').
started
...
([email protected])20> dist_perf:go('[email protected]').
started
Sent:2088MB/s, Hist:{1,0,0,9,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,80,0}
Sent:2088MB/s, Hist:{1,0,0,8,0,0,0,0,0,0,0} RHist:{0,0,0,0,0,0,0,0,0,32,0}
([email protected])21> dist_perf:stop().

([email protected])41> erlang:memory(binary).
104096560

After dist_perf:stop(). on n2

([email protected])42> erlang:memory(binary).
112176

nickva · 2022-04-15T16:57:52Z

Deployed the update to production as a patch on top of maint-23 branch and so far looks good:

garazdawi · 2022-04-19T06:42:30Z

Great! Then I just need to make sure that I did not break anything else. Handling of process state is quite complex and it is easy to break something while fixing something else...

Regarding the +R option, it is only used when the feature flags cannot be used to negotiate which distribution features you want to use. So in the case of fragments it has no effect. In fact, the last time the +R flag was used was when migrating from Erlang R7 to R9, so it may be time to remove it completely since it hasn't provided any functionality in 20:ish years...

nickva · 2022-04-20T14:46:38Z

Thanks for explanation about the +R flag. I think it would be nice to remove it, as you said, since it hasn't provided any functionality in years.

As for fix, thanks for the heads up. For now we haven't deployed it further but the cluster were it was deployed seems stable.

nickva · 2022-05-03T13:38:22Z

@garazdawi thank you for investigating and fixing the binary leak. I see it's already part of the new Erlang 23 patch release https://github.com/erlang/otp/releases/tag/OTP-23.3.4.14!

garazdawi · 2022-05-03T13:43:41Z

Happy to help! Thanks for providing such an easy to reproduce test case! Having a way to reproduce the problem makes things so much easier.

nickva added the bug Issue is reported as a bug label Apr 11, 2022

rickard-green added the team:VM Assigned to OTP team VM label Apr 11, 2022

rickard-green assigned rickard-green and garazdawi Apr 11, 2022

garazdawi mentioned this issue Apr 14, 2022

erts: Fix fragmented send to finish before exiting #5892

Merged

rickard-green added the in progress label Apr 19, 2022

garazdawi closed this as completed in c78a8e9 May 2, 2022

nickva mentioned this issue Feb 8, 2024

Rapid memory leak in OTP 25 #8044

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary memory leak in drv_binary associated with the dist port #5876

Binary memory leak in drv_binary associated with the dist port #5876

nickva commented Apr 11, 2022 •

edited

Loading

garazdawi commented Apr 11, 2022

garazdawi commented Apr 11, 2022

nickva commented Apr 11, 2022 •

edited

Loading

garazdawi commented Apr 12, 2022

nickva commented Apr 12, 2022 •

edited

Loading

garazdawi commented Apr 12, 2022

nickva commented Apr 12, 2022 •

edited

Loading

nickva commented Apr 13, 2022

nickva commented Apr 13, 2022 •

edited

Loading

garazdawi commented Apr 14, 2022

garazdawi commented Apr 14, 2022

nickva commented Apr 14, 2022

nickva commented Apr 14, 2022

garazdawi commented Apr 14, 2022

nickva commented Apr 14, 2022 •

edited

Loading

nickva commented Apr 15, 2022

garazdawi commented Apr 19, 2022

nickva commented Apr 20, 2022

nickva commented May 3, 2022

garazdawi commented May 3, 2022

Binary memory leak in drv_binary associated with the dist port #5876

Binary memory leak in drv_binary associated with the dist port #5876

Comments

nickva commented Apr 11, 2022 • edited Loading

garazdawi commented Apr 11, 2022

garazdawi commented Apr 11, 2022

nickva commented Apr 11, 2022 • edited Loading

garazdawi commented Apr 12, 2022

nickva commented Apr 12, 2022 • edited Loading

garazdawi commented Apr 12, 2022

nickva commented Apr 12, 2022 • edited Loading

nickva commented Apr 13, 2022

nickva commented Apr 13, 2022 • edited Loading

OTP 23

OTP 21

garazdawi commented Apr 14, 2022

garazdawi commented Apr 14, 2022

nickva commented Apr 14, 2022

nickva commented Apr 14, 2022

garazdawi commented Apr 14, 2022

nickva commented Apr 14, 2022 • edited Loading

nickva commented Apr 15, 2022

garazdawi commented Apr 19, 2022

nickva commented Apr 20, 2022

nickva commented May 3, 2022

garazdawi commented May 3, 2022

nickva commented Apr 11, 2022 •

edited

Loading

nickva commented Apr 11, 2022 •

edited

Loading

nickva commented Apr 12, 2022 •

edited

Loading

nickva commented Apr 12, 2022 •

edited

Loading

nickva commented Apr 13, 2022 •

edited

Loading

nickva commented Apr 14, 2022 •

edited

Loading