-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serious memory issues since mirage-3.7 update #93
Comments
(some talk with @hannesm on irc later) the build is currently ocaml 4.08.1 with netchannel and mirage-net-xen both pinned to --dev i added a call to Gc.compact() in front of the mempressure reporting code, built+restarted, lets see if that makes any difference. |
so, the "forced Gc.compact()" version has been running for 20h now, and it seems to make a difference:
the dips to 65 and 69MB seems to have been at time where i was manually messing around with the downstream VM. will deploy in a "many clients" role next. |
By default, OCaml increases its heap size in 15% increments: https://github.com/ocaml/ocaml/blob/5ad64306d36755b600f2556805effb73627508c8/runtime/caml/config.h#L216. That's about 4 MB for your heap size. The main thing that changed recently was the adding the IP fragments cache: qubes-mirage-firewall/client_net.ml Line 87 in 0ced0ee
256 * 1024 entries per client does seem like a lot. I guess it's a few MB of extra memory per client? What happens if you reduce that number? |
the 256 * 1024 is the capacity in bytes (see the weight function at https://github.com/mirage/mirage-tcpip/blob/v4.0.0/src/ipv4/fragments.ml#L55) of the fragmentation cache. this is maximum of 256kB per client. i.e. if you've clients not doing fragmentation, the overhead will be rather small.
I'm not sure how 256 * 1024 will be "a few MB of extra memory per client", sorry. to me, it rather looks like previously the io-page allocator https://github.com/mirage/io-page/blob/v2.3.0/lib/io_page.ml#L36-L45 got called on more call sites (which includes the hack to call Gc.compact if allocation raises out of memory). This is coherent with #93 (comment) where adding a call to |
Hmm. Here's a test program: #require "tcpip.ipv4";;
#require "fmt";;
let show_heap () =
let stat = Gc.stat () in
Fmt.pr "Heap size = %a@." Fmt.byte_size (stat.Gc.live_words * 8)
let n = 1024
let () =
show_heap ();
let x = List.init n (fun _ -> Fragments.Cache.create (256*1024)) in
Gc.full_major ();
show_heap ();
ignore (Sys.opaque_identity x) It creates 1024 caches. It prints:
That's about 2MB per cache by my calculation. Which is roughly what I'd expect since I seem to recall there are some problems with how OCaml calculates when it needs to do a GC when lots of external mallocs being used. This is why the firewall sometimes does a Gc manually, which shouldn't really be necessary. Large hashtables cause problems because the firewall can run out of memory suddenly (either on creation or on resize). We used to have this problem with mirage-nat before it switched to a different representation IIRC. |
@talex5 thanks for your example code. that's quite surprising behaviour of with the given behaviour that a hashtbl of that size is allocated, I can only repeat from #89:
(or, alternatively adjust/PR Lru.M to allocate a small hashtbl on |
or -- given that rather unfriendly Hashtbl memory behaviour (on grow/resize or create), revert the Fragments implementaion to use a Lru.F (this is my current favourite).. |
Yeah, getting rid of the hash table sounds like the best solution to me! |
CHANGES: * Revert "Ipv4.Fragments use a Lru.M.t instead of Lru.F.t" (mirage/mirage-tcpip#423 by @hannesm) A Lru.M.t allocates a Hashtbl.t of size = capacity (= 256 * 1024 in our case), this leads to excessive ~2MB memory consumption for each Fragment cache, reported by @xaki23 in mirage/qubes-mirage-firewall#93 * use SOCK_RAW for an ICMP socket in the unix sockets API (previously used SOCK_DGRAM which did not work) reported by @justinc1 in mirage/mirage-tcpip#358, fixed in mirage/mirage-tcpip#424 by @hannesm * tcp is now compatible with lwt >= 5.0.0 (where Lwt.async requires a function of (unit -> unit Lwt.t) (mirage/mirage-tcpip#370 mirage/mirage-tcpip#425 @cfcs @hannesm, issue mirage/mirage-tcpip#392 @emillon) * Add a dependency on dune-configurator to support dune 2.0.0 (mirage/mirage-tcpip#421 @avsm)
built with the two pins above and the matching q-m-fw branch. so +1 from me for the whole set. |
CHANGES: * Revert "Ipv4.Fragments use a Lru.M.t instead of Lru.F.t" (mirage/mirage-tcpip#423 by @hannesm) A Lru.M.t allocates a Hashtbl.t of size = capacity (= 256 * 1024 in our case), this leads to excessive ~2MB memory consumption for each Fragment cache, reported by @xaki23 in mirage/qubes-mirage-firewall#93 * use SOCK_RAW for an ICMP socket in the unix sockets API (previously used SOCK_DGRAM which did not work) reported by @justinc1 in mirage/mirage-tcpip#358, fixed in mirage/mirage-tcpip#424 by @hannesm * tcp is now compatible with lwt >= 5.0.0 (where Lwt.async requires a function of (unit -> unit Lwt.t) (mirage/mirage-tcpip#370 mirage/mirage-tcpip#425 @cfcs @hannesm, issue mirage/mirage-tcpip#392 @emillon) * Add a dependency on dune-configurator to support dune 2.0.0 (mirage/mirage-tcpip#421 @avsm)
i am running into some serious memory problems since the update.
afaict some part of "on new downstream vm" tries to allocate 4MB of unfragmented/aligned memory and fails.
before the update a mirage-fw with 40MB of ram was working "just fine" even with a lot of downstream VMs.
after the update even a m-fw with a single downstream VM starts failing quickly (after 1-3 vm (re)starts).
on new-vm-starts the fail in m-fw log looks like this:
there seems to be some "leakage" too, as in the 4MB dont get cleaned up every time the downstream vm shuts down. but sometimes it gets freed up again after a while.
usecase "single downstream vm that gets restarted every hour", cranked up the mem from 40 to 100 to make it more visible:
note:
in another usecase (more client vms, less restarts) i set memory to 200MB and it failed after about a day while still reporting more than 65MB free.
The text was updated successfully, but these errors were encountered: