Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory measurements - weird behaviour #205

Closed
PragTob opened this issue Apr 7, 2018 · 12 comments
Closed

Memory measurements - weird behaviour #205

PragTob opened this issue Apr 7, 2018 · 12 comments
Labels
Milestone

Comments

@PragTob
Copy link
Member

PragTob commented Apr 7, 2018

(builds on top of #204 for now)

So I took the fast functions example and wanted to see what the general overhead might be like:

lil_range = 1..2
range = 1..10
list_10 = Enum.to_list(range)
range_50 = 1..50
Benchee.run(%{
  "Integer addition"          => fn -> 1 + 1 end,
  "String concatention"       => fn -> "1" <> "1" end,
  "adding a head to an array" => fn -> [1 | [1]] end,
  "++ array concat"           => fn -> [1] ++ [1] end,
  "noop"                      => fn -> 0 end,
  "noop nil"                  => fn -> nil end,
  "Enum.map(empty)"           => fn -> Enum.map([], fn(i) -> i end) end,
  "Enum.map(2)"               => fn -> Enum.map(lil_range, fn(i) -> i end) end,
  "Enum.map(10)"              => fn -> Enum.map(range, fn(i) -> i end) end,
  "Enum.map(10 list)"         => fn -> Enum.map(list_10, fn(i) -> i end) end,
  "just return 10 list"       => fn -> list_10 end,
  "Enum.map(50)"              => fn -> Enum.map(range_50, fn(i) -> i end) end
}, warmup: 0, time: 0.00001, memory_time: 1)

Much to my own surprise the results look like this:

Name                         Memory usage
Integer addition                    616 B
adding a head to an array           616 B - 1.00x memory usage
noop nil                            616 B - 1.00x memory usage
just return 10 list                 616 B - 1.00x memory usage
noop                                616 B - 1.00x memory usage
String concatention                 616 B - 1.00x memory usage
Enum.map(empty)                     664 B - 1.08x memory usage
Enum.map(2)                         784 B - 1.27x memory usage
Enum.map(10 list)                   208 B - 0.34x memory usage
Enum.map(10)                        424 B - 0.69x memory usage
Enum.map(50)                        568 B - 0.92x memory usage
++ array concat                     616 B - 1.00x memory usage

**All measurements for memory usage were the same**

WHAT???

  • At first it seems like the overhead from just the function is 616 B, but then Enum.map(10) stickst out and consumes less memory, although it's the only one that should actually generate any data. (I assume that the others could/should be compiler optimizations as they are clearly static)
  • Enum.map of an empty list consumes more than of a range with 10 or 50 elements, also more than with a list of 10 elements
  • flat out returning the list of 10 elements has the usual 616B memory consumption - mapping over it goes down to 208 B????

I don't know. Might be we're missing to collect some memory somewhere or it might just be weird optimization around that I'm unaware which show up in these very micro examples.

We don't need to solve this (if it is solvable) before we release 0.13 but I at least want to have a look at it.

doing @michalmuskala for inside and BEAM knowledge, @devonestes cause 🐰 )+ initial memory implementation research etc. etc.) :D

@PragTob PragTob added this to the 0.13 milestone Apr 7, 2018
@michalmuskala
Copy link
Contributor

I haven't had time to look at this yet, I'm sorry. But something that might be affecting the result is the size of the function objects themselves. I cannot explain how the Enum.map call is the smallest.

@PragTob
Copy link
Member Author

PragTob commented Apr 9, 2018

@michalmuskala 👋 Thanks and no worries or hurry. It's awesome and amazing any time you manage to make time and help us by looking at stuff, and if you don't get to it that's also cool. You already to a ton of things and in the end we're all human and do this in our free time. We do what we can and that's already 🌟 🌟 🌟

@PragTob PragTob modified the milestones: 0.13, 1.0 Apr 11, 2018
@PragTob
Copy link
Member Author

PragTob commented Apr 11, 2018

Changed milestone, we didn't get to it in time and from the basics it seems like there might be some optimizations going on underneath. We should still look into it but I also wanna get memory measurements out there and people can test it and see if they get any errors or problems.

It seems so far that this weirdness only happens for small/static data. It's unlikely to affect real world examples and might also be correct.

Also we're still 0.x and I shouldn't cling to my perfectionism as much 😎

@michalmuskala
Copy link
Contributor

I tried looking more into this and I have no idea what is happening.

In general - a lot of the overhead we're seeing there is the measurement itself, the calls to Process.info and the messages from the tracing server consume memory itself. That does not explain, why some calls which should consume more memory actually consume less.

@devonestes
Copy link
Collaborator

I have a feeling maybe this was resolved with #216? Maybe you should look at this one more time now that that's hit master.

@PragTob
Copy link
Member Author

PragTob commented May 4, 2018

Sadly not, for reasons unknown the 10 case for instance is still weird with less memory etc:

Name                         Memory usage
++ array concat                     616 B
String concatention                 616 B - 1.00x memory usage
adding a head to an array           616 B - 1.00x memory usage
noop                                616 B - 1.00x memory usage
noop nil                            616 B - 1.00x memory usage
just return 10 list                 616 B - 1.00x memory usage
Integer addition                    616 B - 1.00x memory usage
Enum.map(empty)                     664 B - 1.08x memory usage
Enum.map(10 list)                   208 B - 0.34x memory usage
Enum.map(2)                         784 B - 1.27x memory usage
Enum.map(10)                        424 B - 0.69x memory usage
Enum.map(50)                       1032 B - 1.68x memory usage

@devonestes
Copy link
Collaborator

So I've confirmed that none of these functions trigger even a minor GC (as most of them allocate no data, this was expected). This means all memory usage for these functions comes from Process.info. If something is weird about how we're getting memory data, it's gotta be coming from there. I'm going to do some more digging there to see if we can get more accurate data, or maybe there's even a bug?

@PragTob
Copy link
Member Author

PragTob commented Jul 26, 2018

All the 616B cases just optimize to a literal (I decompiled and checked them) - it's still weird to me that Enum.map gives these lower memory consumptions etc

@devonestes
Copy link
Collaborator

Ok, I have a new weird thing that's going on.

I was looking at the "before" and "after" measurements that we're taking with Process.info/2 for Enum.map(50), and even though I wasn't seeing any garbage collection happening, I got these as the before and after measurements:

[
  old_heap_block_size: 0,
  heap_block_size: 233,
  mbuf_size: 0,
  recent_size: 0,
  stack_size: 14,
  old_heap_size: 0,
  heap_size: 115,
  bin_vheap_size: 0,
  bin_vheap_block_size: 46422,
  bin_old_vheap_size: 0,
  bin_old_vheap_block_size: 46422
]
[
  old_heap_block_size: 610,
  heap_block_size: 376,
  mbuf_size: 0,
  recent_size: 100,
  stack_size: 14,
  old_heap_size: 58,
  heap_size: 177,
  bin_vheap_size: 0,
  bin_vheap_block_size: 46422,
  bin_old_vheap_size: 0,
  bin_old_vheap_block_size: 46422
]

What jumped out to me was that not only was an old_heap block allocated, but there's data on there! Which means not only one, but many GC cycles must have taken place - but I wasn't seeing anything being printed out from my little printer that was supposed to be printing all GC info to the console as the process was running.

Is it possible we're missing GC events somehow?

Here's the modifications I made to check out what's happening:

defmodule Benchee.Benchmark.Measure.Memory do
  @moduledoc false

  # Measure memory consumption of a function.
  #
  # Returns `{nil, return_value}` in case the memory measurement went bad.

  @behaviour Benchee.Benchmark.Measure

  def measure(fun) do
    ref = make_ref()
    Process.flag(:trap_exit, true)
    start_runner(fun, ref)

    receive do
      {^ref, memory_usage_info} -> return_memory(memory_usage_info)
      :shutdown -> nil
    end
  end

  defp start_runner(fun, ref) do
    parent = self()

    spawn_link(fn ->
      printer = start_printer()
      tracer = start_tracer(self(), printer)

      try do
        memory_usage_info = measure_memory(fun, tracer, printer)
        send(parent, {ref, memory_usage_info})
      catch
        kind, reason -> graceful_exit(kind, reason, tracer, parent)
      after
        send(tracer, :done)
      end
    end)
  end

  defp start_printer() do
    spawn(fn -> printer_loop() end)
  end

  defp printer_loop() do
    receive do
      info ->
        IO.inspect(info)
        printer_loop()
    end
  end

  defp return_memory({memory_usage, result}) when memory_usage < 0, do: {nil, result}
  defp return_memory({memory_usage, result}), do: {memory_usage, result}

  defp measure_memory(fun, tracer, printer) do
    word_size = :erlang.system_info(:wordsize)
    {:garbage_collection_info, heap_before} = Process.info(self(), :garbage_collection_info)
    send(printer, heap_before)
    result = fun.()
    {:garbage_collection_info, heap_after} = Process.info(self(), :garbage_collection_info)
    send(printer, heap_after)
    mem_collected = get_collected_memory(tracer)

    memory_used =
      (total_memory(heap_after) - total_memory(heap_before) + mem_collected) * word_size

    {memory_used, result}
  end

  @spec graceful_exit(Exception.kind(), any(), pid(), pid()) :: no_return
  defp graceful_exit(kind, reason, tracer, parent) do
    send(tracer, :done)
    send(parent, :shutdown)
    stacktrace = System.stacktrace()
    IO.puts(Exception.format(kind, reason, stacktrace))
    exit(:normal)
  end

  defp get_collected_memory(tracer) do
    ref = Process.monitor(tracer)
    send(tracer, {:get_collected_memory, self(), ref})

    receive do
      {:DOWN, ^ref, _, _, _} -> nil
      {^ref, collected} -> collected
    end
  end

  defp start_tracer(pid, printer) do
    spawn(fn ->
      :erlang.trace(pid, true, [:garbage_collection, tracer: self()])
      tracer_loop(pid, 0, printer)
    end)
  end

  defp tracer_loop(pid, acc, printer) do
    receive do
      {:get_collected_memory, reply_to, ref} ->
        send(reply_to, {ref, acc})

      {:trace, ^pid, :gc_minor_start, info} ->
        send(printer, {:gc_minor_start, info})
        listen_gc_end(pid, :gc_minor_end, acc, total_memory(info), printer)

      {:trace, ^pid, :gc_major_start, info} ->
        send(printer, {:gc_major_start, info})
        listen_gc_end(pid, :gc_major_end, acc, total_memory(info), printer)

      :done ->
        exit(:normal)

      other ->
        send(printer, other)
        tracer_loop(pid, acc, printer)
    end
  end

  defp listen_gc_end(pid, tag, acc, mem_before, printer) do
    receive do
      {:trace, ^pid, ^tag, info} ->
        send(printer, {tag, info})
        mem_after = total_memory(info)
        tracer_loop(pid, acc + mem_before - mem_after, printer)

      other ->
        send(printer, other)
        tracer_loop(pid, acc, printer)
    end
  end

  defp total_memory(info) do
    # `:heap_size` seems to only contain the memory size of the youngest
    # generation `:old_heap_size` has the old generation. There is also
    # `:recent_size` but that seems to already be accounted for.
    Keyword.fetch!(info, :heap_size) + Keyword.fetch!(info, :old_heap_size)
  end
end

@PragTob
Copy link
Member Author

PragTob commented Jul 27, 2018

Can't we get the count of how many GCs happened?

Other thean that... you are right... are we sure that old heap only comes with a full GC? I think so just asking... but I think yes only during the GC are things copied into the old_heap... Unsure.

Code looks ok to me... can't we just IO.inspect inline to make sure we don't miss any events or even raise ? I know it will effect the memory measurements... but only if we print and once we print in that scenario it's already too late.

So yeah, not sure :)

Thanks for all the work!

@devonestes
Copy link
Collaborator

You're right - it's not that stuff is only moved to the old heap during a major GC, it's that the old heap is only collected during a major GC. But, yeah, the only way (to my knowledge) for something to get moved to the old heap is through a GC run.

I didn't want to do any IO in the runner process because it uses memory, and these weird things are only happening (that I know of) in scenarios with very little memory use.

I'll see if there's some way to get a count of GC runs before the runner process exits, that's a good idea!

@devonestes
Copy link
Collaborator

This is resolved by #239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants