Mimalloc/marksweep high memory overhead #688

qinsoon · 2022-10-26T00:21:35Z

This tracks leftover issues with our mimalloc allocator and the mark sweep plan that uses it.

Zeroing

Currently zeroing happens when we allocate in a cell, that is very fine grained zeroing. We should use a more coarse grained zeroing. For example, we can do zeroing when we sweep the whole block, and when we sweep contiguous cells.

Min heap

The min heap of the current implementation (based on #643) is:

benchmark	marksweep	marksweep (without abandoning blocks in GC)	non-moving immix with 64K block	semispace
dacapo2006-antlr	37	38	20	10
dacapo2006-luindex	57	65	24	12
dacapo2006-lusearch	92	186	25	15
dacapo2006-pmd	97	97	65	52
dacapo2006-xalan	66	102	69	67

Mimalloc uses thread-local free lists. So instead of having a global free list for each size class, mimalloc allocator has a thread-local free list for each size class. In the original implementation, an allocator always allocate from its own free list, not others. And we do the same in our GC'd version: GC identifies live objects, and let each allocator sweep their dead cells. In this case, if an allocator has recyclable blocks for certain size classes and it does not use it frequently, other allocators cannot reuse those blocks, but have to allocate new blocks from the global memory.

We solved part of the problem by abandoning blocks in GC. So for each GC, each allocator will abandon their local block lists, and return the blocks back to the global pool (in the same way as if those threads die). So for new allocation, allocators will first attempt to get blocks from the global pool. This helps the min heap (from Column2 to Column1).

However, in comparison with non-moving Immix, we are still worse. We will need to investigate more.

qinsoon · 2022-10-27T02:01:41Z

I measured the min heap for our malloc mark sweep with mimalloc library. I am not doing bulk xor for mark bits in mark sweep in the measurement, as it is known to have issues with detecting used pages. Instead, I iterate through each object and check their mark bit.

The results are as below.

benchmark	mark sweep with library mimalloc counting 4K page	mark sweep with library mimalloc counting 64K page
dacapo2006-antlr	10	16
dacapo2006-luindex	11	13
dacapo2006-lusearch	19	58
dacapo2006-pmd	58	73
dacapo2006-xalan	47	93

The min heap values for the library mimalloc is lower than our mimalloc mark sweep implementation, especially for antlr and luindex.

qinsoon · 2022-10-28T00:07:36Z

I further tested the relation between min heap values and benchmark threads. I am using lusearch from dacapo-bach (instead of dacapo-2006 which does not support setting benchmark threads). The results are as below:

dacapo-bach-lusearch threads	mark sweep	mark sweep with library mimalloc (counting 64K page)
1	14	28
24	50	72
48	94	129

k-sareen · 2022-10-28T04:15:52Z

I further tested the relation between min heap values and benchmark threads. I am using lusearch from dacapo-bach (instead of dacapo-2006 which does not support setting benchmark threads). The results are as below:
dacapo-bach-lusearch threads mark sweep mark sweep with library mimalloc (counting 64K page)
1 14 28
24 50 72
48 94 129

What is the second column in the table? The Rust-mimalloc MarkSweep implementation?

qinsoon · 2022-10-28T04:17:13Z

I further tested the relation between min heap values and benchmark threads. I am using lusearch from dacapo-bach (instead of dacapo-2006 which does not support setting benchmark threads). The results are as below:
dacapo-bach-lusearch threads mark sweep mark sweep with library mimalloc (counting 64K page)
1 14 28
24 50 72
48 94 129

What is the second column in the table? The Rust-mimalloc MarkSweep implementation?

Yes. Rust-mimalloc mark sweep with returning thread-local blocks in each GC.

k-sareen · 2022-10-28T04:20:51Z

Oh? So it's actually much more competitive than previously thought? Could you try lusearch on the latest DaCapo version instead as we know what the minheap for different collectors is for that version.

qinsoon · 2022-10-28T04:30:09Z

My current conclusion is that we need a larger min heap size when we have more mutator threads when using mimalloc mark sweep. As we see the same issue with library mimalloc, this is not likely a bug for our Rust mimalloc implementation.

Oh? So it's actually much more competitive than previously thought? Could you try lusearch on the latest DaCapo version instead as we know what the minheap for different collectors is for that version.

The first table in the original post has a comparison with non-moving immix (with 64K block size), and semispace. It shows that the min heap of the mark sweep (with mimalloc) is pretty bad. In the long term, we would either need to fix it in some way, or use another allocator for mark sweep. I can run it with the latest dacapo, but I don't think it will change the conclusion.

k-sareen · 2022-10-28T04:38:24Z

My current conclusion is that we need a larger min heap size when we have more mutator threads when using mimalloc mark sweep. As we see the same issue with library mimalloc, this is not likely a bug for our Rust mimalloc implementation.

While the trend is seen by both, the raw values are different with the Rust-mimalloc having a (considerably) smaller minheap than library mimalloc. What I'm stuck on is that originally here (#688 (comment)), the library mimalloc is significantly more space-efficient than the Rust-mimalloc and is around the same as SemiSpace. But in the last comment here (#688 (comment)), the trend has flipped with the library mimalloc being less space-efficient than Rust-mimalloc. The key being the benchmarks have changed.

The first table in the original post has a comparison with non-moving immix (with 64K block size), and semispace. It shows that the min heap of the mark sweep (with mimalloc) is pretty bad. In the long term, we would either need to fix it in some way, or use another allocator for mark sweep. I can run it with the latest dacapo, but I don't think it will change the conclusion.

Right but as I've mentioned above, something doesn't quite make sense. If Rust-mimalloc was worse than library mimalloc, you'd expect to see it in the lusearch results as well. The only reason I bring up the newer DaCapo version is that I have values of minheap for Immix, SemiSpace, and GenCopy for that version.

Also did you try with tcmalloc or some other malloc like Hoard/jemalloc?

qinsoon · 2022-10-28T05:06:06Z

I didn't try with other malloc libraries. I should do that.

I will get the min heap with those builds on latest dacapo.

This pull request introduces an allocator based on Microsoft's mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) used in a MarkSweep GC. This closes #348. This PR has a few leftover issues: #688 General changes in this PR (if requested, I can separate these changes to different PRs): * `destroy_mutator()`: * now takes `&mut Mutator` instead of `Box<Mutator>`. The reclamation of the `Mutator` memory is up to the binding, and allows the binding to copy the `Mutator` value to their thread local data structure, in which case, Rust cannot reclaim the memory as if it is a `Box`. * now calls each allocator for their allocator-specific `on_destroy()` behavior. * Extract `Chunk` and `ChunkMap` from Immix, and make it general for other policies to use. * Changes in `VMBinding` constants: * `LOG_MIN_ALIGNMENT` is removed. We still provide `MIN_ALIGNMENT`. This avoids confusion that a binding may override one constant but not the other. * `MAX_ALIGNMENT_SHIFT` is removed. We provide `MAX_ALIGNMENT` for the same reason as above. * Add `USE_ALLOCATION_OFFSET: bool`. If a binding never use allocation offset (i.e. `offset === 0`), they can set this to `false`, and MMTk core could use this for some optimizations. * Changes in `ObjectModel`: * Add `UNIFIED_OBJECT_REFERENCE_ADDRESS: bool` to allow a binding to tell us if the object reference uses the same address as its object start and its to_address. * Add `OBJECT_REF_OFFSET_LOWER_BOUND` to allow binding to tell us roughly where the object reference points to (with respect of the allocation address). * Changes to related methods to cope with this change. Mark sweep changes in this PR: * add two features for marksweep * `eager_sweeping`: sweep unused memory eagerly in each GC. Without this feature, unused memory is swept when we attempt to allocate from them. * `malloc_mark_sweep`: the same as our previous `malloc_mark_sweep`. When this is used, the mark sweep plan uses `MallocSpace` and `MallocAllocator`. * Move the current `policy::mallocspace` to `policy::marksweepspace::malloc_ms` * Add `policy::marksweepspace::native_ms` for the mark sweep space with mimalloc. * Add `util::alloc::free_list_allocator`. Co-authored-by: Yi Lin <[email protected]>

This pull request introduces an allocator based on Microsoft's mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) used in a MarkSweep GC. This closes mmtk#348. This PR has a few leftover issues: mmtk#688 General changes in this PR (if requested, I can separate these changes to different PRs): * `destroy_mutator()`: * now takes `&mut Mutator` instead of `Box<Mutator>`. The reclamation of the `Mutator` memory is up to the binding, and allows the binding to copy the `Mutator` value to their thread local data structure, in which case, Rust cannot reclaim the memory as if it is a `Box`. * now calls each allocator for their allocator-specific `on_destroy()` behavior. * Extract `Chunk` and `ChunkMap` from Immix, and make it general for other policies to use. * Changes in `VMBinding` constants: * `LOG_MIN_ALIGNMENT` is removed. We still provide `MIN_ALIGNMENT`. This avoids confusion that a binding may override one constant but not the other. * `MAX_ALIGNMENT_SHIFT` is removed. We provide `MAX_ALIGNMENT` for the same reason as above. * Add `USE_ALLOCATION_OFFSET: bool`. If a binding never use allocation offset (i.e. `offset === 0`), they can set this to `false`, and MMTk core could use this for some optimizations. * Changes in `ObjectModel`: * Add `UNIFIED_OBJECT_REFERENCE_ADDRESS: bool` to allow a binding to tell us if the object reference uses the same address as its object start and its to_address. * Add `OBJECT_REF_OFFSET_LOWER_BOUND` to allow binding to tell us roughly where the object reference points to (with respect of the allocation address). * Changes to related methods to cope with this change. Mark sweep changes in this PR: * add two features for marksweep * `eager_sweeping`: sweep unused memory eagerly in each GC. Without this feature, unused memory is swept when we attempt to allocate from them. * `malloc_mark_sweep`: the same as our previous `malloc_mark_sweep`. When this is used, the mark sweep plan uses `MallocSpace` and `MallocAllocator`. * Move the current `policy::mallocspace` to `policy::marksweepspace::malloc_ms` * Add `policy::marksweepspace::native_ms` for the mark sweep space with mimalloc. * Add `util::alloc::free_list_allocator`. Co-authored-by: Yi Lin <[email protected]>

qinsoon · 2024-04-16T04:12:01Z

@wks spotted an issue in #1112 (comment): we do not reset block state back to unmarked. In this case, once a block is marked, it stays as marked. We may sweep a marked block and reuse it, but we never release the block. This causes memory leak.

I reran some minheap measurements to compare the number we recorded in this issue. marksweep-fix refers to the PR (#1112), and marksweep-master refers to the master at the time of measurement (86da9e9).

Min heap

This compares to the min heap measurement in the original post (#688 (comment))

benchmark	marksweep-fix (eager sweeping)	marksweep-fix (lazy sweeping)	marksweep (buggy)
dacapo2006-antlr	10	11	30
dacapo2006-luindex	8	10	59
dacapo2006-lusearch	37	29	83
dacapo2006-pmd	38	38	75
dacapo2006-xalan	40	40	48

After the fix, the min heap values are reasonable.

Min heap with the number of mutator threads

This compares to the min heap increase due to the number of mutator threads we observed in #688 (comment).

dacapo-bach-lusearch threads	marksweep-fix
1	10
24	35
48	65

Overall, the min heap values are better, but still scale with the number of mutator threads.

k-sareen · 2024-04-16T04:35:57Z

Min heap with the number of mutator threads

This compares to the min heap increase due to the number of mutator threads we observed in #688 (comment).
dacapo-bach-lusearch threads marksweep-fix
1 10
24 35
48 65

Overall, the min heap values are better, but still scale with the number of mutator threads.

But this is just due to multiple queries executing at the same time so naturally the minheap is larger. I'm not sure if it's actually a real problem. One way to confirm is that to check the MS heap sizes in comparison to SemiSpace for equivalent thread counts.

wks · 2024-04-16T06:17:26Z

Overall, the min heap values are better, but still scale with the number of mutator threads.

I think it is just the nature of the lusearch benchmark. It allocates lots and lots of objects, but keeps only a small amount of objects alive. And each MiMalloc allocator keeps many blocks (for each size class) locally, and cannot defragment. As a result, the memory reserved per mutator is more obvious for lusearch+MiMalloc. The min heap size for bump-pointer allocators should also scale with the number of mutators, but not with such a steep slope.

qinsoon · 2024-04-17T00:14:23Z

I measured the min heap for a few other plans. I have some doubts about the min heap value for mark compact with 48threads -- it keeps timing out in 10mins for any heap size smaller than 53m. It may have a smaller min heap value if we allow more time, but I just report what I got.

Marksweep with mimalloc still has the worse scaling among them (if we ignore markcompact with 48threads). Mark sweep is a non moving plan and it suffers from fragmentation. But nonmoving immix also suffers from fragmentation, yet its min heap with 48 threads is only 33m (compared to 14m with 1 thread).

dacapo-bach-lusearch threads	immix	immix nonmoving	markcompact
1	6	14	5
24	8	28	7
48	18	33	53 (timeout in 10mins for heap size under 53)

And each MiMalloc allocator keeps many blocks (for each size class) locally, and cannot defragment. As a result, the memory reserved per mutator is more obvious for lusearch+MiMalloc.

Yeah, that is the issue with MiMalloc.

But anyway, with the fix in #1112, the min heap for mark sweep with mimalloc is reasonable now.

k-sareen · 2024-04-17T00:19:10Z

A potential solution could be to return the allocator-local blocks to the global pool for every GC in mutator.flush() instead of waiting for the mutator thread to get destroyed in mutator.on_destroy().

qinsoon · 2024-04-17T00:22:54Z

A potential solution could be to return the allocator-local blocks to the global pool for every GC in mutator.flush() instead of waiting for the mutator thread to get destroyed in mutator.on_destroy().

We are doing that. See the table in the original post. All the following measurements include that.

k-sareen · 2024-04-17T00:26:56Z

Ah right. Forgot the status of what we had already done. Sorry

This PR refactors mark sweep. * It removes the data race for accessing block lists, as discussed in #1103 (comment). * It fixes an issue that block state did not get reset before GC and caused marked blocks to stay alive forever: #688 (comment) * It fixes an issue that empty blocks did not get released for lazy sweeping.

qinsoon · 2024-04-17T04:32:28Z

Should we close this issue, or keep it open with low priority?

wks · 2024-04-17T08:23:01Z

Should we close this issue, or keep it open with low priority?

I suggest we keep this open and dig deeper when we have time. It is still worth investigating the details of the per-mutator memory overhead of mark-sweep. One thing we can do is printing out the fragmentedness of each block, locally reserved or globally available. In this way, we know exactly where those extra memory has gone to.

qinsoon added C-enhancement Category: Enhancement A-allocator Area: Allocator A-policy Area: Policy labels Oct 26, 2022

qinsoon mentioned this issue Oct 26, 2022

Mimalloc Allocator #643

Merged

udesou mentioned this issue Sep 11, 2023

Support allocation into a non-moving space mmtk/mmtk-julia#102

Open

qinsoon changed the title ~~Mimalloc/marksweep issues~~ Mimalloc/marksweep high memory overhead Sep 19, 2023

qinsoon mentioned this issue Sep 19, 2023

Mark sweep high memory usage mmtk/mmtk-openjdk#247

Open

udesou added the P-low Priority: Low. A low-priority issue won't be scheduled and assigned. Any help is welcome. label Nov 8, 2023

qinsoon mentioned this issue Apr 15, 2024

Remove data race for marksweep block lists #1112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimalloc/marksweep high memory overhead #688

Mimalloc/marksweep high memory overhead #688

qinsoon commented Oct 26, 2022

qinsoon commented Oct 27, 2022

qinsoon commented Oct 28, 2022 •

edited

Loading

k-sareen commented Oct 28, 2022

qinsoon commented Oct 28, 2022 •

edited

Loading

k-sareen commented Oct 28, 2022

qinsoon commented Oct 28, 2022

k-sareen commented Oct 28, 2022

qinsoon commented Oct 28, 2022

qinsoon commented Apr 16, 2024

k-sareen commented Apr 16, 2024

Min heap with the number of mutator threads

wks commented Apr 16, 2024

qinsoon commented Apr 17, 2024

k-sareen commented Apr 17, 2024

qinsoon commented Apr 17, 2024

k-sareen commented Apr 17, 2024

qinsoon commented Apr 17, 2024

wks commented Apr 17, 2024

Mimalloc/marksweep high memory overhead #688

Mimalloc/marksweep high memory overhead #688

Comments

qinsoon commented Oct 26, 2022

Zeroing

Min heap

qinsoon commented Oct 27, 2022

qinsoon commented Oct 28, 2022 • edited Loading

k-sareen commented Oct 28, 2022

qinsoon commented Oct 28, 2022 • edited Loading

k-sareen commented Oct 28, 2022

qinsoon commented Oct 28, 2022

k-sareen commented Oct 28, 2022

qinsoon commented Oct 28, 2022

qinsoon commented Apr 16, 2024

Min heap

Min heap with the number of mutator threads

k-sareen commented Apr 16, 2024

Min heap with the number of mutator threads

wks commented Apr 16, 2024

qinsoon commented Apr 17, 2024

k-sareen commented Apr 17, 2024

qinsoon commented Apr 17, 2024

k-sareen commented Apr 17, 2024

qinsoon commented Apr 17, 2024

wks commented Apr 17, 2024

qinsoon commented Oct 28, 2022 •

edited

Loading

qinsoon commented Oct 28, 2022 •

edited

Loading