Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mimalloc/marksweep high memory overhead #688

Open
qinsoon opened this issue Oct 26, 2022 · 17 comments
Open

Mimalloc/marksweep high memory overhead #688

qinsoon opened this issue Oct 26, 2022 · 17 comments
Labels
A-allocator Area: Allocator A-policy Area: Policy C-enhancement Category: Enhancement P-low Priority: Low. A low-priority issue won't be scheduled and assigned. Any help is welcome.

Comments

@qinsoon
Copy link
Member

qinsoon commented Oct 26, 2022

This tracks leftover issues with our mimalloc allocator and the mark sweep plan that uses it.

Zeroing

Currently zeroing happens when we allocate in a cell, that is very fine grained zeroing. We should use a more coarse grained zeroing. For example, we can do zeroing when we sweep the whole block, and when we sweep contiguous cells.

Min heap

The min heap of the current implementation (based on #643) is:

benchmark marksweep marksweep (without abandoning blocks in GC) non-moving immix with 64K block semispace
dacapo2006-antlr 37 38 20 10
dacapo2006-luindex 57 65 24 12
dacapo2006-lusearch 92 186 25 15
dacapo2006-pmd 97 97 65 52
dacapo2006-xalan 66 102 69 67

Mimalloc uses thread-local free lists. So instead of having a global free list for each size class, mimalloc allocator has a thread-local free list for each size class. In the original implementation, an allocator always allocate from its own free list, not others. And we do the same in our GC'd version: GC identifies live objects, and let each allocator sweep their dead cells. In this case, if an allocator has recyclable blocks for certain size classes and it does not use it frequently, other allocators cannot reuse those blocks, but have to allocate new blocks from the global memory.

We solved part of the problem by abandoning blocks in GC. So for each GC, each allocator will abandon their local block lists, and return the blocks back to the global pool (in the same way as if those threads die). So for new allocation, allocators will first attempt to get blocks from the global pool. This helps the min heap (from Column2 to Column1).

However, in comparison with non-moving Immix, we are still worse. We will need to investigate more.

@qinsoon qinsoon added C-enhancement Category: Enhancement A-allocator Area: Allocator A-policy Area: Policy labels Oct 26, 2022
@qinsoon
Copy link
Member Author

qinsoon commented Oct 27, 2022

I measured the min heap for our malloc mark sweep with mimalloc library. I am not doing bulk xor for mark bits in mark sweep in the measurement, as it is known to have issues with detecting used pages. Instead, I iterate through each object and check their mark bit.

The results are as below.

benchmark mark sweep with library mimalloc counting 4K page mark sweep with library mimalloc counting 64K page
dacapo2006-antlr 10 16
dacapo2006-luindex 11 13
dacapo2006-lusearch 19 58
dacapo2006-pmd 58 73
dacapo2006-xalan 47 93

The min heap values for the library mimalloc is lower than our mimalloc mark sweep implementation, especially for antlr and luindex.

@qinsoon
Copy link
Member Author

qinsoon commented Oct 28, 2022

I further tested the relation between min heap values and benchmark threads. I am using lusearch from dacapo-bach (instead of dacapo-2006 which does not support setting benchmark threads). The results are as below:

dacapo-bach-lusearch threads mark sweep mark sweep with library mimalloc (counting 64K page)
1 14 28
24 50 72
48 94 129

@k-sareen
Copy link
Collaborator

I further tested the relation between min heap values and benchmark threads. I am using lusearch from dacapo-bach (instead of dacapo-2006 which does not support setting benchmark threads). The results are as below:
dacapo-bach-lusearch threads mark sweep mark sweep with library mimalloc (counting 64K page)
1 14 28
24 50 72
48 94 129

What is the second column in the table? The Rust-mimalloc MarkSweep implementation?

@qinsoon
Copy link
Member Author

qinsoon commented Oct 28, 2022

I further tested the relation between min heap values and benchmark threads. I am using lusearch from dacapo-bach (instead of dacapo-2006 which does not support setting benchmark threads). The results are as below:
dacapo-bach-lusearch threads mark sweep mark sweep with library mimalloc (counting 64K page)
1 14 28
24 50 72
48 94 129

What is the second column in the table? The Rust-mimalloc MarkSweep implementation?

Yes. Rust-mimalloc mark sweep with returning thread-local blocks in each GC.

@k-sareen
Copy link
Collaborator

Oh? So it's actually much more competitive than previously thought? Could you try lusearch on the latest DaCapo version instead as we know what the minheap for different collectors is for that version.

@qinsoon
Copy link
Member Author

qinsoon commented Oct 28, 2022

My current conclusion is that we need a larger min heap size when we have more mutator threads when using mimalloc mark sweep. As we see the same issue with library mimalloc, this is not likely a bug for our Rust mimalloc implementation.

Oh? So it's actually much more competitive than previously thought? Could you try lusearch on the latest DaCapo version instead as we know what the minheap for different collectors is for that version.

The first table in the original post has a comparison with non-moving immix (with 64K block size), and semispace. It shows that the min heap of the mark sweep (with mimalloc) is pretty bad. In the long term, we would either need to fix it in some way, or use another allocator for mark sweep. I can run it with the latest dacapo, but I don't think it will change the conclusion.

@k-sareen
Copy link
Collaborator

My current conclusion is that we need a larger min heap size when we have more mutator threads when using mimalloc mark sweep. As we see the same issue with library mimalloc, this is not likely a bug for our Rust mimalloc implementation.

While the trend is seen by both, the raw values are different with the Rust-mimalloc having a (considerably) smaller minheap than library mimalloc. What I'm stuck on is that originally here (#688 (comment)), the library mimalloc is significantly more space-efficient than the Rust-mimalloc and is around the same as SemiSpace. But in the last comment here (#688 (comment)), the trend has flipped with the library mimalloc being less space-efficient than Rust-mimalloc. The key being the benchmarks have changed.

The first table in the original post has a comparison with non-moving immix (with 64K block size), and semispace. It shows that the min heap of the mark sweep (with mimalloc) is pretty bad. In the long term, we would either need to fix it in some way, or use another allocator for mark sweep. I can run it with the latest dacapo, but I don't think it will change the conclusion.

Right but as I've mentioned above, something doesn't quite make sense. If Rust-mimalloc was worse than library mimalloc, you'd expect to see it in the lusearch results as well. The only reason I bring up the newer DaCapo version is that I have values of minheap for Immix, SemiSpace, and GenCopy for that version.

Also did you try with tcmalloc or some other malloc like Hoard/jemalloc?

@qinsoon
Copy link
Member Author

qinsoon commented Oct 28, 2022

I didn't try with other malloc libraries. I should do that.

I will get the min heap with those builds on latest dacapo.

qinsoon added a commit that referenced this issue Dec 5, 2022
This pull request introduces an allocator based on Microsoft's mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) used in a MarkSweep GC. This closes #348.

This PR has a few leftover issues: #688

General changes in this PR (if requested, I can separate these changes to different PRs):
* `destroy_mutator()`:
  * now takes `&mut Mutator` instead of `Box<Mutator>`. The reclamation of the `Mutator` memory is up to the binding, and allows the binding to copy the `Mutator` value to their thread local data structure, in which case, Rust cannot reclaim the memory as if it is a `Box`.
  * now calls each allocator for their allocator-specific `on_destroy()` behavior.
* Extract `Chunk` and `ChunkMap` from Immix, and make it general for other policies to use.
* Changes in `VMBinding` constants:
  * `LOG_MIN_ALIGNMENT` is removed. We still provide `MIN_ALIGNMENT`. This avoids confusion that a binding may override one constant but not the other.
  * `MAX_ALIGNMENT_SHIFT` is removed. We provide `MAX_ALIGNMENT` for the same reason as above.
  * Add `USE_ALLOCATION_OFFSET: bool`. If a binding never use allocation offset (i.e. `offset === 0`), they can set this to `false`, and MMTk core could use this for some optimizations.
* Changes in `ObjectModel`:
  * Add `UNIFIED_OBJECT_REFERENCE_ADDRESS: bool` to allow a binding to tell us if the object reference uses the same address as its object start and its to_address.
  * Add `OBJECT_REF_OFFSET_LOWER_BOUND` to allow binding to tell us roughly where the object reference points to (with respect of the allocation address).
  * Changes to related methods to cope with this change.

Mark sweep changes in this PR:
* add two features for marksweep
  * `eager_sweeping`: sweep unused memory eagerly in each GC. Without this feature, unused memory is swept when we attempt to allocate from them.
  * `malloc_mark_sweep`: the same as our previous `malloc_mark_sweep`. When this is used, the mark sweep plan uses `MallocSpace` and `MallocAllocator`.
* Move the current `policy::mallocspace` to `policy::marksweepspace::malloc_ms`
* Add `policy::marksweepspace::native_ms` for the mark sweep space with mimalloc.
* Add `util::alloc::free_list_allocator`.

Co-authored-by: Yi Lin <[email protected]>
wenyuzhao pushed a commit to wenyuzhao/mmtk-core that referenced this issue Mar 20, 2023
This pull request introduces an allocator based on Microsoft's mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) used in a MarkSweep GC. This closes mmtk#348.

This PR has a few leftover issues: mmtk#688

General changes in this PR (if requested, I can separate these changes to different PRs):
* `destroy_mutator()`:
  * now takes `&mut Mutator` instead of `Box<Mutator>`. The reclamation of the `Mutator` memory is up to the binding, and allows the binding to copy the `Mutator` value to their thread local data structure, in which case, Rust cannot reclaim the memory as if it is a `Box`.
  * now calls each allocator for their allocator-specific `on_destroy()` behavior.
* Extract `Chunk` and `ChunkMap` from Immix, and make it general for other policies to use.
* Changes in `VMBinding` constants:
  * `LOG_MIN_ALIGNMENT` is removed. We still provide `MIN_ALIGNMENT`. This avoids confusion that a binding may override one constant but not the other.
  * `MAX_ALIGNMENT_SHIFT` is removed. We provide `MAX_ALIGNMENT` for the same reason as above.
  * Add `USE_ALLOCATION_OFFSET: bool`. If a binding never use allocation offset (i.e. `offset === 0`), they can set this to `false`, and MMTk core could use this for some optimizations.
* Changes in `ObjectModel`:
  * Add `UNIFIED_OBJECT_REFERENCE_ADDRESS: bool` to allow a binding to tell us if the object reference uses the same address as its object start and its to_address.
  * Add `OBJECT_REF_OFFSET_LOWER_BOUND` to allow binding to tell us roughly where the object reference points to (with respect of the allocation address).
  * Changes to related methods to cope with this change.

Mark sweep changes in this PR:
* add two features for marksweep
  * `eager_sweeping`: sweep unused memory eagerly in each GC. Without this feature, unused memory is swept when we attempt to allocate from them.
  * `malloc_mark_sweep`: the same as our previous `malloc_mark_sweep`. When this is used, the mark sweep plan uses `MallocSpace` and `MallocAllocator`.
* Move the current `policy::mallocspace` to `policy::marksweepspace::malloc_ms`
* Add `policy::marksweepspace::native_ms` for the mark sweep space with mimalloc.
* Add `util::alloc::free_list_allocator`.

Co-authored-by: Yi Lin <[email protected]>
@qinsoon qinsoon changed the title Mimalloc/marksweep issues Mimalloc/marksweep high memory overhead Sep 19, 2023
@udesou udesou added the P-low Priority: Low. A low-priority issue won't be scheduled and assigned. Any help is welcome. label Nov 8, 2023
@qinsoon
Copy link
Member Author

qinsoon commented Apr 16, 2024

@wks spotted an issue in #1112 (comment): we do not reset block state back to unmarked. In this case, once a block is marked, it stays as marked. We may sweep a marked block and reuse it, but we never release the block. This causes memory leak.

I reran some minheap measurements to compare the number we recorded in this issue. marksweep-fix refers to the PR (#1112), and marksweep-master refers to the master at the time of measurement (86da9e9).

Min heap

This compares to the min heap measurement in the original post (#688 (comment))

benchmark marksweep-fix (eager sweeping) marksweep-fix (lazy sweeping) marksweep (buggy)
dacapo2006-antlr 10 11 30
dacapo2006-luindex 8 10 59
dacapo2006-lusearch 37 29 83
dacapo2006-pmd 38 38 75
dacapo2006-xalan 40 40 48

After the fix, the min heap values are reasonable.

Min heap with the number of mutator threads

This compares to the min heap increase due to the number of mutator threads we observed in #688 (comment).

dacapo-bach-lusearch threads marksweep-fix
1 10
24 35
48 65

Overall, the min heap values are better, but still scale with the number of mutator threads.

@k-sareen
Copy link
Collaborator

Min heap with the number of mutator threads

This compares to the min heap increase due to the number of mutator threads we observed in #688 (comment).
dacapo-bach-lusearch threads marksweep-fix
1 10
24 35
48 65

Overall, the min heap values are better, but still scale with the number of mutator threads.

But this is just due to multiple queries executing at the same time so naturally the minheap is larger. I'm not sure if it's actually a real problem. One way to confirm is that to check the MS heap sizes in comparison to SemiSpace for equivalent thread counts.

@wks
Copy link
Collaborator

wks commented Apr 16, 2024

Overall, the min heap values are better, but still scale with the number of mutator threads.

I think it is just the nature of the lusearch benchmark. It allocates lots and lots of objects, but keeps only a small amount of objects alive. And each MiMalloc allocator keeps many blocks (for each size class) locally, and cannot defragment. As a result, the memory reserved per mutator is more obvious for lusearch+MiMalloc. The min heap size for bump-pointer allocators should also scale with the number of mutators, but not with such a steep slope.

@qinsoon
Copy link
Member Author

qinsoon commented Apr 17, 2024

I measured the min heap for a few other plans. I have some doubts about the min heap value for mark compact with 48threads -- it keeps timing out in 10mins for any heap size smaller than 53m. It may have a smaller min heap value if we allow more time, but I just report what I got.

Marksweep with mimalloc still has the worse scaling among them (if we ignore markcompact with 48threads). Mark sweep is a non moving plan and it suffers from fragmentation. But nonmoving immix also suffers from fragmentation, yet its min heap with 48 threads is only 33m (compared to 14m with 1 thread).

dacapo-bach-lusearch threads immix immix nonmoving markcompact
1 6 14 5
24 8 28 7
48 18 33 53 (timeout in 10mins for heap size under 53)

And each MiMalloc allocator keeps many blocks (for each size class) locally, and cannot defragment. As a result, the memory reserved per mutator is more obvious for lusearch+MiMalloc.

Yeah, that is the issue with MiMalloc.

But anyway, with the fix in #1112, the min heap for mark sweep with mimalloc is reasonable now.

@k-sareen
Copy link
Collaborator

A potential solution could be to return the allocator-local blocks to the global pool for every GC in mutator.flush() instead of waiting for the mutator thread to get destroyed in mutator.on_destroy().

@qinsoon
Copy link
Member Author

qinsoon commented Apr 17, 2024

A potential solution could be to return the allocator-local blocks to the global pool for every GC in mutator.flush() instead of waiting for the mutator thread to get destroyed in mutator.on_destroy().

We are doing that. See the table in the original post. All the following measurements include that.

@k-sareen
Copy link
Collaborator

Ah right. Forgot the status of what we had already done. Sorry

github-merge-queue bot pushed a commit that referenced this issue Apr 17, 2024
This PR refactors mark sweep.
* It removes the data race for accessing block lists, as discussed in
#1103 (comment).
* It fixes an issue that block state did not get reset before GC and
caused marked blocks to stay alive forever:
#688 (comment)
* It fixes an issue that empty blocks did not get released for lazy
sweeping.
@qinsoon
Copy link
Member Author

qinsoon commented Apr 17, 2024

Should we close this issue, or keep it open with low priority?

@wks
Copy link
Collaborator

wks commented Apr 17, 2024

Should we close this issue, or keep it open with low priority?

I suggest we keep this open and dig deeper when we have time. It is still worth investigating the details of the per-mutator memory overhead of mark-sweep. One thing we can do is printing out the fragmentedness of each block, locally reserved or globally available. In this way, we know exactly where those extra memory has gone to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-allocator Area: Allocator A-policy Area: Policy C-enhancement Category: Enhancement P-low Priority: Low. A low-priority issue won't be scheduled and assigned. Any help is welcome.
Projects
None yet
Development

No branches or pull requests

4 participants