-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mimalloc/marksweep high memory overhead #688
Comments
I measured the min heap for our malloc mark sweep with mimalloc library. I am not doing bulk xor for mark bits in mark sweep in the measurement, as it is known to have issues with detecting used pages. Instead, I iterate through each object and check their mark bit. The results are as below.
The min heap values for the library mimalloc is lower than our mimalloc mark sweep implementation, especially for |
I further tested the relation between min heap values and benchmark threads. I am using
|
What is the second column in the table? The Rust-mimalloc MarkSweep implementation? |
Yes. Rust-mimalloc mark sweep with returning thread-local blocks in each GC. |
Oh? So it's actually much more competitive than previously thought? Could you try lusearch on the latest DaCapo version instead as we know what the minheap for different collectors is for that version. |
My current conclusion is that we need a larger min heap size when we have more mutator threads when using mimalloc mark sweep. As we see the same issue with library mimalloc, this is not likely a bug for our Rust mimalloc implementation.
The first table in the original post has a comparison with non-moving immix (with 64K block size), and semispace. It shows that the min heap of the mark sweep (with mimalloc) is pretty bad. In the long term, we would either need to fix it in some way, or use another allocator for mark sweep. I can run it with the latest dacapo, but I don't think it will change the conclusion. |
While the trend is seen by both, the raw values are different with the Rust-mimalloc having a (considerably) smaller minheap than library mimalloc. What I'm stuck on is that originally here (#688 (comment)), the library mimalloc is significantly more space-efficient than the Rust-mimalloc and is around the same as SemiSpace. But in the last comment here (#688 (comment)), the trend has flipped with the library mimalloc being less space-efficient than Rust-mimalloc. The key being the benchmarks have changed.
Right but as I've mentioned above, something doesn't quite make sense. If Rust-mimalloc was worse than library mimalloc, you'd expect to see it in the lusearch results as well. The only reason I bring up the newer DaCapo version is that I have values of minheap for Immix, SemiSpace, and GenCopy for that version. Also did you try with tcmalloc or some other malloc like Hoard/jemalloc? |
I didn't try with other malloc libraries. I should do that. I will get the min heap with those builds on latest dacapo. |
This pull request introduces an allocator based on Microsoft's mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) used in a MarkSweep GC. This closes #348. This PR has a few leftover issues: #688 General changes in this PR (if requested, I can separate these changes to different PRs): * `destroy_mutator()`: * now takes `&mut Mutator` instead of `Box<Mutator>`. The reclamation of the `Mutator` memory is up to the binding, and allows the binding to copy the `Mutator` value to their thread local data structure, in which case, Rust cannot reclaim the memory as if it is a `Box`. * now calls each allocator for their allocator-specific `on_destroy()` behavior. * Extract `Chunk` and `ChunkMap` from Immix, and make it general for other policies to use. * Changes in `VMBinding` constants: * `LOG_MIN_ALIGNMENT` is removed. We still provide `MIN_ALIGNMENT`. This avoids confusion that a binding may override one constant but not the other. * `MAX_ALIGNMENT_SHIFT` is removed. We provide `MAX_ALIGNMENT` for the same reason as above. * Add `USE_ALLOCATION_OFFSET: bool`. If a binding never use allocation offset (i.e. `offset === 0`), they can set this to `false`, and MMTk core could use this for some optimizations. * Changes in `ObjectModel`: * Add `UNIFIED_OBJECT_REFERENCE_ADDRESS: bool` to allow a binding to tell us if the object reference uses the same address as its object start and its to_address. * Add `OBJECT_REF_OFFSET_LOWER_BOUND` to allow binding to tell us roughly where the object reference points to (with respect of the allocation address). * Changes to related methods to cope with this change. Mark sweep changes in this PR: * add two features for marksweep * `eager_sweeping`: sweep unused memory eagerly in each GC. Without this feature, unused memory is swept when we attempt to allocate from them. * `malloc_mark_sweep`: the same as our previous `malloc_mark_sweep`. When this is used, the mark sweep plan uses `MallocSpace` and `MallocAllocator`. * Move the current `policy::mallocspace` to `policy::marksweepspace::malloc_ms` * Add `policy::marksweepspace::native_ms` for the mark sweep space with mimalloc. * Add `util::alloc::free_list_allocator`. Co-authored-by: Yi Lin <[email protected]>
This pull request introduces an allocator based on Microsoft's mimalloc (https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) used in a MarkSweep GC. This closes mmtk#348. This PR has a few leftover issues: mmtk#688 General changes in this PR (if requested, I can separate these changes to different PRs): * `destroy_mutator()`: * now takes `&mut Mutator` instead of `Box<Mutator>`. The reclamation of the `Mutator` memory is up to the binding, and allows the binding to copy the `Mutator` value to their thread local data structure, in which case, Rust cannot reclaim the memory as if it is a `Box`. * now calls each allocator for their allocator-specific `on_destroy()` behavior. * Extract `Chunk` and `ChunkMap` from Immix, and make it general for other policies to use. * Changes in `VMBinding` constants: * `LOG_MIN_ALIGNMENT` is removed. We still provide `MIN_ALIGNMENT`. This avoids confusion that a binding may override one constant but not the other. * `MAX_ALIGNMENT_SHIFT` is removed. We provide `MAX_ALIGNMENT` for the same reason as above. * Add `USE_ALLOCATION_OFFSET: bool`. If a binding never use allocation offset (i.e. `offset === 0`), they can set this to `false`, and MMTk core could use this for some optimizations. * Changes in `ObjectModel`: * Add `UNIFIED_OBJECT_REFERENCE_ADDRESS: bool` to allow a binding to tell us if the object reference uses the same address as its object start and its to_address. * Add `OBJECT_REF_OFFSET_LOWER_BOUND` to allow binding to tell us roughly where the object reference points to (with respect of the allocation address). * Changes to related methods to cope with this change. Mark sweep changes in this PR: * add two features for marksweep * `eager_sweeping`: sweep unused memory eagerly in each GC. Without this feature, unused memory is swept when we attempt to allocate from them. * `malloc_mark_sweep`: the same as our previous `malloc_mark_sweep`. When this is used, the mark sweep plan uses `MallocSpace` and `MallocAllocator`. * Move the current `policy::mallocspace` to `policy::marksweepspace::malloc_ms` * Add `policy::marksweepspace::native_ms` for the mark sweep space with mimalloc. * Add `util::alloc::free_list_allocator`. Co-authored-by: Yi Lin <[email protected]>
@wks spotted an issue in #1112 (comment): we do not reset block state back to unmarked. In this case, once a block is marked, it stays as marked. We may sweep a marked block and reuse it, but we never release the block. This causes memory leak. I reran some minheap measurements to compare the number we recorded in this issue. Min heapThis compares to the min heap measurement in the original post (#688 (comment))
After the fix, the min heap values are reasonable. Min heap with the number of mutator threadsThis compares to the min heap increase due to the number of mutator threads we observed in #688 (comment).
Overall, the min heap values are better, but still scale with the number of mutator threads. |
But this is just due to multiple queries executing at the same time so naturally the minheap is larger. I'm not sure if it's actually a real problem. One way to confirm is that to check the MS heap sizes in comparison to SemiSpace for equivalent thread counts. |
I think it is just the nature of the lusearch benchmark. It allocates lots and lots of objects, but keeps only a small amount of objects alive. And each MiMalloc allocator keeps many blocks (for each size class) locally, and cannot defragment. As a result, the memory reserved per mutator is more obvious for lusearch+MiMalloc. The min heap size for bump-pointer allocators should also scale with the number of mutators, but not with such a steep slope. |
I measured the min heap for a few other plans. I have some doubts about the min heap value for mark compact with 48threads -- it keeps timing out in 10mins for any heap size smaller than 53m. It may have a smaller min heap value if we allow more time, but I just report what I got. Marksweep with mimalloc still has the worse scaling among them (if we ignore markcompact with 48threads). Mark sweep is a non moving plan and it suffers from fragmentation. But nonmoving immix also suffers from fragmentation, yet its min heap with 48 threads is only 33m (compared to 14m with 1 thread).
Yeah, that is the issue with MiMalloc. But anyway, with the fix in #1112, the min heap for mark sweep with mimalloc is reasonable now. |
A potential solution could be to return the allocator-local blocks to the global pool for every GC in |
We are doing that. See the table in the original post. All the following measurements include that. |
Ah right. Forgot the status of what we had already done. Sorry |
This PR refactors mark sweep. * It removes the data race for accessing block lists, as discussed in #1103 (comment). * It fixes an issue that block state did not get reset before GC and caused marked blocks to stay alive forever: #688 (comment) * It fixes an issue that empty blocks did not get released for lazy sweeping.
Should we close this issue, or keep it open with low priority? |
I suggest we keep this open and dig deeper when we have time. It is still worth investigating the details of the per-mutator memory overhead of mark-sweep. One thing we can do is printing out the fragmentedness of each block, locally reserved or globally available. In this way, we know exactly where those extra memory has gone to. |
This tracks leftover issues with our mimalloc allocator and the mark sweep plan that uses it.
Zeroing
Currently zeroing happens when we allocate in a cell, that is very fine grained zeroing. We should use a more coarse grained zeroing. For example, we can do zeroing when we sweep the whole block, and when we sweep contiguous cells.
Min heap
The min heap of the current implementation (based on #643) is:
Mimalloc uses thread-local free lists. So instead of having a global free list for each size class, mimalloc allocator has a thread-local free list for each size class. In the original implementation, an allocator always allocate from its own free list, not others. And we do the same in our GC'd version: GC identifies live objects, and let each allocator sweep their dead cells. In this case, if an allocator has recyclable blocks for certain size classes and it does not use it frequently, other allocators cannot reuse those blocks, but have to allocate new blocks from the global memory.
We solved part of the problem by abandoning blocks in GC. So for each GC, each allocator will abandon their local block lists, and return the blocks back to the global pool (in the same way as if those threads die). So for new allocation, allocators will first attempt to get blocks from the global pool. This helps the min heap (from Column2 to Column1).
However, in comparison with non-moving Immix, we are still worse. We will need to investigate more.
The text was updated successfully, but these errors were encountered: