Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Image GC Improvements #2386

Open
christianwimmer opened this issue Apr 23, 2020 · 9 comments
Open

Native Image GC Improvements #2386

christianwimmer opened this issue Apr 23, 2020 · 9 comments

Comments

@christianwimmer
Copy link

christianwimmer commented Apr 23, 2020

GraalVM Native Image CE currently provides a simple (non-parallel, non-concurrent) stop&copy GC. There are various areas where the GC can be improved. This issue captures ideas. Actual work should be done under separate issues, but linked from this issue so that everyone who wants to work on GC performance gets an overview of who is working on what.

TLAB implementation and sizing algorithm

The heap is divided into chunks, and currently full chunks are used as the TLAB. It is desirable to have a reasonably large chunk size (currently 1 MByte), which is often too big for a TLAB. Especially when many threads are started and some threads have very low allocation rates compared to others, GCs are started too often. It can also lead to pathological cases: when ChunkSize * NumberOfThreads > YoungGenerationSize, there are not enough chunks for all threads and the system starts a GC continuously.

To improve this, the TLAB implementation should be decoupled from the chunk management so that many TLAB can be in one chunk. TLAB size can be adjusted per thread based on the allocation rate of a thread, i.e., threads that allocate a lot still get a whole chunk, while threads that barely allocate get a small TLAB.

Performance improvements in the GC/heap implementation itself

  • Cluster related image heap objects to improve cache locality and reduce footprint and the amount of scanned objects from remembered set
  • Optimize object pinning: do not keep other objects in the chunk alive and avoid creating additional objects
  • Optimize concurrency in low-level memory management (CommittedMemoryProvider)
  • Use a single survivor space instead of one space per object age to avoid internal fragmentation
  • Use prefetch instructions before copying an object to hide the memory latency.
  • Hold more data in the chunk header to avoid querying the space.
  • Implement exact write barriers (the current write barrier marks the whole object as dirty, which is not a good idea for large object arrays)
  • Decrease the size of the write barrier by redesigning it (it would be significantly smaller if objects in unaligned & aligned chunks could be treated the same)

Implement a mark&compact GC for the old generation

The stop&copy GC has a high memory overhead during GC. In the worst case, twice as much memory is needed when the whole heap is reachable, because all objects are copied during a full GC. If the OS cannot provide any memory during GC, then the VM exits because the heap is in an inconsistent state.

For the old generation, a mark&compact algorithm avoids additional memory overhead because compaction happens in place.

Error handling

  • Throw an out-of-memory error if too much time is spent in the GC.
  • Handle out-of-memory conditions during VM operations more gracefully.
@fniephaus
Copy link
Member

fniephaus commented Aug 10, 2020

To your list, @christianwimmer, I'd like to add mechanisms that could be useful to language implementors, such as:

  • Some languages, such as Ruby and Smalltalk, allow the enumeration of instances for a given class (ObjectSpace.each_object(Numeric) and Behavior>>#allInstances). This is often implemented in the GC using the marking phase. In TruffleSqueak and TruffleRuby, custom tracing algorithms written in Java are used to make things work, but a GC-based solution can provide much better performance.

  • Some languages expose GC mechanisms (e.g. forced full/incremental GC, finalizers, pinning) and statistics (e.g. number of objects, eden size, ...) through some API (e.g. Python's gc module and Ruby's ObjectSpace)

  • Some languages may support other mechanisms that would be best implemented in the GC. Smalltalk, for example, supports the infamous #become: method to swap the references of objects. Some variants of it (e.g. ProtoObject>>becomeForward:: All variables in the entire system that used to point to the receiver now point to the argument.) are usually implemented in the GC, it would be cool if the SVM GC could provide an API that lets language developers implement such features.

@rodrigo-bruno
Copy link
Contributor

Several of these reference or object tracking features are currently provided in HotSpot through the JVMTI. If native-image eventually supports JVMTI (even if partially), these would come naturally.

The downside of most of these features is that the more hooks you add to the GC implementation to call back some user code, the slower the GC becomes (more checks for callbacks) and the more difficult it becomes to update it because you are locked to the user-facing APIs. Tracing APIs (in JVMTI) are an example as they require header space which might also be necessary for the GC. Another issue is with finalizers, which are tricky to implement correctly and efficiently.

Pinning, on the other hand, is a relatively simple feature to implement and can unlock very interesting use-cases where you allow accelerators to read/write directly from your managed heap. In HotSpot only Shenandoah supports pinning but G1 could also support it I think (and thus also native image could).

@fniephaus
Copy link
Member

Thanks for your comment, @rodrigo-bruno!

I'm somewhat familiar with JVMTI and know that it supports, for example, IterateOverInstancesOfClass. In a Truffle language, however, I'm not sure you can easily access this API. If I understand correctly, JVMTI, as the name suggests, is a tooling API, as in external tools such as VisualVM. I believe you could connect to JVMTI from within a Truffle language, but that feels more like a terrible hack. I tried that at some point but gave up on the idea fairly quickly.

But since this functionality exists, it shouldn't be too hard to provide similar APIs in Truffle, for example through TruffleRuntime.

@rodrigo-bruno
Copy link
Contributor

Hi @fniephaus , that is correct, JVMTI was (AFAIK) originally intended to be an interface used by external tools. However, JVMTI is implemented on top of JNI, which you can use to build other abstractions.

I am not super familiar with the Truffle internal workings but couldn't you create an interface on top of JNI and expose it to Truffle languages (some extra plumbing inside Truffle would be required)? Most likely the implementation of such interface would be JVM specific (i.e., one for HotSpot, one for native image).

@christianwimmer
Copy link
Author

@fniephaus Some features you mention, like a #become to change the class of a Java object, are certainly not desirable because they cannot be implemented in every GC. Offering an API that can only be implemented easily in a simple non-concurrent non-parallel GC are poisonous for future optimizations.

SVM supports object pinning, this is actually a feature that we need in every GC due to the way the low-level C interface is implemented.

@christianwimmer
Copy link
Author

@rodrigo-bruno

However, JVMTI is implemented on top of JNI, which you can use to build other abstractions.
That is not correct, I would say JNI and JVMTI do not have anything to do with each other. JVMTI is an interface into the Java VM and offers hooks that JNI does not offer you.

@rodrigo-bruno
Copy link
Contributor

That is not correct, I would say JNI and JVMTI do not have anything to do with each other. JVMTI is an interface into the Java VM and offers hooks that JNI does not offer you.

Yup, you are right! I was under the impression that the JVMTI greatly relied on existing JNI functions but after checking the code it seems that it is quite tight with the JVM internals.

@fniephaus
Copy link
Member

@fniephaus Some features you mention, like a #become to change the class of a Java object, are certainly not desirable because they cannot be implemented in every GC. Offering an API that can only be implemented easily in a simple non-concurrent non-parallel GC are poisonous for future optimizations.

Of course, there are always trade-offs. If some APIs needed for such mechanisms are not common enough and possibly hinder future optimizations, it may not make sense to provide them. My main goal here was to give some examples for what kind of things language implementers might want to do with the GC. Forcing incremental/full GCs and enumerating objects should be relatively easy to provide. Another thing to consider is API compatibility between JVM and SVM: it's probably undesirable if a language behaves significantly different when running on the JVM. As an example, there's no proper Java API to force incremental GCs in the JVM at the moment, and forcing full GCs instead may cause a significant performance penalty.

@dufoli
Copy link

dufoli commented Nov 10, 2020

hello, I have worked on mono long time ago and they work a lot on GC. And there is a cool article which describe feature. Some of them can be very interesting.
https://www.mono-project.com/docs/advanced/garbage-collector/sgen/

I know that, at the beginning, they use an external GC (bohem GC) but it was not so accurate. So they choose to implement a new one and it take years to be done. So why not reuse a external library (or fork it?) just my 2 cents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants