Memory leak on TestContainers IT tests #41156

brunobat · 2024-06-12T10:49:15Z

Describe the bug

When trying to run 10 IT tests with TestContainers in this PR: #39032
The CI build fails with an OOM exception on all JDKs. Example: https://github.com/quarkusio/quarkus/actions/runs/9283986768/job/25545787368#step:16:4086

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  06:03 min
[INFO] Finished at: 2024-05-29T10:10:21Z
[INFO] ------------------------------------------------------------------------
[INFO] 327 goals, 235 executed, 92 from cache, saving at least 3m 27s
Error:  Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.2.5:test (default-test) on project quarkus-opentelemetry-deployment: 
Error:  
Error:  Please refer to /home/runner/work/quarkus/quarkus/extensions/opentelemetry/deployment/target/surefire-reports for the individual test results.
Error:  Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
Error: [ERROR] There was an error in the forked process
Error:  Java heap space
Error:  java.lang.OutOfMemoryError: Java heap space
Error:  	at jdk.zipfs/jdk.nio.zipfs.ZipFileSystem.initCEN(ZipFileSystem.java:1552)
Error:  	at jdk.zipfs/jdk.nio.zipfs.ZipFileSystem.<init>(ZipFileSystem.java:179)
Error:  	at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getZipFileSystem(ZipFileSystemProvider.java:125)
Error:  	at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.newFileSystem(ZipFileSystemProvider.java:120)
Error:  	at io.quarkus.fs.util.ZipUtils.newFileSystem(ZipUtils.java:228)
Error:  	at io.quarkus.fs.util.ZipUtils.newFileSystem(ZipUtils.java:210)
Error:  	at io.quarkus.paths.ArchivePathTree.openFs(ArchivePathTree.java:179)
Error:  	at io.quarkus.paths.ArchivePathTree.apply(ArchivePathTree.java:117)
Error:  	at io.quarkus.paths.PathTreeWithManifest.apply(PathTreeWithManifest.java:74)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.getExtensionInfoOrNull(ApplicationDependencyTreeResolver.java:566)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.getExtensionDependencyOrNull(ApplicationDependencyTreeResolver.java:489)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependency(ApplicationDependencyTreeResolver.java:426)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependencies(ApplicationDependencyTreeResolver.java:404)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependency(ApplicationDependencyTreeResolver.java:469)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependencies(ApplicationDependencyTreeResolver.java:404)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependency(ApplicationDependencyTreeResolver.java:469)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependencies(ApplicationDependencyTreeResolver.java:404)
Error:  	at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.resolve(ApplicationDependencyTreeResolver.java:172)
Error:  	at io.quarkus.bootstrap.resolver.BootstrapAppModelResolver.buildAppModel(BootstrapAppModelResolver.java:382)
Error:  	at io.quarkus.bootstrap.resolver.BootstrapAppModelResolver.doResolveModel(BootstrapAppModelResolver.java:319)
Error:  	at io.quarkus.bootstrap.resolver.BootstrapAppModelResolver.resolveManagedModel(BootstrapAppModelResolver.java:200)
Error:  	at io.quarkus.bootstrap.BootstrapAppModelFactory.resolveAppModel(BootstrapAppModelFactory.java:294)
Error:  	at io.quarkus.bootstrap.app.QuarkusBootstrap.bootstrap(QuarkusBootstrap.java:133)
Error:  	at io.quarkus.test.QuarkusUnitTest.beforeAll(QuarkusUnitTest.java:651)
Error:  	at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.lambda$invokeBeforeAllCallbacks$12(ClassBasedTestDescriptor.java:396)
Error:  	at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor$$Lambda$449/0x00007f367813c4d0.execute(Unknown Source)
Error:  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
Error:  	at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.invokeBeforeAllCallbacks(ClassBasedTestDescriptor.java:396)
Error:  	at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:212)
Error:  	at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:85)
Error:  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:148)
Error:  	at org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$367/0x00007f3678118dc0.execute(Unknown Source)
Error:  
Error:  org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process

Local execution reveals a memory leak:

Initial analysis pointed to the new feature being added, however that was just a tipping point. The bulk of memory being used is related with the QuarkusTestExtension class or with TestContainers.

When running all the tests the containers are not immediately shut down. They linger in the system until the last test is finished, according to my view of Podman desktop.

I profiled the test execution and I'm attacking the JFR stream file here:

GrpcNoTLSNoCompressionTest_and_9_more_2024_06_12_102823.jfr.zip

Further analyses reveals this on QuarkusTestExtension:

Expected behavior

No out of memory when running mutiple IT tests with TestContainers

Actual behavior

CI will fail with out of memory

How to Reproduce?

Run all these tests on Intellij Idea in profiler mode: https://github.com/quarkusio/quarkus/blob/fb79c7c22ee746c904a95199390e295865024f73/integration-tests/opentelemetry-vertx-exporter/src/test/java/io/quarkus/it/opentelemetry/vertx/exporter

Output of `uname -a` or `ver`

Darwin xxxxxxxxxxx 22.6.0 Darwin Kernel Version 22.6.0: Mon Apr 22 20:49:37 PDT 2024; root:xnu-8796.141.3.705.2~1/RELEASE_ARM64_T6000 arm64

Output of `java -version`

openjdk version "21.0.2" 2024-01-16 LTS OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode)

Quarkus version or git rev

git version 2.38.1

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae) Maven home: /Users/xxxxxxx/.sdkman/candidates/maven/current Java version: 21.0.2, vendor: Eclipse Adoptium, runtime: /Users/xxxxxxx/.sdkman/candidates/java/21.0.2-tem Default locale: en_PT, platform encoding: UTF-8 OS name: "mac os x", version: "13.6.7", arch: "aarch64", family: "mac"

Additional information

Running Podman Desktop Version 1.10.3 (1.10.3)

See companion discussion on Zulip: https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/TestContainer.20IT.20tests.20out.20of.20memory

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2024-06-12T10:50:09Z

/cc @geoand (testing)

brunobat · 2024-06-12T11:44:41Z

Running the tests without GrpcWithTLSWithTrustAllWithCompressionTest and HttpWithTLSWithTrustAllWithCompressionTest leads to less allocations, proportional to less tests being run:

The profiling file:

GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_123021.jfr.zip

Also of note...
TestContainers, in addition to the OpenTelemetry Collector container will lunch 2 other types of containers per test:

These containers are not cleaned up right away. They will only go away after all TestContainer tests are finished:

brunobat · 2024-06-12T11:56:26Z

Also on the configuration without GrpcWithTLSWithTrustAllWithCompressionTest and HttpWithTLSWithTrustAllWithCompressionTest I got this memory dump while executing the 7th test:

https://drive.google.com/file/d/13hvrMHEYO4KWX13zDd5xV7BkGTqKxDNf/view?usp=drive_link

(The file is too big for github)

geoand · 2024-06-12T12:10:23Z

https://github.com/quarkusio/quarkus/compare/main...geoand:%2341156-test?expand=1 is worth a test

brunobat · 2024-06-12T13:44:08Z

Configuration without GrpcWithTLSWithTrustAllWithCompressionTest and HttpWithTLSWithTrustAllWithCompressionTestwithwith export TESTCONTAINERS_RYUK_DISABLED=true:

Uploading GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_143605.jfr.zip…

gsmet · 2024-06-12T14:02:53Z

@brunobat I think we will need the heap dumps for the tests not the JFR file.

holly-cummins · 2024-06-17T16:26:58Z

I've also noticed recent memory increases in dev mode tests, which tipped the pact ecosystem CI tests over the edge on June 5th, and got worse again around June 10th. I have #38814 open for a general frog-boiling in the memory requirements of dev/test mode.

gsmet · 2024-06-17T16:45:44Z

I'm slowly making progress here on a patch that might improve the situation but things are a bit messy: #41172

Also see #41233 .

holly-cummins · 2024-06-17T16:54:37Z

If want a hungry test to play with ...

git clone [email protected]:quarkiverse/quarkus-pact.git
cd quarkus-pact
mvn install 
cd cross-extension-integration-tests/target/test-classes/projects/happy-everyone-all-together-processed  
<edit the pom.xml to use a 999-SNAPSHOT quarkus>
mvn compile quarkus:dev -Dquarkus.analytics.disabled=true -Dquarkus.http.test-port=8083 -Ddebug=false -Djvm.args='-Xmx140m'

That 140m setting passed on 3.11, but fails in 3.12 (you won't see an OOM in most cases, just test failures, but I'm pretty sure it's OOMing under the covers somewhere). In 3.6, 120m would work. With quarkus:test, 105m will pass and 95m will fail.

gsmet · 2024-06-17T16:56:52Z

It would be interesting to get a heap dump in both cases.

holly-cummins · 2024-06-17T17:43:56Z

It would be interesting to get a heap dump in both cases.

The heap dump thing is kind of interesting and complicated, IMO. There's another comment on it that I put somewhere ... A single heap dump isn't going to be useful, at all (and I know that's not what you're suggesting). So we'd need two heap dumps, from (say) Quarkus 3.6 and 3.7, or Quarkus 3.11 and 3.12.

Then we can compare them. But let's assume the problem isn't a memory leak, but just that ... we do more, so we have more allocations. So the question we want the dumps to answer isn't "which objects are leaking," but rather "which objects do I have now that I didn't have before?" or "what objects do I have more of now?" ... but there's then another challenge, which is "how do I get dumps at the exact same stage in the good/bad case?" and "what stage should I be getting dumps at?" The simplest thing to do is to set the heap to a identical, constrained, value in good and bad cases (because different heap sizes would seem like lunacy) and then wait for the OOM dump to be generated. But because the bad case is hungrier, it will OOM earlier than the good case, so it will be at a different stage in the quarkus lifecycle, so there will be a ton of 'noise' in the delta between the two dumps. The objects on the heap will be different in the two cases, because the app is doing different things when the dump happens.

To avoid that issue you could start the applications with an unconstrained heap and let them fully finish starting, and then trigger a dump. In that case, you're avoiding the problem where the two apps are at different stages, but you have a new problem, which is that memory usage peaks during test execution. So you're not measuring 'at high tide', you're measuring after things have settled down. The offending objects might have already been freed and removed from the heap.

And there's another question - what problem are we trying to solve? Maybe we did increase memory requirements, and we get lucky and pinpoint the exact classes responsible. But they're just the straw that broke the camel's back - maybe those new classes are doing something really useful. If so, we won't do anything with them after identifying them - we'd need to reduce memory requirements by finding something else in the heap which is waste that we can optimise away.

Or maybe the problem is a leak, and then we wouldn't want to compare 3.6 and 3.7, but 3.7 mid-test and 3.7 at the end of tests.

All of which is a long way of saying that I did gather various dumps when I was investigating the 3.6->3.7 increase, and after looking in MAT for a while, nothing was jumping out. I concluded that the increase in memory requirements wasn't a big jump, just an increase that tipped these tests over the threshold into an OOM, but investigating that would have needed decisions about what problem was being solved, and a different set of dumps.

gsmet · 2024-06-17T17:47:29Z

@holly-cummins In any case, I would be interesting in having a dump for each version :). I agree we might not find anything but there's something odd.

I wonder if it could be due to the new version of Netty because, in my branch which logs something every time we try to access a closed CL, I see things like:

2024-06-17 19:42:59,363 ERROR [io.qua.boo.cla.Qua.lifecycle] (Finalizer) Class loader QuarkusClassLoader[type=BASE_RUNTIME, name=Quarkus Base Runtime ClassLoader: TEST for GrpcOpenInstrumentationDisabledTest (QuarkusUnitTest), id=285fbc53] has been closed and may not be accessed anymore: java.lang.RuntimeException: Created to log a stacktrace
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:727)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:510)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:505)
	at io.netty.util.internal.shaded.org.jctools.queues.ConcurrentCircularArrayQueue.isEmpty(ConcurrentCircularArrayQueue.java:71)
	at io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue.isEmpty(MpscArrayQueue.java:207)
	at io.netty.buffer.PoolThreadCache.checkCacheMayLeak(PoolThreadCache.java:236)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:227)
	at io.netty.buffer.PoolThreadCache$FreeOnFinalize.finalize(PoolThreadCache.java:500)
	at java.base/java.lang.System$2.invokeFinalize(System.java:2297)
	at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:88)
	at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)

that I wasn't seeing before.

So we have some ConcurrentCircularArrayQueue that are surviving the lifetime of their class loader (not for long as they apparently check that there are no leaks but what is odd is that I see some of them trying to load classes from the bootstrap CL).

franz1981 · 2024-06-25T08:36:44Z

I can help on this @gsmet : there are 2 problems here:

Despite the finalizer doesn't have anything to do and the logger.isDebug at https://github.com/netty/netty/blob/53be385142f73b613e8729b709e2c1c296e8c63f/buffer/src/main/java/io/netty/buffer/PoolThreadCache.java#L237 is false, the finalizer still keep the heap reference of the caches alive
I don't think is a good idea to have blocking thread pools to use the pool buffer caches - will increase RSS dramatically (and heap usage as well) see JBossThread (off-heap) memory footprint due to FastThreadLocal usage #39062

I think both could be fixed on Netty side i.e.

I can make a check on the debugger level and cleanup the caches references as soon as https://github.com/netty/netty/blob/53be385142f73b613e8729b709e2c1c296e8c63f/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L543 kicks-in, which should happen when the JBoss executor is shutdown - this will make them eligible to be GCed without waiting the finalizer pass - but just a normal major GC (because they will likely be in old gen...)
i can create a new system property which avoid non-event loop threads which are still FastThreadLocalThread to NOT uses thread local caches - and we can use it on our side to reduce the footprint of the blocking thread pool

gsmet · 2024-06-25T11:24:37Z

@franz1981 so one problem with the finalizers is that they are actually loading classes and they are doing so from closed class loaders.

Ideally, it would be nice to get rid of them in the Quarkus case if we could. I can get you a stacktrace later if you need it.

franz1981 · 2024-06-25T11:34:28Z

Yeah,I think I can, regardless, expose a sys property to disable finalizers and just rely on FastThreadLocal::onRemove, which is dependent by stopping the event loop threads instead. And that would help. I can cleanup the costy fields as soon as the race to free them up is won, too, which would help to defer the problem - because the pool cache will become a stale empty instance, at worse.

franz1981 · 2024-06-25T12:04:57Z

Let me know if is key for me work to work on this so I can prioritize it, ok? @gsmet

ikovalyov · 2024-09-24T11:34:58Z

We are having this issue too

franz1981 · 2024-09-30T09:14:50Z

@ikovalyov how things are going? did you opened a separate issue or it's the same issue reported here?

brunobat added the kind/bug Something isn't working label Jun 12, 2024

quarkus-bot bot added the triage/needs-triage label Jun 12, 2024

brunobat added area/testing and removed triage/needs-triage labels Jun 12, 2024

geoand mentioned this issue Jun 12, 2024

ClassLoader Leak using TestProfiles (OutOfMemoryError: Metaspace) #38774

Open

gsmet mentioned this issue Jun 12, 2024

Attempt to alleviate QuarkusClassLoader leaks #41172

Closed

This was referenced Jun 17, 2024

Memory usage increased between Quarkus 3.6 and 3.7 #38814

Closed

Fix failures in ecosystem CI, caused by insufficient memory quarkiverse/quarkus-pact#210

Merged

[CI] - Pact + Quarkus main quarkiverse/quarkiverse#94

Closed

franz1981 mentioned this issue Jul 2, 2024

Aggressively remove PoolThreadCache references from its finalizer object netty/netty#14155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak on TestContainers IT tests #41156

Memory leak on TestContainers IT tests #41156

brunobat commented Jun 12, 2024 •

edited

Loading

quarkus-bot bot commented Jun 12, 2024

brunobat commented Jun 12, 2024

brunobat commented Jun 12, 2024

geoand commented Jun 12, 2024

brunobat commented Jun 12, 2024

gsmet commented Jun 12, 2024

holly-cummins commented Jun 17, 2024

gsmet commented Jun 17, 2024

holly-cummins commented Jun 17, 2024 •

edited

Loading

gsmet commented Jun 17, 2024

holly-cummins commented Jun 17, 2024

gsmet commented Jun 17, 2024 •

edited

Loading

franz1981 commented Jun 25, 2024

gsmet commented Jun 25, 2024

franz1981 commented Jun 25, 2024 •

edited

Loading

franz1981 commented Jun 25, 2024

ikovalyov commented Sep 24, 2024 •

edited

Loading

franz1981 commented Sep 30, 2024

Memory leak on TestContainers IT tests #41156

Memory leak on TestContainers IT tests #41156

Comments

brunobat commented Jun 12, 2024 • edited Loading

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

quarkus-bot bot commented Jun 12, 2024

brunobat commented Jun 12, 2024

brunobat commented Jun 12, 2024

geoand commented Jun 12, 2024

brunobat commented Jun 12, 2024

gsmet commented Jun 12, 2024

holly-cummins commented Jun 17, 2024

gsmet commented Jun 17, 2024

holly-cummins commented Jun 17, 2024 • edited Loading

gsmet commented Jun 17, 2024

holly-cummins commented Jun 17, 2024

gsmet commented Jun 17, 2024 • edited Loading

franz1981 commented Jun 25, 2024

gsmet commented Jun 25, 2024

franz1981 commented Jun 25, 2024 • edited Loading

franz1981 commented Jun 25, 2024

ikovalyov commented Sep 24, 2024 • edited Loading

franz1981 commented Sep 30, 2024

brunobat commented Jun 12, 2024 •

edited

Loading

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

holly-cummins commented Jun 17, 2024 •

edited

Loading

gsmet commented Jun 17, 2024 •

edited

Loading

franz1981 commented Jun 25, 2024 •

edited

Loading

ikovalyov commented Sep 24, 2024 •

edited

Loading