-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak on TestContainers IT tests #41156
Comments
/cc @geoand (testing) |
Running the tests without The profiling file: GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_123021.jfr.zip Also of note... These containers are not cleaned up right away. They will only go away after all TestContainer tests are finished: |
Also on the configuration without https://drive.google.com/file/d/13hvrMHEYO4KWX13zDd5xV7BkGTqKxDNf/view?usp=drive_link (The file is too big for github) |
Configuration without Uploading GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_143605.jfr.zip… |
@brunobat I think we will need the heap dumps for the tests not the JFR file. |
I've also noticed recent memory increases in dev mode tests, which tipped the pact ecosystem CI tests over the edge on June 5th, and got worse again around June 10th. I have #38814 open for a general frog-boiling in the memory requirements of dev/test mode. |
If want a hungry test to play with ...
That 140m setting passed on 3.11, but fails in 3.12 (you won't see an OOM in most cases, just test failures, but I'm pretty sure it's OOMing under the covers somewhere). In 3.6, 120m would work. With |
It would be interesting to get a heap dump in both cases. |
The heap dump thing is kind of interesting and complicated, IMO. There's another comment on it that I put somewhere ... A single heap dump isn't going to be useful, at all (and I know that's not what you're suggesting). So we'd need two heap dumps, from (say) Quarkus 3.6 and 3.7, or Quarkus 3.11 and 3.12. Then we can compare them. But let's assume the problem isn't a memory leak, but just that ... we do more, so we have more allocations. So the question we want the dumps to answer isn't "which objects are leaking," but rather "which objects do I have now that I didn't have before?" or "what objects do I have more of now?" ... but there's then another challenge, which is "how do I get dumps at the exact same stage in the good/bad case?" and "what stage should I be getting dumps at?" The simplest thing to do is to set the heap to a identical, constrained, value in good and bad cases (because different heap sizes would seem like lunacy) and then wait for the OOM dump to be generated. But because the bad case is hungrier, it will OOM earlier than the good case, so it will be at a different stage in the quarkus lifecycle, so there will be a ton of 'noise' in the delta between the two dumps. The objects on the heap will be different in the two cases, because the app is doing different things when the dump happens. To avoid that issue you could start the applications with an unconstrained heap and let them fully finish starting, and then trigger a dump. In that case, you're avoiding the problem where the two apps are at different stages, but you have a new problem, which is that memory usage peaks during test execution. So you're not measuring 'at high tide', you're measuring after things have settled down. The offending objects might have already been freed and removed from the heap. And there's another question - what problem are we trying to solve? Maybe we did increase memory requirements, and we get lucky and pinpoint the exact classes responsible. But they're just the straw that broke the camel's back - maybe those new classes are doing something really useful. If so, we won't do anything with them after identifying them - we'd need to reduce memory requirements by finding something else in the heap which is waste that we can optimise away. Or maybe the problem is a leak, and then we wouldn't want to compare 3.6 and 3.7, but 3.7 mid-test and 3.7 at the end of tests. All of which is a long way of saying that I did gather various dumps when I was investigating the 3.6->3.7 increase, and after looking in MAT for a while, nothing was jumping out. I concluded that the increase in memory requirements wasn't a big jump, just an increase that tipped these tests over the threshold into an OOM, but investigating that would have needed decisions about what problem was being solved, and a different set of dumps. |
@holly-cummins In any case, I would be interesting in having a dump for each version :). I agree we might not find anything but there's something odd. I wonder if it could be due to the new version of Netty because, in my branch which logs something every time we try to access a closed CL, I see things like:
that I wasn't seeing before. So we have some |
I can help on this @gsmet : there are 2 problems here:
I think both could be fixed on Netty side i.e.
|
@franz1981 so one problem with the finalizers is that they are actually loading classes and they are doing so from closed class loaders. Ideally, it would be nice to get rid of them in the Quarkus case if we could. I can get you a stacktrace later if you need it. |
Yeah,I think I can, regardless, expose a sys property to disable finalizers and just rely on FastThreadLocal::onRemove, which is dependent by stopping the event loop threads instead. And that would help. I can cleanup the costy fields as soon as the race to free them up is won, too, which would help to defer the problem - because the pool cache will become a stale empty instance, at worse. |
Let me know if is key for me work to work on this so I can prioritize it, ok? @gsmet |
We are having this issue too |
@ikovalyov how things are going? did you opened a separate issue or it's the same issue reported here? |
Describe the bug
When trying to run 10 IT tests with TestContainers in this PR: #39032
The CI build fails with an OOM exception on all JDKs. Example: https://github.com/quarkusio/quarkus/actions/runs/9283986768/job/25545787368#step:16:4086
Local execution reveals a memory leak:
Initial analysis pointed to the new feature being added, however that was just a tipping point. The bulk of memory being used is related with the
QuarkusTestExtension
class or with TestContainers.When running all the tests the containers are not immediately shut down. They linger in the system until the last test is finished, according to my view of Podman desktop.
I profiled the test execution and I'm attacking the JFR stream file here:
GrpcNoTLSNoCompressionTest_and_9_more_2024_06_12_102823.jfr.zip
Further analyses reveals this on
QuarkusTestExtension
:Expected behavior
No out of memory when running mutiple IT tests with TestContainers
Actual behavior
CI will fail with out of memory
How to Reproduce?
Output of
uname -a
orver
Darwin xxxxxxxxxxx 22.6.0 Darwin Kernel Version 22.6.0: Mon Apr 22 20:49:37 PDT 2024; root:xnu-8796.141.3.705.2~1/RELEASE_ARM64_T6000 arm64
Output of
java -version
openjdk version "21.0.2" 2024-01-16 LTS OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode)
Quarkus version or git rev
git version 2.38.1
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae) Maven home: /Users/xxxxxxx/.sdkman/candidates/maven/current Java version: 21.0.2, vendor: Eclipse Adoptium, runtime: /Users/xxxxxxx/.sdkman/candidates/java/21.0.2-tem Default locale: en_PT, platform encoding: UTF-8 OS name: "mac os x", version: "13.6.7", arch: "aarch64", family: "mac"
Additional information
Running Podman Desktop Version 1.10.3 (1.10.3)
See companion discussion on Zulip: https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/TestContainer.20IT.20tests.20out.20of.20memory
The text was updated successfully, but these errors were encountered: