-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: gradle check failing with java heap OutOfMemoryError #2324
Comments
Please let us know what kind of cleanup you need.
If you have any commands to clear java heap please let us know. Thanks. |
@dreamer-89 can you explore options to break up the gradle check tasks in to different modules? The current gradle check is very monolithic and we cannot sustain it longer by continuously increasing hardware resources. |
Another occurrence in opensearch-project/OpenSearch#3924 In this case, even though the build was successful but marked failed due to heap space. |
Thanks @peterzhuamazon for providing different options. At this point, I am not sure which cleanup will help. I think we need to deep dive to understand which process is consuming the heap space. As this is repeating across instances, it is good to root cause and then we can try appropriate options from above. |
Thanks @bbarani for the comment. I suspect the existing heap space is not due to limited hardware but instead some resource leak causing heap m/y issue. I think we need to spend brain cycles on existing failure. It could be a test but I started observing these failures recently. |
Hi @dreamer-89 let me clarify those 3 options have ALL been applied to our script already. |
Additionally I now remember I even kill all the opensearch process if any still exist. |
@dblock @dreamer-89 Please let us know if you can think of any additional clean ups / tweaks that needs to be implemented for this Gradle check. We are still seeing lot of flaky errors, memory issues after increasing the hardware resources and we should focus on fixing it to improve the developer velocity. CC: @peterzhuamazon @CEHENKLE |
Can we define the recommended heap specifications to ensure container memory? Flags like |
I don't have any useful advice. But one thing that I did notice - we used to run gradle checks without these problems with a previous set of hardware/jenkins/instances, did we downgrade from that capacity-wise? |
We have actually increased the hardware resources and I think we are using c524xlarge instance now. I assume you are seeing more errors due to the fact that we are running gradle checks more frequently now as we eliminated the need for commenting 'start gradle check' on the PR to begin the process. Having said that, I am seeing very interesting pattern where it passes for certain amount of time and fails continuously for certain amount of time before it starts passing again. |
I hear from @peterzhuamazon that the JDK may have been changed? I would double check that G1GC is enabled. Are these instances recycled every build? |
I am currently writing a setup to permanently recycle all the instances. |
Hi @dblock @dreamer-89 @bbarani After this change gradle check generally complete between 27-36min, quicker than the original 45-60min. Even the failure is legit failure most of the time: Tho flaky test will occasionally show: This seems to me that gradle check have some zombie process / memory leak that cause the continuous flaky runs on the same runner. By restrict the runs to 1 on each brand new runner, this temporarily resolve the issue and increase the success rate. Small sample size still but already show a different trend in success rate: |
Remember this is not a permanent solution, we would like core team to help identify the cause within gradle check and fix the root problem. Thanks. |
@peterzhuamazon I think we need to switch gradle jobs to run increasingly with |
Hi @dblock that is already done and that is not related here. We have been using this method since fork Jenkins. |
I saw another out of memory error in https://build.ci.opensearch.org/job/gradle-check/874/ |
I observed heap issue locally as well. Created opensearch-project/OpenSearch#3973 to track on core. |
Hi @dreamer-89 let us know when you have the fix on your side, so we can implement it on Jenkins workflow. |
Describe the bug
Public jenkins gradle check job failure due to java heap OutOfMemoryError. Raised this bug to get more understanding around existing gradle check function and prevent it on different machines. The ec2 hosts were previously upgraded to c524xlarge instance. Not sure if instance needs some cleanup.
To reproduce
https://build.ci.opensearch.org/job/gradle-check/381
Expected behavior
Job should not fail with
Screenshots
If applicable, add screenshots to help explain your problem.
Host / Environment
Running on EC2 (Amazon_ec2_cloud) - jenkinsAgentNode-Jenkins-Agent-Ubuntu2004-X64-c524xlarge-Single-Host (i-093f212ad4f5e9583) in /var/jenkins/workspace/gradle-check
Additional context
No response
Relevant log output
The text was updated successfully, but these errors were encountered: