-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cosmos SDK Netty ByteBuf leak #9023
Comments
Thanks for reporting this @allenhumphreys. |
@allenhumphreys thanks for reporting this and putting out a great summary of the issue. We are working on this and plan to have a fix out there by the end of this week. |
Thanks, @kushagraThapar. Will there be a new release with this fix as well? |
@timwhit yes, there will be. |
This PR has fixed this issue: #9211 New version will be released this week. |
@allenhumphreys @timwhit - The fix for this has been released - v3.7.1 |
It will take some time to test this since we’ve switched over to gateway mode. |
@allenhumphreys - Sure, please close this issue whenever you feel, or if you want, I can close it now, and you can re-open it if you see the issue again. However you want it. |
Closing this, @allenhumphreys please reopen if you see the issue again. |
@kushagraThapar I've updated my services and re-enabled DIRECT mode in my staging environment. I have observed what seems to be new behavior possibly related to this fix. I am receiving |
@allenhumphreys Yes, let's open a new ticket, so we can track it separately. Please paste the complete stack trace in addition to operation details and other important details that can help us. Meanwhile, continue using GATEWAY mode while we fix this. |
@kushagraThapar we were explicitly told yesterday by Azure Operations to NOT use Gateway mode as it has no SLA and there are no guarantees about it's uptime. This is an extremely serious issue and I'll be escalating the new ticket as soon as @allenhumphreys creates it. |
@timwhit - as soon as @allenhumphreys creates the new ticket, we will start working on it. We are taking these fixes with high priority. |
@timwhit - Regarding running on GATEWAY mode, I understand that it advised to NOT use GATEWAY mode, but the reason the mode still is in place is for situations like these. So it is perfectly okay to run on GATEWAY mode, until the DIRECT mode is completely fixed. |
Describe the bug
There is a leak in the Cosmos SDK when using the DIRECT connection mode.
The problem seems to only become obvious when services are under continuous medium load. We've had Cosmos deployed for many months and this hasn't been an issue, but shortly after going live and having sustained traffic increase above a certain level, all our services started running out of heap memory and experiencing OutOfMemory exceptions.
A leak this bad is highly concerning. Fortunately we were able to switch our service to use GATEWAY mode to bypass the issue.
We went through several mitigation steps, including:
All to have it become abundantly clear that the leak was coming from the Cosmos SDK, and that there was nothing we could do.
We subsequently enabled ADVANCED Netty leak detection to get a record of the last code to access the ByteBuf. That log showed that the last access is a call to
release()
atRntbdResponse.java:198
. That code has suspicious comments, but more importantly it helped me confirm that attempting to switch to GATEWAY connection mode would likely bypass the leaking behavior, which it did.Netty Leak Detection Log
To Reproduce
Steps to reproduce the behavior:
Unfortunately, I have not been able to reproduce this in isolation, only in production, as it seems to require a certain amount of sustained traffic to happen.
Screenshots
Heap usage during leaking behavior:
Heap usage after switching to GATEWAY mode:
Setup (please complete the following information):
We are running our services in AKS using open JDK 8
Gradle Runtime Dependency Report
The text was updated successfully, but these errors were encountered: