-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High native memory usage in certificate revocation checking #52577
Comments
Also, it's worth nothing that this seems to be an unmanaged memory leak. We only ran into the problem when setting |
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDescriptionSigns point to a memory leak in the Service Point Manager when the Configuration
Regression?Not sure, maybe? Other informationOn April 12th I and @tdg5 noticed many services that we work on and maintain started using orders of magnitude more memory, from around 500mb to 4.5gb give or take, increasing rapidly and then ultimately receiving an OOMKill from k8s. As we debugged the issue, we tracked it down to this line in the C# Snowflake connector. We maintain a private fork of the C# connector and removed that line from my fork. After doing so, all of the memory pressure returned to normal. The strange part about this is that the memory pressure spiked on running pods, there was no new code or new deployments, it just started seemingly randomly. On further investigation (thanks @tdg5), this seems like a likely culprit: https://discuss.newrelic.com/t/important-upcoming-new-relic-server-certificate-update-will-impact-most-users-of-java-agent-version-6-1-and-a-few-users-of-java-agent-version-6-2-0-to-6-4-2/141711 . All of our services also use New Relic, so.. signs point to this being an issue. Initially we looked into debugging the problem with them, see newrelic/newrelic-dotnet-agent#544, but it seems to be more of a .net problem with the Service Point Manager.
|
@alexrosenfeld10 I am a bit confused. You mention ServicePointManager on .NET Core 3.1/5.0, but your referenced code of C# Snowflake connector uses HttpClientHandler. Also, it seems the problem happens only when profiler is enabled |
@karelz thanks for the reply, I can understand some of your confusion. Allow me to further detail below: At the time of starting the debugging and drafting the original issues, the Snowflake Connector set these static properties on the Thanks for noting that they're obsolete, that's what I found in my research here as well. However.. they still caused us a serious problem here... It's more or less dumb luck that the Snowflake library is open source and we were able to make our own fork with those settings removed. If it wasn't, I guess we'd just be in support ticket land. On the profiling front, yes, we only found this to be an issue with NewRelic enabled. However, I'd imagine that's less to do with NewRelic's Agent and more to do with the fact that there were certs updated on the same exact day and time this memory leak started to occur in our services. In order to repro, we'd probably need:
I am unsure of what |
The NewRelic profiler code can be found here: https://github.com/newrelic/newrelic-dotnet-agent/tree/main/src/Agent/NewRelic/Profiler I've tried to |
Unless the code in I recommend to raise an issue against snowflake-connector-net first to get a clarification on their internal logic, which is very confusing and might be buggy (intended settings not taking any effect on .NET Core). Unless there is a repro with a definite proof of an issue with SocketsHttpHandler, there is not much we can do about this. |
We get it and we are also surprised by the behavior we've been seeing. The Snowflake connector is not specifically to blame other than its tweaking of the |
OK, so there is a mystery and lots of components involved. |
I started working on one while we were actively debugging this. I believe Danny has taken up the helm for the time being on that. I'm happy to work together to get things in a state where this can be reproduced in an MVC way. |
Yeah, I am trying to reproduce the problem with fewer components involved. Still missing some part of the incantation thus far, but I will keep y'all posted. |
Anyone know of a NoOp Profiler I can plug in instead of the NewRelic Profiler? I'm all about eliminating variables 😀 |
I am able to reproduce the problem with this repo: https://github.com/alexrosenfeld10/DotNetNewRelicMemoryLeak You'll need a New Relic token though. I'm looking for ways to eliminate New Relic, ideas welcome. I fear that even if I remove New Relic as a dependency, I'll still need something I can hit that has a somewhat recently expired certificate. Not sure how to reconcile that. |
Repo has been updated so that New Relic isn't needed or included (though we still hit Seems that the interplay is ... Snowflake sets If you have questions, let @alexrosenfeld10 or @tdg5 know and we'll help in whatever way we can. https://github.com/alexrosenfeld10/DotNetNewRelicMemoryLeak |
Is New Relic in the same process as Snowflake setting |
@karelz perhaps our messages passed each other in flight. A minimal repo w/o New Relic but using It hits a I don't see any reason why it couldn't be a console app instead of an ASP.NET app, but I'm afraid I have to leave that to you as my employer has already invested significantly in this bug 😢 |
I'll rename it. Please see https://github.com/alexrosenfeld10/DotNetWebRequestCertMemLeak |
It is worth noting that even though the reproduction takes the It is not at all clear to me why this more granular case was impacting WebRequest, but alas, it seemed to be. |
Not an issue with Repro with Reproes under Linux when validating a revoked certificate with the You can observe massive memory growth (up to ~100 MB) per
@wfurt as far as I can tell, the |
Tagging subscribers to this area: @dotnet/ncl, @vcsjones Issue DetailsDescriptionSigns point to a memory leak in the Service Point Manager when the Configuration
Regression?Not sure, maybe? Other informationOn April 12th I and @tdg5 noticed many services that we work on and maintain started using orders of magnitude more memory, from around 500mb to 4.5gb give or take, increasing rapidly and then ultimately receiving an OOMKill from k8s. As we debugged the issue, we tracked it down to this line in the C# Snowflake connector. We maintain a private fork of the C# connector and removed that line from my fork. After doing so, all of the memory pressure returned to normal. The strange part about this is that the memory pressure spiked on running pods, there was no new code or new deployments, it just started seemingly randomly. On further investigation (thanks @tdg5), this seems like a likely culprit: https://discuss.newrelic.com/t/important-upcoming-new-relic-server-certificate-update-will-impact-most-users-of-java-agent-version-6-1-and-a-few-users-of-java-agent-version-6-2-0-to-6-4-2/141711 . All of our services also use New Relic, so.. signs point to this being an issue. Initially we looked into debugging the problem with them, see newrelic/newrelic-dotnet-agent#544, but it seems to be more of a .net problem with the Service Point Manager.
|
We will have to move this one to 7.0 at this point (6.0 is locking down and we're heads down on that). |
yes, @bartonjs is probably most knowledgeable in that regard. We may cache the revocation results as well but I'm note sure. If you can reproduce it with simple calling |
The fact that it eventually stabilizes means it isn't a leak, per se. Since the test is doing a GC.Collect/GC.WaitForPendingFinalizers that rules out us having some largeish objects tracked by SafeHandles that got pushed out to finalization. My guess is that it's just down to malloc/free's "lazy" implementation (free won't give the memory back to the OS by default, it keeps it around for the next malloc) combined with how many small allocations OpenSSL requests for loading a CRL... and it probably just ends up being a while before the "ready for a future malloc" list ends up nicely aligning with when OpenSSL does need the occasional larger chunk. We already have a similar issue described in #55672 |
Right, it's not exactly a leak, but 100 MB per request seems excessive. |
Thanks for putting eyes on this folks. As a user here, I agree with @MihaZupan, the growth per-request is the big problem for apps. The fact that this was caused by calls happening inside our APM tooling (calls that happen a lot in the background) ballooned our memory pretty darn fast. We saw up to 8GB after a few minutes of running sometimes. |
if there is simple repro, one can p/invoke to https://man7.org/linux/man-pages/man3/mallinfo.3.html |
Tested @MihaZupan minimal reproduction and wasn't able to see such terrible allocation.
I observed same results on Ubuntu 20.04 and Docker mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim, mcr.microsoft.com/dotnet/aspnet:5.0-buster-slim It is interesting the issue is not even reproducible on the same docker images mentioned in the issue description. @alexrosenfeld10 - can you confirm the issue persists and it is reproducible with @MihaZupan repro? |
Hey, apologies but I don't currently have the time in my day to day to spend on this anymore. Right now I don't see the issues in my app since removing the cert revocation check setting. If I find more time I can try to look into the other repro, but I don't think that will be anytime soon. |
I have tried to reproduce this issue multiple times with different setups and did not observe any unbounded growth of memory. I will close this issue but feel free to reopen this if more info is available. |
This might be related: #57213 |
Definitely seems related, especially if the solution is disabling the same property |
Description
Signs point to a memory leak in the Service Point Manager when the
CheckCertificateRevocationList
property is set totrue
and there is an expired certConfiguration
netcoreapp3.1
,net5.0
running insidemcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim
andmcr.microsoft.com/dotnet/aspnet:5.0-buster-slim
Regression?
Not sure, maybe?
Other information
On April 12th I and @tdg5 noticed many services that we work on and maintain started using orders of magnitude more memory, from around 500mb to 4.5gb give or take, increasing rapidly and then ultimately receiving an OOMKill from k8s.
As we debugged the issue, we tracked it down to this line in the C# Snowflake connector. We maintain a private fork of the C# connector and removed that line from my fork. After doing so, all of the memory pressure returned to normal.
The strange part about this is that the memory pressure spiked on running pods, there was no new code or new deployments, it just started seemingly randomly. On further investigation (thanks @tdg5), this seems like a likely culprit: https://discuss.newrelic.com/t/important-upcoming-new-relic-server-certificate-update-will-impact-most-users-of-java-agent-version-6-1-and-a-few-users-of-java-agent-version-6-2-0-to-6-4-2/141711 . All of our services also use New Relic, so.. signs point to this being an issue. Initially we looked into debugging the problem with them, see newrelic/newrelic-dotnet-agent#544, but it seems to be more of a .net problem with the Service Point Manager.
The text was updated successfully, but these errors were encountered: