Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High native memory usage in certificate revocation checking #52577

Closed
alexrosenfeld10 opened this issue May 7, 2021 · 49 comments
Closed

High native memory usage in certificate revocation checking #52577

alexrosenfeld10 opened this issue May 7, 2021 · 49 comments
Labels
area-System.Net.Security bug os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Milestone

Comments

@alexrosenfeld10
Copy link

Description

Signs point to a memory leak in the Service Point Manager when the CheckCertificateRevocationList property is set to true and there is an expired cert

Configuration

netcoreapp3.1, net5.0 running inside mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim and mcr.microsoft.com/dotnet/aspnet:5.0-buster-slim

Regression?

Not sure, maybe?

Other information

On April 12th I and @tdg5 noticed many services that we work on and maintain started using orders of magnitude more memory, from around 500mb to 4.5gb give or take, increasing rapidly and then ultimately receiving an OOMKill from k8s.

As we debugged the issue, we tracked it down to this line in the C# Snowflake connector. We maintain a private fork of the C# connector and removed that line from my fork. After doing so, all of the memory pressure returned to normal.

The strange part about this is that the memory pressure spiked on running pods, there was no new code or new deployments, it just started seemingly randomly. On further investigation (thanks @tdg5), this seems like a likely culprit: https://discuss.newrelic.com/t/important-upcoming-new-relic-server-certificate-update-will-impact-most-users-of-java-agent-version-6-1-and-a-few-users-of-java-agent-version-6-2-0-to-6-4-2/141711 . All of our services also use New Relic, so.. signs point to this being an issue. Initially we looked into debugging the problem with them, see newrelic/newrelic-dotnet-agent#544, but it seems to be more of a .net problem with the Service Point Manager.

@alexrosenfeld10
Copy link
Author

Also, it's worth nothing that this seems to be an unmanaged memory leak. We only ran into the problem when setting CORECLR_ENABLE_PROFILING=1.

@GrabYourPitchforks GrabYourPitchforks transferred this issue from dotnet/core May 11, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label May 11, 2021
@ghost
Copy link

ghost commented May 11, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Signs point to a memory leak in the Service Point Manager when the CheckCertificateRevocationList property is set to true and there is an expired cert

Configuration

netcoreapp3.1, net5.0 running inside mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim and mcr.microsoft.com/dotnet/aspnet:5.0-buster-slim

Regression?

Not sure, maybe?

Other information

On April 12th I and @tdg5 noticed many services that we work on and maintain started using orders of magnitude more memory, from around 500mb to 4.5gb give or take, increasing rapidly and then ultimately receiving an OOMKill from k8s.

As we debugged the issue, we tracked it down to this line in the C# Snowflake connector. We maintain a private fork of the C# connector and removed that line from my fork. After doing so, all of the memory pressure returned to normal.

The strange part about this is that the memory pressure spiked on running pods, there was no new code or new deployments, it just started seemingly randomly. On further investigation (thanks @tdg5), this seems like a likely culprit: https://discuss.newrelic.com/t/important-upcoming-new-relic-server-certificate-update-will-impact-most-users-of-java-agent-version-6-1-and-a-few-users-of-java-agent-version-6-2-0-to-6-4-2/141711 . All of our services also use New Relic, so.. signs point to this being an issue. Initially we looked into debugging the problem with them, see newrelic/newrelic-dotnet-agent#544, but it seems to be more of a .net problem with the Service Point Manager.

Author: alexrosenfeld10
Assignees: -
Labels:

area-System.Net, untriaged

Milestone: -

@karelz
Copy link
Member

karelz commented May 11, 2021

@alexrosenfeld10 I am a bit confused. You mention ServicePointManager on .NET Core 3.1/5.0, but your referenced code of C# Snowflake connector uses HttpClientHandler.
BTW: On .NET Core, ServicePointManager/ServicePoint/HttpWebRequest are obsolete APIs, suboptimal and available only for basic compat. They are not used to implement HttpClient/HttpClientHandler/SocketsHttpHandler unlike on .NET Framework.

Also, it seems the problem happens only when profiler is enabled CORECLR_ENABLE_PROFILING=1. Does it require NewRelic in the repro, or does it happen also without NewRelic attached?
Overall we need to rule out NewRelic profiler doing something invasive and leaking the memory. For that, it would be best to provide a simplified standalone HelloWorld-style repro. Is that possible?

@karelz karelz added api-needs-work API needs work before it is approved, it is NOT ready for implementation new-api-needs-documentation needs more info and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation new-api-needs-documentation labels May 11, 2021
@alexrosenfeld10
Copy link
Author

alexrosenfeld10 commented May 11, 2021

@karelz thanks for the reply, I can understand some of your confusion. Allow me to further detail below:

At the time of starting the debugging and drafting the original issues, the Snowflake Connector set these static properties on the ServicePointManager. Traveling back into the commit history, you can find them set here: https://github.com/snowflakedb/snowflake-connector-net/blob/7331e46dea0cc28d59a2660e0ee71d7d9fa4a28a/Snowflake.Data/Core/HttpUtil.cs#L40. In our testing, we've found this to be a problem in both approaches (that being, setting the property on the HttpClientHandler as well as on the ServicePointManager directly.

Thanks for noting that they're obsolete, that's what I found in my research here as well. However.. they still caused us a serious problem here... It's more or less dumb luck that the Snowflake library is open source and we were able to make our own fork with those settings removed. If it wasn't, I guess we'd just be in support ticket land.

On the profiling front, yes, we only found this to be an issue with NewRelic enabled. However, I'd imagine that's less to do with NewRelic's Agent and more to do with the fact that there were certs updated on the same exact day and time this memory leak started to occur in our services. In order to repro, we'd probably need:

  1. an HttpClient with the CheckCertificateRevocationList property enabled
  2. CORECLR_ENABLE_PROFILING=1
  3. Calls to a service with a revoked cert

I am unsure of what CheckCertificateRevocationList really does under the covers when scenario 3 is the case. However, clearly something funky goes on. Hopefully you might be able to get some more ideas with this further detail?

@tdg5
Copy link

tdg5 commented May 11, 2021

The NewRelic profiler code can be found here: https://github.com/newrelic/newrelic-dotnet-agent/tree/main/src/Agent/NewRelic/Profiler

I've tried to grep for indications that the profiler is using something related to ServicePointManager or HttpClient or Web or Socket or Revocation, but I didn't turn up anything interesting. Any suggestions on what I might look for? I have no experience with the Profiler API or unmanaged code so I could be way off base in my search.

@antonfirsov
Copy link
Member

antonfirsov commented May 11, 2021

Unless the code in initHttpClient() runs on .NET Framework, the ServicePointManager poperties should not make any impact on HttpClient's behavior, since HttpClientHandler's underlying handler is SocketsHttpHandler which doesn't depend on the ServicePointManager stuff in any way.

I recommend to raise an issue against snowflake-connector-net first to get a clarification on their internal logic, which is very confusing and might be buggy (intended settings not taking any effect on .NET Core). Unless there is a repro with a definite proof of an issue with SocketsHttpHandler, there is not much we can do about this.

@tdg5
Copy link

tdg5 commented May 11, 2021

Unless the code in initHttpClient() runs on .NET Framework, the ServicePointManager poperties should not make any impact on HttpClient's behavior, since HttpClientHandler's underlying handler is SocketsHttpHandler which doesn't depend on the ServicePointManager stuff in any way.

We get it and we are also surprised by the behavior we've been seeing. The Snowflake connector is not specifically to blame other than its tweaking of the ServicePointManager config; I ran a modified version of their library locally to reproduce the problem and as part of the test I tweaked the code so it would do some configuration of ServicePointManager, but never create an http client because I thought the HTTP requests were likely the source of the problem. That change broke the snowflake connector, but it didn't stop the unmanaged memory leak. I was also very surprised.

@karelz
Copy link
Member

karelz commented May 11, 2021

OK, so there is a mystery and lots of components involved.
Is anyone in position to create a minimal HelloWorld-style repro for this without NewRelic and other products in play? (either from scratch or by removing parts of the end-to-end story)
That should navigate us towards the root cause of the problem - in whichever component/product it is.

@alexrosenfeld10
Copy link
Author

I started working on one while we were actively debugging this. I believe Danny has taken up the helm for the time being on that. I'm happy to work together to get things in a state where this can be reproduced in an MVC way.

@tdg5
Copy link

tdg5 commented May 11, 2021

Yeah, I am trying to reproduce the problem with fewer components involved. Still missing some part of the incantation thus far, but I will keep y'all posted.

@tdg5
Copy link

tdg5 commented May 11, 2021

Anyone know of a NoOp Profiler I can plug in instead of the NewRelic Profiler? I'm all about eliminating variables 😀

@tdg5
Copy link

tdg5 commented May 11, 2021

I am able to reproduce the problem with this repo: https://github.com/alexrosenfeld10/DotNetNewRelicMemoryLeak

You'll need a New Relic token though. I'm looking for ways to eliminate New Relic, ideas welcome.

I fear that even if I remove New Relic as a dependency, I'll still need something I can hit that has a somewhat recently expired certificate. Not sure how to reconcile that.

@tdg5
Copy link

tdg5 commented May 11, 2021

Repo has been updated so that New Relic isn't needed or included (though we still hit collector.newrelic.com since it angers the certificate revocation flying spaghetti monster).

Seems that the interplay is ... Snowflake sets ServicePointManager.CheckCertificateRevocationList = true which doesn't actually do anything to benefit Snowflake on .NET Core. However, New Relic uses WebRequest under the hood which does seem to be impacted by ServicePointManager configurations. I don't know where that suggests the problem is, but it seems to be a .NET issue and probably has something to do with WebRequest I'd guess 😀

If you have questions, let @alexrosenfeld10 or @tdg5 know and we'll help in whatever way we can.

https://github.com/alexrosenfeld10/DotNetNewRelicMemoryLeak

@karelz
Copy link
Member

karelz commented May 11, 2021

New Relic uses WebRequest under the hood which does seem to be impacted by ServicePointManager configurations

Is New Relic in the same process as Snowflake setting ServicePointManager? If yes, then it can be affected.
We would need a minimal repro with WebRequest then (without New Relic) -- just the same pattern of WebRequest usage.
Once we have that, I hope we can make it console app instead of ASP.NET app ...

@tdg5
Copy link

tdg5 commented May 11, 2021

@karelz perhaps our messages passed each other in flight. A minimal repo w/o New Relic but using WebRequest is available here: https://github.com/alexrosenfeld10/DotNetNewRelicMemoryLeak

It hits a collector.newrelic.com URI (because of recent *.newrelic.com certificate revocation), but otherwise does not include any New Relic libraries or anything like that. Ignore the bad name of the repo 😁

I don't see any reason why it couldn't be a console app instead of an ASP.NET app, but I'm afraid I have to leave that to you as my employer has already invested significantly in this bug 😢

@alexrosenfeld10
Copy link
Author

I'll rename it. Please see https://github.com/alexrosenfeld10/DotNetWebRequestCertMemLeak

@tdg5
Copy link

tdg5 commented May 11, 2021

It is worth noting that even though the reproduction takes the configure the global approach, we were also seeing a memory leak when the configuration appeared to be much more granular, e.g. https://github.com/snowflakedb/snowflake-connector-net/blob/master/Snowflake.Data/Core/HttpUtil.cs#L66

It is not at all clear to me why this more granular case was impacting WebRequest, but alas, it seemed to be.

@karelz
Copy link
Member

karelz commented May 12, 2021

@tdg5 looks like our responses crossed in flight. @ManickaP is now looking at the repro - she will get back to you if it is sufficient for us.

Is anyone currently blocked on some end-to-end scenarios, or were you able to apply some workarounds higher up in the stack?

@karelz karelz assigned MihaZupan and unassigned ManickaP Jul 16, 2021
@MihaZupan MihaZupan changed the title Potential memory leak in the Service Point Manager High native memory usage in certificate revocation checking Jul 22, 2021
@MihaZupan
Copy link
Member

Not an issue with ServicePointManager/WebRequest.

Repro with SslStream alone: https://gist.github.com/MihaZupan/2e207c4c703272961b475e482e96afa3#file-certrevocationmemory-cs-L18-L30

Reproes under Linux when validating a revoked certificate with the X509RevocationMode set to Offline or Online.

You can observe massive memory growth (up to ~100 MB) per AuthenticateAsClientAsync call.
It eventually stops growing at ~1 GB on my machine. If done in parallel, it consumes 2+ GB of memory on my machine.

Offline
 0:  64 MB private, 156 KB managed
 1: 205 MB private, 353 KB managed
 2: 314 MB private, 374 KB managed
 3: 326 MB private, 396 KB managed
 4: 338 MB private, 398 KB managed
 5: 338 MB private, 400 KB managed
 6: 430 MB private, 406 KB managed
 7: 430 MB private, 412 KB managed
 8: 523 MB private, 436 KB managed
 9: 534 MB private, 442 KB managed
10: 628 MB private, 454 KB managed

@wfurt as far as I can tell, the X509RevocationMode isn't really used in SslStream directly, but instead passed to System.Security.Cryptography.X509Certificates's chain. Is that correct?
I have a suspicion that the issue is somewhere in OpenSslX509ChainProcessor's ProcessRevocation if the revocation mode isn't set to X509RevocationMode.NoCheck.

@ghost
Copy link

ghost commented Jul 22, 2021

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Signs point to a memory leak in the Service Point Manager when the CheckCertificateRevocationList property is set to true and there is an expired cert

Configuration

netcoreapp3.1, net5.0 running inside mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim and mcr.microsoft.com/dotnet/aspnet:5.0-buster-slim

Regression?

Not sure, maybe?

Other information

On April 12th I and @tdg5 noticed many services that we work on and maintain started using orders of magnitude more memory, from around 500mb to 4.5gb give or take, increasing rapidly and then ultimately receiving an OOMKill from k8s.

As we debugged the issue, we tracked it down to this line in the C# Snowflake connector. We maintain a private fork of the C# connector and removed that line from my fork. After doing so, all of the memory pressure returned to normal.

The strange part about this is that the memory pressure spiked on running pods, there was no new code or new deployments, it just started seemingly randomly. On further investigation (thanks @tdg5), this seems like a likely culprit: https://discuss.newrelic.com/t/important-upcoming-new-relic-server-certificate-update-will-impact-most-users-of-java-agent-version-6-1-and-a-few-users-of-java-agent-version-6-2-0-to-6-4-2/141711 . All of our services also use New Relic, so.. signs point to this being an issue. Initially we looked into debugging the problem with them, see newrelic/newrelic-dotnet-agent#544, but it seems to be more of a .net problem with the Service Point Manager.

Author: alexrosenfeld10
Assignees: MihaZupan
Labels:

area-System.Net.Security, needs more info

Milestone: 6.0.0

@karelz karelz added bug tenet-performance Performance related issue and removed needs more info labels Jul 22, 2021
@MihaZupan MihaZupan added the os-linux Linux OS (any supported distro) label Jul 22, 2021
@karelz
Copy link
Member

karelz commented Jul 22, 2021

We will have to move this one to 7.0 at this point (6.0 is locking down and we're heads down on that).
If it impacts more customers or scenarios, we can always reconsider it for servicing later ...

@karelz karelz modified the milestones: 6.0.0, 7.0.0 Jul 22, 2021
@wfurt
Copy link
Member

wfurt commented Jul 22, 2021

yes, @bartonjs is probably most knowledgeable in that regard. We may cache the revocation results as well but I'm note sure. If you can reproduce it with simple calling X509Chain build with revocation @MihaZupan it would clearly belong to System.Security.

@bartonjs
Copy link
Member

The fact that it eventually stabilizes means it isn't a leak, per se. Since the test is doing a GC.Collect/GC.WaitForPendingFinalizers that rules out us having some largeish objects tracked by SafeHandles that got pushed out to finalization.

My guess is that it's just down to malloc/free's "lazy" implementation (free won't give the memory back to the OS by default, it keeps it around for the next malloc) combined with how many small allocations OpenSSL requests for loading a CRL... and it probably just ends up being a while before the "ready for a future malloc" list ends up nicely aligning with when OpenSSL does need the occasional larger chunk.

We already have a similar issue described in #55672

@MihaZupan
Copy link
Member

Right, it's not exactly a leak, but 100 MB per request seems excessive.
It's even weirder that it keeps growing so much on subsequent requests to the same host.

@alexrosenfeld10
Copy link
Author

Thanks for putting eyes on this folks. As a user here, I agree with @MihaZupan, the growth per-request is the big problem for apps. The fact that this was caused by calls happening inside our APM tooling (calls that happen a lot in the background) ballooned our memory pretty darn fast. We saw up to 8GB after a few minutes of running sometimes.

@wfurt
Copy link
Member

wfurt commented Jul 22, 2021

if there is simple repro, one can p/invoke to https://man7.org/linux/man-pages/man3/mallinfo.3.html
That would provide info to if the memory is still used or if the process size is outcome of fragmented memory or something else.

@aik-jahoda
Copy link
Contributor

Tested @MihaZupan minimal reproduction and wasn't able to see such terrible allocation.
testes on:

- .net core 3.1 .net 5
WSL Ubuntu 20.04
NoCheck
0: 47 MB private, 135 KB managed
1: 77 MB private, 253 KB managed
2: 94 MB private, 270 KB managed
3: 94 MB private, 290 KB managed
4: 95 MB private, 283 KB managed
5: 95 MB private, 284 KB managed
6: 95 MB private, 284 KB managed
7: 95 MB private, 287 KB managed
8: 95 MB private, 304 KB managed
9: 95 MB private, 303 KB managed
10: 95 MB private, 320 KB managed
11: 95 MB private, 330 KB managed
12: 95 MB private, 317 KB managed
13: 95 MB private, 318 KB managed
14: 95 MB private, 334 KB managed
15: 95 MB private, 335 KB managed
16: 95 MB private, 321 KB managed
17: 95 MB private, 321 KB managed
18: 95 MB private, 324 KB managed
19: 95 MB private, 327 KB managed
20: 95 MB private, 345 KB managed
NoCheck
0: 55 MB private, 146 KB managed
1: 101 MB private, 282 KB managed
2: 118 MB private, 292 KB managed
3: 118 MB private, 313 KB managed
4: 119 MB private, 325 KB managed
5: 119 MB private, 335 KB managed
6: 119 MB private, 347 KB managed
7: 119 MB private, 349 KB managed
8: 119 MB private, 363 KB managed
9: 119 MB private, 367 KB managed
0: 119 MB private, 361 KB managed
1: 120 MB private, 375 KB managed
2: 120 MB private, 370 KB managed
3: 120 MB private, 379 KB managed
4: 120 MB private, 371 KB managed
5: 121 MB private, 381 KB managed
6: 121 MB private, 383 KB managed
7: 121 MB private, 388 KB managed
8: 121 MB private, 387 KB managed
9: 121 MB private, 390 KB managed
0: 121 MB private, 390 KB managed

I observed same results on Ubuntu 20.04 and Docker mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim, mcr.microsoft.com/dotnet/aspnet:5.0-buster-slim

It is interesting the issue is not even reproducible on the same docker images mentioned in the issue description.

@alexrosenfeld10 - can you confirm the issue persists and it is reproducible with @MihaZupan repro?

@alexrosenfeld10
Copy link
Author

Hey, apologies but I don't currently have the time in my day to day to spend on this anymore.

Right now I don't see the issues in my app since removing the cert revocation check setting. If I find more time I can try to look into the other repro, but I don't think that will be anytime soon.

@rzikm rzikm self-assigned this May 26, 2022
@rzikm
Copy link
Member

rzikm commented May 26, 2022

I have tried to reproduce this issue multiple times with different setups and did not observe any unbounded growth of memory. I will close this issue but feel free to reopen this if more info is available.

@rzikm rzikm closed this as completed May 26, 2022
@rzikm rzikm removed their assignment May 26, 2022
@Vandersteen
Copy link

This might be related: #57213

@alexrosenfeld10
Copy link
Author

Definitely seems related, especially if the solution is disabling the same property

@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Security bug os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests