Too many FileWatcher threads #486

dtran808 · 2017-09-02T23:58:35Z

Hi,

I'm trying to use my resources as efficiently as possible. Recently, I did some load testing and noticed that I've been hitting the system limit of threads and it caused my app to crash. However, the CPU and memory were just fine. It's a software/config limit rather than an actual physical limit.

I dumped all my threads and seen that almost a third (100+) of my threads were named "file-watcher-/etc/ssl/certs/ca-certificates.crt". Each of these thread's stacktrace gives me:

"sun.misc.Unsafe.park(Native Method)",
"java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)",
"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)",
"java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)",
"java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)",
"sun.nio.fs.AbstractWatchService.take(AbstractWatchService.java:118)",
"org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:58)",
"org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:34)",
"java.util.concurrent.FutureTask.run(FutureTask.java:266)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)",
"java.lang.Thread.run(Thread.java:748)"

Is there some kind of leak here? Threads seemed to be spawning from FileWatcher uncontrollably * gasp *

There also seemed to be a similar issue in the recent past: #459
However, the symptom of each request having logged a watcher instance is not happening with my issue.

The text was updated successfully, but these errors were encountered:

nebhale · 2017-09-05T16:04:15Z

I've just run a simple application in Cloud Foundry that looks like the following:

@SpringBootApplication
public class SecurityProviderTestApplication {

    public static void main(String[] args) {
        SpringApplication.run(SecurityProviderTestApplication.class, args);
    }

    @PostConstruct
    public void test() throws NoSuchAlgorithmException {
        for (int i = 0; i < 500; i++) {
            System.out.println("Getting TrustStore");
            TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
        }

        System.out.printf("Thread Count: %d%n", ManagementFactory.getThreadMXBean().getThreadCount());
    }

}

The full output can be found here but as of version 1.8:

...
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Getting TrustStore
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Getting TrustStore
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Getting TrustStore
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Thread Count: 8

How are you using the TrustManager or SSLContext to retrieve the trust store?

nebhale · 2017-09-05T16:07:50Z

Nope. I'm totally wrong. This is a legit bug and reproduced. The cache scope isn't sufficient.

nebhale · 2017-09-05T16:19:53Z

Even as I say that, I'm not actually sure that it's a bug. If you take a look at the default implementation, you'll see that TrustManagers are cached to the scope of a TrustManagerFactory. We match that same behavior. The result of this is that the following code caches properly:

public void test() throws NoSuchAlgorithmException, KeyStoreException {
   TrustManagerFactory factory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());

   for (int i = 0; i < 500; i++) {
       System.out.println("Getting TrustStore");
       factory.init((KeyStore) null);
       factory.getTrustManagers();
   }

   System.out.printf("Thread Count: %d%n", ManagementFactory.getThreadMXBean().getThreadCount());
}

while the following does not:

public void test() throws NoSuchAlgorithmException, KeyStoreException {
   for (int i = 0; i < 500; i++) {
       System.out.println("Getting TrustStore");
       TrustManagerFactory factory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
       factory.init((KeyStore) null);
       factory.getTrustManagers();
   }

   System.out.printf("Thread Count: %d%n", ManagementFactory.getThreadMXBean().getThreadCount());
}

I'm inclined to keep our code matching the default implementation unless you believe that there is a popular networking library that does not properly cache the TrustManagerFactory.

dtran808 · 2017-09-05T16:39:31Z

Hmmmm very strange. Now (since Sunday), no matter how many times I try to deploy, I don't see this named thread anymore. Very odd. However, when I did see it, I saw this in the deploy logs:

[JCD] 2017-09-02 11:28:25.165457: 11:28:25.165 INFO  [] -    2017-09-02T11:27:51.68-0700 [App/0] OUT Sep 2, 2017 18:27:51 +0000 [29 1] com.newrelic INFO: Setting protocol to "https"
[JCD] 2017-09-02 11:28:25.165762: 11:28:25.165 INFO  [] -    2017-09-02T11:27:51.68-0700 [App/0] OUT Sep 2, 2017 18:27:51 +0000 [29 1] com.newrelic INFO: Setting audit_mode to false
[JCD] 2017-09-02 11:28:25.166035: 11:28:25.165 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR INFO: Adding System Trust Manager
[JCD] 2017-09-02 11:28:25.166321: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getSystemTrustManager
[JCD] 2017-09-02 11:28:25.166623: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR INFO: Adding System Trust Manager
[JCD] 2017-09-02 11:28:25.166872: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR INFO: Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.167180: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getContainerTrustManager
[JCD] 2017-09-02 11:28:25.167480: 11:28:25.167 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getSystemTrustManager
[JCD] 2017-09-02 11:28:25.167762: 11:28:25.167 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getContainerTrustManager
[JCD] 2017-09-02 11:28:25.168048: 11:28:25.167 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR INFO: Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.168365: 11:28:25.168 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR INFO: Start watching /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.168804: 11:28:25.168 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.FileWatcher call
[JCD] 2017-09-02 11:28:25.168993: 11:28:25.168 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.FileWatcher call
[JCD] 2017-09-02 11:28:25.169232: 11:28:25.169 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR INFO: Start watching /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.169583: 11:28:25.169 INFO  [] -    2017-09-02T11:27:54.51-0700 [App/0] ERR INFO: Initialized TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.169769: 11:28:25.169 INFO  [] -    2017-09-02T11:27:54.50-0700 [App/0] ERR Sep 02, 2017 6:27:54 PM org.cloudfoundry.security.FileWatchingX509ExtendedTrustManager <init>
[JCD] 2017-09-02 11:28:25.170005: 11:28:25.169 INFO  [] -    2017-09-02T11:27:54.50-0700 [App/0] ERR INFO: Initialized TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.170307: 11:28:25.170 INFO  [] -    2017-09-02T11:27:54.51-0700 [App/0] ERR Sep 02, 2017 6:27:54 PM org.cloudfoundry.security.FileWatchingX509ExtendedTrustManager <init>
[JCD] 2017-09-02 11:28:25.170697: 11:28:25.170 INFO  [] -    2017-09-02T11:27:55.08-0700 [App/0] OUT Sep 2, 2017 18:27:55 +0000 [29 1] com.newrelic.agent.config.ConfigServiceImpl INFO: Configuration file is /home/vcap/app/.java-buildpack/new_relic_agent/./newrelic.yml

It stood out because they were ERR logs (in red). But again, I can't seem to see that named thread anymore. I haven't changed code. Just redeployed. Not sure if that helps.

nebhale · 2017-09-05T16:45:57Z

The ERR logs are because your logger configuration writes those log lines to stderr. I'd guess that's Java Util Logging and that's why you're getting the call sites as well.

I do see what appears to be two parallel initializations of the TrustManagerFactory and the resulting TrustManagers, but as pointed out earlier, this isn't necessarily wrong. If you have two networking libraries, or two library initializations that don't cache the TrustManagerFactory this could happen. The last change to the Security Provider was 29 days ago, so I don't believe that we've made any changes that would affect you.

dtran808 · 2017-09-05T17:07:04Z

I guess I spoke too soon. I am now able to reproduce these threads. These threads build up only when I see those ERR logs at beginning of deployment. On other deployments (of same app), those ERR logs do not appear at all and these apps have no longevity issue.

I believe this problem didn't show itself until I bumped up the ceiling of my thread pool to 400. Maybe it was always there?

If X is my pool size, I'm incurring an additional 0.3X threads. So if I lower my pool size to 200, for example, this issue goes away. Since the VM can definitely handle 200 + 60 threads.

I'll try and find out why these logs sometimes show and sometimes don't. That might give a clue.

Thanks

nebhale · 2017-09-05T17:11:24Z

The FileWatcher does not use any thread pool that you can configure so the number of threads it will attempt to start is not governed by any of your configuration.

dtran808 · 2017-09-05T17:51:49Z

It would if my each of my new threads trigger a thread creation on your end. Something akin to #459

lejeunen · 2017-09-11T09:05:58Z

We are also facing the same issue. Our server ran into memory issue, more precisely thread creation issues

java.lang.OutOfMemoryError: unable to create new native thread

I logged into the server (running on PWS), and a thread dump revealed that there hundreds (528 to be precise) of stacks like this one

"file-watcher-/etc/ssl/certs/ca-certificates.crt" #15372 daemon prio=5 os_prio=0 tid=0x00007f9308012000 nid=0x3c6b waiting on condition [0x00007f92ac342000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000ba5881a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at sun.nio.fs.AbstractWatchService.take(AbstractWatchService.java:118) at org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:58) at org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:34) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

This causes the server to reach the hard cap of 1000, and prevents new thread creation.

lejeunen · 2017-09-11T09:11:12Z

I'm attaching the complete thread dump if it can help.
filewatcher-threads.txt

nebhale · 2017-09-11T13:37:33Z

@lejeunen Do you have any idea what the lifecycle of your TrustManagerFactorys are? Specifically, are you using a networking client, or are you using a networking client in such a way, that does not cache the TrustManagerFactory?

I’m interested in getting a better understanding of the instances where this occurs. I’m looking to find out if there are specific frameworks or specific framework recommendations that end up in this non-cached situation.

lejeunen · 2017-09-11T14:08:22Z

We do not manipulate the TrustManagerFactory directly, i.e. in our application code. We do use many libraries though, see below.

The server where we observe the issue is our gateway/edge server (in a microservice sense), so it gets all the incoming REST requests from our web client (a Javascript app, mostly run from Chrome), and forwards them to other microservices.

To achieve this, we use the Spring Cloud stack, where Zuul reverse proxy is used together with a discovery service (Eureka, also bundled with Spring Cloud) to forward requests to other services.

In these other services, also based on Spring cloud but without the reverse proxy part, we only see a few threads (3-4) related to the file watcher.

In our gateway server log, we see many logs related to this (as detailed in #459) whenever our web client sends requests to the server.

2017-09-11T16:03:20.48+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:20.479  INFO 14 --- [nio-8080-exec-7] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.61+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.619  INFO 14 --- [nio-8080-exec-5] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.61+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.619  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.623  INFO 14 --- [nio-8080-exec-1] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.623  INFO 14 --- [nio-8080-exec-1] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.623  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-9] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-3] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-9] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-3] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.627  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt

lejeunen · 2017-09-11T14:40:40Z

We have a branch with a more recent version of Spring Boot, and therefore Spring Cloud, and the problem does not seem to happen on that environment!

We are conducting more tests to confirm this.

dtran808 · 2017-09-11T18:41:50Z

Which spring boot are you using? I'm using 1.5.2. however, this problem is intermittent. Out of 5 deployments, 2 of them will have this issue. Maybe a certain version will guarantees the fix.

lejeunen · 2017-09-12T07:13:28Z

We have not seen it happen yet with Boot 1.5.6

nebhale · 2017-09-12T13:30:33Z

@lejeunen Is it possible that your gateway server (where you see those many logs) is running on a different version of the Java Buildpack from the application application without the reverse proxy?

I think at this point, I’m going to find someway to cache the “watcher” bit of the security provider no matter what the behavior of the network clients are. I’m just gathering more information at this point to make sure I cache in the proper location.

lejeunen · 2017-09-15T13:17:16Z

@nebhale No we use the same buildpack version as the other services.

nebhale · 2017-09-15T13:29:46Z

@lejeunen Understood. Thanks for the feedback.

nebhale · 2017-09-18T20:49:25Z

I've changed the design of the Provider such that the container key manager and trust manager are cached for the JVM rather than the instance. This will go out in Security Provider 1.9.0.RELEASE.

nebhale self-assigned this Sep 5, 2017

nebhale added the question label Sep 5, 2017

nebhale added bug and removed question labels Sep 5, 2017

nebhale added this to the v4.6 milestone Sep 5, 2017

nebhale added question and removed bug labels Sep 5, 2017

nebhale removed this from the v4.6 milestone Sep 5, 2017

nebhale added bug framework labels Sep 11, 2017

nebhale added this to the v4.6 milestone Sep 11, 2017

nebhale removed the question label Sep 13, 2017

nebhale closed this as completed Sep 18, 2017

baso53 mentioned this issue Sep 3, 2020

Too many "Adding System Trust Manager" messages, possible memory leak #822

Closed

yusmenrashidov mentioned this issue Dec 13, 2021

Potential FileWatcher memory leak #919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many FileWatcher threads #486

Too many FileWatcher threads #486

dtran808 commented Sep 2, 2017 •

edited by nebhale

Loading

nebhale commented Sep 5, 2017

nebhale commented Sep 5, 2017 •

edited

Loading

nebhale commented Sep 5, 2017 •

edited

Loading

dtran808 commented Sep 5, 2017 •

edited by nebhale

Loading

nebhale commented Sep 5, 2017

dtran808 commented Sep 5, 2017

nebhale commented Sep 5, 2017

dtran808 commented Sep 5, 2017

lejeunen commented Sep 11, 2017

lejeunen commented Sep 11, 2017

nebhale commented Sep 11, 2017 •

edited

Loading

lejeunen commented Sep 11, 2017

lejeunen commented Sep 11, 2017

dtran808 commented Sep 11, 2017

lejeunen commented Sep 12, 2017

nebhale commented Sep 12, 2017 •

edited

Loading

lejeunen commented Sep 15, 2017

nebhale commented Sep 15, 2017

nebhale commented Sep 18, 2017

Too many FileWatcher threads #486

Too many FileWatcher threads #486

Comments

dtran808 commented Sep 2, 2017 • edited by nebhale Loading

nebhale commented Sep 5, 2017

nebhale commented Sep 5, 2017 • edited Loading

nebhale commented Sep 5, 2017 • edited Loading

dtran808 commented Sep 5, 2017 • edited by nebhale Loading

nebhale commented Sep 5, 2017

dtran808 commented Sep 5, 2017

nebhale commented Sep 5, 2017

dtran808 commented Sep 5, 2017

lejeunen commented Sep 11, 2017

lejeunen commented Sep 11, 2017

nebhale commented Sep 11, 2017 • edited Loading

lejeunen commented Sep 11, 2017

lejeunen commented Sep 11, 2017

dtran808 commented Sep 11, 2017

lejeunen commented Sep 12, 2017

nebhale commented Sep 12, 2017 • edited Loading

lejeunen commented Sep 15, 2017

nebhale commented Sep 15, 2017

nebhale commented Sep 18, 2017

dtran808 commented Sep 2, 2017 •

edited by nebhale

Loading

nebhale commented Sep 5, 2017 •

edited

Loading

nebhale commented Sep 5, 2017 •

edited

Loading

dtran808 commented Sep 5, 2017 •

edited by nebhale

Loading

nebhale commented Sep 11, 2017 •

edited

Loading

nebhale commented Sep 12, 2017 •

edited

Loading