Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many FileWatcher threads #486

Closed
dtran808 opened this issue Sep 2, 2017 · 19 comments
Closed

Too many FileWatcher threads #486

dtran808 opened this issue Sep 2, 2017 · 19 comments
Assignees
Milestone

Comments

@dtran808
Copy link

dtran808 commented Sep 2, 2017

Hi,

I'm trying to use my resources as efficiently as possible. Recently, I did some load testing and noticed that I've been hitting the system limit of threads and it caused my app to crash. However, the CPU and memory were just fine. It's a software/config limit rather than an actual physical limit.

I dumped all my threads and seen that almost a third (100+) of my threads were named "file-watcher-/etc/ssl/certs/ca-certificates.crt". Each of these thread's stacktrace gives me:

"sun.misc.Unsafe.park(Native Method)",
"java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)",
"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)",
"java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)",
"java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)",
"sun.nio.fs.AbstractWatchService.take(AbstractWatchService.java:118)",
"org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:58)",
"org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:34)",
"java.util.concurrent.FutureTask.run(FutureTask.java:266)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)",
"java.lang.Thread.run(Thread.java:748)"

Is there some kind of leak here? Threads seemed to be spawning from FileWatcher uncontrollably * gasp *

There also seemed to be a similar issue in the recent past: #459
However, the symptom of each request having logged a watcher instance is not happening with my issue.

@nebhale
Copy link
Contributor

nebhale commented Sep 5, 2017

I've just run a simple application in Cloud Foundry that looks like the following:

@SpringBootApplication
public class SecurityProviderTestApplication {

    public static void main(String[] args) {
        SpringApplication.run(SecurityProviderTestApplication.class, args);
    }

    @PostConstruct
    public void test() throws NoSuchAlgorithmException {
        for (int i = 0; i < 500; i++) {
            System.out.println("Getting TrustStore");
            TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
        }

        System.out.printf("Thread Count: %d%n", ManagementFactory.getThreadMXBean().getThreadCount());
    }

}

The full output can be found here but as of version 1.8:

...
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Getting TrustStore
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Getting TrustStore
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Getting TrustStore
2017-09-05T09:01:43.24-0700 [APP/PROC/WEB/0] OUT Thread Count: 8

How are you using the TrustManager or SSLContext to retrieve the trust store?

@nebhale nebhale self-assigned this Sep 5, 2017
@nebhale
Copy link
Contributor

nebhale commented Sep 5, 2017

Nope. I'm totally wrong. This is a legit bug and reproduced. The cache scope isn't sufficient.

@nebhale nebhale added bug and removed question labels Sep 5, 2017
@nebhale nebhale added this to the v4.6 milestone Sep 5, 2017
@nebhale
Copy link
Contributor

nebhale commented Sep 5, 2017

Even as I say that, I'm not actually sure that it's a bug. If you take a look at the default implementation, you'll see that TrustManagers are cached to the scope of a TrustManagerFactory. We match that same behavior. The result of this is that the following code caches properly:

public void test() throws NoSuchAlgorithmException, KeyStoreException {
   TrustManagerFactory factory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());

   for (int i = 0; i < 500; i++) {
       System.out.println("Getting TrustStore");
       factory.init((KeyStore) null);
       factory.getTrustManagers();
   }

   System.out.printf("Thread Count: %d%n", ManagementFactory.getThreadMXBean().getThreadCount());
}

while the following does not:

public void test() throws NoSuchAlgorithmException, KeyStoreException {
   for (int i = 0; i < 500; i++) {
       System.out.println("Getting TrustStore");
       TrustManagerFactory factory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());
       factory.init((KeyStore) null);
       factory.getTrustManagers();
   }

   System.out.printf("Thread Count: %d%n", ManagementFactory.getThreadMXBean().getThreadCount());
}

I'm inclined to keep our code matching the default implementation unless you believe that there is a popular networking library that does not properly cache the TrustManagerFactory.

@dtran808
Copy link
Author

dtran808 commented Sep 5, 2017

Hmmmm very strange. Now (since Sunday), no matter how many times I try to deploy, I don't see this named thread anymore. Very odd. However, when I did see it, I saw this in the deploy logs:

[JCD] 2017-09-02 11:28:25.165457: 11:28:25.165 INFO  [] -    2017-09-02T11:27:51.68-0700 [App/0] OUT Sep 2, 2017 18:27:51 +0000 [29 1] com.newrelic INFO: Setting protocol to "https"
[JCD] 2017-09-02 11:28:25.165762: 11:28:25.165 INFO  [] -    2017-09-02T11:27:51.68-0700 [App/0] OUT Sep 2, 2017 18:27:51 +0000 [29 1] com.newrelic INFO: Setting audit_mode to false
[JCD] 2017-09-02 11:28:25.166035: 11:28:25.165 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR INFO: Adding System Trust Manager
[JCD] 2017-09-02 11:28:25.166321: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getSystemTrustManager
[JCD] 2017-09-02 11:28:25.166623: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR INFO: Adding System Trust Manager
[JCD] 2017-09-02 11:28:25.166872: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR INFO: Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.167180: 11:28:25.166 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getContainerTrustManager
[JCD] 2017-09-02 11:28:25.167480: 11:28:25.167 INFO  [] -    2017-09-02T11:27:52.51-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getSystemTrustManager
[JCD] 2017-09-02 11:28:25.167762: 11:28:25.167 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.CloudFoundryContainerTrustManagerFactory getContainerTrustManager
[JCD] 2017-09-02 11:28:25.168048: 11:28:25.167 INFO  [] -    2017-09-02T11:27:52.53-0700 [App/0] ERR INFO: Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.168365: 11:28:25.168 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR INFO: Start watching /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.168804: 11:28:25.168 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.FileWatcher call
[JCD] 2017-09-02 11:28:25.168993: 11:28:25.168 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR Sep 02, 2017 6:27:52 PM org.cloudfoundry.security.FileWatcher call
[JCD] 2017-09-02 11:28:25.169232: 11:28:25.169 INFO  [] -    2017-09-02T11:27:52.54-0700 [App/0] ERR INFO: Start watching /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.169583: 11:28:25.169 INFO  [] -    2017-09-02T11:27:54.51-0700 [App/0] ERR INFO: Initialized TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.169769: 11:28:25.169 INFO  [] -    2017-09-02T11:27:54.50-0700 [App/0] ERR Sep 02, 2017 6:27:54 PM org.cloudfoundry.security.FileWatchingX509ExtendedTrustManager <init>
[JCD] 2017-09-02 11:28:25.170005: 11:28:25.169 INFO  [] -    2017-09-02T11:27:54.50-0700 [App/0] ERR INFO: Initialized TrustManager for /etc/ssl/certs/ca-certificates.crt
[JCD] 2017-09-02 11:28:25.170307: 11:28:25.170 INFO  [] -    2017-09-02T11:27:54.51-0700 [App/0] ERR Sep 02, 2017 6:27:54 PM org.cloudfoundry.security.FileWatchingX509ExtendedTrustManager <init>
[JCD] 2017-09-02 11:28:25.170697: 11:28:25.170 INFO  [] -    2017-09-02T11:27:55.08-0700 [App/0] OUT Sep 2, 2017 18:27:55 +0000 [29 1] com.newrelic.agent.config.ConfigServiceImpl INFO: Configuration file is /home/vcap/app/.java-buildpack/new_relic_agent/./newrelic.yml

It stood out because they were ERR logs (in red). But again, I can't seem to see that named thread anymore. I haven't changed code. Just redeployed. Not sure if that helps.

@nebhale
Copy link
Contributor

nebhale commented Sep 5, 2017

The ERR logs are because your logger configuration writes those log lines to stderr. I'd guess that's Java Util Logging and that's why you're getting the call sites as well.

I do see what appears to be two parallel initializations of the TrustManagerFactory and the resulting TrustManagers, but as pointed out earlier, this isn't necessarily wrong. If you have two networking libraries, or two library initializations that don't cache the TrustManagerFactory this could happen. The last change to the Security Provider was 29 days ago, so I don't believe that we've made any changes that would affect you.

@nebhale nebhale added question and removed bug labels Sep 5, 2017
@nebhale nebhale removed this from the v4.6 milestone Sep 5, 2017
@dtran808
Copy link
Author

dtran808 commented Sep 5, 2017

I guess I spoke too soon. I am now able to reproduce these threads. These threads build up only when I see those ERR logs at beginning of deployment. On other deployments (of same app), those ERR logs do not appear at all and these apps have no longevity issue.

I believe this problem didn't show itself until I bumped up the ceiling of my thread pool to 400. Maybe it was always there?

If X is my pool size, I'm incurring an additional 0.3X threads. So if I lower my pool size to 200, for example, this issue goes away. Since the VM can definitely handle 200 + 60 threads.

I'll try and find out why these logs sometimes show and sometimes don't. That might give a clue.

Thanks

@nebhale
Copy link
Contributor

nebhale commented Sep 5, 2017

The FileWatcher does not use any thread pool that you can configure so the number of threads it will attempt to start is not governed by any of your configuration.

@dtran808
Copy link
Author

dtran808 commented Sep 5, 2017

It would if my each of my new threads trigger a thread creation on your end. Something akin to #459

@lejeunen
Copy link

We are also facing the same issue. Our server ran into memory issue, more precisely thread creation issues

java.lang.OutOfMemoryError: unable to create new native thread

I logged into the server (running on PWS), and a thread dump revealed that there hundreds (528 to be precise) of stacks like this one

"file-watcher-/etc/ssl/certs/ca-certificates.crt" #15372 daemon prio=5 os_prio=0 tid=0x00007f9308012000 nid=0x3c6b waiting on condition [0x00007f92ac342000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000ba5881a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at sun.nio.fs.AbstractWatchService.take(AbstractWatchService.java:118) at org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:58) at org.cloudfoundry.security.FileWatcher.call(FileWatcher.java:34) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

This causes the server to reach the hard cap of 1000, and prevents new thread creation.

@lejeunen
Copy link

I'm attaching the complete thread dump if it can help.
filewatcher-threads.txt

@nebhale
Copy link
Contributor

nebhale commented Sep 11, 2017

@lejeunen Do you have any idea what the lifecycle of your TrustManagerFactorys are? Specifically, are you using a networking client, or are you using a networking client in such a way, that does not cache the TrustManagerFactory?

I’m interested in getting a better understanding of the instances where this occurs. I’m looking to find out if there are specific frameworks or specific framework recommendations that end up in this non-cached situation.

@nebhale nebhale added this to the v4.6 milestone Sep 11, 2017
@lejeunen
Copy link

We do not manipulate the TrustManagerFactory directly, i.e. in our application code. We do use many libraries though, see below.

The server where we observe the issue is our gateway/edge server (in a microservice sense), so it gets all the incoming REST requests from our web client (a Javascript app, mostly run from Chrome), and forwards them to other microservices.

To achieve this, we use the Spring Cloud stack, where Zuul reverse proxy is used together with a discovery service (Eureka, also bundled with Spring Cloud) to forward requests to other services.

In these other services, also based on Spring cloud but without the reverse proxy part, we only see a few threads (3-4) related to the file watcher.

In our gateway server log, we see many logs related to this (as detailed in #459) whenever our web client sends requests to the server.

2017-09-11T16:03:20.48+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:20.479  INFO 14 --- [nio-8080-exec-7] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.61+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.619  INFO 14 --- [nio-8080-exec-5] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.61+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.619  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.623  INFO 14 --- [nio-8080-exec-1] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.623  INFO 14 --- [nio-8080-exec-1] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.623  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-9] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-3] ContainerTrustManagerFactory$PKIXFactory : Adding System Trust Manager
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-9] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [nio-8080-exec-3] ContainerTrustManagerFactory$PKIXFactory : Adding TrustManager for /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.626  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt
2017-09-11T16:03:19.62+0200 [APP/PROC/WEB/1]OUT 2017-09-11 14:03:19.627  INFO 14 --- [ertificates.crt] org.cloudfoundry.security.FileWatcher    : Start watching /etc/ssl/certs/ca-certificates.crt

@lejeunen
Copy link

We have a branch with a more recent version of Spring Boot, and therefore Spring Cloud, and the problem does not seem to happen on that environment!

We are conducting more tests to confirm this.

@dtran808
Copy link
Author

Which spring boot are you using? I'm using 1.5.2. however, this problem is intermittent. Out of 5 deployments, 2 of them will have this issue. Maybe a certain version will guarantees the fix.

@lejeunen
Copy link

We have not seen it happen yet with Boot 1.5.6

@nebhale
Copy link
Contributor

nebhale commented Sep 12, 2017

@lejeunen Is it possible that your gateway server (where you see those many logs) is running on a different version of the Java Buildpack from the application application without the reverse proxy?

I think at this point, I’m going to find someway to cache the “watcher” bit of the security provider no matter what the behavior of the network clients are. I’m just gathering more information at this point to make sure I cache in the proper location.

@nebhale nebhale removed the question label Sep 13, 2017
@lejeunen
Copy link

@nebhale No we use the same buildpack version as the other services.

@nebhale
Copy link
Contributor

nebhale commented Sep 15, 2017

@lejeunen Understood. Thanks for the feedback.

@nebhale
Copy link
Contributor

nebhale commented Sep 18, 2017

I've changed the design of the Provider such that the container key manager and trust manager are cached for the JVM rather than the instance. This will go out in Security Provider 1.9.0.RELEASE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants