Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Kestrel randomly locked and stop responding. #1267

Closed
ZOXEXIVO opened this issue Dec 21, 2016 · 76 comments
Closed

Kestrel randomly locked and stop responding. #1267

ZOXEXIVO opened this issue Dec 21, 2016 · 76 comments
Milestone

Comments

@ZOXEXIVO
Copy link

ZOXEXIVO commented Dec 21, 2016

I run ASP.NET Core 1.1 on Ubuntu 16.04 in Docker container (with Ubuntu image).
This works fine, but 3 times in two weeks I catch error, that Kestrel stop responding.
No errors in Exception filter, in logs - only last record with start Action Executing, but not completed.
Why problem in Kestrel ? - Because it locked and not processing new requests.
At locked state, CPU usage is 4-5% (per core). I wait 4 hours, but not result.

TOP command result:

PID    USER      PR    NI    VIRT       RES          SHR S       %CPU      %MEM      TIME+       COMMAND
10788   root      20   0     11.396g    185060       53704 S     4.7       2.3       14:17.56     dotnet

Restarting docker container is solve issue.
I can't attach and trace it, because all tools, that I use - works only on Windows.

@cesarblum
Copy link
Contributor

Just to be clear, are you running everything (host and containers) on Linux?

Are you doing remote access from a Windows console window? It could be this if so: #1259

@ZOXEXIVO
Copy link
Author

ZOXEXIVO commented Dec 21, 2016

Yes, it's all running on Ubuntu. 2-core server with no server CPU (G4400) and 8GB RAM
I know problem with console locking in Windows, with mouse e.t.c, but this is not same issue, because docker run in background and I not access it directly. It stop responding without any user activity with server.
Kestrel doing something, but i don't understand what:
CPU load 4.7% per core (at locked state) is equals to 3-5 fast requests to main page in normal state.

It very hard to reproduce and i want help you to fix it, Can you advice trace tools for collect debug info at .NET Core in Linux environment ?

Issue always appear after few days without any user activity on server.

@ZOXEXIVO
Copy link
Author

UPD. Just now server blocked again. 2 days after restarted container

@davidfowl
Copy link
Member

@ZOXEXIVO are you passing a cancellation token to any of your response writing API calls (if you have any)

@ZOXEXIVO
Copy link
Author

ZOXEXIVO commented Jan 4, 2017

@davidfowl No. Only simple async requests to MongoDB without passing CancellationToken.

@philjones88
Copy link

philjones88 commented Jan 5, 2017

We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly.

I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger.

Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504.

Curl inside container:

* Rebuilt URL to: 127.0.0.1:1924/
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 1924 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:1924
> User-Agent: curl/7.50.1
> Accept: */*
>

Hangs and has to be killed.

netstat inside container:

[root@66e41d245901 vq]# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp     1976      0 localhost:xiip          localhost:48872         CLOSE_WAIT
tcp      515      0 localhost:xiip          localhost:48866         CLOSE_WAIT
tcp        0      0 localhost:49636         localhost:57050         ESTABLISHED
tcp       79      0 localhost:xiip          localhost:48938         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48980         CLOSE_WAIT
tcp       32      0 66e41d245901:54064      192.168.1.11:https      CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48934         CLOSE_WAIT
tcp        0      0 localhost:postgres      localhost:56804         ESTABLISHED
tcp     2076      0 localhost:xiip          localhost:48884         CLOSE_WAIT
tcp     2002      0 localhost:xiip          localhost:48950         CLOSE_WAIT
tcp      236      0 localhost:xiip          localhost:49010         CLOSE_WAIT
tcp        0      0 66e41d245901:54308      192.168.1.11:https      ESTABLISHED
tcp        0      0 localhost:postgres      localhost:56838         ESTABLISHED
tcp        0      0 localhost:xiip          localhost:48754         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48958         CLOSE_WAIT
tcp       77      0 localhost:xiip          localhost:48962         CLOSE_WAIT
tcp        0      0 66e41d245901:50248      mirrors.melbourne.:http TIME_WAIT
tcp       77      0 localhost:xiip          localhost:48960         CLOSE_WAIT
tcp        0      0 localhost:56804         localhost:postgres      ESTABLISHED
tcp        0      0 localhost:57050         localhost:54525         ESTABLISHED
tcp        0      0 localhost:56838         localhost:postgres      ESTABLISHED
tcp      236      0 localhost:xiip          localhost:49008         CLOSE_WAIT
tcp        0      0 localhost:57050         localhost:49636         ESTABLISHED
tcp      515      0 localhost:xiip          localhost:48818         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48966         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48970         CLOSE_WAIT
tcp      489      0 localhost:xiip          localhost:48848         CLOSE_WAIT
tcp        0      0 localhost:54525         localhost:57050         ESTABLISHED
tcp     2002      0 localhost:xiip          localhost:48978         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48888         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48964         CLOSE_WAIT
udp6       0      0 localhost:56619         localhost:56619         ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  3      [ ]         STREAM     CONNECTED     22431
unix  3      [ ]         STREAM     CONNECTED     24418    /tmp/kestrel_0a937d9e657e40f98d00f6ec9e1d1388
unix  3      [ ]         STREAM     CONNECTED     27652    /tmp/kestrel_ac09a04d99044ce686585e802a31f7ed
unix  3      [ ]         STREAM     CONNECTED     28605
unix  3      [ ]         STREAM     CONNECTED     28606
unix  3      [ ]         STREAM     CONNECTED     26748

@jchannon
Copy link

jchannon commented Jan 5, 2017

I work with @philjones88 and to confirm this is a self-contained app running against NetCore 1.0.1 on a CentOS 7 container.

@philjones88
Copy link

we enabled more verbose logging and captured:

2017-01-05 14:06:40.951911  RequestId : 0HL1LEEL4VG15 Component : VQPortal Method : OWIN Pipeline Message : Request Start: /static/themes/base.css?cb=1483621285065
2017-01-05 14:06:40.951911 Microsoft.AspNetCore.StaticFiles.StaticFileMiddleware Sending file. Request path: '/themes/base.css'. Physical path: '/vq/portal/static/themes/base.css'
2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" communication error Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvException: Error -32 EPIPE broken pipe
2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" disconnecting.```

@ZOXEXIVO
Copy link
Author

After using NGINX before Kestrel, problem disappear (2 weeks uptime)

@euclid47
Copy link

euclid47 commented Feb 2, 2017

@ZOXEXIVO we are experiencing the same symptoms in the same containerized environment. Since deploying nginx before kestrel , have you had any locks?

@ZOXEXIVO
Copy link
Author

ZOXEXIVO commented Feb 2, 2017

@euclid47 behind nginx server, it works perfect! I
Here, #1304 problem was in StaticFiles middleware.

@runxc1
Copy link

runxc1 commented Feb 3, 2017

Any idea when #1304 will be pushed out? I am running .net core 1.1 in an Azure web app and am similarly having the problem where it will crash and never start serving content again until the app is swapped or restarted

@muratg
Copy link
Contributor

muratg commented Feb 3, 2017

@ZOXEXIVO @runxc1 #1304 should be released in Feb.

@Eilon
Copy link
Member

Eilon commented Feb 8, 2017

Closing as a dup of #1304.

@Eilon Eilon closed this as completed Feb 8, 2017
@tavisca-syedave
Copy link

Hi,
Do we have any plans to ship v1.1.1 as we are also getting the same error while deploying dot net core web app on Linux docker - ubntu ?

@Eilon
Copy link
Member

Eilon commented Feb 15, 2017

@tavisca-syedave we plan to have the patch out hopefully later this month.

@cesarblum
Copy link
Contributor

Kestrel 1.1.1 is now released. Is anyone still running into this issue?

@pieceofsummer
Copy link

pieceofsummer commented May 13, 2017

I think I've just got it with 1.1.2 using unix socket (behind nginx). Will enable verbose logging and come back if I can reproduce it.

I have StaticFiles middleware disabled in production (static files are served by nginx).

@ilyabreev
Copy link

Just got it too with 1.1.2 begind nginx.

@jeff-pang
Copy link

jeff-pang commented Aug 25, 2017

I also just got it on 1.1.2. Nginx reverse proxy to Kestrel in Debian. In my case it was working in production for more more than 1 month then suddenly this it stopped working. Now it stopped working every hour

@ZOXEXIVO
Copy link
Author

ZOXEXIVO commented Aug 25, 2017

Best advice -> upgrade to v2.0

@jeff-pang
Copy link

@ZOXEXIVO Did upgrading to v2 fixed your problem?

@ZOXEXIVO
Copy link
Author

@jeff-pang Problem disappeared when i host kestrel behind nginx

@hasanoruc
Copy link

@halter73
If every request to the server and every transaction in the database causes a dotnet process in the ram in my ubuntu server, and this does not disappear by itself. I need to clear these operations. I can clean them with crontab. When you delete this information and the cache on the server, the user is connected a bit late. But if it remains within the time specified in the crontab, it reads again from the cache and fast.

@niemyjski
Copy link

I'm also trying to track down this issue with two of our sites. It seems to happen when there is load on the site and it doesn't fix itself. We use a lot of async redis calls / http client calls via the nest client

@halter73
Copy link
Member

@niemyjski It could be threadpool starvation. Have you collected a memory dump when Kestrel isn't responding?

@ilyabreev
Copy link

ilyabreev commented Dec 18, 2017

Same problem here. Randomly stops responding and consumes 1-2% CPU. dotnet core 2.0.3

@halter73
Copy link
Member

@ilyabreev Any details?

Information that might be useful:

  1. OS version
  2. Whether or not your running under IIS/ANCM or behind another reverse proxy
  3. Frequency of occurence
  4. Amount of load the server was under when it stopped responding
  5. A core dump taken when the server is not responding
  6. Anything out of the ordinary that could be related.

Also, I recommend opening a new issue. Even if the behavior described in this issue is similar to what you're experiencing, this issue describes a bug that was fixed prior to the release of 2.0.0.

@niemyjski
Copy link

@halter73 yes, I have some memory dumps.

@halter73
Copy link
Member

@niemyjski Can you share these? You can email me with a link at the address listed in my github profile. Thanks.

@ilyabreev
Copy link

@halter73

  1. Ubuntu 14.04.5 LTS as host system. dotnet process runs inside docker container which based on ubuntu:xenial
  2. Running behind nginx as reverse proxy.
  3. 0-6 times in a day.
  4. Don't know exactly. I don't think it's constantly too high. Maybe some kind of CPU spikes occured before locking.
  5. Will try to email you link to memory dump.
  6. Nothing special.

@mazong1123
Copy link

As this issue has been closed, do we have another new issue to track down the "dotnet process stops" problem? We're facing the same issue as the dotnet process stops unexpectedly.

@muratg
Copy link
Contributor

muratg commented Dec 20, 2017

@mazong1123 please feel free to file another bug to discuss the current issues.

@mshanak
Copy link

mshanak commented Dec 24, 2017

This blog post could help you increasing max open files and configure Nginx to close finished request connections

@hasanoruc
Copy link

hasanoruc commented Dec 29, 2017

I guess i solved my own problem. A few days ago, i looked nginx error logs in /var/log/nginx/error.log. and saw error logs this: "upstream prematurely closed connection while reading response header from upstream". And i think about it and i looked my bad nginx config. And I changed the following config:

location / {
proxy_pass http://backend;  #throw error this line
}
location /api {
proxy_pass http://localhost:5021; #this is not throwing an error because config is ok.
}
upstream backend {
server 127.0.0.1:5020; # this throwing error. because i think coming request going to 2 port. and confusing.
server 127.0.0.1:19920;
}

my new config is:

location / {
proxy_pass http://localhost:5020;
}
location /api {
proxy_pass http://localhost:5021;
}

And not using the upstream because not necessary my system.

Sorry for my bad english.

@niemyjski
Copy link

niemyjski commented Jan 3, 2018

@halter73 just a heads up I've been running 1.5 weeks in production without issues. I've seen high thread count and have been trying to get the profiler attached in azure to take a look (without success). Considering I've seen thread exhaustion in the past I want to ensure I'm not leaking any thing but the high thread count is normal for my load (but I never remember seeing this many of threads in the legacy asp.net app). I got a dump but all the threads were native and empty stack traces. Currently serving up 5+ billion requests a month across 4-5 large azure web app instances. Seeing 70-80% cpu and a couple hundred threads (600+). I was able to narrow down one of my issues to this: StackExchange/StackExchange.Redis#759 .

Interesting note here is I wish Microsoft would throw some resources at this project (Redis client) as it is used by Microsoft for the distributed cache implementation and is used for various scenarios inside of asp.net ;)

@benaadams
Copy link
Contributor

I've made a BlockingDetector package to output warnings to the logs when you block:
https://github.com/benaadams/Ben.BlockingDetector

Only detects when blocking actually happens; so doesn't pick up calls that may block but don't or warn about coding practices that lead to blocking (or blocking that happens at OS rather than .NET level e.g. File.Read) - but it may help pick up things.

@Itoktsnhc
Copy link

Same problem here, randomly no response on my asp.net core project on Azure app service (IIS and ASP.NET Core Module).
Another related issue: dotnet/aspnetcore#1950

The typical metric of requests indicators as shown below, when no response error happens, the number of threads being used in the dotnet core will jump up to 400 or 500 rapidly.

image

@shaykeren
Copy link

Same problem in my environment, ASP NET CORE 2.0 published to ubuntu 16.04.
Application deployed in AWS with the following instances:
instance #1 contains our API which hang from time to time
instance #2 contains our static html files, nginx which host the static files and forward the requests to API (instance #1) and other instances which serves our other microservices.
(API Process hangs every 2-4 days)

I saw that this process has 650 dead threads, is it relevant?

(lldb) sos Threads
ThreadCount: 677
UnstartedThread: 0
BackgroundThread: 25
PendingThread: 0
DeadThread: 650
Hosted Runtime: no

Any Idea?

@mshanak
Copy link

mshanak commented May 16, 2018

I had the same problem before, it's related to HttpClient library .

I wrote this blog post about this issue, it may help you https://medium.com/@mshanak/soved-dotnet-core-too-many-open-files-in-system-when-using-postgress-with-entity-framework-c6e30eeff6d1

@runxc1
Copy link

runxc1 commented May 16, 2018

If you update to 2.1 do you still run into the same problem? I am just curious if the switch from libuv to sockets inside of Kestrel fixes this issue?

@shaykeren
Copy link

I'm familiar with your post and we are using a single httpclient.

@mshanak
Copy link

mshanak commented May 17, 2018

@shaykeren do use the HTTPclient to send requests periodically to an Url? if so, use cron job instead.

@euclid47
Copy link

euclid47 commented May 17, 2018

@mshanak @shaykeren when we experienced this issue it happened with async calls to the user identity and/or redis for cached user data within the razor view. The only solution was to bounce the application.

Edit: Forgot to add razor view

@ZOXEXIVO
Copy link
Author

Catch it many times last week
NET Core 2.1, new sockets transport. Application stuck up

@halter73
Copy link
Member

@ZOXEXIVO Are you running on 2.1.0 or one of the patches? It's possible you could be running into dotnet/corefx#29785 if you're still on 2.1.0.

@HassanHashemi
Copy link

HassanHashemi commented Aug 14, 2018

@halter73 Facing same issue on Centos

dotnet --version:
2.1.302

I have used @benaadams 's blocking detector and saw plenty of blocking call, however i do not think that it could be the actual issue because the same app runs on windows for days without any problem.

With that said if i make following changes:

System.Net.ServicePointManager.DefaultConnectionLimit = 256; 
System.Threading.ThreadPool.GetMaxThreads(out int _, out int completionThreads);
System.Threading.ThreadPool.SetMinThreads(256, completionThreads);

The app remains running for a couple of more days..

any ideas?

@kyuz0
Copy link

kyuz0 commented Nov 15, 2018

Sorry to re-open this. We are experiencing this issue in our application as well, we run on CentOS in docker and behind nginx. Our app has had this issue since with 2.0 and also with 2.1. Increasing the thread pool seemed to help, but sometimes we experience kestrel hanging with very little load while for the majority of time we have long up times with high load, which makes it difficult to debug.

Has there been any update on this or any additional troubleshooting advice? We are planning to take a full core dump of kesterel/dotnet the next time this happens and look at what's happening, we suspect it's got to do with some deadlocks but we can't really see how our code would result in deadlocks, so maybe we are triggering some bug at the framework level.

@halter73
Copy link
Member

@kyuz0 If increasing the number of threadpool threads helped, that's a good sign you're experiencing threadpool starvation. This could be caused by a deadlock like you suspect, but there could be other causes too like blocking I/O.

To find the exact cause, collecting a core dump is your best bet. Unfortunately, analyzing linux core dumps isn't yet quite as easy as it is on Windows, but you can find a guide here. "clrstack -all" should show all the managed stacks and what they're blocked on.

@kyuz0
Copy link

kyuz0 commented Nov 16, 2018

@halter73 thanks for your reply. Is there any debugging that we can enable in Kestrel to get some logs/alerts when the threadpool is full? The difficulty at the moment is that when the application reaches this state, there's simply nothing happening in the logs.

Something else I've seen in our code is that we make raw TCP connections to some endpoints like this:

using(var client = new TcpClient())
            {
                    try
                    {
                        var result = client.BeginConnect(IPAddress.Parse(externalIp), port, null, null);
                        var success = result.AsyncWaitHandle.WaitOne(TimeSpan.FromMilliseconds(TimeoutReadyCheckMilliseconds));
                        if (!success)
                        {
                            return false;
                        }

                        client.EndConnect(result);
                    }
                    catch (SocketException e)
                    {
                        return false;
                    }
            }

The using statement should ensure that the TcpClient gets disposed of in all cases, it should not matter that we don't explicitly call EndConnect() in all cases... at least this was my understanding, but maybe this is what's causing issues as the socket will not receive an explicit EndConnect in case success is false or in case of exceptions.

@halter73
Copy link
Member

halter73 commented Nov 16, 2018

Blocking on the IAsyncResult returned by TcpClient.BeginConnect is a possible candidate for the cause of your threadpool starvation. To fix this, you'll want to call the newer TcpClient.ConnectAsync method and await it in an async method while being careful to avoid calls to Task.Wait(), Task.Result or anything else in your codebase that could block a thread waiting on a Task.

The BeginConnect/EndConnect method pair implement an older asynchronous API pattern called the Asynchronous Programing Model or APM. It's possible to use APM APIs without blocking, but it's much more difficult. You're almost always better off using a Task-returning API with async/await over using an APM API.

Also, did you leave some code out in the TcpClient initialization sample? The using block ends right after the call to TcpClient.EndConnect which means the connection has only just been established.

As for threadpool starvation events, you could look for the ThreadPoolWorkerThreadAdjustmentAdjustment event with "0x06 - Starvation" Reason using perfcollect and perfview. That being said, if this issues causes any health checks to fail, it's probably easier to have a watchdog process collect a core dump before killing the server process when the health check fails.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests