Kestrel randomly locked and stop responding. #1267

ZOXEXIVO · 2016-12-21T19:38:28Z

I run ASP.NET Core 1.1 on Ubuntu 16.04 in Docker container (with Ubuntu image).
This works fine, but 3 times in two weeks I catch error, that Kestrel stop responding.
No errors in Exception filter, in logs - only last record with start Action Executing, but not completed.
Why problem in Kestrel ? - Because it locked and not processing new requests.
At locked state, CPU usage is 4-5% (per core). I wait 4 hours, but not result.

TOP command result:

PID    USER      PR    NI    VIRT       RES          SHR S       %CPU      %MEM      TIME+       COMMAND
10788   root      20   0     11.396g    185060       53704 S     4.7       2.3       14:17.56     dotnet

Restarting docker container is solve issue.
I can't attach and trace it, because all tools, that I use - works only on Windows.

The text was updated successfully, but these errors were encountered:

cesarblum · 2016-12-21T19:52:53Z

Just to be clear, are you running everything (host and containers) on Linux?

Are you doing remote access from a Windows console window? It could be this if so: #1259

ZOXEXIVO · 2016-12-21T20:19:14Z

Yes, it's all running on Ubuntu. 2-core server with no server CPU (G4400) and 8GB RAM
I know problem with console locking in Windows, with mouse e.t.c, but this is not same issue, because docker run in background and I not access it directly. It stop responding without any user activity with server.
Kestrel doing something, but i don't understand what:
CPU load 4.7% per core (at locked state) is equals to 3-5 fast requests to main page in normal state.

It very hard to reproduce and i want help you to fix it, Can you advice trace tools for collect debug info at .NET Core in Linux environment ?

Issue always appear after few days without any user activity on server.

ZOXEXIVO · 2016-12-23T21:11:14Z

UPD. Just now server blocked again. 2 days after restarted container

davidfowl · 2017-01-04T16:30:48Z

@ZOXEXIVO are you passing a cancellation token to any of your response writing API calls (if you have any)

ZOXEXIVO · 2017-01-04T20:15:03Z

@davidfowl No. Only simple async requests to MongoDB without passing CancellationToken.

philjones88 · 2017-01-05T12:36:08Z

We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly.

I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger.

Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504.

Curl inside container:

* Rebuilt URL to: 127.0.0.1:1924/
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 1924 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:1924
> User-Agent: curl/7.50.1
> Accept: */*
>

Hangs and has to be killed.

netstat inside container:

[root@66e41d245901 vq]# netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp     1976      0 localhost:xiip          localhost:48872         CLOSE_WAIT
tcp      515      0 localhost:xiip          localhost:48866         CLOSE_WAIT
tcp        0      0 localhost:49636         localhost:57050         ESTABLISHED
tcp       79      0 localhost:xiip          localhost:48938         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48980         CLOSE_WAIT
tcp       32      0 66e41d245901:54064      192.168.1.11:https      CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48934         CLOSE_WAIT
tcp        0      0 localhost:postgres      localhost:56804         ESTABLISHED
tcp     2076      0 localhost:xiip          localhost:48884         CLOSE_WAIT
tcp     2002      0 localhost:xiip          localhost:48950         CLOSE_WAIT
tcp      236      0 localhost:xiip          localhost:49010         CLOSE_WAIT
tcp        0      0 66e41d245901:54308      192.168.1.11:https      ESTABLISHED
tcp        0      0 localhost:postgres      localhost:56838         ESTABLISHED
tcp        0      0 localhost:xiip          localhost:48754         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48958         CLOSE_WAIT
tcp       77      0 localhost:xiip          localhost:48962         CLOSE_WAIT
tcp        0      0 66e41d245901:50248      mirrors.melbourne.:http TIME_WAIT
tcp       77      0 localhost:xiip          localhost:48960         CLOSE_WAIT
tcp        0      0 localhost:56804         localhost:postgres      ESTABLISHED
tcp        0      0 localhost:57050         localhost:54525         ESTABLISHED
tcp        0      0 localhost:56838         localhost:postgres      ESTABLISHED
tcp      236      0 localhost:xiip          localhost:49008         CLOSE_WAIT
tcp        0      0 localhost:57050         localhost:49636         ESTABLISHED
tcp      515      0 localhost:xiip          localhost:48818         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48966         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48970         CLOSE_WAIT
tcp      489      0 localhost:xiip          localhost:48848         CLOSE_WAIT
tcp        0      0 localhost:54525         localhost:57050         ESTABLISHED
tcp     2002      0 localhost:xiip          localhost:48978         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48888         CLOSE_WAIT
tcp       79      0 localhost:xiip          localhost:48964         CLOSE_WAIT
udp6       0      0 localhost:56619         localhost:56619         ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  3      [ ]         STREAM     CONNECTED     22431
unix  3      [ ]         STREAM     CONNECTED     24418    /tmp/kestrel_0a937d9e657e40f98d00f6ec9e1d1388
unix  3      [ ]         STREAM     CONNECTED     27652    /tmp/kestrel_ac09a04d99044ce686585e802a31f7ed
unix  3      [ ]         STREAM     CONNECTED     28605
unix  3      [ ]         STREAM     CONNECTED     28606
unix  3      [ ]         STREAM     CONNECTED     26748

jchannon · 2017-01-05T13:24:39Z

I work with @philjones88 and to confirm this is a self-contained app running against NetCore 1.0.1 on a CentOS 7 container.

philjones88 · 2017-01-05T14:10:53Z

we enabled more verbose logging and captured:

2017-01-05 14:06:40.951911  RequestId : 0HL1LEEL4VG15 Component : VQPortal Method : OWIN Pipeline Message : Request Start: /static/themes/base.css?cb=1483621285065
2017-01-05 14:06:40.951911 Microsoft.AspNetCore.StaticFiles.StaticFileMiddleware Sending file. Request path: '/themes/base.css'. Physical path: '/vq/portal/static/themes/base.css'
2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" communication error Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvException: Error -32 EPIPE broken pipe
2017-01-05 14:06:40.977634 Microsoft.AspNetCore.Server.Kestrel Connection id "0HL1LEEL40UDH" disconnecting.```

ZOXEXIVO · 2017-01-19T22:12:14Z

After using NGINX before Kestrel, problem disappear (2 weeks uptime)

euclid47 · 2017-02-02T02:41:23Z

@ZOXEXIVO we are experiencing the same symptoms in the same containerized environment. Since deploying nginx before kestrel , have you had any locks?

ZOXEXIVO · 2017-02-02T06:54:43Z

@euclid47 behind nginx server, it works perfect! I
Here, #1304 problem was in StaticFiles middleware.

runxc1 · 2017-02-03T15:39:49Z

Any idea when #1304 will be pushed out? I am running .net core 1.1 in an Azure web app and am similarly having the problem where it will crash and never start serving content again until the app is swapped or restarted

muratg · 2017-02-03T19:49:50Z

@ZOXEXIVO @runxc1 #1304 should be released in Feb.

Eilon · 2017-02-08T18:26:10Z

Closing as a dup of #1304.

tavisca-syedave · 2017-02-15T12:43:03Z

Hi,
Do we have any plans to ship v1.1.1 as we are also getting the same error while deploying dot net core web app on Linux docker - ubntu ?

Eilon · 2017-02-15T18:34:44Z

@tavisca-syedave we plan to have the patch out hopefully later this month.

cesarblum · 2017-03-14T00:59:54Z

Kestrel 1.1.1 is now released. Is anyone still running into this issue?

pieceofsummer · 2017-05-13T16:22:26Z

I think I've just got it with 1.1.2 using unix socket (behind nginx). Will enable verbose logging and come back if I can reproduce it.

I have StaticFiles middleware disabled in production (static files are served by nginx).

ilyabreev · 2017-07-13T06:39:59Z

Just got it too with 1.1.2 begind nginx.

jeff-pang · 2017-08-25T11:58:27Z

I also just got it on 1.1.2. Nginx reverse proxy to Kestrel in Debian. In my case it was working in production for more more than 1 month then suddenly this it stopped working. Now it stopped working every hour

ZOXEXIVO · 2017-08-25T12:05:55Z

Best advice -> upgrade to v2.0

jeff-pang · 2017-08-25T12:07:30Z

@ZOXEXIVO Did upgrading to v2 fixed your problem?

ZOXEXIVO · 2017-08-25T12:23:23Z

@jeff-pang Problem disappeared when i host kestrel behind nginx

hasanoruc · 2017-12-06T06:53:54Z

@halter73
If every request to the server and every transaction in the database causes a dotnet process in the ram in my ubuntu server, and this does not disappear by itself. I need to clear these operations. I can clean them with crontab. When you delete this information and the cache on the server, the user is connected a bit late. But if it remains within the time specified in the crontab, it reads again from the cache and fast.

niemyjski · 2017-12-15T00:21:00Z

I'm also trying to track down this issue with two of our sites. It seems to happen when there is load on the site and it doesn't fix itself. We use a lot of async redis calls / http client calls via the nest client

halter73 · 2017-12-15T00:28:01Z

@niemyjski It could be threadpool starvation. Have you collected a memory dump when Kestrel isn't responding?

ilyabreev · 2017-12-18T15:07:08Z

Same problem here. Randomly stops responding and consumes 1-2% CPU. dotnet core 2.0.3

halter73 · 2017-12-18T20:22:56Z

@ilyabreev Any details?

Information that might be useful:

OS version
Whether or not your running under IIS/ANCM or behind another reverse proxy
Frequency of occurence
Amount of load the server was under when it stopped responding
A core dump taken when the server is not responding
Anything out of the ordinary that could be related.

Also, I recommend opening a new issue. Even if the behavior described in this issue is similar to what you're experiencing, this issue describes a bug that was fixed prior to the release of 2.0.0.

niemyjski · 2017-12-18T20:53:52Z

@halter73 yes, I have some memory dumps.

halter73 · 2017-12-18T22:38:06Z

@niemyjski Can you share these? You can email me with a link at the address listed in my github profile. Thanks.

ilyabreev · 2017-12-19T06:57:51Z

@halter73

Ubuntu 14.04.5 LTS as host system. dotnet process runs inside docker container which based on ubuntu:xenial
Running behind nginx as reverse proxy.
0-6 times in a day.
Don't know exactly. I don't think it's constantly too high. Maybe some kind of CPU spikes occured before locking.
Will try to email you link to memory dump.
Nothing special.

mazong1123 · 2017-12-20T03:11:11Z

As this issue has been closed, do we have another new issue to track down the "dotnet process stops" problem? We're facing the same issue as the dotnet process stops unexpectedly.

muratg · 2017-12-20T17:41:59Z

@mazong1123 please feel free to file another bug to discuss the current issues.

mshanak · 2017-12-24T09:27:48Z

This blog post could help you increasing max open files and configure Nginx to close finished request connections

hasanoruc · 2017-12-29T22:46:04Z

I guess i solved my own problem. A few days ago, i looked nginx error logs in /var/log/nginx/error.log. and saw error logs this: "upstream prematurely closed connection while reading response header from upstream". And i think about it and i looked my bad nginx config. And I changed the following config:

location / {
proxy_pass http://backend;  #throw error this line
}
location /api {
proxy_pass http://localhost:5021; #this is not throwing an error because config is ok.
}
upstream backend {
server 127.0.0.1:5020; # this throwing error. because i think coming request going to 2 port. and confusing.
server 127.0.0.1:19920;
}

my new config is:

location / {
proxy_pass http://localhost:5020;
}
location /api {
proxy_pass http://localhost:5021;
}

And not using the upstream because not necessary my system.

Sorry for my bad english.

niemyjski · 2018-01-03T11:17:17Z

@halter73 just a heads up I've been running 1.5 weeks in production without issues. I've seen high thread count and have been trying to get the profiler attached in azure to take a look (without success). Considering I've seen thread exhaustion in the past I want to ensure I'm not leaking any thing but the high thread count is normal for my load (but I never remember seeing this many of threads in the legacy asp.net app). I got a dump but all the threads were native and empty stack traces. Currently serving up 5+ billion requests a month across 4-5 large azure web app instances. Seeing 70-80% cpu and a couple hundred threads (600+). I was able to narrow down one of my issues to this: StackExchange/StackExchange.Redis#759 .

Interesting note here is I wish Microsoft would throw some resources at this project (Redis client) as it is used by Microsoft for the distributed cache implementation and is used for various scenarios inside of asp.net ;)

benaadams · 2018-01-10T15:57:52Z

I've made a BlockingDetector package to output warnings to the logs when you block:
https://github.com/benaadams/Ben.BlockingDetector

Only detects when blocking actually happens; so doesn't pick up calls that may block but don't or warn about coding practices that lead to blocking (or blocking that happens at OS rather than .NET level e.g. File.Read) - but it may help pick up things.

Itoktsnhc · 2018-03-01T08:37:53Z

Same problem here, randomly no response on my asp.net core project on Azure app service (IIS and ASP.NET Core Module).
Another related issue: dotnet/aspnetcore#1950

The typical metric of requests indicators as shown below, when no response error happens, the number of threads being used in the dotnet core will jump up to 400 or 500 rapidly.

shaykeren · 2018-05-16T08:34:09Z

Same problem in my environment, ASP NET CORE 2.0 published to ubuntu 16.04.
Application deployed in AWS with the following instances:
instance #1 contains our API which hang from time to time
instance #2 contains our static html files, nginx which host the static files and forward the requests to API (instance #1) and other instances which serves our other microservices.
(API Process hangs every 2-4 days)

I saw that this process has 650 dead threads, is it relevant?

(lldb) sos Threads
ThreadCount: 677
UnstartedThread: 0
BackgroundThread: 25
PendingThread: 0
DeadThread: 650
Hosted Runtime: no

Any Idea?

mshanak · 2018-05-16T10:21:32Z

I had the same problem before, it's related to HttpClient library .

I wrote this blog post about this issue, it may help you https://medium.com/@mshanak/soved-dotnet-core-too-many-open-files-in-system-when-using-postgress-with-entity-framework-c6e30eeff6d1

runxc1 · 2018-05-16T17:05:56Z

If you update to 2.1 do you still run into the same problem? I am just curious if the switch from libuv to sockets inside of Kestrel fixes this issue?

shaykeren · 2018-05-16T20:06:32Z

I'm familiar with your post and we are using a single httpclient.

mshanak · 2018-05-17T07:57:59Z

@shaykeren do use the HTTPclient to send requests periodically to an Url? if so, use cron job instead.

euclid47 · 2018-05-17T12:28:48Z

@mshanak @shaykeren when we experienced this issue it happened with async calls to the user identity and/or redis for cached user data within the razor view. The only solution was to bounce the application.

Edit: Forgot to add razor view

ZOXEXIVO · 2018-07-13T15:54:20Z

Catch it many times last week
NET Core 2.1, new sockets transport. Application stuck up

halter73 · 2018-07-13T19:21:46Z

@ZOXEXIVO Are you running on 2.1.0 or one of the patches? It's possible you could be running into dotnet/corefx#29785 if you're still on 2.1.0.

HassanHashemi · 2018-08-14T07:40:25Z

@halter73 Facing same issue on Centos

dotnet --version:
2.1.302

I have used @benaadams 's blocking detector and saw plenty of blocking call, however i do not think that it could be the actual issue because the same app runs on windows for days without any problem.

With that said if i make following changes:

System.Net.ServicePointManager.DefaultConnectionLimit = 256; 
System.Threading.ThreadPool.GetMaxThreads(out int _, out int completionThreads);
System.Threading.ThreadPool.SetMinThreads(256, completionThreads);

The app remains running for a couple of more days..

any ideas?

kyuz0 · 2018-11-15T20:53:44Z

Sorry to re-open this. We are experiencing this issue in our application as well, we run on CentOS in docker and behind nginx. Our app has had this issue since with 2.0 and also with 2.1. Increasing the thread pool seemed to help, but sometimes we experience kestrel hanging with very little load while for the majority of time we have long up times with high load, which makes it difficult to debug.

Has there been any update on this or any additional troubleshooting advice? We are planning to take a full core dump of kesterel/dotnet the next time this happens and look at what's happening, we suspect it's got to do with some deadlocks but we can't really see how our code would result in deadlocks, so maybe we are triggering some bug at the framework level.

halter73 · 2018-11-15T23:00:59Z

@kyuz0 If increasing the number of threadpool threads helped, that's a good sign you're experiencing threadpool starvation. This could be caused by a deadlock like you suspect, but there could be other causes too like blocking I/O.

To find the exact cause, collecting a core dump is your best bet. Unfortunately, analyzing linux core dumps isn't yet quite as easy as it is on Windows, but you can find a guide here. "clrstack -all" should show all the managed stacks and what they're blocked on.

kyuz0 · 2018-11-16T10:12:07Z

@halter73 thanks for your reply. Is there any debugging that we can enable in Kestrel to get some logs/alerts when the threadpool is full? The difficulty at the moment is that when the application reaches this state, there's simply nothing happening in the logs.

Something else I've seen in our code is that we make raw TCP connections to some endpoints like this:

using(var client = new TcpClient())
            {
                    try
                    {
                        var result = client.BeginConnect(IPAddress.Parse(externalIp), port, null, null);
                        var success = result.AsyncWaitHandle.WaitOne(TimeSpan.FromMilliseconds(TimeoutReadyCheckMilliseconds));
                        if (!success)
                        {
                            return false;
                        }

                        client.EndConnect(result);
                    }
                    catch (SocketException e)
                    {
                        return false;
                    }
            }

The using statement should ensure that the TcpClient gets disposed of in all cases, it should not matter that we don't explicitly call EndConnect() in all cases... at least this was my understanding, but maybe this is what's causing issues as the socket will not receive an explicit EndConnect in case success is false or in case of exceptions.

halter73 · 2018-11-16T19:52:07Z

Blocking on the IAsyncResult returned by TcpClient.BeginConnect is a possible candidate for the cause of your threadpool starvation. To fix this, you'll want to call the newer TcpClient.ConnectAsync method and await it in an async method while being careful to avoid calls to Task.Wait(), Task.Result or anything else in your codebase that could block a thread waiting on a Task.

The BeginConnect/EndConnect method pair implement an older asynchronous API pattern called the Asynchronous Programing Model or APM. It's possible to use APM APIs without blocking, but it's much more difficult. You're almost always better off using a Task-returning API with async/await over using an APM API.

Also, did you leave some code out in the TcpClient initialization sample? The using block ends right after the call to TcpClient.EndConnect which means the connection has only just been established.

As for threadpool starvation events, you could look for the ThreadPoolWorkerThreadAdjustmentAdjustment event with "0x06 - Starvation" Reason using perfcollect and perfview. That being said, if this issues causes any health checks to fail, it's probably easier to have a watchdog process collect a core dump before killing the server process when the health check fails.

physhi mentioned this issue Dec 31, 2016

Kestrel: Deadlocked in SocketOutput? #1278

Closed

muratg added this to the 1.1.1 milestone Feb 3, 2017

muratg assigned cesarblum Feb 3, 2017

Eilon added patch-proposed duplicate and removed patch-proposed labels Feb 7, 2017

Eilon closed this as completed Feb 8, 2017

This was referenced Sep 7, 2021

Site hangs in production IsraelHikingMap/Site#1599

Closed

Kestrel hangs dotnet/aspnetcore#36274

Closed

Kestrel randomly locked and stop responding. #1267

Kestrel randomly locked and stop responding. #1267

Comments

ZOXEXIVO commented Dec 21, 2016 • edited Loading

cesarblum commented Dec 21, 2016

ZOXEXIVO commented Dec 21, 2016 • edited Loading

ZOXEXIVO commented Dec 23, 2016

davidfowl commented Jan 4, 2017

ZOXEXIVO commented Jan 4, 2017 • edited Loading

philjones88 commented Jan 5, 2017 • edited Loading

jchannon commented Jan 5, 2017 • edited Loading

philjones88 commented Jan 5, 2017

ZOXEXIVO commented Jan 19, 2017

euclid47 commented Feb 2, 2017

ZOXEXIVO commented Feb 2, 2017

runxc1 commented Feb 3, 2017

muratg commented Feb 3, 2017

Eilon commented Feb 8, 2017

tavisca-syedave commented Feb 15, 2017

Eilon commented Feb 15, 2017

cesarblum commented Mar 14, 2017

pieceofsummer commented May 13, 2017 • edited Loading

ilyabreev commented Jul 13, 2017

jeff-pang commented Aug 25, 2017 • edited Loading

ZOXEXIVO commented Aug 25, 2017 • edited Loading

jeff-pang commented Aug 25, 2017

ZOXEXIVO commented Aug 25, 2017

hasanoruc commented Dec 6, 2017

niemyjski commented Dec 15, 2017

halter73 commented Dec 15, 2017

ilyabreev commented Dec 18, 2017 • edited Loading

halter73 commented Dec 18, 2017

niemyjski commented Dec 18, 2017

halter73 commented Dec 18, 2017

ilyabreev commented Dec 19, 2017

mazong1123 commented Dec 20, 2017

muratg commented Dec 20, 2017

mshanak commented Dec 24, 2017

hasanoruc commented Dec 29, 2017 • edited Loading

niemyjski commented Jan 3, 2018 • edited Loading

benaadams commented Jan 10, 2018

Itoktsnhc commented Mar 1, 2018

shaykeren commented May 16, 2018

mshanak commented May 16, 2018

runxc1 commented May 16, 2018

shaykeren commented May 16, 2018

mshanak commented May 17, 2018

euclid47 commented May 17, 2018 • edited Loading

ZOXEXIVO commented Jul 13, 2018

halter73 commented Jul 13, 2018

HassanHashemi commented Aug 14, 2018 • edited Loading

kyuz0 commented Nov 15, 2018

halter73 commented Nov 15, 2018

kyuz0 commented Nov 16, 2018

halter73 commented Nov 16, 2018 • edited Loading

ZOXEXIVO commented Dec 21, 2016 •

edited

Loading

ZOXEXIVO commented Dec 21, 2016 •

edited

Loading

ZOXEXIVO commented Jan 4, 2017 •

edited

Loading

philjones88 commented Jan 5, 2017 •

edited

Loading

jchannon commented Jan 5, 2017 •

edited

Loading

pieceofsummer commented May 13, 2017 •

edited

Loading

jeff-pang commented Aug 25, 2017 •

edited

Loading

ZOXEXIVO commented Aug 25, 2017 •

edited

Loading

ilyabreev commented Dec 18, 2017 •

edited

Loading

hasanoruc commented Dec 29, 2017 •

edited

Loading

niemyjski commented Jan 3, 2018 •

edited

Loading

euclid47 commented May 17, 2018 •

edited

Loading

HassanHashemi commented Aug 14, 2018 •

edited

Loading

halter73 commented Nov 16, 2018 •

edited

Loading