-
Notifications
You must be signed in to change notification settings - Fork 526
Kestrel randomly locked and stop responding. #1267
Comments
Just to be clear, are you running everything (host and containers) on Linux? Are you doing remote access from a Windows console window? It could be this if so: #1259 |
Yes, it's all running on Ubuntu. 2-core server with no server CPU (G4400) and 8GB RAM It very hard to reproduce and i want help you to fix it, Can you advice trace tools for collect debug info at .NET Core in Linux environment ? Issue always appear after few days without any user activity on server. |
UPD. Just now server blocked again. 2 days after restarted container |
@ZOXEXIVO are you passing a cancellation token to any of your response writing API calls (if you have any) |
@davidfowl No. Only simple async requests to MongoDB without passing CancellationToken. |
We're seeing similar. I can exec into the container and see the process still running. It's not responding to curl. It's sat at 4-6% CPU constantly. I can 100% reproduce it. It happens randomly under heavy load from our protractor unit tests hitting our web app, so just takes anywhere from 0-20 minutes to trigger. Our app listens on port 1924 and has nginx in front of it. Nginx returns a 502 or 504. Curl inside container:
Hangs and has to be killed. netstat inside container:
|
I work with @philjones88 and to confirm this is a self-contained app running against NetCore 1.0.1 on a CentOS 7 container. |
we enabled more verbose logging and captured:
|
After using NGINX before Kestrel, problem disappear (2 weeks uptime) |
@ZOXEXIVO we are experiencing the same symptoms in the same containerized environment. Since deploying nginx before kestrel , have you had any locks? |
Any idea when #1304 will be pushed out? I am running .net core 1.1 in an Azure web app and am similarly having the problem where it will crash and never start serving content again until the app is swapped or restarted |
Closing as a dup of #1304. |
Hi, |
@tavisca-syedave we plan to have the patch out hopefully later this month. |
Kestrel 1.1.1 is now released. Is anyone still running into this issue? |
I think I've just got it with 1.1.2 using unix socket (behind nginx). Will enable verbose logging and come back if I can reproduce it. I have StaticFiles middleware disabled in production (static files are served by nginx). |
Just got it too with 1.1.2 begind nginx. |
I also just got it on 1.1.2. Nginx reverse proxy to Kestrel in Debian. In my case it was working in production for more more than 1 month then suddenly this it stopped working. Now it stopped working every hour |
Best advice -> upgrade to v2.0 |
@ZOXEXIVO Did upgrading to v2 fixed your problem? |
@jeff-pang Problem disappeared when i host kestrel behind nginx |
@halter73 |
I'm also trying to track down this issue with two of our sites. It seems to happen when there is load on the site and it doesn't fix itself. We use a lot of async redis calls / http client calls via the nest client |
@niemyjski It could be threadpool starvation. Have you collected a memory dump when Kestrel isn't responding? |
Same problem here. Randomly stops responding and consumes 1-2% CPU. dotnet core 2.0.3 |
@ilyabreev Any details? Information that might be useful:
Also, I recommend opening a new issue. Even if the behavior described in this issue is similar to what you're experiencing, this issue describes a bug that was fixed prior to the release of 2.0.0. |
@halter73 yes, I have some memory dumps. |
@niemyjski Can you share these? You can email me with a link at the address listed in my github profile. Thanks. |
|
As this issue has been closed, do we have another new issue to track down the "dotnet process stops" problem? We're facing the same issue as the dotnet process stops unexpectedly. |
@mazong1123 please feel free to file another bug to discuss the current issues. |
This blog post could help you increasing max open files and configure Nginx to close finished request connections |
I guess i solved my own problem. A few days ago, i looked nginx error logs in /var/log/nginx/error.log. and saw error logs this: "upstream prematurely closed connection while reading response header from upstream". And i think about it and i looked my bad nginx config. And I changed the following config:
my new config is:
And not using the upstream because not necessary my system. Sorry for my bad english. |
@halter73 just a heads up I've been running 1.5 weeks in production without issues. I've seen high thread count and have been trying to get the profiler attached in azure to take a look (without success). Considering I've seen thread exhaustion in the past I want to ensure I'm not leaking any thing but the high thread count is normal for my load (but I never remember seeing this many of threads in the legacy asp.net app). I got a dump but all the threads were native and empty stack traces. Currently serving up 5+ billion requests a month across 4-5 large azure web app instances. Seeing 70-80% cpu and a couple hundred threads (600+). I was able to narrow down one of my issues to this: StackExchange/StackExchange.Redis#759 . Interesting note here is I wish Microsoft would throw some resources at this project (Redis client) as it is used by Microsoft for the distributed cache implementation and is used for various scenarios inside of asp.net ;) |
I've made a BlockingDetector package to output warnings to the logs when you block: Only detects when blocking actually happens; so doesn't pick up calls that may block but don't or warn about coding practices that lead to blocking (or blocking that happens at OS rather than .NET level e.g. File.Read) - but it may help pick up things. |
Same problem here, randomly no response on my asp.net core project on Azure app service (IIS and ASP.NET Core Module). The typical metric of requests indicators as shown below, when no response error happens, the number of threads being used in the dotnet core will jump up to 400 or 500 rapidly. |
Same problem in my environment, ASP NET CORE 2.0 published to ubuntu 16.04. I saw that this process has 650 dead threads, is it relevant? (lldb) sos Threads Any Idea? |
I had the same problem before, it's related to HttpClient library . I wrote this blog post about this issue, it may help you https://medium.com/@mshanak/soved-dotnet-core-too-many-open-files-in-system-when-using-postgress-with-entity-framework-c6e30eeff6d1 |
If you update to 2.1 do you still run into the same problem? I am just curious if the switch from libuv to sockets inside of Kestrel fixes this issue? |
I'm familiar with your post and we are using a single httpclient. |
@shaykeren do use the HTTPclient to send requests periodically to an Url? if so, use cron job instead. |
@mshanak @shaykeren when we experienced this issue it happened with async calls to the user identity and/or redis for cached user data within the razor view. The only solution was to bounce the application. Edit: Forgot to add razor view |
Catch it many times last week |
@ZOXEXIVO Are you running on 2.1.0 or one of the patches? It's possible you could be running into dotnet/corefx#29785 if you're still on 2.1.0. |
@halter73 Facing same issue on Centos
I have used @benaadams 's blocking detector and saw plenty of blocking call, however i do not think that it could be the actual issue because the same app runs on windows for days without any problem. With that said if i make following changes:
The app remains running for a couple of more days.. any ideas? |
Sorry to re-open this. We are experiencing this issue in our application as well, we run on CentOS in docker and behind nginx. Our app has had this issue since with 2.0 and also with 2.1. Increasing the thread pool seemed to help, but sometimes we experience kestrel hanging with very little load while for the majority of time we have long up times with high load, which makes it difficult to debug. Has there been any update on this or any additional troubleshooting advice? We are planning to take a full core dump of kesterel/dotnet the next time this happens and look at what's happening, we suspect it's got to do with some deadlocks but we can't really see how our code would result in deadlocks, so maybe we are triggering some bug at the framework level. |
@kyuz0 If increasing the number of threadpool threads helped, that's a good sign you're experiencing threadpool starvation. This could be caused by a deadlock like you suspect, but there could be other causes too like blocking I/O. To find the exact cause, collecting a core dump is your best bet. Unfortunately, analyzing linux core dumps isn't yet quite as easy as it is on Windows, but you can find a guide here. "clrstack -all" should show all the managed stacks and what they're blocked on. |
@halter73 thanks for your reply. Is there any debugging that we can enable in Kestrel to get some logs/alerts when the threadpool is full? The difficulty at the moment is that when the application reaches this state, there's simply nothing happening in the logs. Something else I've seen in our code is that we make raw TCP connections to some endpoints like this:
The using statement should ensure that the TcpClient gets disposed of in all cases, it should not matter that we don't explicitly call EndConnect() in all cases... at least this was my understanding, but maybe this is what's causing issues as the socket will not receive an explicit EndConnect in case success is false or in case of exceptions. |
Blocking on the IAsyncResult returned by TcpClient.BeginConnect is a possible candidate for the cause of your threadpool starvation. To fix this, you'll want to call the newer TcpClient.ConnectAsync method and await it in an async method while being careful to avoid calls to Task.Wait(), Task.Result or anything else in your codebase that could block a thread waiting on a Task. The BeginConnect/EndConnect method pair implement an older asynchronous API pattern called the Asynchronous Programing Model or APM. It's possible to use APM APIs without blocking, but it's much more difficult. You're almost always better off using a Task-returning API with async/await over using an APM API. Also, did you leave some code out in the TcpClient initialization sample? The using block ends right after the call to TcpClient.EndConnect which means the connection has only just been established. As for threadpool starvation events, you could look for the ThreadPoolWorkerThreadAdjustmentAdjustment event with "0x06 - Starvation" Reason using perfcollect and perfview. That being said, if this issues causes any health checks to fail, it's probably easier to have a watchdog process collect a core dump before killing the server process when the health check fails. |
I run ASP.NET Core 1.1 on Ubuntu 16.04 in Docker container (with Ubuntu image).
This works fine, but 3 times in two weeks I catch error, that Kestrel stop responding.
No errors in Exception filter, in logs - only last record with start Action Executing, but not completed.
Why problem in Kestrel ? - Because it locked and not processing new requests.
At locked state, CPU usage is 4-5% (per core). I wait 4 hours, but not result.
TOP command result:
Restarting docker container is solve issue.
I can't attach and trace it, because all tools, that I use - works only on Windows.
The text was updated successfully, but these errors were encountered: