-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure nginx worker timeout #1088
Conversation
Looking forward to this! 😄 I'm trying to deploy a WebSocket application to Kubernetes and use NGINX to reverse proxy the WebSocket connections to my application pods, but I observed a very odd memory usage behavior in my ingress pods ever since I deployed this to production: After struggling for a while, I noticed that the increased memory usage was due to lots of worker processes stuck in
Stracing one of those worker processes showed that it was still handling WebSocket traffic:
So, is specifying a worker shutdown timeout (what this PR does) the only way to avoid having worker nodes unable to shut down due to active WebSocket connections, or do you guys know other ways of handling this? |
@danielfm you can use this image |
@aledbf just tried this image, but could not find the |
Sorry about that. Please use Edit: this image contains current master |
It did not seem to work for me; I still see several workers stuck in
Any suggestion on how to find out what's going on? |
@danielfm how are you testing this? |
I simply deployed the image you provided and, when the ingress deployment was fully rolled out, I changed the DNS record for this application to shift traffic to the one deployed in Kubernetes. As soon as I switch the DNS, client applications start connecting to the WebSocket application in Kubernetes via the NGINX ingress load balancer. When I do that, I can almost immediately see the memory usage for the ingress pods going up: And when I log into any machine where some NGINX pod is running, I can see several workers unable to shutdown (see the When I rollback the DNS change, the memory usage stabilizes: I'm also seeing lots of reload requests made by the ingress controller that I'm unable to explain (I mean, I expect the configuration to be reloaded, but not that frequently), this might give you some insight:
Edit: As soon as I kill those workers, the memory usage returns back to "normal": |
I seem to have found the culprit in my case: a misconfigured socket.io ping interval / timeout. After adjusting these according to NGINX read/send timeout (and ELB idle timeout, since I'm running this on AWS), this is how things are working now:
From the client PoV, this is nice because we avoid dropping WebSockets at every configuration reloads (which can happen quit frequently in larger deployments). However, from the server PoV, you might end accumulating workers in 'shutting down' state (which might cause the elevated memory consumption as I showed earlier) depending on how long the WebSocket connections are kept open and how many times the configuration gets reloaded by the ingress controller. After spending some time on this problem, it seems like the only way to mitigate this is to find a way to keep the rate of configuration reloads as low as possible. One way of achieving this is running a dedicated ingress deployment just for this WebSocket app, but it seems a bit overkill. @aledbf What do you think? Sorry for hijacking this thread, but it has gone too far already. 😅 |
Can you increase the log level to 2? (flag |
@danielfm are you using slack (k8s channel)? |
@aledbf Yes, my handle is 'danielmartins'. I've increased the logs and, as far as I can tell, the only thing triggering configuration reloads are changes in upstreams. (Well, not exactly, since apparently the list of endpoints and servers are identical, the only thing that changed was the order in which they got rendered): I0810 23:14:47.866939 5 nginx.go:300] NGINX configuration diff
I0810 23:14:47.866963 5 nginx.go:301] --- /tmp/a072836772 2017-08-10 23:14:47.000000000 +0000
+++ /tmp/b304986035 2017-08-10 23:14:47.000000000 +0000
@@ -163,32 +163,26 @@
proxy_ssl_session_reuse on;
- upstream production-chimera-production-pepper-80 {
+ upstream upstream-default-backend {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.71.14:3000 max_fails=0 fail_timeout=0;
- server 10.2.32.22:3000 max_fails=0 fail_timeout=0;
+ server 10.2.157.13:8080 max_fails=0 fail_timeout=0;
}
- upstream production-gabarito-production-80 {
+ upstream production-landings-production-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.110.13:3000 max_fails=0 fail_timeout=0;
- server 10.2.109.195:3000 max_fails=0 fail_timeout=0;
+ server 10.2.82.66:3000 max_fails=0 fail_timeout=0;
+ server 10.2.79.124:3000 max_fails=0 fail_timeout=0;
+ server 10.2.59.21:3000 max_fails=0 fail_timeout=0;
+ server 10.2.45.219:3000 max_fails=0 fail_timeout=0;
}
upstream production-sisu-production-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.109.177:3000 max_fails=0 fail_timeout=0;
server 10.2.12.161:3000 max_fails=0 fail_timeout=0;
- }
-
- upstream production-lap-production-worker-80 {
- # Load balance algorithm; empty for round robin, which is the default
- least_conn;
- server 10.2.21.37:9292 max_fails=0 fail_timeout=0;
- server 10.2.65.105:9292 max_fails=0 fail_timeout=0;
+ server 10.2.109.177:3000 max_fails=0 fail_timeout=0;
}
upstream production-passepartout-production-80 {
@@ -201,61 +195,67 @@
upstream production-lap-production-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.45.223:8000 max_fails=0 fail_timeout=0;
+ server 10.2.21.36:8000 max_fails=0 fail_timeout=0;
server 10.2.78.36:8000 max_fails=0 fail_timeout=0;
+ server 10.2.45.223:8000 max_fails=0 fail_timeout=0;
server 10.2.99.151:8000 max_fails=0 fail_timeout=0;
- server 10.2.21.36:8000 max_fails=0 fail_timeout=0;
}
- upstream production-desauth-production-80 {
+ upstream production-chimera-production-pepper-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.79.126:3000 max_fails=0 fail_timeout=0;
- server 10.2.35.105:3000 max_fails=0 fail_timeout=0;
- server 10.2.114.143:3000 max_fails=0 fail_timeout=0;
- server 10.2.50.44:3000 max_fails=0 fail_timeout=0;
- server 10.2.149.135:3000 max_fails=0 fail_timeout=0;
- server 10.2.45.155:3000 max_fails=0 fail_timeout=0;
+ server 10.2.71.14:3000 max_fails=0 fail_timeout=0;
+ server 10.2.32.22:3000 max_fails=0 fail_timeout=0;
}
- upstream production-live-production-80 {
+ upstream production-gabarito-production-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.53.23:5000 max_fails=0 fail_timeout=0;
- server 10.2.110.22:5000 max_fails=0 fail_timeout=0;
- server 10.2.35.91:5000 max_fails=0 fail_timeout=0;
- server 10.2.45.221:5000 max_fails=0 fail_timeout=0;
+ server 10.2.110.13:3000 max_fails=0 fail_timeout=0;
+ server 10.2.109.195:3000 max_fails=0 fail_timeout=0;
}
- upstream upstream-default-backend {
+ upstream production-chimera-production-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.157.13:8080 max_fails=0 fail_timeout=0;
+ server 10.2.78.26:3000 max_fails=0 fail_timeout=0;
+ server 10.2.59.22:3000 max_fails=0 fail_timeout=0;
+ server 10.2.96.249:3000 max_fails=0 fail_timeout=0;
+ server 10.2.32.21:3000 max_fails=0 fail_timeout=0;
+ server 10.2.114.177:3000 max_fails=0 fail_timeout=0;
+ server 10.2.83.20:3000 max_fails=0 fail_timeout=0;
+ server 10.2.118.111:3000 max_fails=0 fail_timeout=0;
+ server 10.2.26.23:3000 max_fails=0 fail_timeout=0;
+ server 10.2.35.150:3000 max_fails=0 fail_timeout=0;
+ server 10.2.79.125:3000 max_fails=0 fail_timeout=0;
+ server 10.2.157.165:3000 max_fails=0 fail_timeout=0;
}
- upstream production-landings-production-80 {
+ upstream production-lap-production-worker-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.79.124:3000 max_fails=0 fail_timeout=0;
- server 10.2.82.66:3000 max_fails=0 fail_timeout=0;
- server 10.2.45.219:3000 max_fails=0 fail_timeout=0;
- server 10.2.59.21:3000 max_fails=0 fail_timeout=0;
+ server 10.2.21.37:9292 max_fails=0 fail_timeout=0;
+ server 10.2.65.105:9292 max_fails=0 fail_timeout=0;
}
- upstream production-chimera-production-80 {
+ upstream production-desauth-production-80 {
# Load balance algorithm; empty for round robin, which is the default
least_conn;
- server 10.2.96.249:3000 max_fails=0 fail_timeout=0;
- server 10.2.157.165:3000 max_fails=0 fail_timeout=0;
- server 10.2.114.177:3000 max_fails=0 fail_timeout=0;
- server 10.2.118.111:3000 max_fails=0 fail_timeout=0;
- server 10.2.79.125:3000 max_fails=0 fail_timeout=0;
- server 10.2.78.26:3000 max_fails=0 fail_timeout=0;
- server 10.2.59.22:3000 max_fails=0 fail_timeout=0;
- server 10.2.35.150:3000 max_fails=0 fail_timeout=0;
- server 10.2.32.21:3000 max_fails=0 fail_timeout=0;
- server 10.2.83.20:3000 max_fails=0 fail_timeout=0;
- server 10.2.26.23:3000 max_fails=0 fail_timeout=0;
+ server 10.2.114.143:3000 max_fails=0 fail_timeout=0;
+ server 10.2.79.126:3000 max_fails=0 fail_timeout=0;
+ server 10.2.45.155:3000 max_fails=0 fail_timeout=0;
+ server 10.2.35.105:3000 max_fails=0 fail_timeout=0;
+ server 10.2.50.44:3000 max_fails=0 fail_timeout=0;
+ server 10.2.149.135:3000 max_fails=0 fail_timeout=0;
+ }
+
+ upstream production-live-production-80 {
+ # Load balance algorithm; empty for round robin, which is the default
+ least_conn;
+ server 10.2.53.23:5000 max_fails=0 fail_timeout=0;
+ server 10.2.45.221:5000 max_fails=0 fail_timeout=0;
+ server 10.2.35.91:5000 max_fails=0 fail_timeout=0;
+ server 10.2.110.22:5000 max_fails=0 fail_timeout=0;
}
server { All other configuration reloads looks like this. |
@danielfm , maybe it is not too late, but "--sort-backends=true" should help :) |
@redbaron Haha thanks, after I hit this, I noticed the latest version introduced that flag and started using it right away. 😄 |
No description provided.