You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last night, one of our production servers experienced some partial downtime. From about 8:15PM-8:45PM (Eastern Time) about 8% of the requests were failing.
The culprit was that there were too many files open at a system level on the server, which left nginx sporadically unable to fulfill requests. The issue was that nginx was holding open many thousands of file descriptors to /dev/urandom until we eventually hit the system limit on number of open files allowed (which causes many bad things to happen). The issue was temporarily solved last night by simply restarting nginx, which cleared all the open file descriptors, and got things back to normal. However, nginx's file descriptors are still growing, so if nginx isn't restarted, then in a couple more weeks we would eventually exhaust the open file descriptors limit again.
I've pinpointed the issue to our use of the ngx_txid module for nginx. This module had a bug in it that caused a leak in file descriptors every time nginx was sent the HUP signal to reload (which differs from a full nginx restart). We've actually been using this nginx module for quite some time, but when we rolled out the DNS changes a couple weeks ago, we began reloading nginx with much more frequency. So while this leak had existed ever since we began using ngx_txid, it didn't really become super-problematic until we began reloading nginx many more times than previously.
I submitted a patch to fix this file descriptor leak ing ngx_txid this morning and it's already been accepted: streadway/ngx_txid#6 So to fix this, we need to build a new version of nginx with this updated module and deploy that. In the meantime, we're not in immediate danger of this happening again, since it takes about 2 weeks for us to exhaust our file descriptor limit, but I'll try to get this patched version deployed today or tomorrow.
In terms of how we could have prevented this problem from happening, this unfortunately was a pretty tricky one that I didn't have much foresight to test. Since the issue took 2 weeks to reveal itself under load on our production servers, this would have been difficult to test for unless we were explicitly looking for this type of low-level leak. We had these updates running on our staging instances for about a week, but that wasn't long enough for things to actually break (particularly since we were restarting staging more frequently during that time to test things, so those full restarts also served to delay and mask the issue further).
That being said, now that this is something we know to look for, I am working on a test to add to our integration suite that repeatedly reloads nginx and looks for unexpected growth in the number of open file descriptors on the system. This should effectively test for this situation and help prevent this specific issue from cropping up in future nginx or module updates to our system.
The one other thing we might want to look into is a better health check for our load balancers. That could have potentially removed the problematic server from rotation sooner, so the issue would have had less impact (however, given that things were mostly still working, a better load balancer health check may have still led to the problematic server going in and out of rotation).
The text was updated successfully, but these errors were encountered:
This fix has been deployed to production, and everything looks good. We're now holding steady at around 10 open urandom file descriptors for nginx (rather than upwards of 50,000 or more). But I'll continue to keep my eye on the overall system to watch for any other growth like this.
I've also committed a test for this this to our integration suite on a branch, but I need to get the newer Ubuntu packages built with this same fix before it'll pass on our CI environment and then we can merge to master: NREL/api-umbrella-router@d16c002
Last night, one of our production servers experienced some partial downtime. From about 8:15PM-8:45PM (Eastern Time) about 8% of the requests were failing.
The culprit was that there were too many files open at a system level on the server, which left nginx sporadically unable to fulfill requests. The issue was that nginx was holding open many thousands of file descriptors to
/dev/urandom
until we eventually hit the system limit on number of open files allowed (which causes many bad things to happen). The issue was temporarily solved last night by simply restarting nginx, which cleared all the open file descriptors, and got things back to normal. However, nginx's file descriptors are still growing, so if nginx isn't restarted, then in a couple more weeks we would eventually exhaust the open file descriptors limit again.I've pinpointed the issue to our use of the ngx_txid module for nginx. This module had a bug in it that caused a leak in file descriptors every time nginx was sent the HUP signal to reload (which differs from a full nginx restart). We've actually been using this nginx module for quite some time, but when we rolled out the DNS changes a couple weeks ago, we began reloading nginx with much more frequency. So while this leak had existed ever since we began using ngx_txid, it didn't really become super-problematic until we began reloading nginx many more times than previously.
I submitted a patch to fix this file descriptor leak ing ngx_txid this morning and it's already been accepted: streadway/ngx_txid#6 So to fix this, we need to build a new version of nginx with this updated module and deploy that. In the meantime, we're not in immediate danger of this happening again, since it takes about 2 weeks for us to exhaust our file descriptor limit, but I'll try to get this patched version deployed today or tomorrow.
In terms of how we could have prevented this problem from happening, this unfortunately was a pretty tricky one that I didn't have much foresight to test. Since the issue took 2 weeks to reveal itself under load on our production servers, this would have been difficult to test for unless we were explicitly looking for this type of low-level leak. We had these updates running on our staging instances for about a week, but that wasn't long enough for things to actually break (particularly since we were restarting staging more frequently during that time to test things, so those full restarts also served to delay and mask the issue further).
That being said, now that this is something we know to look for, I am working on a test to add to our integration suite that repeatedly reloads nginx and looks for unexpected growth in the number of open file descriptors on the system. This should effectively test for this situation and help prevent this specific issue from cropping up in future nginx or module updates to our system.
The one other thing we might want to look into is a better health check for our load balancers. That could have potentially removed the problematic server from rotation sooner, so the issue would have had less impact (however, given that things were mostly still working, a better load balancer health check may have still led to the problematic server going in and out of rotation).
The text was updated successfully, but these errors were encountered: