-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression issue with keep alive connections #27363
Comments
// cc @nodejs/http |
The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.) Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior? |
In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic. I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter. |
I guess that @OrKoN What happens with your test suite if you put a |
Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow To illustrate the issue with an example of two requests on a keep-alive connection:
I wonder whether |
@shuhei so you mean that |
@OrKoN Yes, |
I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And P.S. I will run our tests with increased headerTimeout today to see if it helps. |
So we have applied the workaround ( |
I faced this issue too. I configured my Nginx load balancer to use keepalive when connecting to Node upstreams. I already saw it dropping connections and found the reason. I switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node closed the connection unexpectedly and then Nginx disables that upstream for a while. I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. I think this is a serious bug, since it results in load balancers switching nodes off and on. Why does not anybody else find this bug alarming? My guess is that -
|
We're having the same problem after upgrading from 8.x to 10.15.3. The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it. So I tweaked the test so that:
The result are pretty consistent: Error: socket hang up
at createHangUpError (_http_client.js:343:17)
at Socket.socketOnEnd (_http_client.js:444:23)
at Socket.emit (events.js:205:15)
at endReadableNT (_stream_readable.js:1137:12)
at processTicksAndRejections (internal/process/task_queues.js:84:9) {
code: 'ECONNRESET'
} You can clone the code from yoavain/node8keepAliveTimeout
When setting keepAliveTimeout to 0, the problem is gone.
|
Thanks for the info guys! This is a nasty issue that reared it's head when we went straight from 10.14 to 12. Node kept dropping our connections before the AWS Load Balancer knew about it. Once I set the ELB timeout < keepAliveTimeout < headersTimeout (we weren't even setting that one) the problem went away. |
I can confirm I'm pretty sure we're seeing this as well (v10.13.0). We have Nginx in front of NodeJS within K8s. We were seeing random "connection reset by peer" or "upstream prematurely closed connection" for requests Nginx was sending to nodeJS apps. On all these occasions the problem was occurring for connections established by Nginx to Node. Right on the default 5 second keepAliveTimeout on the nodeJS side, nginx decided to reuse it's open/established connection to the node process and send another request (however technically outside of the 5 second timeout limit on the node side by <2ms). NodeJS accepted this new request over the existing connection, responded with an ACK packet, then <2ms later node also followed up with a RST packet closing the connection. However stracing the nodeJS process I could see the app code had received the request and was processing it, but before the response could be sent, node had already closed the connection. I would second the thoughts that there is a slight race condition between the point the connection is about to be closed by nodeJS but it still accepting an incoming request. To avoid we simply increased the nodeJS keepAliveTimeout to be higher than Nginx's, thus giving Nginx the power over the keepAlive connections. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout PrintScreen of a packet capture taken on the nodeJS side of the connection is attached: |
Wow, very interesting thread. I have a suspicion that we're facing similar issue in AppEngine Node.js Standard. ~100 502 errors a day from ~1M requests per day total (~0.01% of all requests) |
Can confirm this is still the case in v10.19.0 release. |
For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. Fixes: nodejs#27363
For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. Fixes: nodejs#27363
We have investigated an issue for a very small subset of requests as well and this was the root-cause. The behaviour we see is exactly what @markfermor's described, you can read even more in our investigation details. The following configuration lines did indeed solve the issue: server.keepAliveTimeout = 76 * 1000;
server.headersTimeout = 77 * 1000; |
This is what solved it for us in AppEngine: this.server.keepAliveTimeout = 600 * 1000 |
For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. PR-URL: #32329 Fixes: #27363 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Node 10.15.4 is not happening. At best, it might be backported to Node 10.20.1 or analog. |
Would love to see it in 10.20.x or similar if it’s at all feasible to port it back. |
For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. PR-URL: #32329 Fixes: #27363 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Note that this also appears to be broken in v12, not just the recent v10s. |
Since #32329 was merged, now I don't need to set |
For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. Backport-PR-URL: #34131 PR-URL: #32329 Fixes: #27363 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
@yokomotod if you're asking for v12, v12.19.0 contains the fix v12.18.4...v12.19.0#diff-feaf3339998a19f0baf3f82414762c22 https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0 |
If default is 60s then we still need to override that if ELB has longer duration, right? |
Yes, the application keep alive timeout should be higher than whatever is in front (nginx, ELB, ...). Whether to change the timeout on the side of the application or the load balancer depends on your setup (e.g. google cloud HTTP load balancer does not allow changing the timeout value). I'd suggest making your application agnostic to how it is deployed and make this value configurable through an environment variable for example. The NodeJS default depends on the version you're running (I think it is 5 seconds for all non-obsolete versions), you can check for your version in the docs here. |
…46052) Resolves #39689, partially resolves #28642 (see notes below) Inspired by #44627 In #28642 it was also asked to expose `server.headersTimeout`, but it is probably not needed for most use cases and not implemented even in `next start`. It was needed to change this option before nodejs/node#27363. There also exists a rare bug that is described here nodejs/node#32329 (comment). To fix this exposing `server.headersTimeout` might be required both in `server.js` and in `next start`. Co-authored-by: JJ Kasper <[email protected]>
Hi everyone. Thank you for sharing these insights. Is there a chance that the fix is missing in v20.11.0? This issue does not occur in 16.16.0 or 18.19.0. However, I'm noticing 502 errors when I use v20.11.0. Would love to get feedback on whether other members in the community are facing this issue. |
I am seeing the same issue on v20.9.0 using Fastify as my server. |
For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. PR-URL: nodejs/node#32329 Fixes: nodejs/node#27363 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Hi,
We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting
server.keepAliveTimeout = X
where X is higher than the keep-alive timeout on the load balancer side.With version 10.15.3 setting
server.keepAliveTimeout = X
does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 1a7302bd48 which might have caused the issue we are seeing.Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.
The text was updated successfully, but these errors were encountered: