-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECONNRESET while doing http 1.1 keep alive requests and server closes the connections #47130
Comments
I've read your description of the problem bug I don't think it's actionable without a way to reproduce that doesn't depend on third-party modules like axios. Can you add a test case? |
@bnoordhuis i'll see if I can come up with something. Since this happened across multiple frameworks this should be reproducible by using just the core stuff. Gimme this weekend :) |
Node handles it when ECONNRESET happens when the socket is inside the socket pool but there's a fundamental (and inevitable) race condition when that happens just when the socket is being pulled from the pool. Node could theoretically use another socket if the current socket fails on first write but without a circuit breaker or other fail-safes that creates a DoS-like scenario when the server immediately closes incoming connections. Node would basically busy-loop trying to send the request. |
I understand the inevitable race that can happen and I agree you have to retry in such a scenario and its best that the application takes care of that. But based on my tcp dump the server sends a FIN which causes the client to send an ACK(socket on client transitions to close_wait). Whatever happens after the ACK is sent causes the socket to get pushed to the top of the connection pool(which is why the very next request goes on it). So, there is time for the lifo algorithm or underlying code to push the connection to top of the pool - the same code can remove it from the pool instead of pushing it to the top when the ACK is for a FIN. This is what I wanted to highlight. I am working on a reproducible example. I hope I am successful with it. |
@bnoordhuis So, I don't have guaranteed replicability - but over time (5 of the runs i got it in like the first 5 mins, 2 other runs i had to leave it overnight). I'm running them on different physical boxes. I couldn't get Server codeimport http from 'node:http';
//Time, in seconds, any socket can be alive for. If there is a request already being processed close it as soon as response is done
let maxSocketLifetime = 20;
if (maxSocketLifetime === 0) {
process.exit(1);
} else {
maxSocketLifetime = maxSocketLifetime * 1000;
}
/**
*
* @param socket {node:net.Socket}
* @return {Promise<void>}
*/
async function initiateClose(socket) {
//console.log('Initiate Close' + socket.remotePort);
if (socket.writableEnded || socket.destroyed || socket.closed) {
return;
}
if(socket.requestInFlight) {
socket.closeAfterResponse = true;
return;
}
//Send FIN
//console.log('Timed Close' + socket.remotePort);
socket.end();
}
async function handleClose(hasError) {
console.log('Socket Closed' + this.remotePort);
//when it was closed by client
clearTimeout(this.intervalz);
}
/**
*
* @param socket {node:net.Socket}
* @return {Promise<void>}
*/
async function handleConnection(socket) {
console.log('New Connection' + socket.remotePort);
socket.requestInFlight = false;
socket.closeAfterResponse = false;
socket.intervalz = setTimeout(initiateClose, maxSocketLifetime - 100, socket);
socket.on('close', handleClose);
}
/**
*
* @param req {http.IncomingMessage}
* @param res {http.ServerResponse}
* @return {Promise<void>}
*/
async function handleRequest(req, res) {
if(req.socket.writableEnded) {// If client sends data after we sent FIN
console.log('Reset')
req.socket.resetAndDestroy();
}
req.socket.requestInFlight = true;
let body = [];
req.on('data', (chunk) => {
body.push(chunk);
}).on('end', () => {
body = Buffer.concat(body).toString();
res.end(body);
res.socket.requestInFlight = false;
if(res.socket.closeAfterResponse) {
//Send FIN
//console.log('Close after response' + res.socket.remotePort);
res.socket.end();
}
});
}
let serverOpts = {
keepAliveTimeout: maxSocketLifetime,
connectionsCheckingInterval: Math.floor(maxSocketLifetime/3)
}
let server = http.createServer(serverOpts);
server.on('connection', handleConnection);
server.on('request', handleRequest);
server.listen(6969);
console.log('Server Started'); Client codeimport http from "node:http";
let load_arr = [5000, 4000, 3000, 1500, 1000, 600, 400, 300, 200, 100, 50, 20 , 10 , 1];
let interval_arr = [1, 3, 5, 8, 9, 10, 13, 15, 17];
const postData = JSON.stringify([{"_id":"6414532c438f86245e1a9fae","index":0,"guid":"a7bf9526-1972-418d-b251-f865a02131ae","isActive":false,"balance":"$2,510.55","picture":"http://placehold.it/32x32","age":31,"eyeColor":"green","name":"Pittman Green","gender":"male","company":"EVENTEX","email":"[email protected]","phone":"+1 (812) 487-3784","address":"557 Johnson Avenue, Dotsero, South Dakota, 682","about":"Esse voluptate proident duis sunt excepteur eiusmod proident dolore. Ullamco nulla voluptate nostrud ullamco officia quis proident id pariatur Lorem quis culpa. Non ipsum voluptate exercitation reprehenderit ex duis nisi exercitation.\r\n","registered":"2014-05-19T06:47:19 +07:00","latitude":55.881952,"longitude":176.809812,"tags":["dolor","magna","aute","voluptate","ex","eiusmod","nisi"],"friends":[{"id":0,"name":"Tammi Odom"},{"id":1,"name":"Bush Ingram"},{"id":2,"name":"Pennington Sharp"}],"greeting":"Hello, Pittman Green! You have 1 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532ce611be5510ba2c21","index":1,"guid":"6350b383-bee6-48a1-b19e-00cc20d74db7","isActive":true,"balance":"$2,193.74","picture":"http://placehold.it/32x32","age":38,"eyeColor":"brown","name":"Rice Mccullough","gender":"male","company":"MAGNINA","email":"[email protected]","phone":"+1 (915) 527-2484","address":"689 Hanover Place, Lumberton, Wisconsin, 4844","about":"Labore dolor nostrud anim veniam incididunt dolore fugiat proident. Sit fugiat incididunt culpa aute veniam. Sint labore non magna labore tempor amet. Id mollit ipsum eiusmod reprehenderit dolor consequat nostrud nisi.\r\n","registered":"2019-04-02T08:05:50 +07:00","latitude":34.247422,"longitude":-144.451109,"tags":["elit","adipisicing","voluptate","proident","consectetur","minim","duis"],"friends":[{"id":0,"name":"Abbott Compton"},{"id":1,"name":"Cathryn Wilkins"},{"id":2,"name":"Jeannette Robertson"}],"greeting":"Hello, Rice Mccullough! You have 7 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c1896aa6394d16d18","index":2,"guid":"22596015-6d9b-447e-8fa4-9007e80b37bb","isActive":true,"balance":"$2,100.66","picture":"http://placehold.it/32x32","age":24,"eyeColor":"green","name":"Katharine Kaufman","gender":"female","company":"ELENTRIX","email":"[email protected]","phone":"+1 (846) 453-2614","address":"271 Grand Avenue, Tonopah, New Hampshire, 9234","about":"Eiusmod officia amet sit dolor elit id labore ea. Aliquip elit veniam reprehenderit aliquip mollit. Tempor non ipsum ad exercitation sit sunt consectetur amet pariatur. Nisi in non incididunt quis ex laboris. Veniam voluptate proident do commodo laborum elit velit Lorem enim aliquip mollit quis tempor veniam. Anim ea velit fugiat sit ea voluptate esse in officia nostrud consequat amet magna.\r\n","registered":"2020-12-21T05:29:02 +08:00","latitude":-54.93951,"longitude":151.859727,"tags":["proident","sint","Lorem","duis","sint","sit","excepteur"],"friends":[{"id":0,"name":"Corrine Chavez"},{"id":1,"name":"Diana Pate"},{"id":2,"name":"Madden Yang"}],"greeting":"Hello, Katharine Kaufman! You have 2 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c404e9ebdabd81650","index":3,"guid":"b8d8605f-a72a-4d8f-954d-2339434d3c5d","isActive":false,"balance":"$2,420.51","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Mattie Craft","gender":"female","company":"CINASTER","email":"[email protected]","phone":"+1 (982) 481-3054","address":"525 Church Lane, Statenville, Michigan, 8845","about":"Et irure velit esse reprehenderit id. Dolore excepteur labore ullamco eu est nostrud et. Ipsum consectetur exercitation ut elit id irure ullamco irure irure voluptate laborum id enim est. Ipsum elit voluptate ullamco culpa dolor esse mollit sint velit. Cupidatat exercitation ipsum commodo enim labore anim officia ut incididunt ea velit amet ut Lorem. Fugiat id occaecat anim aute ex elit irure pariatur commodo dolore magna pariatur.\r\n","registered":"2014-05-10T04:15:02 +07:00","latitude":-46.535255,"longitude":140.547752,"tags":["esse","ullamco","mollit","quis","proident","aliqua","pariatur"],"friends":[{"id":0,"name":"Irene Delgado"},{"id":1,"name":"Christensen Craig"},{"id":2,"name":"Vilma Mccarthy"}],"greeting":"Hello, Mattie Craft! You have 6 unread messages.","favoriteFruit":"banana"},{"_id":"6414532c1a5cfc33a11dc118","index":4,"guid":"d7dfa4ee-e4be-43c4-bd5a-2993debbbebc","isActive":false,"balance":"$2,646.29","picture":"http://placehold.it/32x32","age":23,"eyeColor":"brown","name":"Lopez Davenport","gender":"male","company":"MEDIFAX","email":"[email protected]","phone":"+1 (885) 442-2655","address":"401 Eastern Parkway, Loretto, Iowa, 7360","about":"Nulla laboris laborum ipsum voluptate sint deserunt eu. Esse velit sit sunt qui. Proident anim do ea do. Elit qui sit cupidatat dolor quis fugiat dolor quis voluptate in. Velit esse irure est duis nulla incididunt do ut. Tempor qui culpa labore eiusmod voluptate quis elit in cillum ea nostrud.\r\n","registered":"2020-04-25T03:40:35 +07:00","latitude":58.560169,"longitude":58.62533,"tags":["voluptate","esse","est","cupidatat","officia","elit","deserunt"],"friends":[{"id":0,"name":"Blanca Wilkinson"},{"id":1,"name":"Aguirre Schmidt"},{"id":2,"name":"Mandy Norris"}],"greeting":"Hello, Lopez Davenport! You have 1 unread messages.","favoriteFruit":"strawberry"}]);
let agent = new http.Agent({
keepAlive:true,
maxSockets: 50,
});
const options = {
agent: agent,
hostname: 'localhost',
port: 6969,
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
};
async function handleResponse(res) {
let body = [];
res.setEncoding('utf8');
res.on('data', (chunk) => {
body.push(chunk);
});
res.on('end', () => {
body = body.join("");
});
}
async function handleRequestError(error) {
console.error(`problem with request: ${error.message}`)
}
while (true) {
let load = load_arr[load_arr.length * Math.random() | 0];
for (let i = 0; i < load; i++) {
let req = http.request(options, handleResponse);
req.on('error', handleRequestError);
req.write(postData);
req.end();
}
let interval = interval_arr[interval_arr.length * Math.random() | 0];
await new Promise(r => setTimeout(r, interval * 1000));
} |
@nodejs/http can you take a look? |
To surface the problem more frequently, i tried to randomly close ~ 1% of the connections after the response was ended
changed to this
You can change the number to increase the percentage of connections closed by the server. The cool thing I saw in the dump was that these connections were idle till the FIN showed up. Which is why I'm emphasizing that there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective. |
Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides. Undici does that and I would recommend you switch to that. |
I'm sorry what do you mean by hints? Is the hint part of the spec or implemented by nodejs servers? I don't know about this - so if you could point me to the info regarding this, I would greatly appreciate it.
Can't - its not that simple. Our restAPI interactions are all generated code based on things like OpenApi and protobuff schemas. A quick google search does not find me a generator that uses Undici. If we had to switch - we'd probably choose another language at that point. This is because my colleagues have spend quite a bit of time on this and we don't believe its the race mentioned in spec regarding keep-alive.(We haven't faced this problem in rust and python implementations of core http libs) Based on the tcpdump timelines and ordering of which connection gets the next request, its pretty clear that nodejs has the time to apply the lifo algorithm to place the socket, that just sent the ACK to the incoming FIN, to the top of the pool. Which also means you have the same amount of time to take that socket out of the connection pool. I believe this is fixable. If you can show me where the code, that handles the connection pool management, is - I can see if I can submit a PR. |
MDN calls it parameters. See the timeout parameter. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive |
The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold). |
@ronag , timeout is not the problem here. Neither is the concept of keep alive. Please have a deep look into my issue. Thanks :) |
@mcollina I think this is something we've resolved in undici. This kind of goes into http core client maintenance issue... |
@DevasiaThomas I believe this is just the known issue as memtioned by @ronag and @bnoordhuis. For you curiosity why
your sever code response a HTTP response indicating keep-alive followed by a TCP FIN: async function handleRequest(req, res) {
// ...
res.socket.requestInFlight = false;
if(res.socket.closeAfterResponse) {
//Send FIN
//console.log('Close after response' + res.socket.remotePort);
res.socket.end();
}
// ... nodejs client does not take the TCP FIN as a signal to reuse the connection but does take the HTTP response with a keep-alive indication. |
FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused. |
@ywave620
|
This is the default behavior of a nodejs client/server, i.e. nodejs automatically close any half-open connection. This behavior can be overwritten by setting
This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario? |
Also note that |
Nope, since the client sends the next request right away, server sends reset. client has no opportunity to send FIN. |
@ywave620 I just checked EDIT: idk if that is only for the listening socket or applies to all incoming sockets for the server. |
|
I think your best bet is to either disable keep alive or switch to undici. |
You've mentioned this already. Neither of which I can do. However, I'm fairly certain there's a bug here - please let it get fixed. |
Would you like to send a pull request to resolve this issue? Investigating this problem would likely require multiple days of work, and I'm not convinced either @ronag, myself or other members of @nodejs/http have the time to investigate. |
@mcollina Sure I can try :) |
Just an observation from reading lib/_http_agent.js but I don't think that's possible. Sockets are removed from the FIFO/LIFO queue but sockets in the queue are never reordered. (It's the What I think can happen is this: destroyed sockets aren't immediately removed from the queue, they're simply discarded (in [FL]IFO order) when the agent looks for a live socket, by inspecting Assume for simplicity that all sockets expire at the same time, e.g., because the network went down. Node gets notified by the operating system on a per-socket basis. Crucially, node may not get those notifications all at the same time. For the next request, the agent discards the ones it knows are dead. It picks the first one it believes is alive - except it isn't, we just haven't received the notification yet. It's basically the race condition I mentioned earlier. |
@bnoordhuis Thank you looking into it :) . I just reached that part of code Looking for your thoughts on the below:
|
Why it was closed without any comment? |
As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here. |
@ronag then how we are supposed to use nodejs/undici when it's randomly closes connections from time to time? |
@ronag given the conversation, I've added known limitation, as there's nothing Node.js can do to fix it, but it's known, correct? Feel free to adjust if needed. |
This issue is about the node core http client. Not undici or fetch. Though they will encounter the same issue. The problem is that the server (unexpectedly for the client but correctly) closes the keep alive connection. Please read through the whole issue if you want to have a better understanding. |
What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different? Can I get my Node 20.12.2 service to run with the same parameters as Node 18.13.0 so this doesn't happen? I've read through this issue, and unless I've missed it, that topic has not been touched on? No fancy config here: |
The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off. |
Thanks @h0od I'll give this a go. |
@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:
Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node? I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests. Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:
As a mitigation - you mention in an earlier comment:
Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client? It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer. I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use. But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there. HTTP Spec ReflectionsThe section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:
I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user. I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.
Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward. |
Has nothing to do with Node. It's a general problem.
Yea, undici does this for idempotent requests. Not something we will improve with the core client. PR welcome if someone wants to do it.
Yes that's a possible improvement but I'm sceptical of its practical significance.
That's at least my long term hope. It's difficult to remove the existing client without breaking the ecosystem.
It will retry automatically for idempotent requests. Not for e.g. POST.
Possibly something to improve but again I'm sceptical to it's practical significance.
Possibly something to improve but again I'm sceptical to it's practical significance.
Undici does this.
Undici does this.
Undici does this. |
@ronag many thanks for the detailed reply 👍. It sounds like undici is better architected, and in general, we (as users in the node ecosystem) should be transitioning across to it over the inbuilt http client. Just regarding the "sceptical to it's practical significance" bits - I have a 100% repro of this issue using the node http client to send two consecutive awaited POST requests with I'm not suggesting we should fix it in the node client though - I guess the next step would be to try undici and see if I get the same issue - if so, we can raise a ticket there to see if it can seamlessly handle socket-was-closed-before-anything-was-sent, even for e.g. POST requests. |
We can reduce the time span where the data race can occur, but we can't solve the root problem. IMHO it's better to properly retry so it works in the general case than micro optimizing to reduce the likelyhood by a little bit (though ofc it doesn't hurt). Regarding this issue what we maybe could consider is to disable keep-alive by default again and refer to undici for high performance use cases. @mcollina wdyt? |
Doesn't look like this worked for us. We created our own agents with
It looks like a lot of the recent talk is regarding |
Well, unfortunately it seems using We'll now be downgrading. |
Is it possible to just use |
HTTP keepalive was enabled by default in Node 20, and is apparently causing errors for axios. See e.g. nodejs/node#47130, axios/axios#6113
Completely solved issue for my use case by installing undici and using it instead of native fetch:
|
It seems obvious that keep-alive absolutely has to be set to false by default in the default http agent. Automatic keep-alive should not be default behavior, it is an optimization (reusing sockets instead of creating new ones). This is something that someone would change when they need/want it. Any project/company that upgrades from |
I think the default mode should be some kind of "auto mode" where if the server provides a keep alive hint in the response then we re-use connections ("random" failures should be basicaly 0) and otherwise no re-use by default. @mcollina wdyt? I think we could/should do something similar for undici. |
I can attest to that! We've been migrating services to node.js v20.x and we've been scratching our heads for the amount of connection errors we've been seeing since. |
Version
v18.15.0
Platform
Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },
Subsystem
http or net
What steps will reproduce the bug?
Run non stop http1.1 requests over a keepAlive connection pool using
node-fetch
oraxios
against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).When configuring the custom http agent on the client side, i supplied
{keepAlive: true, maxSockets:50}
.How often does it reproduce? Is there a required condition?
When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.
Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a
[FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully
But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).
Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.
What is the expected behavior? Why is that the expected behavior?
A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.
What do you see instead?
The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET
Additional information
I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .
When I reduced
maxSockets
from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to thekeepaliveagent
package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were usingkeepaliveagent
as well.The text was updated successfully, but these errors were encountered: