ECONNRESET while doing http 1.1 keep alive requests and server closes the connections #47130

DevasiaThomas · 2023-03-17T07:16:41Z

Version

v18.15.0

Platform

Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },

Subsystem

http or net

What steps will reproduce the bug?

Run non stop http1.1 requests over a keepAlive connection pool using node-fetch or axios against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).

When configuring the custom http agent on the client side, i supplied {keepAlive: true, maxSockets:50}.

How often does it reproduce? Is there a required condition?

When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.

Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a
[FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully

But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).

Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.

What is the expected behavior? Why is that the expected behavior?

A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.

What do you see instead?

The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET

Additional information

I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .

When I reduced maxSockets from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to the keepaliveagent package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were using keepaliveagent as well.

The text was updated successfully, but these errors were encountered:

bnoordhuis · 2023-03-17T08:34:42Z

I've read your description of the problem bug I don't think it's actionable without a way to reproduce that doesn't depend on third-party modules like axios. Can you add a test case?

DevasiaThomas · 2023-03-17T09:35:21Z

@bnoordhuis i'll see if I can come up with something. Since this happened across multiple frameworks this should be reproducible by using just the core stuff. Gimme this weekend :)
I was hoping you check the TCP FIN handling part of the code and see how the socket is treated on the client side on receiving a FIN. If it doesn't remove the connection from the pool before doing anything else - that would be the problem.

bnoordhuis · 2023-03-17T21:26:38Z

Node handles it when ECONNRESET happens when the socket is inside the socket pool but there's a fundamental (and inevitable) race condition when that happens just when the socket is being pulled from the pool.

Node could theoretically use another socket if the current socket fails on first write but without a circuit breaker or other fail-safes that creates a DoS-like scenario when the server immediately closes incoming connections. Node would basically busy-loop trying to send the request.

DevasiaThomas · 2023-03-18T07:37:14Z

I understand the inevitable race that can happen and I agree you have to retry in such a scenario and its best that the application takes care of that. But based on my tcp dump the server sends a FIN which causes the client to send an ACK(socket on client transitions to close_wait). Whatever happens after the ACK is sent causes the socket to get pushed to the top of the connection pool(which is why the very next request goes on it). So, there is time for the lifo algorithm or underlying code to push the connection to top of the pool - the same code can remove it from the pool instead of pushing it to the top when the ACK is for a FIN. This is what I wanted to highlight. I am working on a reproducible example. I hope I am successful with it.

DevasiaThomas · 2023-03-18T19:16:42Z

@bnoordhuis So, I don't have guaranteed replicability - but over time (5 of the runs i got it in like the first 5 mins, 2 other runs i had to leave it overnight). I'm running them on different physical boxes. I couldn't get resetAndDestroy to actually send a TCP RST, but the prints happen and socket hangups get reported on the client end. If you can figure out how to send the RST, you'll get the ECONNRESET as reported above. The code tries to mimic what happened in my firm's environment.

Server code

import http from 'node:http';
//Time, in seconds, any socket can be alive for. If there is a request already being processed close it as soon as response is done
let maxSocketLifetime = 20;

if (maxSocketLifetime === 0) {
    process.exit(1);
} else {
    maxSocketLifetime = maxSocketLifetime * 1000;
}

/**
 *
 * @param socket {node:net.Socket}
 * @return {Promise<void>}
 */
async function initiateClose(socket) {
    //console.log('Initiate Close' + socket.remotePort);
    if (socket.writableEnded || socket.destroyed || socket.closed) {
        return;
    }
    if(socket.requestInFlight) {
        socket.closeAfterResponse = true;
        return;
    }
    //Send FIN
    //console.log('Timed Close' + socket.remotePort);
    socket.end();
}
async function handleClose(hasError) {
    console.log('Socket Closed' + this.remotePort);
    //when it was closed by client
    clearTimeout(this.intervalz);
}
/**
 *
 * @param socket {node:net.Socket}
 * @return {Promise<void>}
 */
async function handleConnection(socket) {
    console.log('New Connection' + socket.remotePort);
    socket.requestInFlight = false;
    socket.closeAfterResponse = false;
    socket.intervalz = setTimeout(initiateClose, maxSocketLifetime - 100, socket);
    socket.on('close', handleClose);
}

/**
 *
 * @param req {http.IncomingMessage}
 * @param res {http.ServerResponse}
 * @return {Promise<void>}
 */
async function handleRequest(req, res) {
    if(req.socket.writableEnded) {// If client sends data after we sent FIN
        console.log('Reset')
        req.socket.resetAndDestroy();
    }
    req.socket.requestInFlight = true;
    let body = [];
    req.on('data', (chunk) => {
        body.push(chunk);
    }).on('end', () => {
        body = Buffer.concat(body).toString();
        res.end(body);
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
    });
}


let serverOpts = {
    keepAliveTimeout: maxSocketLifetime,
    connectionsCheckingInterval: Math.floor(maxSocketLifetime/3)
}
let server = http.createServer(serverOpts);
server.on('connection', handleConnection);
server.on('request', handleRequest);
server.listen(6969);
console.log('Server Started');

Client code

import http from "node:http";
let load_arr = [5000, 4000, 3000, 1500, 1000, 600, 400, 300, 200, 100, 50, 20 , 10 , 1];
let interval_arr = [1, 3, 5, 8, 9, 10, 13, 15, 17];
const postData = JSON.stringify([{"_id":"6414532c438f86245e1a9fae","index":0,"guid":"a7bf9526-1972-418d-b251-f865a02131ae","isActive":false,"balance":"$2,510.55","picture":"http://placehold.it/32x32","age":31,"eyeColor":"green","name":"Pittman Green","gender":"male","company":"EVENTEX","email":"[email protected]","phone":"+1 (812) 487-3784","address":"557 Johnson Avenue, Dotsero, South Dakota, 682","about":"Esse voluptate proident duis sunt excepteur eiusmod proident dolore. Ullamco nulla voluptate nostrud ullamco officia quis proident id pariatur Lorem quis culpa. Non ipsum voluptate exercitation reprehenderit ex duis nisi exercitation.\r\n","registered":"2014-05-19T06:47:19 +07:00","latitude":55.881952,"longitude":176.809812,"tags":["dolor","magna","aute","voluptate","ex","eiusmod","nisi"],"friends":[{"id":0,"name":"Tammi Odom"},{"id":1,"name":"Bush Ingram"},{"id":2,"name":"Pennington Sharp"}],"greeting":"Hello, Pittman Green! You have 1 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532ce611be5510ba2c21","index":1,"guid":"6350b383-bee6-48a1-b19e-00cc20d74db7","isActive":true,"balance":"$2,193.74","picture":"http://placehold.it/32x32","age":38,"eyeColor":"brown","name":"Rice Mccullough","gender":"male","company":"MAGNINA","email":"[email protected]","phone":"+1 (915) 527-2484","address":"689 Hanover Place, Lumberton, Wisconsin, 4844","about":"Labore dolor nostrud anim veniam incididunt dolore fugiat proident. Sit fugiat incididunt culpa aute veniam. Sint labore non magna labore tempor amet. Id mollit ipsum eiusmod reprehenderit dolor consequat nostrud nisi.\r\n","registered":"2019-04-02T08:05:50 +07:00","latitude":34.247422,"longitude":-144.451109,"tags":["elit","adipisicing","voluptate","proident","consectetur","minim","duis"],"friends":[{"id":0,"name":"Abbott Compton"},{"id":1,"name":"Cathryn Wilkins"},{"id":2,"name":"Jeannette Robertson"}],"greeting":"Hello, Rice Mccullough! You have 7 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c1896aa6394d16d18","index":2,"guid":"22596015-6d9b-447e-8fa4-9007e80b37bb","isActive":true,"balance":"$2,100.66","picture":"http://placehold.it/32x32","age":24,"eyeColor":"green","name":"Katharine Kaufman","gender":"female","company":"ELENTRIX","email":"[email protected]","phone":"+1 (846) 453-2614","address":"271 Grand Avenue, Tonopah, New Hampshire, 9234","about":"Eiusmod officia amet sit dolor elit id labore ea. Aliquip elit veniam reprehenderit aliquip mollit. Tempor non ipsum ad exercitation sit sunt consectetur amet pariatur. Nisi in non incididunt quis ex laboris. Veniam voluptate proident do commodo laborum elit velit Lorem enim aliquip mollit quis tempor veniam. Anim ea velit fugiat sit ea voluptate esse in officia nostrud consequat amet magna.\r\n","registered":"2020-12-21T05:29:02 +08:00","latitude":-54.93951,"longitude":151.859727,"tags":["proident","sint","Lorem","duis","sint","sit","excepteur"],"friends":[{"id":0,"name":"Corrine Chavez"},{"id":1,"name":"Diana Pate"},{"id":2,"name":"Madden Yang"}],"greeting":"Hello, Katharine Kaufman! You have 2 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c404e9ebdabd81650","index":3,"guid":"b8d8605f-a72a-4d8f-954d-2339434d3c5d","isActive":false,"balance":"$2,420.51","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Mattie Craft","gender":"female","company":"CINASTER","email":"[email protected]","phone":"+1 (982) 481-3054","address":"525 Church Lane, Statenville, Michigan, 8845","about":"Et irure velit esse reprehenderit id. Dolore excepteur labore ullamco eu est nostrud et. Ipsum consectetur exercitation ut elit id irure ullamco irure irure voluptate laborum id enim est. Ipsum elit voluptate ullamco culpa dolor esse mollit sint velit. Cupidatat exercitation ipsum commodo enim labore anim officia ut incididunt ea velit amet ut Lorem. Fugiat id occaecat anim aute ex elit irure pariatur commodo dolore magna pariatur.\r\n","registered":"2014-05-10T04:15:02 +07:00","latitude":-46.535255,"longitude":140.547752,"tags":["esse","ullamco","mollit","quis","proident","aliqua","pariatur"],"friends":[{"id":0,"name":"Irene Delgado"},{"id":1,"name":"Christensen Craig"},{"id":2,"name":"Vilma Mccarthy"}],"greeting":"Hello, Mattie Craft! You have 6 unread messages.","favoriteFruit":"banana"},{"_id":"6414532c1a5cfc33a11dc118","index":4,"guid":"d7dfa4ee-e4be-43c4-bd5a-2993debbbebc","isActive":false,"balance":"$2,646.29","picture":"http://placehold.it/32x32","age":23,"eyeColor":"brown","name":"Lopez Davenport","gender":"male","company":"MEDIFAX","email":"[email protected]","phone":"+1 (885) 442-2655","address":"401 Eastern Parkway, Loretto, Iowa, 7360","about":"Nulla laboris laborum ipsum voluptate sint deserunt eu. Esse velit sit sunt qui. Proident anim do ea do. Elit qui sit cupidatat dolor quis fugiat dolor quis voluptate in. Velit esse irure est duis nulla incididunt do ut. Tempor qui culpa labore eiusmod voluptate quis elit in cillum ea nostrud.\r\n","registered":"2020-04-25T03:40:35 +07:00","latitude":58.560169,"longitude":58.62533,"tags":["voluptate","esse","est","cupidatat","officia","elit","deserunt"],"friends":[{"id":0,"name":"Blanca Wilkinson"},{"id":1,"name":"Aguirre Schmidt"},{"id":2,"name":"Mandy Norris"}],"greeting":"Hello, Lopez Davenport! You have 1 unread messages.","favoriteFruit":"strawberry"}]);
let agent = new http.Agent({
    keepAlive:true,
    maxSockets: 50,
});
const options = {
    agent: agent,
    hostname: 'localhost',
    port: 6969,
    method: 'POST',
    headers: {
        'Content-Type': 'application/json',
    },
};

async function handleResponse(res) {
    let body = [];
    res.setEncoding('utf8');
    res.on('data', (chunk) => {
        body.push(chunk);
    });
    res.on('end', () => {
        body = body.join("");
    });
}
async function handleRequestError(error) {
    console.error(`problem with request: ${error.message}`)
}

while (true) {
    let load = load_arr[load_arr.length * Math.random() | 0];
    for (let i = 0; i < load; i++) {
        let req = http.request(options, handleResponse);
        req.on('error', handleRequestError);
        req.write(postData);
        req.end();
    }

    let interval = interval_arr[interval_arr.length * Math.random() | 0];
    await new Promise(r => setTimeout(r, interval * 1000));
}

bnoordhuis · 2023-03-18T23:05:09Z

@nodejs/http can you take a look?

DevasiaThomas · 2023-03-19T10:28:04Z

To surface the problem more frequently, i tried to randomly close ~ 1% of the connections after the response was ended
so this

        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {

changed to this

        res.socket.requestInFlight = false;
        let random = Math.random();
        if(random > 0.98 || res.socket.closeAfterResponse) {

You can change the number to increase the percentage of connections closed by the server.
This what the TCP dump will look like (if my node.js server would actually send the reset).

The cool thing I saw in the dump was that these connections were idle till the FIN showed up. Which is why I'm emphasizing that there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective.

ronag · 2023-03-19T12:25:12Z

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

DevasiaThomas · 2023-03-19T13:18:57Z

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

I'm sorry what do you mean by hints? Is the hint part of the spec or implemented by nodejs servers? I don't know about this - so if you could point me to the info regarding this, I would greatly appreciate it.

Undici does that and I would recommend you switch to that.

Can't - its not that simple. Our restAPI interactions are all generated code based on things like OpenApi and protobuff schemas. A quick google search does not find me a generator that uses Undici. If we had to switch - we'd probably choose another language at that point. This is because my colleagues have spend quite a bit of time on this and we don't believe its the race mentioned in spec regarding keep-alive.(We haven't faced this problem in rust and python implementations of core http libs)

Based on the tcpdump timelines and ordering of which connection gets the next request, its pretty clear that nodejs has the time to apply the lifo algorithm to place the socket, that just sent the ACK to the incoming FIN, to the top of the pool. Which also means you have the same amount of time to take that socket out of the connection pool. I believe this is fixable. If you can show me where the code, that handles the connection pool management, is - I can see if I can submit a PR.

ronag · 2023-03-19T14:50:56Z

MDN calls it parameters. See the timeout parameter. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

ronag · 2023-03-19T14:52:22Z

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

DevasiaThomas · 2023-03-19T14:56:22Z

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

@ronag , timeout is not the problem here. Neither is the concept of keep alive. Please have a deep look into my issue. Thanks :)

ronag · 2023-03-19T17:30:48Z

@mcollina I think this is something we've resolved in undici. This kind of goes into http core client maintenance issue...

ywave620 · 2023-03-20T02:59:48Z

@DevasiaThomas I believe this is just the known issue as memtioned by @ronag and @bnoordhuis. For you curiosity why

an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool

your sever code response a HTTP response indicating keep-alive followed by a TCP FIN:

async function handleRequest(req, res) {
// ...
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
// ...

nodejs client does not take the TCP FIN as a signal to reuse the connection but does take the HTTP response with a keep-alive indication.

ywave620 · 2023-03-20T03:02:29Z

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

DevasiaThomas · 2023-03-20T03:17:28Z

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

@ywave620
There's 2 behaviors on an incoming FIN,ACK from the server based on my TCP dump.

Node.js Client sends FIN, ACK --> This has no problem, works fine
Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

ywave620 · 2023-03-20T03:48:08Z

Node.js Client sends FIN, ACK --> This has no problem, works fine

This is the default behavior of a nodejs client/server, i.e. nodejs automatically close any half-open connection. This behavior can be overwritten by setting allowHalfOpen to true in the option paased to http or net module when creating outbound requests/connections or creating server

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

ywave620 · 2023-03-20T03:50:16Z

Also note that allowHalfOpen is not officially documented in http module. Is it intended or do we miss it? @ronag

DevasiaThomas · 2023-03-20T04:02:30Z

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

Nope, since the client sends the next request right away, server sends reset. client has no opportunity to send FIN.
I haven't passed allowHalfOpen option anywhere so i can't comment about that.

DevasiaThomas · 2023-03-20T04:07:18Z

@ywave620 I just checked http.createServer call path real quick and it seems that the Server creates a net.Server with halfOpen as true always. Please confirm. But my issue isn't with the server but the client. I'll check that as well

EDIT: idk if that is only for the listening socket or applies to all incoming sockets for the server.

ywave620 · 2023-03-20T04:42:39Z

allowHalfOpen applies to either client or server. It make sense in net abstraction level, however, it is copied to the option paased to net by http. E.g. net.createConnection inherits the option from http.request, which is why http.request({allowHalfOpen: true}) works. Although I don't know whether this is the desired usage, in other words, should it be considered as backdoor?

ronag · 2023-03-20T05:50:26Z

I think your best bet is to either disable keep alive or switch to undici.

DevasiaThomas · 2023-03-20T06:01:00Z

I think your best bet is to either disable keep alive or switch to undici.

You've mentioned this already. Neither of which I can do. However, I'm fairly certain there's a bug here - please let it get fixed.

mcollina · 2023-03-20T09:12:24Z

Would you like to send a pull request to resolve this issue? Investigating this problem would likely require multiple days of work, and I'm not convinced either @ronag, myself or other members of @nodejs/http have the time to investigate.

DevasiaThomas · 2023-03-20T09:13:21Z

@ywave620 So default behavior for nodejs server is to allowHalfOpen whereas for clients, its to not allowHalfOpen - you can still pass it in options to override this behavior (backdoor or not - up the maintainers)

DevasiaThomas · 2023-03-20T09:13:54Z

@mcollina Sure I can try :)

bnoordhuis · 2023-03-20T09:49:12Z

there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective

Just an observation from reading lib/_http_agent.js but I don't think that's possible. Sockets are removed from the FIFO/LIFO queue but sockets in the queue are never reordered.

(It's the .freeSockets property, keyed by host.)

What I think can happen is this: destroyed sockets aren't immediately removed from the queue, they're simply discarded (in [FL]IFO order) when the agent looks for a live socket, by inspecting socket.destroyed == true.

Assume for simplicity that all sockets expire at the same time, e.g., because the network went down. Node gets notified by the operating system on a per-socket basis. Crucially, node may not get those notifications all at the same time.

For the next request, the agent discards the ones it knows are dead. It picks the first one it believes is alive - except it isn't, we just haven't received the notification yet. It's basically the race condition I mentioned earlier.

DevasiaThomas · 2023-03-20T11:46:55Z

@bnoordhuis Thank you looking into it :) . I just reached that part of code

Looking for your thoughts on the below:

in addRequest, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo, you would get a socket from the other end that was not checked for stale sockets.
in places like addRequest, on 'free' and removeSocket, instead of just destroyed or !writable, could we check for half closed scenarios incase allowHalfOpen is not set -> My case where server sent the FIN so its the readable end that was closed but not (yet) the writeable.

StanislavKharchenko · 2024-07-22T17:10:02Z

Why it was closed without any comment?
@ronag ?

ronag · 2024-07-22T17:24:48Z

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

ziimakc · 2024-07-22T17:40:53Z

@ronag then how we are supposed to use nodejs/undici when it's randomly closes connections from time to time?

avivkeller · 2024-07-22T19:09:38Z

@ronag given the conversation, I've added known limitation, as there's nothing Node.js can do to fix it, but it's known, correct?

Feel free to adjust if needed.

ronag · 2024-07-22T19:11:19Z

This issue is about the node core http client. Not undici or fetch. Though they will encounter the same issue.

The problem is that the server (unexpectedly for the client but correctly) closes the keep alive connection. Please read through the whole issue if you want to have a better understanding.

rubengmurray · 2024-07-23T08:19:25Z

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

Can I get my Node 20.12.2 service to run with the same parameters as Node 18.13.0 so this doesn't happen?

I've read through this issue, and unless I've missed it, that topic has not been touched on?

No fancy config here: http.Server(express)

h0od · 2024-07-23T08:26:04Z

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off.

rubengmurray · 2024-07-23T08:44:03Z

Thanks @h0od I'll give this a go.

dhedey · 2024-07-23T13:37:08Z

@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node?

I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests.

Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:

If on first attempted re-use of a socket, it's seen to be already closed at first read, we get an ECONNRESET propagated up to the client; rather than the http client just using a different socket. This feels like an abstraction leak in the connection assignment of the http client.
The race condition is worsened because of the single threaded model in Node slowing down handling of close events, and multiple event loops of delay to update the reusable socket queue.

As a mitigation - you mention in an earlier comment:

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client?

It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer.

I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use.

But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there.

HTTP Spec Reflections

The section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:

When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.

I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user.

I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress.

This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2).

Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.

Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward.

ronag · 2024-07-23T18:57:24Z

@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node?

Has nothing to do with Node. It's a general problem.

I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests.

Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:

If on first attempted re-use of a socket, it's seen to be already closed at first read, we get an ECONNRESET propagated up to the client; rather than the http client just using a different socket. This feels like an abstraction leak in the connection assignment of the http client.

Yea, undici does this for idempotent requests. Not something we will improve with the core client. PR welcome if someone wants to do it.

The race condition is worsened because of the single threaded model in Node slowing down handling of close events, and multiple event loops of delay to update the reusable socket queue.

Yes that's a possible improvement but I'm sceptical of its practical significance.

As a mitigation - you mention in an earlier comment:

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.
Undici does that and I would recommend you switch to that.

Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client?

That's at least my long term hope. It's difficult to remove the existing client without breaking the ecosystem.

It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer.

I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use.

It will retry automatically for idempotent requests. Not for e.g. POST.

But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there.

HTTP Spec Reflections

The section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:

When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.

I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user.

Possibly something to improve but again I'm sceptical to it's practical significance.

I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.

Possibly something to improve but again I'm sceptical to it's practical significance.

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress.
This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2).

Undici does this.

Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.

Undici does this.

Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward.

Undici does this.

dhedey · 2024-07-23T23:55:17Z

@ronag many thanks for the detailed reply 👍.

It sounds like undici is better architected, and in general, we (as users in the node ecosystem) should be transitioning across to it over the inbuilt http client.

Just regarding the "sceptical to it's practical significance" bits - I have a 100% repro of this issue using the node http client to send two consecutive awaited POST requests with Connection: Close against a localhost Axum server (which causes the server to send a HTTP response with no Connection Close and then an immediate FIN) - you could argue it's dodgy server behaviour, but other clients I've tested are fine with it, and the server did send a FIN, which should be "responded to as appropriate".

I'm not suggesting we should fix it in the node client though - I guess the next step would be to try undici and see if I get the same issue - if so, we can raise a ticket there to see if it can seamlessly handle socket-was-closed-before-anything-was-sent, even for e.g. POST requests.

ronag · 2024-07-24T07:43:41Z

Just regarding the "sceptical to it's practical significance" bits

We can reduce the time span where the data race can occur, but we can't solve the root problem. IMHO it's better to properly retry so it works in the general case than micro optimizing to reduce the likelyhood by a little bit (though ofc it doesn't hurt).

Regarding this issue what we maybe could consider is to disable keep-alive by default again and refer to undici for high performance use cases. @mcollina wdyt?

rubengmurray · 2024-07-24T08:21:59Z

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off.

Doesn't look like this worked for us. We created our own agents with keepAlive: false for http & https and passed them to axios and we're getting the same error of socket hang up.

import http from 'http';
import https from 'https';

export const httpAgent = new http.Agent({ keepAlive: false });
export const httpsAgent = new https.Agent({ keepAlive: false });

....

axios
    .request<T>({
      ...config,
      httpAgent,
      httpsAgent,
      method: 'post',
      url,
      data,
    })

It looks like a lot of the recent talk is regarding undici so I think we will try this for outbound requests instead.

rubengmurray · 2024-07-24T14:29:00Z

Well, unfortunately it seems using undici breaks our integration tests as nock doesn't work with it. There is an issue for this here nock/nock#2183 which says support has been added in the beta, but we've found that not to work.

We'll now be downgrading.

adijesori · 2024-07-24T14:31:39Z

Is it possible to just use keep-alive = false by default as was originally before node 20?

HTTP keepalive was enabled by default in Node 20, and is apparently causing errors for axios. See e.g. nodejs/node#47130, axios/axios#6113

ziimakc · 2024-08-04T19:18:42Z

Completely solved issue for my use case by installing undici and using it instead of native fetch:

import {
	fetch as undiciFetch,
	type RequestInfo as UndiciRequestInfo,
	type RequestInit as UndiciRequestInit,
} from "undici";

// works
async function fetchWrapperUndici(
	input: RequestInfo | URL,
	init?: RequestInit,
): Promise<Response> {
	return undiciFetch(input, init);
}

// ECONNRESET all the time
async function fetchWrapperNative(
	input: RequestInfo | URL,
	init?: RequestInit,
): Promise<Response> {
	return fetch(input, init);
}

abrie-stubber · 2024-10-25T08:56:10Z

It seems obvious that keep-alive absolutely has to be set to false by default in the default http agent. Automatic keep-alive should not be default behavior, it is an optimization (reusing sockets instead of creating new ones). This is something that someone would change when they need/want it.

Any project/company that upgrades from node:18 to node:20 would go from 0 requests failing, everything working reliably, to some random amount of requests failing. And from their perspective it is the same http agent now doing some completely unnecessary thing that they did not ask for.
That seems insane.

ronag · 2024-10-25T09:05:30Z

I think the default mode should be some kind of "auto mode" where if the server provides a keep alive hint in the response then we re-use connections ("random" failures should be basicaly 0) and otherwise no re-use by default.

@mcollina wdyt? I think we could/should do something similar for undici.

dreaganluna · 2024-11-06T13:11:33Z

It seems obvious that keep-alive absolutely has to be set to false by default in the default http agent. Automatic keep-alive should not be default behavior, it is an optimization (reusing sockets instead of creating new ones). This is something that someone would change when they need/want it.

Any project/company that upgrades from node:18 to node:20 would go from 0 requests failing, everything working reliably, to some random amount of requests failing. And from their perspective it is the same http agent now doing some completely unnecessary thing that they did not ask for. That seems insane.

I can attest to that! We've been migrating services to node.js v20.x and we've been scratching our heads for the amount of connection errors we've been seeing since.

bnoordhuis added the http Issues or PRs related to the http subsystem. label Mar 17, 2023

mcollina added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Mar 20, 2023

avivkeller added known limitation Issues that are identified as known limitations. and removed help wanted Issues that need assistance from volunteers or PRs that need help to proceed. labels Jul 22, 2024

pavel-kont mentioned this issue Jul 26, 2024

Undici support nock/nock#2183

Open

MattiasOlla added a commit to BonnierNews/lu-common that referenced this issue Jul 29, 2024

Don't enable HTTP keepalive by default

ae75f39

HTTP keepalive was enabled by default in Node 20, and is apparently causing errors for axios. See e.g. nodejs/node#47130, axios/axios#6113

MattiasOlla mentioned this issue Jul 29, 2024

Don't keepalive HTTP requests by default BonnierNews/lu-common#171

Closed

gcv-epalmer mentioned this issue Aug 9, 2024

MIGRATION ISSUE: Lambdas seemingly not cleaning up/disposing of RDS connections properly after upgrade aws/aws-sdk-js-v3#6369

Closed

5 tasks

bmaupin mentioned this issue Sep 5, 2024

GitHub API socket hang up bmaupin/langtrends-data#25

Open

1 task

IbrahimTanyalcin mentioned this issue Oct 9, 2024

Socket Hang Up / ECONNRESET / ECONNREFUSED on consecutive http requests (>= 100) and temporary FIX #55330

Closed

novoj mentioned this issue Oct 30, 2024

Random Armeria ClosedSessionException when keepAlive is disabled FgForrest/evitaDB#724

Closed

bryn-pins mentioned this issue Nov 7, 2024

chore(A2-1478): update to node 20 Planning-Inspectorate/appeal-planning-decision#3900

Merged

5 tasks

MarioRomanDono mentioned this issue Nov 7, 2024

Set default keepAlive options when creating HTTP/S agents #55770

Open

ECONNRESET while doing http 1.1 keep alive requests and server closes the connections #47130

ECONNRESET while doing http 1.1 keep alive requests and server closes the connections #47130

Comments

DevasiaThomas commented Mar 17, 2023 • edited Loading

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

bnoordhuis commented Mar 17, 2023

DevasiaThomas commented Mar 17, 2023

bnoordhuis commented Mar 17, 2023

DevasiaThomas commented Mar 18, 2023 • edited Loading

DevasiaThomas commented Mar 18, 2023 • edited Loading

bnoordhuis commented Mar 18, 2023

DevasiaThomas commented Mar 19, 2023

ronag commented Mar 19, 2023

DevasiaThomas commented Mar 19, 2023 • edited Loading

ronag commented Mar 19, 2023

ronag commented Mar 19, 2023

DevasiaThomas commented Mar 19, 2023 • edited Loading

ronag commented Mar 19, 2023

ywave620 commented Mar 20, 2023

ywave620 commented Mar 20, 2023

DevasiaThomas commented Mar 20, 2023

ywave620 commented Mar 20, 2023 • edited Loading

ywave620 commented Mar 20, 2023

DevasiaThomas commented Mar 20, 2023

DevasiaThomas commented Mar 20, 2023 • edited Loading

ywave620 commented Mar 20, 2023

ronag commented Mar 20, 2023

DevasiaThomas commented Mar 20, 2023

mcollina commented Mar 20, 2023

DevasiaThomas commented Mar 20, 2023 • edited Loading

DevasiaThomas commented Mar 20, 2023

bnoordhuis commented Mar 20, 2023

DevasiaThomas commented Mar 20, 2023 • edited Loading

StanislavKharchenko commented Jul 22, 2024

ronag commented Jul 22, 2024 • edited Loading

ziimakc commented Jul 22, 2024 • edited Loading

avivkeller commented Jul 22, 2024

ronag commented Jul 22, 2024 • edited Loading

rubengmurray commented Jul 23, 2024

h0od commented Jul 23, 2024

rubengmurray commented Jul 23, 2024

dhedey commented Jul 23, 2024

HTTP Spec Reflections

ronag commented Jul 23, 2024 • edited Loading

HTTP Spec Reflections

dhedey commented Jul 23, 2024 • edited Loading

ronag commented Jul 24, 2024 • edited Loading

rubengmurray commented Jul 24, 2024

rubengmurray commented Jul 24, 2024

adijesori commented Jul 24, 2024

ziimakc commented Aug 4, 2024

abrie-stubber commented Oct 25, 2024

ronag commented Oct 25, 2024 • edited Loading

dreaganluna commented Nov 6, 2024

DevasiaThomas commented Mar 17, 2023 •

edited

Loading

DevasiaThomas commented Mar 18, 2023 •

edited

Loading

DevasiaThomas commented Mar 18, 2023 •

edited

Loading

DevasiaThomas commented Mar 19, 2023 •

edited

Loading

DevasiaThomas commented Mar 19, 2023 •

edited

Loading

ywave620 commented Mar 20, 2023 •

edited

Loading

DevasiaThomas commented Mar 20, 2023 •

edited

Loading

DevasiaThomas commented Mar 20, 2023 •

edited

Loading

DevasiaThomas commented Mar 20, 2023 •

edited

Loading

ronag commented Jul 22, 2024 •

edited

Loading

ziimakc commented Jul 22, 2024 •

edited

Loading

ronag commented Jul 22, 2024 •

edited

Loading

ronag commented Jul 23, 2024 •

edited

Loading

dhedey commented Jul 23, 2024 •

edited

Loading

ronag commented Jul 24, 2024 •

edited

Loading

ronag commented Oct 25, 2024 •

edited

Loading