Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET while doing http 1.1 keep alive requests and server closes the connections #47130

Closed
DevasiaThomas opened this issue Mar 17, 2023 · 75 comments
Labels
http Issues or PRs related to the http subsystem. known limitation Issues that are identified as known limitations.

Comments

@DevasiaThomas
Copy link

DevasiaThomas commented Mar 17, 2023

Version

v18.15.0

Platform

Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },

Subsystem

http or net

What steps will reproduce the bug?

Run non stop http1.1 requests over a keepAlive connection pool using node-fetch or axios against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).

When configuring the custom http agent on the client side, i supplied {keepAlive: true, maxSockets:50}.

How often does it reproduce? Is there a required condition?

When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.

Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a
[FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully

But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).

Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.

What is the expected behavior? Why is that the expected behavior?

A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.

What do you see instead?

The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET

Additional information

I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .

When I reduced maxSockets from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to the keepaliveagent package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were using keepaliveagent as well.

@bnoordhuis bnoordhuis added the http Issues or PRs related to the http subsystem. label Mar 17, 2023
@bnoordhuis
Copy link
Member

I've read your description of the problem bug I don't think it's actionable without a way to reproduce that doesn't depend on third-party modules like axios. Can you add a test case?

@DevasiaThomas
Copy link
Author

@bnoordhuis i'll see if I can come up with something. Since this happened across multiple frameworks this should be reproducible by using just the core stuff. Gimme this weekend :)
I was hoping you check the TCP FIN handling part of the code and see how the socket is treated on the client side on receiving a FIN. If it doesn't remove the connection from the pool before doing anything else - that would be the problem.

@bnoordhuis
Copy link
Member

Node handles it when ECONNRESET happens when the socket is inside the socket pool but there's a fundamental (and inevitable) race condition when that happens just when the socket is being pulled from the pool.

Node could theoretically use another socket if the current socket fails on first write but without a circuit breaker or other fail-safes that creates a DoS-like scenario when the server immediately closes incoming connections. Node would basically busy-loop trying to send the request.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 18, 2023

I understand the inevitable race that can happen and I agree you have to retry in such a scenario and its best that the application takes care of that. But based on my tcp dump the server sends a FIN which causes the client to send an ACK(socket on client transitions to close_wait). Whatever happens after the ACK is sent causes the socket to get pushed to the top of the connection pool(which is why the very next request goes on it). So, there is time for the lifo algorithm or underlying code to push the connection to top of the pool - the same code can remove it from the pool instead of pushing it to the top when the ACK is for a FIN. This is what I wanted to highlight. I am working on a reproducible example. I hope I am successful with it.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 18, 2023

@bnoordhuis So, I don't have guaranteed replicability - but over time (5 of the runs i got it in like the first 5 mins, 2 other runs i had to leave it overnight). I'm running them on different physical boxes. I couldn't get resetAndDestroy to actually send a TCP RST, but the prints happen and socket hangups get reported on the client end. If you can figure out how to send the RST, you'll get the ECONNRESET as reported above. The code tries to mimic what happened in my firm's environment.

Server code
import http from 'node:http';
//Time, in seconds, any socket can be alive for. If there is a request already being processed close it as soon as response is done
let maxSocketLifetime = 20;

if (maxSocketLifetime === 0) {
    process.exit(1);
} else {
    maxSocketLifetime = maxSocketLifetime * 1000;
}

/**
 *
 * @param socket {node:net.Socket}
 * @return {Promise<void>}
 */
async function initiateClose(socket) {
    //console.log('Initiate Close' + socket.remotePort);
    if (socket.writableEnded || socket.destroyed || socket.closed) {
        return;
    }
    if(socket.requestInFlight) {
        socket.closeAfterResponse = true;
        return;
    }
    //Send FIN
    //console.log('Timed Close' + socket.remotePort);
    socket.end();
}
async function handleClose(hasError) {
    console.log('Socket Closed' + this.remotePort);
    //when it was closed by client
    clearTimeout(this.intervalz);
}
/**
 *
 * @param socket {node:net.Socket}
 * @return {Promise<void>}
 */
async function handleConnection(socket) {
    console.log('New Connection' + socket.remotePort);
    socket.requestInFlight = false;
    socket.closeAfterResponse = false;
    socket.intervalz = setTimeout(initiateClose, maxSocketLifetime - 100, socket);
    socket.on('close', handleClose);
}

/**
 *
 * @param req {http.IncomingMessage}
 * @param res {http.ServerResponse}
 * @return {Promise<void>}
 */
async function handleRequest(req, res) {
    if(req.socket.writableEnded) {// If client sends data after we sent FIN
        console.log('Reset')
        req.socket.resetAndDestroy();
    }
    req.socket.requestInFlight = true;
    let body = [];
    req.on('data', (chunk) => {
        body.push(chunk);
    }).on('end', () => {
        body = Buffer.concat(body).toString();
        res.end(body);
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
    });
}


let serverOpts = {
    keepAliveTimeout: maxSocketLifetime,
    connectionsCheckingInterval: Math.floor(maxSocketLifetime/3)
}
let server = http.createServer(serverOpts);
server.on('connection', handleConnection);
server.on('request', handleRequest);
server.listen(6969);
console.log('Server Started');
Client code
import http from "node:http";
let load_arr = [5000, 4000, 3000, 1500, 1000, 600, 400, 300, 200, 100, 50, 20 , 10 , 1];
let interval_arr = [1, 3, 5, 8, 9, 10, 13, 15, 17];
const postData = JSON.stringify([{"_id":"6414532c438f86245e1a9fae","index":0,"guid":"a7bf9526-1972-418d-b251-f865a02131ae","isActive":false,"balance":"$2,510.55","picture":"http://placehold.it/32x32","age":31,"eyeColor":"green","name":"Pittman Green","gender":"male","company":"EVENTEX","email":"[email protected]","phone":"+1 (812) 487-3784","address":"557 Johnson Avenue, Dotsero, South Dakota, 682","about":"Esse voluptate proident duis sunt excepteur eiusmod proident dolore. Ullamco nulla voluptate nostrud ullamco officia quis proident id pariatur Lorem quis culpa. Non ipsum voluptate exercitation reprehenderit ex duis nisi exercitation.\r\n","registered":"2014-05-19T06:47:19 +07:00","latitude":55.881952,"longitude":176.809812,"tags":["dolor","magna","aute","voluptate","ex","eiusmod","nisi"],"friends":[{"id":0,"name":"Tammi Odom"},{"id":1,"name":"Bush Ingram"},{"id":2,"name":"Pennington Sharp"}],"greeting":"Hello, Pittman Green! You have 1 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532ce611be5510ba2c21","index":1,"guid":"6350b383-bee6-48a1-b19e-00cc20d74db7","isActive":true,"balance":"$2,193.74","picture":"http://placehold.it/32x32","age":38,"eyeColor":"brown","name":"Rice Mccullough","gender":"male","company":"MAGNINA","email":"[email protected]","phone":"+1 (915) 527-2484","address":"689 Hanover Place, Lumberton, Wisconsin, 4844","about":"Labore dolor nostrud anim veniam incididunt dolore fugiat proident. Sit fugiat incididunt culpa aute veniam. Sint labore non magna labore tempor amet. Id mollit ipsum eiusmod reprehenderit dolor consequat nostrud nisi.\r\n","registered":"2019-04-02T08:05:50 +07:00","latitude":34.247422,"longitude":-144.451109,"tags":["elit","adipisicing","voluptate","proident","consectetur","minim","duis"],"friends":[{"id":0,"name":"Abbott Compton"},{"id":1,"name":"Cathryn Wilkins"},{"id":2,"name":"Jeannette Robertson"}],"greeting":"Hello, Rice Mccullough! You have 7 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c1896aa6394d16d18","index":2,"guid":"22596015-6d9b-447e-8fa4-9007e80b37bb","isActive":true,"balance":"$2,100.66","picture":"http://placehold.it/32x32","age":24,"eyeColor":"green","name":"Katharine Kaufman","gender":"female","company":"ELENTRIX","email":"[email protected]","phone":"+1 (846) 453-2614","address":"271 Grand Avenue, Tonopah, New Hampshire, 9234","about":"Eiusmod officia amet sit dolor elit id labore ea. Aliquip elit veniam reprehenderit aliquip mollit. Tempor non ipsum ad exercitation sit sunt consectetur amet pariatur. Nisi in non incididunt quis ex laboris. Veniam voluptate proident do commodo laborum elit velit Lorem enim aliquip mollit quis tempor veniam. Anim ea velit fugiat sit ea voluptate esse in officia nostrud consequat amet magna.\r\n","registered":"2020-12-21T05:29:02 +08:00","latitude":-54.93951,"longitude":151.859727,"tags":["proident","sint","Lorem","duis","sint","sit","excepteur"],"friends":[{"id":0,"name":"Corrine Chavez"},{"id":1,"name":"Diana Pate"},{"id":2,"name":"Madden Yang"}],"greeting":"Hello, Katharine Kaufman! You have 2 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c404e9ebdabd81650","index":3,"guid":"b8d8605f-a72a-4d8f-954d-2339434d3c5d","isActive":false,"balance":"$2,420.51","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Mattie Craft","gender":"female","company":"CINASTER","email":"[email protected]","phone":"+1 (982) 481-3054","address":"525 Church Lane, Statenville, Michigan, 8845","about":"Et irure velit esse reprehenderit id. Dolore excepteur labore ullamco eu est nostrud et. Ipsum consectetur exercitation ut elit id irure ullamco irure irure voluptate laborum id enim est. Ipsum elit voluptate ullamco culpa dolor esse mollit sint velit. Cupidatat exercitation ipsum commodo enim labore anim officia ut incididunt ea velit amet ut Lorem. Fugiat id occaecat anim aute ex elit irure pariatur commodo dolore magna pariatur.\r\n","registered":"2014-05-10T04:15:02 +07:00","latitude":-46.535255,"longitude":140.547752,"tags":["esse","ullamco","mollit","quis","proident","aliqua","pariatur"],"friends":[{"id":0,"name":"Irene Delgado"},{"id":1,"name":"Christensen Craig"},{"id":2,"name":"Vilma Mccarthy"}],"greeting":"Hello, Mattie Craft! You have 6 unread messages.","favoriteFruit":"banana"},{"_id":"6414532c1a5cfc33a11dc118","index":4,"guid":"d7dfa4ee-e4be-43c4-bd5a-2993debbbebc","isActive":false,"balance":"$2,646.29","picture":"http://placehold.it/32x32","age":23,"eyeColor":"brown","name":"Lopez Davenport","gender":"male","company":"MEDIFAX","email":"[email protected]","phone":"+1 (885) 442-2655","address":"401 Eastern Parkway, Loretto, Iowa, 7360","about":"Nulla laboris laborum ipsum voluptate sint deserunt eu. Esse velit sit sunt qui. Proident anim do ea do. Elit qui sit cupidatat dolor quis fugiat dolor quis voluptate in. Velit esse irure est duis nulla incididunt do ut. Tempor qui culpa labore eiusmod voluptate quis elit in cillum ea nostrud.\r\n","registered":"2020-04-25T03:40:35 +07:00","latitude":58.560169,"longitude":58.62533,"tags":["voluptate","esse","est","cupidatat","officia","elit","deserunt"],"friends":[{"id":0,"name":"Blanca Wilkinson"},{"id":1,"name":"Aguirre Schmidt"},{"id":2,"name":"Mandy Norris"}],"greeting":"Hello, Lopez Davenport! You have 1 unread messages.","favoriteFruit":"strawberry"}]);
let agent = new http.Agent({
    keepAlive:true,
    maxSockets: 50,
});
const options = {
    agent: agent,
    hostname: 'localhost',
    port: 6969,
    method: 'POST',
    headers: {
        'Content-Type': 'application/json',
    },
};

async function handleResponse(res) {
    let body = [];
    res.setEncoding('utf8');
    res.on('data', (chunk) => {
        body.push(chunk);
    });
    res.on('end', () => {
        body = body.join("");
    });
}
async function handleRequestError(error) {
    console.error(`problem with request: ${error.message}`)
}

while (true) {
    let load = load_arr[load_arr.length * Math.random() | 0];
    for (let i = 0; i < load; i++) {
        let req = http.request(options, handleResponse);
        req.on('error', handleRequestError);
        req.write(postData);
        req.end();
    }

    let interval = interval_arr[interval_arr.length * Math.random() | 0];
    await new Promise(r => setTimeout(r, interval * 1000));
}

@bnoordhuis
Copy link
Member

@nodejs/http can you take a look?

@DevasiaThomas
Copy link
Author

To surface the problem more frequently, i tried to randomly close ~ 1% of the connections after the response was ended
so this

        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {

changed to this

        res.socket.requestInFlight = false;
        let random = Math.random();
        if(random > 0.98 || res.socket.closeAfterResponse) {

You can change the number to increase the percentage of connections closed by the server.
This what the TCP dump will look like (if my node.js server would actually send the reset).

Screen Shot 2023-03-19 at 3 13 41 AM

The cool thing I saw in the dump was that these connections were idle till the FIN showed up. Which is why I'm emphasizing that there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective.

@ronag
Copy link
Member

ronag commented Mar 19, 2023

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 19, 2023

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

I'm sorry what do you mean by hints? Is the hint part of the spec or implemented by nodejs servers? I don't know about this - so if you could point me to the info regarding this, I would greatly appreciate it.

Undici does that and I would recommend you switch to that.

Can't - its not that simple. Our restAPI interactions are all generated code based on things like OpenApi and protobuff schemas. A quick google search does not find me a generator that uses Undici. If we had to switch - we'd probably choose another language at that point. This is because my colleagues have spend quite a bit of time on this and we don't believe its the race mentioned in spec regarding keep-alive.(We haven't faced this problem in rust and python implementations of core http libs)

Based on the tcpdump timelines and ordering of which connection gets the next request, its pretty clear that nodejs has the time to apply the lifo algorithm to place the socket, that just sent the ACK to the incoming FIN, to the top of the pool. Which also means you have the same amount of time to take that socket out of the connection pool. I believe this is fixable. If you can show me where the code, that handles the connection pool management, is - I can see if I can submit a PR.

@ronag
Copy link
Member

ronag commented Mar 19, 2023

MDN calls it parameters. See the timeout parameter. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

@ronag
Copy link
Member

ronag commented Mar 19, 2023

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 19, 2023

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

@ronag , timeout is not the problem here. Neither is the concept of keep alive. Please have a deep look into my issue. Thanks :)

@ronag
Copy link
Member

ronag commented Mar 19, 2023

@mcollina I think this is something we've resolved in undici. This kind of goes into http core client maintenance issue...

@ywave620
Copy link
Contributor

@DevasiaThomas I believe this is just the known issue as memtioned by @ronag and @bnoordhuis. For you curiosity why

an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool

your sever code response a HTTP response indicating keep-alive followed by a TCP FIN:

async function handleRequest(req, res) {
// ...
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
// ...

nodejs client does not take the TCP FIN as a signal to reuse the connection but does take the HTTP response with a keep-alive indication.

@ywave620
Copy link
Contributor

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

@DevasiaThomas
Copy link
Author

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

@ywave620
There's 2 behaviors on an incoming FIN,ACK from the server based on my TCP dump.

  1. Node.js Client sends FIN, ACK --> This has no problem, works fine
  2. Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

@ywave620
Copy link
Contributor

ywave620 commented Mar 20, 2023

Node.js Client sends FIN, ACK --> This has no problem, works fine

This is the default behavior of a nodejs client/server, i.e. nodejs automatically close any half-open connection. This behavior can be overwritten by setting allowHalfOpen to true in the option paased to http or net module when creating outbound requests/connections or creating server

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

@ywave620
Copy link
Contributor

Also note that allowHalfOpen is not officially documented in http module. Is it intended or do we miss it? @ronag

@DevasiaThomas
Copy link
Author

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

Nope, since the client sends the next request right away, server sends reset. client has no opportunity to send FIN.
I haven't passed allowHalfOpen option anywhere so i can't comment about that.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 20, 2023

@ywave620 I just checked http.createServer call path real quick and it seems that the Server creates a net.Server with halfOpen as true always. Please confirm. But my issue isn't with the server but the client. I'll check that as well

EDIT: idk if that is only for the listening socket or applies to all incoming sockets for the server.

@ywave620
Copy link
Contributor

allowHalfOpen applies to either client or server. It make sense in net abstraction level, however, it is copied to the option paased to net by http. E.g. net.createConnection inherits the option from http.request, which is why http.request({allowHalfOpen: true}) works. Although I don't know whether this is the desired usage, in other words, should it be considered as backdoor?

@ronag
Copy link
Member

ronag commented Mar 20, 2023

I think your best bet is to either disable keep alive or switch to undici.

@DevasiaThomas
Copy link
Author

I think your best bet is to either disable keep alive or switch to undici.

You've mentioned this already. Neither of which I can do. However, I'm fairly certain there's a bug here - please let it get fixed.

@mcollina mcollina added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Mar 20, 2023
@mcollina
Copy link
Member

Would you like to send a pull request to resolve this issue? Investigating this problem would likely require multiple days of work, and I'm not convinced either @ronag, myself or other members of @nodejs/http have the time to investigate.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 20, 2023

@ywave620 So default behavior for nodejs server is to allowHalfOpen whereas for clients, its to not allowHalfOpen - you can still pass it in options to override this behavior (backdoor or not - up the maintainers)

@DevasiaThomas
Copy link
Author

@mcollina Sure I can try :)

@bnoordhuis
Copy link
Member

there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective

Just an observation from reading lib/_http_agent.js but I don't think that's possible. Sockets are removed from the FIFO/LIFO queue but sockets in the queue are never reordered.

(It's the .freeSockets property, keyed by host.)

What I think can happen is this: destroyed sockets aren't immediately removed from the queue, they're simply discarded (in [FL]IFO order) when the agent looks for a live socket, by inspecting socket.destroyed == true.

Assume for simplicity that all sockets expire at the same time, e.g., because the network went down. Node gets notified by the operating system on a per-socket basis. Crucially, node may not get those notifications all at the same time.

For the next request, the agent discards the ones it knows are dead. It picks the first one it believes is alive - except it isn't, we just haven't received the notification yet. It's basically the race condition I mentioned earlier.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 20, 2023

@bnoordhuis Thank you looking into it :) . I just reached that part of code

Looking for your thoughts on the below:

  1. in addRequest, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo, you would get a socket from the other end that was not checked for stale sockets.

  2. in places like addRequest, on 'free' and removeSocket, instead of just destroyed or !writable, could we check for half closed scenarios incase allowHalfOpen is not set -> My case where server sent the FIN so its the readable end that was closed but not (yet) the writeable.

@StanislavKharchenko
Copy link

Why it was closed without any comment?
@ronag ?

@ronag
Copy link
Member

ronag commented Jul 22, 2024

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

@ziimakc
Copy link

ziimakc commented Jul 22, 2024

@ronag then how we are supposed to use nodejs/undici when it's randomly closes connections from time to time?

@avivkeller avivkeller added known limitation Issues that are identified as known limitations. and removed help wanted Issues that need assistance from volunteers or PRs that need help to proceed. labels Jul 22, 2024
@avivkeller
Copy link
Member

@ronag given the conversation, I've added known limitation, as there's nothing Node.js can do to fix it, but it's known, correct?

Feel free to adjust if needed.

@ronag
Copy link
Member

ronag commented Jul 22, 2024

This issue is about the node core http client. Not undici or fetch. Though they will encounter the same issue.

The problem is that the server (unexpectedly for the client but correctly) closes the keep alive connection. Please read through the whole issue if you want to have a better understanding.

@rubengmurray
Copy link

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

Can I get my Node 20.12.2 service to run with the same parameters as Node 18.13.0 so this doesn't happen?

I've read through this issue, and unless I've missed it, that topic has not been touched on?

No fancy config here: http.Server(express)

@h0od
Copy link

h0od commented Jul 23, 2024

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off.

@rubengmurray
Copy link

Thanks @h0od I'll give this a go.

@dhedey
Copy link

dhedey commented Jul 23, 2024

@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node?

I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests.

Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:

  • If on first attempted re-use of a socket, it's seen to be already closed at first read, we get an ECONNRESET propagated up to the client; rather than the http client just using a different socket. This feels like an abstraction leak in the connection assignment of the http client.
  • The race condition is worsened because of the single threaded model in Node slowing down handling of close events, and multiple event loops of delay to update the reusable socket queue.

As a mitigation - you mention in an earlier comment:

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client?

It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer.

I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use.

But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there.

HTTP Spec Reflections

The section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:

When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.

I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user.

I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress.

This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2).

Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.

Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward.

@ronag
Copy link
Member

ronag commented Jul 23, 2024

@ronag I'm not looking to criticise the closing of this ticket (I understand it's a gnarly one and time could be spent better elsewhere - e.g. undici), but I just wanted to enquire a little more about the background to this:

As has been explained. This is a race condition in the http 1.1 spec and there is nothing further we can do here.

Is this effectively saying it's too hard to improve the behaviour, given the current architecture of Node?

Has nothing to do with Node. It's a general problem.

I rarely see an "ECONNRESET" or equivalent errors in other http clients, but I see it a lot in Node - I don't think its behaviour fully aligns with the SHOULD of the http 1.1 spec (copied below) - e.g. around handling transport closures and retries of idempotent requests.

Moreover, there are additional issues inside Node itself which cause this issue to be hit much more readily than it seems to elsewhere; and the change to default to keep-alive in the http client has meant lots of people are hitting this issue who didn't previously:

  • If on first attempted re-use of a socket, it's seen to be already closed at first read, we get an ECONNRESET propagated up to the client; rather than the http client just using a different socket. This feels like an abstraction leak in the connection assignment of the http client.

Yea, undici does this for idempotent requests. Not something we will improve with the core client. PR welcome if someone wants to do it.

  • The race condition is worsened because of the single threaded model in Node slowing down handling of close events, and multiple event loops of delay to update the reusable socket queue.

Yes that's a possible improvement but I'm sceptical of its practical significance.

As a mitigation - you mention in an earlier comment:

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.
Undici does that and I would recommend you switch to that.

Undici does seem pretty great, and I can see that is where the majority of work is being spent - do we expect undici to be moving inside Node, and replace the http client?

That's at least my long term hope. It's difficult to remove the existing client without breaking the ecosystem.

It looks like Undici has a retry handler which could partly solve this issue; but I imagine it probably still has inherits some of the underlying awkwardness from the Node socket layer.

I imagine it performs better though, but I haven't tested it against pathological test cases, to see whether it assigns an alternative socket if the first attempted socket is detected to be already closed at/before use.

It will retry automatically for idempotent requests. Not for e.g. POST.

But if we expect undici to replace the node http client, then I can see why it's worth us focusing the attention there.

HTTP Spec Reflections

The section 8.1.4 of the http 1.1 spec titled "Practical Considerations" states:

When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.

I don't think Node does a great job of responding appropriately to the transport close / FIN from the socket. In many cases, the first awareness that we're re-using a closed socket is when we first attempt to read/write to it on a next request, which propagates all the way up to the user.

Possibly something to improve but again I'm sceptical to it's practical significance.

I know things are currently very asynchronous, but if we could somehow check the status of the socket before attempting to reuse it -- or detect on failure that we haven't actually sent any of a request down the socket yet -- then we could use a different socket for the request, which would avoid a lot of false-positives, and I think solve a lot of the reports in this thread.

Possibly something to improve but again I'm sceptical to it's practical significance.

A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress.
This means that clients, servers, and proxies MUST be able to recover from asynchronous close events. Client software SHOULD reopen the transport connection and retransmit the aborted sequence of requests without user interaction so long as the request sequence is idempotent (see section 9.1.2).

Undici does this.

Non-idempotent methods or sequences MUST NOT be automatically retried, although user agents MAY offer a human operator the choice of retrying the request(s). Confirmation by user-agent software with semantic understanding of the application MAY substitute for user confirmation. The automatic retry SHOULD NOT be repeated if the second sequence of requests fails.

Undici does this.

Which I read as roughly saying a client should retry GET requests if the underlying socket has issues. This is one of three recommendations I previously put forward.

Undici does this.

@dhedey
Copy link

dhedey commented Jul 23, 2024

@ronag many thanks for the detailed reply 👍.

It sounds like undici is better architected, and in general, we (as users in the node ecosystem) should be transitioning across to it over the inbuilt http client.

Just regarding the "sceptical to it's practical significance" bits - I have a 100% repro of this issue using the node http client to send two consecutive awaited POST requests with Connection: Close against a localhost Axum server (which causes the server to send a HTTP response with no Connection Close and then an immediate FIN) - you could argue it's dodgy server behaviour, but other clients I've tested are fine with it, and the server did send a FIN, which should be "responded to as appropriate".

I'm not suggesting we should fix it in the node client though - I guess the next step would be to try undici and see if I get the same issue - if so, we can raise a ticket there to see if it can seamlessly handle socket-was-closed-before-anything-was-sent, even for e.g. POST requests.

@ronag
Copy link
Member

ronag commented Jul 24, 2024

Just regarding the "sceptical to it's practical significance" bits

We can reduce the time span where the data race can occur, but we can't solve the root problem. IMHO it's better to properly retry so it works in the general case than micro optimizing to reduce the likelyhood by a little bit (though ofc it doesn't hurt).

Regarding this issue what we maybe could consider is to disable keep-alive by default again and refer to undici for high performance use cases. @mcollina wdyt?

@rubengmurray
Copy link

What changed? Node 18.13.0 did not have this issue at all for me. Upgrading to Node 20.12.2 has created this issue. So what's different?

The difference is that HTTP keep-alives are turned on by default for the HTTP(S) agents in Node 20. So to go back to the old behavior you can simply just explicitly turn keep-alives off.

Doesn't look like this worked for us. We created our own agents with keepAlive: false for http & https and passed them to axios and we're getting the same error of socket hang up.

import http from 'http';
import https from 'https';

export const httpAgent = new http.Agent({ keepAlive: false });
export const httpsAgent = new https.Agent({ keepAlive: false });

....

axios
    .request<T>({
      ...config,
      httpAgent,
      httpsAgent,
      method: 'post',
      url,
      data,
    })

It looks like a lot of the recent talk is regarding undici so I think we will try this for outbound requests instead.

@rubengmurray
Copy link

Well, unfortunately it seems using undici breaks our integration tests as nock doesn't work with it. There is an issue for this here nock/nock#2183 which says support has been added in the beta, but we've found that not to work.

We'll now be downgrading.

@adijesori
Copy link

Is it possible to just use keep-alive = false by default as was originally before node 20?

MattiasOlla added a commit to BonnierNews/lu-common that referenced this issue Jul 29, 2024
HTTP keepalive was enabled by default in Node 20, and is apparently causing errors for axios. See e.g. nodejs/node#47130, axios/axios#6113
@ziimakc
Copy link

ziimakc commented Aug 4, 2024

Completely solved issue for my use case by installing undici and using it instead of native fetch:

import {
	fetch as undiciFetch,
	type RequestInfo as UndiciRequestInfo,
	type RequestInit as UndiciRequestInit,
} from "undici";

// works
async function fetchWrapperUndici(
	input: RequestInfo | URL,
	init?: RequestInit,
): Promise<Response> {
	return undiciFetch(input, init);
}

// ECONNRESET all the time
async function fetchWrapperNative(
	input: RequestInfo | URL,
	init?: RequestInit,
): Promise<Response> {
	return fetch(input, init);
}

@abrie-stubber
Copy link

It seems obvious that keep-alive absolutely has to be set to false by default in the default http agent. Automatic keep-alive should not be default behavior, it is an optimization (reusing sockets instead of creating new ones). This is something that someone would change when they need/want it.

Any project/company that upgrades from node:18 to node:20 would go from 0 requests failing, everything working reliably, to some random amount of requests failing. And from their perspective it is the same http agent now doing some completely unnecessary thing that they did not ask for.
That seems insane.

@ronag
Copy link
Member

ronag commented Oct 25, 2024

I think the default mode should be some kind of "auto mode" where if the server provides a keep alive hint in the response then we re-use connections ("random" failures should be basicaly 0) and otherwise no re-use by default.

@mcollina wdyt? I think we could/should do something similar for undici.

@dreaganluna
Copy link

It seems obvious that keep-alive absolutely has to be set to false by default in the default http agent. Automatic keep-alive should not be default behavior, it is an optimization (reusing sockets instead of creating new ones). This is something that someone would change when they need/want it.

Any project/company that upgrades from node:18 to node:20 would go from 0 requests failing, everything working reliably, to some random amount of requests failing. And from their perspective it is the same http agent now doing some completely unnecessary thing that they did not ask for. That seems insane.

I can attest to that! We've been migrating services to node.js v20.x and we've been scratching our heads for the amount of connection errors we've been seeing since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. known limitation Issues that are identified as known limitations.
Projects
None yet
Development

No branches or pull requests