Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network request fails after event loop is blocked for a few seconds (regression in v20) #54293

Closed
nikwen opened this issue Aug 9, 2024 · 11 comments
Labels
http Issues or PRs related to the http subsystem. repro-exists Issues with reproductions.

Comments

@nikwen
Copy link
Contributor

nikwen commented Aug 9, 2024

Version

v20.16.0

Platform

Darwin MacBook.my.network 23.6.0 Darwin Kernel Version 23.6.0: Fri Jul  5 17:54:52 PDT 2024; root:xnu-10063.141.1~2/RELEASE_ARM64_T8103 arm64

Subsystem

http

What steps will reproduce the bug?

  1. Block the event loop for 10 seconds via a long-running operation
  2. Make a network request via http.request()

Code example:

const http = require("node:http");

const makePostRequest = () =>
  new Promise((resolve, reject) => {
    const onResponse = (res) => {
      let data = "";

      res.on("data", (chunk) => {
        data += chunk;
      });

      res.on("end", () => resolve(data));
    };

    const req = http.request(
      "http://127.0.0.1:3000/",
      { method: "POST" },
      onResponse,
    );

    req.on("error", (e) => reject(e));

    req.end();
  });

const makeRequests = async () => {
  await makePostRequest();

  console.log("First request worked");

  const timestamp = Date.now();

  // Block the event loop for 10 seconds
  while (Date.now() < timestamp + 10000);

  // Enable the following line and it will work:
  // await new Promise((resolve) => setTimeout(resolve, 10));

  await makePostRequest();

  console.log("It crashes before it gets here");
};

const server = http.createServer((req, res) => {
  console.log("Request received");
  res.statusCode = 200;
  res.setHeader("Content-Type", "text/plain");
  res.end("Hello world!\n");
});

server.listen(3000, "127.0.0.1", () => {
  console.log("Server up");

  makeRequests();
});

How often does it reproduce? Is there a required condition?

Reproduces always.

What is the expected behavior? Why is that the expected behavior?

The network request works.

What do you see instead?

  1. The network request fails with ECONNRESET.
  2. The network request does not reach the server.

Log from running the code above:

Server up
Request received
First request worked
node:internal/process/promises:391
    triggerUncaughtException(err, true /* fromPromise */);
    ^

Error: read ECONNRESET
    at TCP.onStreamRead (node:internal/stream_base_commons:218:20) {
  errno: -54,
  code: 'ECONNRESET',
  syscall: 'read'
}

Node.js v20.16.0

Process finished with exit code 1

Additional information

This is a regression. The code works on Node.js 18. It fails on Node.js 20.

If we wait for 10ms before running the network request, it works.

@RedYetiDev
Copy link
Member

// repro.js
const http = require("node:http");

const makePostRequest = () =>
  new Promise((resolve, reject) => {
    const onResponse = (res) => {
      let data = "";

      res.on("data", (chunk) => {
        data += chunk;
      });

      res.on("end", () => resolve(data));
    };

    const req = http.request(
      "http://127.0.0.1:3000/",
      { method: "POST" },
      onResponse,
    );

    req.on("error", (e) => reject(e));

    req.end();
  });

const makeRequests = async () => {
  await makePostRequest();

  console.log("First request worked");

  const timestamp = Date.now();

  // Block the event loop for 10 seconds
  while (Date.now() < timestamp + 10000);

  // Enable the following line and it will work:
  // await new Promise((resolve) => setTimeout(resolve, 10));

  await makePostRequest();

  console.log("It crashes before it gets here");
};

const server = http.createServer((req, res) => {
  console.log("Request received");
  res.statusCode = 200;
  res.setHeader("Content-Type", "text/plain");
  res.end("Hello world!\n");
});

server.listen(3000, "127.0.0.1", () => {
  console.log("Server up");

  makeRequests();
});
$ node repro.js 
Server up
Request received
First request worked
node:internal/process/promises:391
    triggerUncaughtException(err, true /* fromPromise */);
    ^

Error: read ECONNRESET
    at TCP.onStreamRead (node:internal/stream_base_commons:218:20) {
  errno: -104,
  code: 'ECONNRESET',
  syscall: 'read'
}

Node.js v22.6.0

@trivikr
Copy link
Member

trivikr commented Aug 10, 2024

Reproducible in v20.0.0
Not reproducible in v18.20.4

@jazelly
Copy link
Contributor

jazelly commented Aug 10, 2024

This is caused by keepAlive, which is true by default after v18. I think this PR made that globally true for httpAgent and this issue surfaced.

If you reduce the while loop to less than 5 seconds, or add a header connection: 'close', the problem goes away.

Since the event loop was blocked and the socket destroy actions were scheduled in the next tick, your next request would be assigned to the socket to be destroyed.

IMHO, the best fix is probably about how to detect this at the request stage

@nikwen
Copy link
Contributor Author

nikwen commented Aug 11, 2024

Thank you so much for the information, @jazelly! I fixed the issue in my application by disabling keepAlive for now.

Agree that your suggested approach sounds good:

  1. Before assigning a request to a socket, check if that socket is to be destroyed.
  2. If yes, assign to a different (potentially new) socket instead.

@jazelly
Copy link
Contributor

jazelly commented Aug 22, 2024

Not suggesting Connection: close as a proper workaround, as it does not persist sockets.

Overall, I think this is an limitation in keepAlive or underlying socket implementation. As according to RFC 9112 standard, socket should not be closed immediately.

To avoid the TCP reset problem, servers typically close a connection in stages. First, the server performs a half-close by closing only the write side of the read/write connection. The server then continues to read from the connection until it receives a corresponding close by the client, or until the server is reasonably certain that its own TCP stack has received the client's acknowledgement of the packet(s) containing the server's last response. Finally, the server fully closes the connection.

@jakecastelli
Copy link
Contributor

Jason (@jazelly) and myself have spent some time to dig into this issue further more.

The reason why the second request failed is because the socket is no longer kept alive after 5 seconds and server tries to destroy the socket (this is asynchronous) however client side does not know the socket is being destroyed and still tries to send an request by reusing the same socket.

What happens next is that the request was assigned to the old socket by http agent at client side while the server is about to close the socket. The server then closed the socket and the request failed. connection: close in the header to force to open a new socket is one way to fix but not the optimal way as open up a socket is more costly, another workaround is always retrying the request.

You can find the similar issue here, hope it helps and please let us know if you have any other questions.

@nikwen
Copy link
Contributor Author

nikwen commented Aug 26, 2024

Thanks for digging further into it, @jakecastelli and @jazelly!

Let me summarize:

  1. The server closes the connection upon Keep-Alive timeout.
  2. The server can also close the connection at any other time.
  3. In the repro from the issue, the client does not know about the connection closure because it hasn't processed incoming packets.
  4. However, there's also a case where a similar issue happens without a blocked event loop:
  5. Network packets sent by the server take some time to arrive on the client side.
  6. So the client doesn't learn about connection closure immediately.
  7. This leaves open a time window where the client doesn't know the connection is closed.
  8. Requests made during that time window will also trigger ECONNRESET.
  9. In that situation it's impossible to detect on the client side at request time that the connection was closed by the server.

Damn, that sucks. It means there's little we can do except for retrying.

Thanks for investigating and thanks for sharing! I appreciate the time you put into this. 🙌

@jazelly
Copy link
Contributor

jazelly commented Aug 28, 2024

I found this PR that will fix this issue. I verified it in my local env.

@nikwen
Copy link
Contributor Author

nikwen commented Aug 28, 2024

@jazelly That's awesome, thanks! Do you know the right process to bring attention back to the PR? It seems stale.

@jazelly
Copy link
Contributor

jazelly commented Aug 28, 2024

Collaborators with write permissions can help run CI for the PR. I don't have the permission but I believe once the CI is passed it will be landed soon.

@nikwen
Copy link
Contributor Author

nikwen commented Aug 28, 2024

Thanks, @jazelly, for getting the PR to the attention of the right people. :)

jazelly added a commit to jazelly/node that referenced this issue Sep 9, 2024
jazelly added a commit to jazelly/node that referenced this issue Sep 9, 2024
jazelly added a commit to jazelly/node that referenced this issue Sep 9, 2024
jazelly added a commit to jazelly/node that referenced this issue Sep 10, 2024
jazelly added a commit to jazelly/node that referenced this issue Sep 10, 2024
jasnell pushed a commit that referenced this issue Sep 13, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
RafaelGSS pushed a commit that referenced this issue Sep 16, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
RafaelGSS pushed a commit that referenced this issue Sep 16, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
RafaelGSS pushed a commit that referenced this issue Sep 17, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
targos pushed a commit that referenced this issue Sep 22, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
targos pushed a commit that referenced this issue Sep 26, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
targos pushed a commit that referenced this issue Oct 2, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
targos pushed a commit that referenced this issue Oct 2, 2024
Fixes: #52649
Refs: #54293

Co-authored-by: Arrigo Zanette <[email protected]>
PR-URL: #54863
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Jake Yuesong Li <[email protected]>
Reviewed-By: Ethan Arrowood <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. repro-exists Issues with reproductions.
Projects
None yet
Development

No branches or pull requests

5 participants