Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Open
Septa2112 opened this issue Jul 5, 2023 · 20 comments
Open

Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Septa2112 opened this issue Jul 5, 2023 · 20 comments
Labels
libuv Issues and PRs related to the libuv dependency or the uv binding. performance Issues and PRs related to the performance of Node.js.

Comments

@Septa2112
Copy link
Contributor

Septa2112 commented Jul 5, 2023

Version

v20.3.0/v20.3.1

Platform

Intel ICX Linux 5.15.0-72-generic
Intel SPR Linux 5.15.0-72-generic

Subsystem

No response

What steps will reproduce the bug?

Benchmark - An internal benchmark based on Ghost.js
Test version of node - v20.2.0 and v20.3.0/v20.3.1
OS - Linux
Test Platform - Intel ICX and SPR

We used the above two versions of node to test our benchmark.
And found that the performance of node v20.3.0 was not as good as that of v20.2.0.

How often does it reproduce? Is there a required condition?

No response

What is the expected behavior? Why is that the expected behavior?

No response

What do you see instead?

This value represents the RPS. It can be found that the version of node-v20.3.0 is nearly 40% lower than that of node-v20.2.0.

And after a rollback test, we found that this performance impact is related to the upgrade of libuv to 1.45.0, and the issue can be located: #48078

Additional information

Any thoughts on this issue?

@Septa2112 Septa2112 changed the title Ghost-based benchmarks have worse performance on node v20.3.0/v20.3.1 Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 Jul 5, 2023
@bnoordhuis
Copy link
Member

What do those benchmarks test? Can you post profiles or flamegraphs of v20.2.0 and v20.3.0?

Even better would be to test the commits between libuv v1.44.2 and v1.45.0 (there are about 100) to pinpoint the exact commit.

If your benchmarks are amenable to something that works with git bisect, you can use this oneliner to land the libuv commits in this repo:

$ git checkout v20.2.0

$ for COMMIT in `cd ../libuv && git log --oneline v1.44.2..v1.45.0 | awk '{print $1}'`; do (cd ../libuv && git archive --prefix=deps/uv/ --format=tar $COMMIT) | tar x; git add deps/uv; git commit deps/uv -m "libuv: upgrade to $COMMIT"; done

Assumes libuv is in a directory adjacent to node.

@bnoordhuis bnoordhuis added libuv Issues and PRs related to the libuv dependency or the uv binding. performance Issues and PRs related to the performance of Node.js. labels Jul 5, 2023
@Septa2112
Copy link
Contributor Author

Septa2112 commented Jul 5, 2023

Benchmark

About the benchmark, I create a ghost.js server based on node in a docker container, and use ab to send requests to test the throughput of the server.

Flamegraph

By comparing these two versions, it can be clearly found that the flame graph of node v20.3.0 has an additional io_sq_thread part, accounting for about 30%.

Rollback Test

I switch to v20.3.0 branch. I found that the decrease of throughput is related to this commit 851508b. The commit is about upgrade of libuv.

With testing of 1 instance, I found that the throughput after this commit was about 20% lower than before.

@bnoordhuis
Copy link
Member

bnoordhuis commented Jul 5, 2023

Can you test outside of docker? There's a known docker issue that will be fixed worked around in the next libuv release.

Replace ab with (for example) wrk if at all possible. ab is notoriously unreliable.

I found that the decrease of throughput is related to this commit 851508b.

How do you know for sure? Did you bisect or are you guessing?

@Septa2112
Copy link
Contributor Author

Thanks, I'm following your suggestion to retest my benchmark. The results are expected soon and I will update here.

By the way, why is ab unreliable?

@bnoordhuis
Copy link
Member

For one, it's too easy to misuse: it gets confused when the server sends back responses with different sizes, unless you pass -l (and that flag was only added in 2.4, before that it reported such responses as errors!)

It has plenty of other bugs and quirks, even though it's over 2 decades old.

@Septa2112
Copy link
Contributor Author

Septa2112 commented Jul 10, 2023

Thank you for the explanation about ab.
I tested both inside and outside the docker container with wrk.

Inside the docker container

I just run 1 instance.

The test results of wrk and h2load both show that the performance of node-v20.3.0 is about 5% lower than that of node-v20.2.0.
Although not as drastic as ab's test results, the performance drop is still there.

Outside the docker container

Whether it is multi-instance or single-instance, the test results of wrk show that the performance of the two versions is similar.

Results

Maybe you are right, there may indeed be a bug related to docker. And it seems that ab is really not reliable.

I saw libuv has been upgraded to 1.46.0 and fixed a docker related bug. In addition, I saw that the latest version of node (v20.4.0) also upgraded libuv to 1.46.0.

I'll be retesting our benchmarks on node-v20.4.0 to see if that fixes the slowdown.

Additional

I found through top that the CPU utilization of node-v20.3.0 is nearly 80% higher than that of node-v20.2.0 when fully loaded. Is that OK? Maybe this is caused by sq_thread of libuv-1.45.0?

How I Tested

About the question

How do you know for sure? Did you bisect or are you guessing?

I rolled back the node-v20.3.0 commit in the container. When rolling back to this commit of upgrade to libuv 1.45.0, I found that the performance changed dramatically before and after applying this commit. So I'm sure that the upgrade of libuv had a negative impact on our benchmark.

@bnoordhuis
Copy link
Member

Thanks for testing, that's very helpful. I'm also very interested in the results for v20.4.0.

I found through top that the CPU utilization of node-v20.3.0 is nearly 80% higher than that of node-v20.2.0 when fully loaded. Is that OK? Maybe this is caused by sq_thread of libuv-1.45.0?

That seems likely but can you check with perf record where the process spends its CPU time?

If that does indeed show it's the sqpoll thread, is it an option for you to test with a v6.x kernel?

@Septa2112
Copy link
Contributor Author

Performance of Node v20.4.0

On our benchmark, the performance of node-v20.4.0 is similar to that of node-20.3.0, and both are much worse than node-v20.2.0. libuv upgrade to 1.46.0 doesn't seem to work. Maybe there is another bug?

Perf Data of v20.2.0 and v20.3.0

By the data of perf, we can find that in the 20.3.0 version, io_sq_thread accounts for about 30%.

Node V20.2.0

node-202

Node V20.3.0

node-203

@bnoordhuis
Copy link
Member

Can you test with a newer kernel? I think you're looking at an io_uring performance bug in the 5.x kernels that's been fixed in newer kernels.

Another thing you may want to check is if the profile looks different with perf record -e cpu-clock. You can also split by kernel and user side CPU time with -e k:cycles and -e u:cycles (also works with cpu-clock, task-clock, etc.)

@Septa2112
Copy link
Contributor Author

I have updated the kernel to v6.4.3 and is retesting all data. It will take a day or two. I will update my data as soon as possible.

@Septa2112
Copy link
Contributor Author

After updating kernel to v6.4.3, here are the newest results.

Worse Performance on benchmark

Inside the docker container - Results of wrk

Here I only tested 1 instance. And here is the result I manually tested by running server and wrk in docker.

The baseline is node-v20.2.0 (100%) and the performance is

Version %
v20.2.0 100%
v20.3.0 86%
v20.4.0 87%

Outside the docker container

All results were tested and collected by wrk.

1 instance

The performance of v20.2.0, v20.3.0 and v20.4.0 is about the same. Everything looks normal

full instance (The number is 128 on our ICX server)

I keep the CPU close to full load, and start an additional instance at this time.
I select the v20.2.0 as the baseline(100%). The performance is

Version %
v20.2.0 100%
v20.3.0 82%
v20.4.0 78%

Abnormal CPU load

Results of Benchmark

I run 1 instance outside the docker container. The CPU utilization of node v20.4.0 and node v20.3.0 is similar. They are still about 80% higher than node v20.2.0 on kernel v6.4.3. But the performance of the three versions is very close.

Results of perf

I will update the results of perf later.

@bnoordhuis
Copy link
Member

Thanks for the update. For my understanding: what does "full instance" (vis-a-vis "1 instance") mean?

@Septa2112
Copy link
Contributor Author

Single/1 instance means 1 server and 1 client. Full instance corresponds to the number of cores of the machine. On our ICX, the cores is 128 and then full instance means 128 servers and 128 clients. Single instance and full instance correspond to different CPU loads.

@bnoordhuis
Copy link
Member

Clear, thanks. Two additional questions if you don't mind:

  1. Is there a single node and a single wrk process or do you use worker_threads/child_process/cluster/etc. to spin up more threads or processes?

  2. With what arguments is wrk started and do you use a custom lua script?

@Septa2112
Copy link
Contributor Author

  1. When 1 instance, here is only a single node and a single wrk.
  2. The command of wrk is wrk -t1 -c50 -d30 --latency and I didn't use a custom lua script.

@Septa2112
Copy link
Contributor Author

Septa2112 commented Jul 18, 2023

Perf Results of node-v20.4.0 and node-v20.2.0 on Kernel 6.4.3

It can be clearly seen that iou_sq_thread occupies a large proportion of kernel space.

perf record -e cycles:u xxxxx

  • node-v20.4.0
    image
  • node-v20.2.0
    image

perf record -e cycles:k xxxxx

  • node-v20.4.0
    image

  • node-v20.2.0
    image

@zo0ok
Copy link

zo0ok commented Dec 15, 2023

I ran into this, upgrading from 18 to 20.10.0

I have a micro-service situation, about 20 node-js processes that communicate with each other using HTTP, all on the same machine. They also do some reading and writing to disk. Usually very low CPU utilization also in production.

With Node 20.0 I started getting HTTP connection errors (timeouts, I suppose) between node processes. I quickly went back to 18.

I have an automated test suite that I run. On one of my development machines (Intel i5 4250U, Debian 12), the time command gives me roughly 20s user CPU and 3s system CPU time to complete the tests, using Node 18. On Node 20 however, I still get 20s system CPU time, but 40s system time! Yes, more than 10x system CPU time.

Node -18 - all versions good.
Node 20.0.0-20.3.0 - tests fail
Node 20.3.1+ - tests pass, system CPU time as described above

macOS/Windows 11 - 20.10.0 slighly faster than 18 So this seems to be Linux specific.

Please let me know if I can provide more useful information.

@Septa2112
Copy link
Contributor Author

@zo0ok I'm not sure if your situation is similar to mine. As far as I know, the abnormal CPU utilization is related to IO_URING when node version >= v20.3.0.

Referring to this comment #49937 (comment), you can try to set the environment variable UV_USE_IO_URING=0. Then you see if it improves.

@zo0ok
Copy link

zo0ok commented Dec 18, 2023

@Septa2112 , yes UV_USE_IO_URING=0 helped, and Node 20 behaves as expected now. Thank you!

@devicenull
Copy link

We hit this as well on a 5.10 era kernel - UV_USE_IO_URING fixed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libuv Issues and PRs related to the libuv dependency or the uv binding. performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

No branches or pull requests

4 participants