Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Septa2112 · 2023-07-05T02:19:54Z

Version

v20.3.0/v20.3.1

Platform

Intel ICX Linux 5.15.0-72-generic
Intel SPR Linux 5.15.0-72-generic

Subsystem

No response

What steps will reproduce the bug?

Benchmark - An internal benchmark based on Ghost.js
Test version of node - v20.2.0 and v20.3.0/v20.3.1
OS - Linux
Test Platform - Intel ICX and SPR

We used the above two versions of node to test our benchmark.
And found that the performance of node v20.3.0 was not as good as that of v20.2.0.

How often does it reproduce? Is there a required condition?

No response

What is the expected behavior? Why is that the expected behavior?

No response

What do you see instead?

This value represents the RPS. It can be found that the version of node-v20.3.0 is nearly 40% lower than that of node-v20.2.0.

And after a rollback test, we found that this performance impact is related to the upgrade of libuv to 1.45.0, and the issue can be located: #48078

Additional information

Any thoughts on this issue?

bnoordhuis · 2023-07-05T08:29:20Z

What do those benchmarks test? Can you post profiles or flamegraphs of v20.2.0 and v20.3.0?

Even better would be to test the commits between libuv v1.44.2 and v1.45.0 (there are about 100) to pinpoint the exact commit.

If your benchmarks are amenable to something that works with git bisect, you can use this oneliner to land the libuv commits in this repo:

$ git checkout v20.2.0

$ for COMMIT in `cd ../libuv && git log --oneline v1.44.2..v1.45.0 | awk '{print $1}'`; do (cd ../libuv && git archive --prefix=deps/uv/ --format=tar $COMMIT) | tar x; git add deps/uv; git commit deps/uv -m "libuv: upgrade to $COMMIT"; done

Assumes libuv is in a directory adjacent to node.

Septa2112 · 2023-07-05T09:33:09Z

Benchmark

About the benchmark, I create a ghost.js server based on node in a docker container, and use ab to send requests to test the throughput of the server.

Flamegraph

By comparing these two versions, it can be clearly found that the flame graph of node v20.3.0 has an additional io_sq_thread part, accounting for about 30%.

Rollback Test

I switch to v20.3.0 branch. I found that the decrease of throughput is related to this commit 851508b. The commit is about upgrade of libuv.

With testing of 1 instance, I found that the throughput after this commit was about 20% lower than before.

bnoordhuis · 2023-07-05T09:53:26Z

Can you test outside of docker? There's a known docker issue that will be ~~fixed~~ worked around in the next libuv release.

Replace ab with (for example) wrk if at all possible. ab is notoriously unreliable.

I found that the decrease of throughput is related to this commit 851508b.

How do you know for sure? Did you bisect or are you guessing?

Septa2112 · 2023-07-06T10:06:41Z

Thanks, I'm following your suggestion to retest my benchmark. The results are expected soon and I will update here.

By the way, why is ab unreliable?

bnoordhuis · 2023-07-06T21:52:02Z

For one, it's too easy to misuse: it gets confused when the server sends back responses with different sizes, unless you pass -l (and that flag was only added in 2.4, before that it reported such responses as errors!)

It has plenty of other bugs and quirks, even though it's over 2 decades old.

Septa2112 · 2023-07-10T06:28:27Z

Thank you for the explanation about ab.
I tested both inside and outside the docker container with wrk.

Inside the docker container

I just run 1 instance.

The test results of wrk and h2load both show that the performance of node-v20.3.0 is about 5% lower than that of node-v20.2.0.
Although not as drastic as ab's test results, the performance drop is still there.

Outside the docker container

Whether it is multi-instance or single-instance, the test results of wrk show that the performance of the two versions is similar.

Results

Maybe you are right, there may indeed be a bug related to docker. And it seems that ab is really not reliable.

I saw libuv has been upgraded to 1.46.0 and fixed a docker related bug. In addition, I saw that the latest version of node (v20.4.0) also upgraded libuv to 1.46.0.

I'll be retesting our benchmarks on node-v20.4.0 to see if that fixes the slowdown.

Additional

I found through top that the CPU utilization of node-v20.3.0 is nearly 80% higher than that of node-v20.2.0 when fully loaded. Is that OK? Maybe this is caused by sq_thread of libuv-1.45.0?

How I Tested

About the question

How do you know for sure? Did you bisect or are you guessing?

I rolled back the node-v20.3.0 commit in the container. When rolling back to this commit of upgrade to libuv 1.45.0, I found that the performance changed dramatically before and after applying this commit. So I'm sure that the upgrade of libuv had a negative impact on our benchmark.

bnoordhuis · 2023-07-10T09:04:51Z

Thanks for testing, that's very helpful. I'm also very interested in the results for v20.4.0.

I found through top that the CPU utilization of node-v20.3.0 is nearly 80% higher than that of node-v20.2.0 when fully loaded. Is that OK? Maybe this is caused by sq_thread of libuv-1.45.0?

That seems likely but can you check with perf record where the process spends its CPU time?

If that does indeed show it's the sqpoll thread, is it an option for you to test with a v6.x kernel?

Septa2112 · 2023-07-10T15:40:37Z

Performance of Node v20.4.0

On our benchmark, the performance of node-v20.4.0 is similar to that of node-20.3.0, and both are much worse than node-v20.2.0. libuv upgrade to 1.46.0 doesn't seem to work. Maybe there is another bug?

Perf Data of v20.2.0 and v20.3.0

By the data of perf, we can find that in the 20.3.0 version, io_sq_thread accounts for about 30%.

Node V20.2.0

Node V20.3.0

bnoordhuis · 2023-07-10T20:10:36Z

Can you test with a newer kernel? I think you're looking at an io_uring performance bug in the 5.x kernels that's been fixed in newer kernels.

Another thing you may want to check is if the profile looks different with perf record -e cpu-clock. You can also split by kernel and user side CPU time with -e k:cycles and -e u:cycles (also works with cpu-clock, task-clock, etc.)

Septa2112 · 2023-07-12T06:40:32Z

I have updated the kernel to v6.4.3 and is retesting all data. It will take a day or two. I will update my data as soon as possible.

Septa2112 · 2023-07-17T06:01:24Z

After updating kernel to v6.4.3, here are the newest results.

Worse Performance on benchmark

Inside the docker container - Results of wrk

Here I only tested 1 instance. And here is the result I manually tested by running server and wrk in docker.

The baseline is node-v20.2.0 (100%) and the performance is

Version	%
v20.2.0	100%
v20.3.0	86%
v20.4.0	87%

Outside the docker container

All results were tested and collected by wrk.

1 instance

The performance of v20.2.0, v20.3.0 and v20.4.0 is about the same. Everything looks normal

full instance (The number is 128 on our ICX server)

I keep the CPU close to full load, and start an additional instance at this time.
I select the v20.2.0 as the baseline(100%). The performance is

Version	%
v20.2.0	100%
v20.3.0	82%
v20.4.0	78%

Abnormal CPU load

Results of Benchmark

I run 1 instance outside the docker container. The CPU utilization of node v20.4.0 and node v20.3.0 is similar. They are still about 80% higher than node v20.2.0 on kernel v6.4.3. But the performance of the three versions is very close.

Results of perf

I will update the results of perf later.

bnoordhuis · 2023-07-17T09:44:58Z

Thanks for the update. For my understanding: what does "full instance" (vis-a-vis "1 instance") mean?

Septa2112 · 2023-07-18T01:44:39Z

Single/1 instance means 1 server and 1 client. Full instance corresponds to the number of cores of the machine. On our ICX, the cores is 128 and then full instance means 128 servers and 128 clients. Single instance and full instance correspond to different CPU loads.

bnoordhuis · 2023-07-18T09:09:19Z

Clear, thanks. Two additional questions if you don't mind:

Is there a single node and a single wrk process or do you use worker_threads/child_process/cluster/etc. to spin up more threads or processes?
With what arguments is wrk started and do you use a custom lua script?

Septa2112 · 2023-07-18T09:31:43Z

When 1 instance, here is only a single node and a single wrk.
The command of wrk is wrk -t1 -c50 -d30 --latency and I didn't use a custom lua script.

Septa2112 · 2023-07-18T10:22:09Z

Perf Results of node-v20.4.0 and node-v20.2.0 on Kernel 6.4.3

It can be clearly seen that iou_sq_thread occupies a large proportion of kernel space.

perf record -e cycles:u xxxxx

node-v20.4.0
node-v20.2.0

perf record -e cycles:k xxxxx

node-v20.4.0
node-v20.2.0

zo0ok · 2023-12-15T20:19:23Z

I ran into this, upgrading from 18 to 20.10.0

I have a micro-service situation, about 20 node-js processes that communicate with each other using HTTP, all on the same machine. They also do some reading and writing to disk. Usually very low CPU utilization also in production.

With Node 20.0 I started getting HTTP connection errors (timeouts, I suppose) between node processes. I quickly went back to 18.

I have an automated test suite that I run. On one of my development machines (Intel i5 4250U, Debian 12), the time command gives me roughly 20s user CPU and 3s system CPU time to complete the tests, using Node 18. On Node 20 however, I still get 20s system CPU time, but 40s system time! Yes, more than 10x system CPU time.

Node -18 - all versions good.
Node 20.0.0-20.3.0 - tests fail
Node 20.3.1+ - tests pass, system CPU time as described above

macOS/Windows 11 - 20.10.0 slighly faster than 18 So this seems to be Linux specific.

Please let me know if I can provide more useful information.

Septa2112 · 2023-12-18T08:56:03Z

@zo0ok I'm not sure if your situation is similar to mine. As far as I know, the abnormal CPU utilization is related to IO_URING when node version >= v20.3.0.

Referring to this comment #49937 (comment), you can try to set the environment variable UV_USE_IO_URING=0. Then you see if it improves.

zo0ok · 2023-12-18T10:20:25Z

@Septa2112 , yes UV_USE_IO_URING=0 helped, and Node 20 behaves as expected now. Thank you!

devicenull · 2024-05-29T15:31:28Z

We hit this as well on a 5.10 era kernel - UV_USE_IO_URING fixed it

Septa2112 changed the title ~~Ghost-based benchmarks have worse performance on node v20.3.0/v20.3.1~~ Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 Jul 5, 2023

bnoordhuis added libuv Issues and PRs related to the libuv dependency or the uv binding. performance Issues and PRs related to the performance of Node.js. labels Jul 5, 2023

brubbel mentioned this issue Sep 28, 2023

raspberry pi 300% cpu setTimeout() #49937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Septa2112 commented Jul 5, 2023 •

edited

Loading

bnoordhuis commented Jul 5, 2023

Septa2112 commented Jul 5, 2023 •

edited

Loading

bnoordhuis commented Jul 5, 2023 •

edited

Loading

Septa2112 commented Jul 6, 2023

bnoordhuis commented Jul 6, 2023

Septa2112 commented Jul 10, 2023 •

edited

Loading

bnoordhuis commented Jul 10, 2023

Septa2112 commented Jul 10, 2023

bnoordhuis commented Jul 10, 2023

Septa2112 commented Jul 12, 2023

Septa2112 commented Jul 17, 2023

bnoordhuis commented Jul 17, 2023

Septa2112 commented Jul 18, 2023

bnoordhuis commented Jul 18, 2023

Septa2112 commented Jul 18, 2023

Septa2112 commented Jul 18, 2023 •

edited

Loading

zo0ok commented Dec 15, 2023 •

edited

Loading

Septa2112 commented Dec 18, 2023

zo0ok commented Dec 18, 2023

devicenull commented May 29, 2024

Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Ghost-based benchmark have worse performance on node v20.3.0/v20.3.1 #48654

Comments

Septa2112 commented Jul 5, 2023 • edited Loading

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

bnoordhuis commented Jul 5, 2023

Septa2112 commented Jul 5, 2023 • edited Loading

Benchmark

Flamegraph

Rollback Test

bnoordhuis commented Jul 5, 2023 • edited Loading

Septa2112 commented Jul 6, 2023

bnoordhuis commented Jul 6, 2023

Septa2112 commented Jul 10, 2023 • edited Loading

Inside the docker container

Outside the docker container

Results

Additional

How I Tested

bnoordhuis commented Jul 10, 2023

Septa2112 commented Jul 10, 2023

Performance of Node v20.4.0

Perf Data of v20.2.0 and v20.3.0

Node V20.2.0

Node V20.3.0

bnoordhuis commented Jul 10, 2023

Septa2112 commented Jul 12, 2023

Septa2112 commented Jul 17, 2023

Worse Performance on benchmark

Inside the docker container - Results of wrk

Outside the docker container

1 instance

full instance (The number is 128 on our ICX server)

Abnormal CPU load

Results of Benchmark

Results of perf

bnoordhuis commented Jul 17, 2023

Septa2112 commented Jul 18, 2023

bnoordhuis commented Jul 18, 2023

Septa2112 commented Jul 18, 2023

Septa2112 commented Jul 18, 2023 • edited Loading

Perf Results of node-v20.4.0 and node-v20.2.0 on Kernel 6.4.3

perf record -e cycles:u xxxxx

perf record -e cycles:k xxxxx

zo0ok commented Dec 15, 2023 • edited Loading

Septa2112 commented Dec 18, 2023

zo0ok commented Dec 18, 2023

devicenull commented May 29, 2024

Septa2112 commented Jul 5, 2023 •

edited

Loading

Septa2112 commented Jul 5, 2023 •

edited

Loading

bnoordhuis commented Jul 5, 2023 •

edited

Loading

Septa2112 commented Jul 10, 2023 •

edited

Loading

Septa2112 commented Jul 18, 2023 •

edited

Loading

zo0ok commented Dec 15, 2023 •

edited

Loading