Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiple cores by the API #224

Closed
alexey-yarmosh opened this issue Nov 7, 2022 · 14 comments · Fixed by #322
Closed

Use multiple cores by the API #224

alexey-yarmosh opened this issue Nov 7, 2022 · 14 comments · Fixed by #322
Assignees

Comments

@alexey-yarmosh
Copy link
Member

Currently we have a 4 cores CPU.
Node effectively uses only one of them.
We should consider options to load idle cores: cluster, worker_threads, etc.

image

@patrykcieszkowski
Copy link
Contributor

I'm not seeing the core usage on this screenshot, but generally speaking there's no benefit to using worker_threads for I/O operations, and that's 90% (if not more) of what this API service is doing. I would say you should consider spawning new threads for CPU intensive work, like hashing, encryption or parsing and leave I/O operations with the main Event Loop.

@alexey-yarmosh
Copy link
Member Author

worker_threads is only one of the ideas to consider. Key here is running 4 API instances on a 4 cores machine to increase the I/O throughput, and that most likely should be done either by node cluster, or on the high level - by docker swarm.

@jimaek
Copy link
Member

jimaek commented Nov 10, 2022

high level - by docker swarm.

Not possible, we use host networking, meaning only 1 container can run at any moment on a VM. Also it would probably increase pub-sub traffic and make syncing of probes more difficult. So ideally we want to only move specific operations to other threads and not run the whole API per thread.

@alexey-yarmosh
Copy link
Member Author

we don't have intensive sync work to move from main thread. So when 2 VMs will not be enough we should choose between adding new VMs or running per thread.

@MartinKolarik
Copy link
Member

MartinKolarik commented Nov 10, 2022

cluster seems like the right thing to do, just needs to check how the messages are routed. We want them always delivered only to one client (app instance), not all clients.

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Mar 9, 2023

The key thing here is to set up socket.io communication. Currently the only way to implement it is by using @socket.io/redis-adapter on every worker. While working correctly that means redis pub/sub traffic will increase dramatically:
currently - 1 fetchScokets with 2 instances with 1 worker generates 2 redis messages.
with 4 workers - 1 fetchSockets with 2 instances with 4 worker generates 8 redis messages.

As redis is our current bottleneck I am not sure we need to proceed with the solution described above. At least until #269 is implemented.

I've posted a question about combining @socket.io/redis-adapter with @socket.io/cluster-adapter, maybe that is doable. More details can be found there: socketio/socket.io#4659

@alexey-yarmosh
Copy link
Member Author

Linked issue:
socketio/socket.io-cluster-adapter#1

@MartinKolarik
Copy link
Member

The fetchSocket() calls are cached now, aren't they? The effect should not be that significant.

@alexey-yarmosh
Copy link
Member Author

That is true, it is cached. And I don't think effect will be that big.

But anyway our servers are usually loaded on ~15% even under perf tests, while redis barely deals with the load. So we shouldn't make the narrow place more narrow to tune the API servers.

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Mar 21, 2023

So what we can see on the image is that currently our bottleneck is the perf of the node process itself. (Redis CPU is ~3%, while API CPU is ~100%).
image

From the flamegraph we are observing that that there are some sync operation done mostly by npm packages, those are:

  • socket.io res/req encoding/decoding
  • redis data JSON.stringifying before sending to the DB
  • http validation
  • a lot of small newrelic operations (not sure what they are)
    image

We can't get rid of most of that operations, but we should benefit from doing them in parallel and applying a cluster to the API seems to make sense now.

Other possible optimizations:

  • ipv4 check worth caching (it is the most heavy part of the http validation)
  • redis operations are already pipelined (there is an internal queue, and it's size grows and shrinks while we are creating new redis.json.set promises). I'll additionally check that it is used in all cases.
  • maybe we can benefit from redis lua scripts to move some set logic from node to redis.
  • socket.io perf doesn't seem to be a problem (API's CPU is at 100% while there are ~4000 RPS through socket.io. Even old benchmarks showed that socket io is able to deal with ~10000RPS.)
  • there a lot of newrelic operations, as I remember API sends logs to NR itself, so probably that can be changed, but seems not more than that.

P.S.
Redis load generated by mtr (5 RPS * 4 seconds * 500 probes):
5 req * 4 seconds * (3 + 3 + 1 + 1 + 1 + 500 * (1 + 2 + 4 * 2 + 1 + 1 + 2) + 4) = 150260 redis writes
150260 / 20 measurement time = ~7500 redis operations per second.

Socket.io load generated by mtr (5 RPS * 4 seconds * 500 probes):
5 req * 4 seconds * 500 probes * (1 request + 1 ack + 4 progress + 1 result) = 70000 socketio messages
70000 / 20 measurement time = 3500 redis messages per second.

@MartinKolarik
Copy link
Member

Can you explain the 500 * (1 + 2 + 4 * 2 + 1 + 1 + 2) part? Does each probe make 15 writes per second per test (since this is multiplied by requests and time in the beginning)?

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Mar 21, 2023

Not 15 writes per second, just 15 writes per probe per measurement. So it is 150000 writes overal, and since such measurements usually take ~20 seconds to end it is 150000/20 = 7500 writes per second.

#240 (comment)

@MartinKolarik
Copy link
Member

Oh ok so the first part "5 req * 4 seconds" means 5 rps for 4 seconds = 20 measurements overall, right?

@alexey-yarmosh
Copy link
Member Author

Yeah, that is the case when CPU gets to 100% after a few mtr requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants