Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Load for idle clusters #5400

Closed
markusschaber opened this issue Nov 25, 2021 · 55 comments · Fixed by #7117
Closed

High CPU Load for idle clusters #5400

markusschaber opened this issue Nov 25, 2021 · 55 comments · Fixed by #7117

Comments

@markusschaber
Copy link
Contributor

Version Information
reproduced with 1.4.27 and 1.4.28
Akka clustering, lighthouse

Describe the bug
Idle akka clusters burn too much CPU.

To Reproduce
Steps to reproduce the behavior:

  1. With kubernetes, deploy the https://github.com/petabridge/akkadotnet-cluster-workshop/blob/lesson5/k8s/lighthouse-deploy.yaml
  2. Increase the number of replicas and check the increasing CPU load per pod (e. G. using metrics server and "watch kubectl top pods --namespace=akka-cqrs"

Expected behavior
CPU load should be negible (not exactly 0, as some cluster gossip is happening...)

Actual behavior
Even with 2 replicas, the CPU usage is rather high for an idle system. However, when increasing the number of replicas, the CPU usage per service also increases:

  • 2 replicas: 20-30 mCPU per instance
  • 4 replicas: 50-60 mCPU, with spikes up to 90
  • 8 replicas: 130-140 mCPU, lots of jitter bertween 110 and 180
  • 12 replicas: 160-190 mCPU per instance, with spikes over 200
  • 16 replicas: 180-210 mCPU
  • 20 replicas: 290-230 mCPU
  • 24 replicas: 175-210 mCPU, total CPU usage between 97 and 100%, machine saturated, "top" output: %Cpu(s): 15,6 us, 74,6 sy, 0,0 ni, 0,0 id, 0,0 wa, 0,0 hi, 9,8 si, 0,0 st

Starting 50 replicas straightly renders my kubernetes unusable, kubectl commands fail with various timeout errors.

Screenshots
Output of watch kubectl top pods --namespace=akka-cqrs with 16 replicas:

Every 2,0s: kubectl top pods --namespace=akka-cqrs

NAME            CPU(cores)   MEMORY(bytes)
lighthouse-0    189m         79Mi
lighthouse-1    190m         58Mi
lighthouse-10   186m         45Mi
lighthouse-11   191m         45Mi
lighthouse-12   192m         44Mi
lighthouse-13   187m         43Mi
lighthouse-14   198m         43Mi
lighthouse-15   192m         44Mi
lighthouse-2    189m         43Mi
lighthouse-3    201m         41Mi
lighthouse-4    193m         45Mi
lighthouse-5    186m         41Mi
lighthouse-6    175m         45Mi
lighthouse-7    182m         41Mi
lighthouse-8    184m         44Mi
lighthouse-9    182m         45Mi

Environment
Happens in different environments, the tests above were taken in a VM running Ubuntu, with 6 CPUs and 8GB Ram, running a single-node kubernetes cluster with microk8s installed via snap:
kubectl versions:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-18T02:34:11Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.3-3+9ec7c40ec93c73", GitCommit:"9ec7c40ec93c73c2281bdd2e4a75baf6247366a0", GitTreeState:"clean", BuildDate:"2021-11-03T10:17:37Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}

Additional context
Can be a serious cost factor in environments which are to be payed per CPU usage, like some cloud services. In our case it's some test and dev environments which are configured "smallish", and burn their "cpu burst quota" rather quickly.

This might be related to #4537 .

@markusschaber
Copy link
Contributor Author

Profiling one of our services (not lighthouse) seems to imply that the DedicatedThreadPool and DotNetty are the main culprits:
image

More details on the DedicatedThreadPool:
image

@Aaronontheweb
Copy link
Member

The ChannelExecutor mentioned in that PR might be a good candidate - I owe @Zetanova a re-review of it again.

@Aaronontheweb Aaronontheweb added this to the 1.4.29 milestone Nov 26, 2021
@Aaronontheweb
Copy link
Member

#5390 might also help - alters the "waiting" mechanism used by the DedicatedThreadPool.

@markusschaber markusschaber changed the title High CPU Load for idle clustersd High CPU Load for idle clusters Nov 26, 2021
@Zetanova
Copy link
Contributor

Zetanova commented Nov 26, 2021

@markusschaber You can use my https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler
But u need to downgrade to Akka 1.4.21, because of some improvements of Ask the cluster-startup extensions
is very racy at startup in later versions

I made view PR's to fix cluster startup and hope that it is fixed with 1.4.29

My ChannelTaskScheduler does not reduce all idle CPU to zero (as it should be)
but removes the scaling issue completely.
I run currently 16 nodes on k8s and everyone are idling with between 20m and 50m

@Zetanova
Copy link
Contributor

The other option would to switch the dispatcher to the build in but not used TaskPoolDispatcher,
but the cluster will down itself on heavy workload, because the cluster packets are processed to late.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 26, 2021 via email

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 26, 2021 via email

@Zetanova
Copy link
Contributor

@Aaronontheweb
https://github.com/Zetanova/PNet.Mesh UDP traffic - full encrypted
not fully tested and NAT+TURN is still missing

Unit tests traffic is going through already full encrypted with crypto routing.

One other option that we could start to think about is the multi-home problem
in akka.cluster, akka.remote and akka.discovery
Found here: #4993

Currently the Akka.Cluster has even a problem with normal DNS names
it will already break the cluster
if some nodes are using: akka.tcp://node1:2334
and others: akka.tcp://node1.myNamespace.cluster.local:2334
and the rest: akka.tcp://node1.myNamespace:2334

@markusschaber
Copy link
Contributor Author

@Zetanova I'll try to test on Monday. We're in european time zone :-)

@Zetanova
Copy link
Contributor

i tried now the nigthly build 1.4.29-betaX and it did fix the cluster startup problems.

@to11mtm
Copy link
Member

to11mtm commented Nov 27, 2021

@Aaronontheweb https://github.com/Zetanova/PNet.Mesh UDP traffic - full encrypted not fully tested and NAT+TURN is still missing

Unit tests traffic is going through already full encrypted with crypto routing.

One other option that we could start to think about is the multi-home problem in akka.cluster, akka.remote and akka.discovery Found here: #4993

Currently the Akka.Cluster has even a problem with normal DNS names it will already break the cluster if some nodes are using: akka.tcp://node1:2334 and others: akka.tcp://node1.myNamespace.cluster.local:2334 and the rest: akka.tcp://node1.myNamespace:2334

@Zetanova this looks pretty nice, it looks like it might be able to support TCP as well with some work? (Just thinking about environments where UDP might be an issue).

FWIW Building a Transport is deceptively simple with one important caveat. @Aaronontheweb can correct my poor explanation here but there's a point during handshaking that some of the in flows need to remain 'corked' while the AssociationHandle is being created. In the DotNetty transport this is handled via setting channel's AutoRead to False and then back to True.

@Zetanova
Copy link
Contributor

@to11mtm UDP fits very well for Mesh/VPN and encryption,
TCP is for streaming good and not for single packets. that's way ikev2, OpenVPN, Wireguard, P2P are using UDP
or working over UDP best. There is nearly no "place" where UDP is not supported.
The Idea with PNet.Mesh is to have a simple UDP Socket with no other OS requirements
and crypto routing/addressing (not relaying on ipv4 addresses)

Wireguard itself would be optimal for akka too, but HAS use restriction.
it's not easy or possible to use wireguard in kubernetes, but a simple UDP Socket is easy.

Akka is a only message based system and don't really need a persistent connection between each node,
it needs only connectivity between nodes.
Maybe we can abstract it away,
I wrote here an idea to it: #4993

@markusschaber
Copy link
Contributor Author

markusschaber commented Nov 29, 2021

Hmm. When I build our real services against 1.4.29-beta637735681605651069, the load per service seems to be a bit lower, but still around 190 mCPU compared to 210 mCPU with 1.4.27.
So either my build went wrong, or the nightly does not help as much as I hoped.

(Are there nightly builds of lighthouse I could use to nail down where the difference is?)

I could not try the Experimental ChannelTaskScheduler yet. It seems there's no NuGet package available, and our Policy forbids copy-pasting 3rd party code into our projects, so I'll need to package it and host it on our internal NuGet Feed, which takes some time (busy with other work right now...)

@Aaronontheweb
Copy link
Member

@markusschaber ah, my comment was for @Zetanova to resolve his startup issue with the ChannelDispatcher

@markusschaber
Copy link
Contributor Author

Ah, I see... And it seems that there's quite some jitter in the mCPU usage, after some time, I also get phases with about 210 mCPU wit the nightly...

@markusschaber
Copy link
Contributor Author

markusschaber commented Nov 29, 2021

After hacking together a solution using the https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler with 1.4.29-beta637735681605651069, it got considerably better. Running the same services with the default hocon for the ChannelTaskScheduler, the CPU usage is down to 60-90 mCPU, so this is around 1/2 to 1/3 of the original CPU usage.

@Aaronontheweb
Copy link
Member

The times I've tested that, there have been some throughput tradeoffs - but on balance that might be the better trade for your use case.

In terms of replacing the DotNetty transport - I'd be interested in @Zetanova's ideas there and I have one of my own (gRPC transport - have some corporate users who rolled their own and had considerably higher throughput than DotNetty) that we can try in lieu of Artery, which is a much bigger project.

@markusschaber
Copy link
Contributor Author

Thanks for your efforts. I'm looking forward to an official solution, which can be used in production code without bending compliance rules. :-)

@Aaronontheweb
Copy link
Member

Thanks for your efforts. I'm looking forward to an official solution, which can be used in production code without bending compliance rules. :-)

Naturally - if @Zetanova is up for sending in a PR with the upgraded ChannelDispatcher as part of v1.4.29 I'd be happy to merge that in and make it an "official" dispatcher option even if it's not set as the default. Meaning, we'll accept and triage bug reports for it.

As for some alternative transports, I'd need to write up something lengthier on that in a separate issue but I'm open to doing that as well - even prior to Akka.NET v1.5 and Artery.

@markusschaber
Copy link
Contributor Author

markusschaber commented Nov 29, 2021

Thank you very much!

As far as I can see, the main issue with the schedulers are the busy loops, things like Thread.Sleep(0) in tight loops seem to burn most of the CPU in our case. I might try to look into that on my own, and submit a pull request if anything valuable comes out.

If anything possible, I'd like to have something more like 10-20 mCPU per Service if there's no traffic...

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 29, 2021

As far as I can see, the main issue with the schedulers are the busy loops, things like Thread.Sleep(0) in tight loops seem to burn most of the CPU in our case. I might try to look into that on my own, and submit a pull request if anything valuable comes out.

"Expensive waiting" is a tricky problem - that and scaling the DedicatedThreadPool without reinventing the hill-climbing algorithm used by the managed thread pool go hand-in-hand. That's what the ChannelDispatcher does well: solves the mutually exclusive scheduling problem that the DedicatedThreadPool does by adding different priority work queues on top of the managed ThreadPool, so we still benefit from the hill-climbing algorithm scaling without suffering from the usual starvation problems that occur when everything runs on the same threadpool.

@markusschaber
Copy link
Contributor Author

markusschaber commented Nov 29, 2021

I'm not sure whether busy waiting actually brings enough benefits, compared to just using a lock / SemaphoreSlim or similar primitives using the OS scheduler. (As far as I know, "modern" primitives like SemaphoreSlim already use optimized mechanisms like futexes and fine-tuned spinning under the hood.)
As far as I know, the main purpose of busy looping is to reduce the overhead and latency introduced by context switches in case another CPU fulfils the condition we're waiting for. However, Thread.Sleep(0) by definition introduces context switches. to my knowledge, the OS schedulers are nowadays rather good at solving things like starvation and priority inversion, so trying to outsmart the OS might not be the optimal solution in all cases.
Checking the "Wait(TimeSpan)" Implementation in the "UnfairSemaphore", I'm not convinced that spinning 50 times through Thread.Sleep(0) on several Threads/CPUs in parallel is actually better than falling back to the SemaphoreSlim after 1 or 2 tries. Maybe the "UnfairSemaphore" could be improved to fine-tune the number of looping threads with the actual load, or it could just be replaced by a SemaphoreSlim directly for some workloads.

Independently, one could argue that any starvation by using the normal thread pool is either a misconfiguration of the thread pool (not enough minimum threads), or a misuse of the thread pool (long running tasks should go to a dedicated thread, blocking I/O should be replaced by async, etc...). Whether that kind of reasoning is acceptable by your users is an entirely different question, and apparently, minds much smarter than me have to fight tricky thread starvation problems (see https://ayende.com/blog/177953/thread-pool-starvation-just-add-another-thread or StephenCleary/AsyncEx#107 (comment) for examples...) - there's a reason one of our services had a line like ThreadPool.SetMinThreads(500, 500); in the startup code for some time... (Btw, according to Microsofts documentation, those 500 Threads are still created "on demand" (just instantly when there's no free thread available), so if 20 threads are enough to saturate the workload, no more threads will ever be created.)

@Aaronontheweb
Copy link
Member

Independently, one could argue that any starvation by using the normal thread pool is either a misconfiguration of the thread pool (not enough minimum threads), or a misuse of the thread pool (long running tasks should go to a dedicated thread, blocking I/O should be replaced by async, etc...)

In our case, the issue is simple: /system tasks, such as Akka.Cluster heartbeats, have real-time processing requirements - i.e. they fail if not responded to within N second. Large ThreadPool work queues that don't allow workload prioritization natively make it difficult for us to uphold those across busy systems where /user workloads are application-dependent and unknown to us. Therefore, we needed a generalizable solution for prioritizing some workloads over others that would work across hundreds of thousands of different use cases. Starvation occurs at the "task in queue" level - given the other items queued for execution, the system wasn't able to service that task in time for it to meet its requirements and keep the cluster available.

Of the solutions we tried years ago (i.e. Akka.NET 1.0-1.1,) separating the workloads at the thread level was what offered the highest throughput in exchange for the least amount of total complexity. Fine-tuning the performance of how that DedicatedThreadPool does its job could yield some better results in terms of how it scales (it's statically allocated based on vCPU counts now) or in terms of how it waits for work when idle would certainly be of interest.

Our job itself isn't so simple - the prioritization has to be handled somewhere; delegating everything to the ThreadPool without it has yielded poor results in busy systems historically. Additionally, the idle CPU vs. throughput tradeoff has historically been won by "throughput" in terms of "what do users care most about?" so that's primarily what's driven our development efforts there, therefore there is probably a lot of low-hanging fruit that could be picked to help optimize it. Offering choices for different use cases (i.e. ChannelDispatcher for users that prefer low resource consumption) or simply putting in the work to reduce idle CPU (see #4031) are both good options for mitigating the issue.

@markusschaber
Copy link
Contributor Author

Hmm, having a closer look at the DedicatedThreadPool, it says:

It prefers to release threads that have more recently begun waiting, to preserve locality.

Maybe we could just solve this problem with some kind of "Stack" of "SemaphoreSlim" or similar, so we just wake up one thread at a time - the most recent waiter being the one on top of the stack. On the other hand, I'm not really sure whether the implied definition of "locality" really fits modern "big iron" hardware which require NUMA awareness etc. for best results. I see a contradiction between "the more CPUs we have, the bigger the chance that another CPU will queue some work while we poll" and "the more CPUs we have, the less likely the thread which most recently has begun waiting is acutally on the right CPU (or close to it in NUMA sense)."

Of course, this usually does not apply to "small" machines like single-socket desktop machines, but on those, it's also less likely that another CPU can queue other work when all CPUs are busy polling on the UnfairSemaphore. ;-)

@Aaronontheweb
Copy link
Member

I'm not sure whether busy waiting actually brings enough benefits, compared to just using a lock / SemaphoreSlim or similar primitives using the OS scheduler. (As far as I know, "modern" primitives like SemaphoreSlim already use optimized mechanisms like futexes and fine-tuned spinning under the hood.)

I bet we could parameterize https://github.com/akkadotnet/akka.net/blob/dev/src/benchmark/Akka.Benchmarks/Actor/PingPongBenchmarks.cs to switch between DedicatedThreadPool and the default ThreadPool so you could measure the impact of these DedicatedThreadPool changes on throughput.

How would you create a benchmark to measure idle CPU? I've wondered about that in the past but without firing up and external system like Docker and collecting system metrics on a Lighthouse instance I'm not sure how to automate that.

@Aaronontheweb Aaronontheweb added this to the 1.4.39 milestone May 25, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.39, 1.4.40 Jun 1, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.40, 1.4.41 Jul 27, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.41, 1.4.42 Sep 7, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.42, 1.4.43, 1.4.44 Sep 23, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.44, 1.4.45, 1.4.46 Oct 17, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.46, 1.4.47 Nov 15, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.47, 1.4.48 Dec 9, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.48, 1.4.49 Jan 5, 2023
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.49, 1.4.50 Jan 26, 2023
@Aaronontheweb Aaronontheweb removed this from the 1.4.50 milestone Mar 15, 2023
@Aaronontheweb Aaronontheweb added this to the 1.5.18 milestone Mar 11, 2024
@markusschaber
Copy link
Contributor Author

I just noticed this issue has been fixed in the latest release. Nice! 👍

@Aaronontheweb
Copy link
Member

I just noticed this issue has been fixed in the latest release. Nice! 👍

You can thank the .NET team for that one - no longer needed our dedicated thread pool any more, which is where the high CPU utilization was coming from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants