Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CPU/Memory metrics collection at Akka.Cluster.Metrics #4142

Closed
IgorFedchenko opened this issue Jan 14, 2020 · 23 comments
Closed

Improve CPU/Memory metrics collection at Akka.Cluster.Metrics #4142

IgorFedchenko opened this issue Jan 14, 2020 · 23 comments

Comments

@IgorFedchenko
Copy link
Contributor

Introduction

Once #4126 will be merged, we will need to improve metrics loading, that is implemented in DefaultCollector.

The basic idea is to collect:

  • Current CPU load on the node, in %
  • Current Memory usage, in %

Using that information, AdaptiveLoadBalancingRoutingLogic will calculate availability of each node and will perform "smart" routing.

Ideally, we should collect more complete list:

  • CPU usage by current process
  • CPU total usage on the node machine (this will be used for routing)
  • Memory usage by current process (in bytes)
  • Memory total usage on the node machine
  • Memory available on the node machine (this with previous one will give total utilization in %)

What we have right now

CPU

There is no API available in netstandard2.1 to collect CPU metrics out-of-the-box, like PerformanceCounters. So what we are doing now is using Process.TotalProcessorTime property, to get time, dedicated to current process. Having total time elapsed, we can give some estimation of CPU usage by current process.

But talking about CPU total usage, this approach would require to get all processes info with Process.GetProcesses() - which is very time consuming (especially when we have to deal with access violation exceptions here), when there are lots of processes.
So total CPU usage is just the same as current process CPU usage now. This is more or less fine for routing based on .NET process load, but not ideal if there are some other heavy processes running on machine.

Memory

Candidate list includes:

  • GC.GetTotalMemory to get currently allocated managed memory size. There is also GC.GetGCMemoryInfo - that will provide struct with TotalAvailableMemoryBytes property, but this method is only available at .netstandard3.0, and we are targeting 2.1

  • PerformanceCouters, which are working under Windows, and there is Mono implementation. There are some other Windows-only ways to get metrics.

  • Process class, which provides multiple memory-related properties

  • Using P/Invoke and working with native API

  • Getting some shell commands output, specific for OS

Currently, we are using the cross-platform sources available for netstandard2.1 - the Process class.

First issue

Same as for CPU: this is quite heavy to get all processes information. So current implementation treats MemoryUsage as current process usage, which is useful, but not ideal for nodes routing.

Second issue

Another issue is understanding the term of "used" memory, and getting "available" memory info.

To track unmanaged memory as well as managed, Process.PrivateMemorySize64 is used instead of GC.GetTotalMemory. It works well by itself. But it is hard to know the upper limit for this value, because it is not the allocated physical memory from RAM (see documentation).
Getting "available" memory is much more tricky, and I did not find anything available under .NET Core sdk to get this value. Ideally would be getting available size of installed physical memory (or available part of it in cloud environment). So far, the Process.VirtualMemorySize64 is used - but is is just a number of bytes in virtual address space, and does not correlate much with really available memory. But still it is one of the upper bounds for available memory, and can be used to get % of memory load (relative to other node).

In my understanding, ideally would be loading Available MBytes PerformanceCounter (but on all platforms) to get available memory, and get some way to load installed total available memory. This two would allow to get % Used Memory on the node, and perform routing. And provide all different Process properties in addition, like WorkingSet, PrivateMemorySize64, and others.

Maybe there is some other convenient approach. The main idea here is that while current used / available relation is Process.PrivateMemorySize64 / Process.VirtualMemorySize64 - it is always is range of [0, 1] and reflects the memory load. So we can compare nodes based on this. But value of 0.5 does not guarantee that there is available memory on the node at all, so need some more accurate values for node's memory capacity calculation.

@IgorFedchenko
Copy link
Contributor Author

Also, in scala they use Sigar library, which seem to have bindings for .NET. @Aaronontheweb Should we port this? It will require users to have binaries of this library for their OS, but may be working approach anyway.

@Aaronontheweb
Copy link
Member

Good to know @IgorFedchenko - I had no idea that they had .NET support.

Do you know if that library does anything that requires elevated permissions?

@IgorFedchenko
Copy link
Contributor Author

Do you know if that library does anything that requires elevated permissions?

Can not find any particular permission requirements in related articles, seems like the most quick way is to download binaries from here, and check by myself (here are some code samples I found).

So need to give it a try once we will work on this issue. Here is nice Wiki for the library.

@Aaronontheweb Aaronontheweb added this to the 1.4.1 and Later milestone Feb 14, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.2, 1.4.3, 1.4.4 Mar 13, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.4, 1.4.5 Mar 31, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.5, 1.4.6 Apr 29, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.6, 1.4.7 May 12, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.7, 1.4.8 May 26, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.8, 1.4.9 Jun 17, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.9, 1.4.10 Jul 21, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.10, 1.4.11 Aug 20, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.11, 1.4.12 Nov 5, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.12, 1.4.13 Nov 16, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.13, 1.4.14 Dec 16, 2020
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.43, 1.4.44 Sep 23, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.44, 1.4.45 Oct 17, 2022
@Arkatufus
Copy link
Contributor

Arkatufus commented Oct 18, 2022

Comparison between data collected using Akka.Cluster.Metrics and data collected using dotnet built-in perf counter that is available in .NET 6.0

Chart 1. Memory consumption, working set is not included because it makes the chart harder to read
image

Chart 2. CPU usage
image

@Arkatufus
Copy link
Contributor

Comparing CPU load measurement between Akka.Cluster.Metrics and Perf Counter on windows are quite accurate.
Memory comparison, however, are quite confusing. What do GC.TotalMemory() actually measure?

@Aaronontheweb
Copy link
Member

@Arkatufus https://stackoverflow.com/a/7455860/377476

Basically the GC.TotalMemory only measures memory allocated onto the heap via the garbage collector - any unmanaged memory, stack memory, and so on can't be measured there. Might be better to use one of the other measures suggested in the SO issue I linked (i.e. Process.TotalWorkingSet) - but those may have accuracy issues as well.

@Arkatufus
Copy link
Contributor

This is what the graph would look like if I include the total working set (reported by perf counter). The difference is quite big, its twice the size.

image

@Aaronontheweb
Copy link
Member

What about what I suggested above @Arkatufus ? Avoiding perf counters is a good idea given that they aren't x-plat - want some sort of abstraction that works on all supported runtimes.

@to11mtm
Copy link
Member

to11mtm commented Oct 18, 2022

What about what I suggested above @Arkatufus ? Avoiding perf counters is a good idea given that they aren't x-plat - want some sort of abstraction that works on all supported runtimes.

FWIW In the past I've found that querying Process and running math on times at a given sampling rate works well for CPU usages, as well as a combination of the memory queries given in the SO post alongside GC stats. (It's pretty useful in some cases to see both, has helped me debug at least one IIS-hosting issue.)

Actually, come and think of it, I frequently saw some of the WorkingSets to stay fairly constant depending on environment, so the GC numbers still had a lot of value in seeing spikes/etc.

@Aaronontheweb
Copy link
Member

NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryUsed, GC.GetTotalMemory(true)).Value,
- we should probably not be forcing full GC here each time we sample. No bueno. Just need to report on what current usage looks like, not have any side effects (plus this causes the current thread to block until GC is complete.)

@Aaronontheweb
Copy link
Member

NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryAvailable, process.VirtualMemorySize64).Value,
- this value measures only virtual memory. Definitely not what users are interested in - we should have three separate memory counters:

  1. Process.WorkingSet64 - allows us to capture total allocated physical memory. Doesn't measure utilization exactly, but utilization for busy processes will be correlated to allocation.
  2. GC.GetTotalMemory - allows us to capture how much memory is being used currently by .NET managed objects. For the majority users, this is the most practical measure of end-to-end utilization.
  3. Process.VirtualMemory64 - if this number is going up, your performance might go down. Means that more and more working set memory is now being offloaded to disk - not necessarily a perf hit unless page faults also goes up.

IMHO, we should probably just track WorkingSet64 and GC.GetTotalMemory - for routing purposes that's probably accurate enough.

In .NET 6, as @Arkatufus pointed out in our call this morning, we can dual target and add support for the new x-plat runtime performance APIs: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters

@Aaronontheweb
Copy link
Member

What's the difference between Process.WorkingSet64 and Environment.WorkingSet?

https://learn.microsoft.com/en-us/dotnet/api/system.environment.workingset?view=netstandard-2.0#system-environment-workingset

Both APIs are available in .NET Standard 2.0.

@Aaronontheweb
Copy link
Member

@Arkatufus
Copy link
Contributor

We can implement the latest Microsoft cross platform performance metrics EventCounters and retrieve the System.Runtime counters in v1.5. We can't backport it to v1.4 because its not available in .NET runtime 3.0 and below.
https://learn.microsoft.com/en-us/dotnet/core/diagnostics/event-counters

@Arkatufus
Copy link
Contributor

I've used it to collect the comparison data to create the graphs above, it's a lot easier to use since we only have a single source of truth to get all our numbers from.

@Arkatufus
Copy link
Contributor

These are the graph from WSL2, running with 4 virtual CPU and 6 GB RAM
image

image

image

@Aaronontheweb
Copy link
Member

CPU numbers look good x-plat on both of the tested platforms so far - and the memory tracking issues are consistently off on both platforms, which makes me think it's just a matter of calling Process.Refresh during sampling and using the right metrics values. This will be easier than we thought - the built-in metrics are actually pretty good for our purposes.

@Arkatufus
Copy link
Contributor

Here are the graph after the changes, some values makes sense, other just doesn't make any sense.

Graph 1, CPU usage, no code change:
image

Graph 2, Memory usage, data marked with (A) are from AkkaCluster..Metrics:
image

  • (A) used: changed to not force GC
  • (A) available: changed from Process.VirtualMemorySize64 to Process.WorkingSet64
  • Add a new metric StandardMetrics.MemoryVirtual and set it to Process.VirtualMemorySize64, but it read as 2 terabytes so I'm not including it in the graph
  • Managed to record StandardMetrics.MaxMemoryRecommended but it of no use, Process.MaxWorkingSet always read as 1.34 megabytes

Graph 3. Adjusted virtual memory. If I removed the excess number from Process.VirtualMemorySize64, it actually have some usable value.
image

@Arkatufus
Copy link
Contributor

Forgot to mention that I induced artificial memory pressure on the test, that's why the memory chart looks different

@Arkatufus
Copy link
Contributor

I'm seeing weird behavior when I'm using Process.WorkingSet64 and GC.GetTotalMemory() inside MNTR.
Working set is supposed to be the total memory being allocated in physical memory and GC total is supposed to be the total memory allocated to the GC heap (gen-0 + gen-1 + gen-2 + LOH + POH).
This result is consistent over multiple MNTR run, the reported node 1 GC heap is always bigger than the working set, which isn't supposed to happen.

Time
Node 1
WorkingSet

GC.Total
Node 2
WorkingSet

GC.Total
Node 3
WorkingSet

GC.Total
0 70.65 86.21 69.19 36.14 68.96 36.13
0.478 71.65 87.60 69.99 37.70 69.52 37.64
1.487 71.91 87.92 70.31 38.01 69.95 38.07
2.496 72.14 88.32 70.58 38.42 70.06 38.34
3.508 72.21 88.63 70.70 38.70 70.22 38.78
4.515 72.33 88.84 70.75 39.11 70.30 39.06
5.528 72.43 89.11 70.80 39.47 70.39 39.36
6.54 72.62 89.46 70.85 39.74 70.41 39.56
7.547 72.96 89.76 70.90 40.03 70.44 39.84
8.558 73.34 89.98 71.04 40.32 70.55 40.11
9.568 76.50 91.24 72.68 41.12 72.19 41.01
10.584 78.09 87.48 72.94 41.83 72.43 41.79

@Aaronontheweb
Copy link
Member

So the only reason our MNTR specs are passing today:

// Forcing garbage collection to keep metrics more resilent to occasional allocations
NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryUsed, GC.GetTotalMemory(true)).Value,
// VirtualMemorySize64 is not best idea here...
NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryAvailable, process.VirtualMemorySize64).Value,

process.VirtualMemorySize64 will always return 2.1 TB of memory - which is vastly higher than what most Akka.NET applications will have access to. It's not a good metric in that it doesn't really approximate total physical memory usage on a system, but since that value is always going to be higher than what GC.GetTotalMemory reports the tests will pass.

@Aaronontheweb
Copy link
Member

closed via #6203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants