-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akka.Cluster.Metrics extension implementation #4126
Conversation
Completed port of the extension code base. Need to add/port specs and verify it is working. And need to implement .NET metrics collector (since scala's are jvm-specific). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long review so far, but I hope you find it helpful
src/contrib/cluster/Akka.Cluster.Metrics/Akka.Cluster.Metrics.csproj
Outdated
Show resolved
Hide resolved
src/contrib/cluster/Akka.Cluster.Metrics/Akka.Cluster.Metrics.csproj
Outdated
Show resolved
Hide resolved
src/contrib/cluster/Akka.Cluster.Metrics/Akka.Cluster.Metrics.csproj
Outdated
Show resolved
Hide resolved
/// Creates new <see cref="ClusterMetrics"/> for given actor system | ||
/// </summary> | ||
/// <param name="system"></param> | ||
public ClusterMetrics(ExtendedActorSystem system) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this private
- force users to call the ClusterMetrics.Get(ActorSystem)
method so we can guarantee a maximum of one ClusterMetrics
instance per-ActorSystem
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this internal
since it is used in ClusterMetricsExtensionProvider
.
/// <summary> | ||
/// DynamicAccess | ||
/// </summary> | ||
internal static class DynamicAccess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So DynamicAccess
, if I am not mistaken, is an internal Scala / Java API that does what the Activator
does more or less.
If you want to use it so it makes a port easier, I'd move this into the core Akka module under the Akka/Utils
folder since it's not specific to Akka.Cluster.Metrics. Just a nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Also I will add Try
class to Akka.Utils
, since it is used by DynamicAccess
and also may be used somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also be worth getting rid of DynamicAccess
altogether and just creating some Activator
extension methods - more idiomatic that way. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm what do you mean by Activator
extension methods? Obviously is can not add custom static method like Activator.CreateInstanceFor<TSomeInterface>(string typeName)
. Either use (new Activator()).CreateInstanceFor
, or create another static class like ActivatorHelper
- but I named it DynamicAccess
to just simplify porting scala code in the future (sometimes one needs to check if some akka's class is already implemented, and search by name helps in such cases).
We can also add extension method for string, like typeName.CreateInstanceAs<TInterface>(params object[] args)
, but... It would be hard to find for other developers, and not obvious.
So since DynamicAccess.CreateInstanceFor
wraps getting type by name, then instantiating with Activator
, and handling exceptions (it returns Try<TResult>
instance which may contain error info), it might be useful when porting things in future. But I can rename it to something more intuitive - do you have any suggestions? I agree that nobody would look for DynamicAccess
class if only he is not porting dynamicAccess
class from scala at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IgorFedchenko let's make DynamicAccess
internal
and then keep it as-is - that way if we decided to clean up the API to make it more idiomatic with .NET in the future we can do that without worrying about user-facing issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep it in Akka.Utils
to make reusable in different modules marked it with [InternalApi]
and added INTERNAL USAGE
text into class summary.
/// the sampled value resulting from the previous smoothing iteration. | ||
/// This value is always used as the previous EWMA to calculate the new EWMA. | ||
/// </summary> | ||
public sealed partial class EWMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a value type (struct
) if it's going to be used primarily in scoped math operations and not frequently passed between actors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is partial
, and it may require more efforts to use separate EWMA
struct, because it is used by other partial
classes... So if it is not a significant improvement, maybe better to keep it as it is - do you agree?
/** | ||
* Defines a remote address. | ||
*/ | ||
message Address { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this type already defined somewhere else in our proto
folder? Let's just import that if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out that yes, we have AddressData
message defined in protobuf/ContainerFormats.proto
(and generated as Akka.Remote.Serialization.Proto.Msg
package).
But for some reason generated c# AddressData
class is internal
there, and I can not reuse it.
Should I regenerate proto messages with public
modifier, or just have my own Address
message defined as we have now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Akka.Cluster.Metrics as a friend assembly of Akka.Remote:
akka.net/src/core/Akka.Remote/Properties/AssemblyInfo.cs
Lines 24 to 35 in 4463fcb
[assembly: Guid("78986bdb-73f7-4532-8e03-1c9ccbe8148e")] | |
[assembly: InternalsVisibleTo("Akka.Remote.TestKit.Tests")] | |
[assembly: InternalsVisibleTo("Akka.Remote.TestKit")] | |
[assembly: InternalsVisibleTo("Akka.Remote.Tests")] | |
[assembly: InternalsVisibleTo("Akka.Remote.Tests.MultiNode")] | |
[assembly: InternalsVisibleTo("Akka.Remote.Tests.Performance")] | |
[assembly: InternalsVisibleTo("Akka.Cluster")] | |
[assembly: InternalsVisibleTo("Akka.Cluster.Tests")] | |
[assembly: InternalsVisibleTo("Akka.Cluster.TestKit")] | |
[assembly: InternalsVisibleTo("Akka.Cluster.Tests.MultiNode")] | |
[assembly: InternalsVisibleTo("Akka.Cluster.Tools")] | |
[assembly: InternalsVisibleTo("Akka.Cluster.Sharding")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then remove the duplicate definition afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, almost forgot about this. Thanks for the tip, fixed now.
public static class StandardMetrics | ||
{ | ||
// Constants for the heap related Metric names | ||
public const string HeapMemoryUsed = "heap-memory-used"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IgorFedchenko so this is where you're going to run into some fun platform portability issues - the CLR doesn't really have a singular "heap" in the same sense that the JVM does: https://stackoverflow.com/questions/10121943/net-process-memory-usage-5x-clr-heap-memory - it uses several different heaps that are set aside for specific purposes.
Heap memory in the JVM is used to measure the total footprint of all Java objects that have been allocated by a specific JVM process: https://www.yourkit.com/docs/kb/sizes.jsp - non-heap memory is used to measure things like native object allocation and etc.
So the term "heap memory" is not something that's going to be immediately clear to .NET users, since the CLR doesn't really assign that term any specific meaning. What I think end-users are interested in, is total memory allocated to the currently running process - and that value CAN fluctuate up and down as objects get garbage collected and the OS frees up memory that has been unused for a sufficiently long period of time.
What do you think? This is a platform difference we're going to need to reconcile somehow, and replacing "heap" memory with "process" memory, without distinguishing between managed object allocation and unmanaged object allocation (i.e. file handles, socket memory, etc...), probably gives .NET users the thing that they're really trying to monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some ways of collecting this information in .NET:
- The garbage collector: https://github.com/petabridge/NBench/blob/dev/src/NBench/Collection/Memory/GcTotalMemoryCollector.cs#L34
- Performance Counters (Windows only)
- Calling into native APIs via P/Invoke - would need different implementations for Windows vs. POSIX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might also be some new .NET Core instrumentation that is cross-platform. That's been on the to-do list for CoreCLR for some time now but I think they've made some progress against that since 2.2+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, indeed - I will need to address this once will work on .NET metrics collector here, in this PR. I totally agree that we should look for something like total allocated memory, and will update collector's name (and metric's name as well) once will get clear understanding of what we are collecting at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for hints about research directions, by the way - and did not know anything about JVM heap :)
@Aaronontheweb big thanks for a deep review! Updated code to reflect changes we discussed - and left questions about few moments. This was very helpful, and I made few notes for the future. In the meantime, added some specs and made them passing (which required some changes to ported code), and working on next specs. |
@IgorFedchenko have this failing in the MNTR: https://dev.azure.com/dotnet/Akka.NET/_build/results?buildId=27573&view=logs&j=f203efd3-114c-5599-f6b4-0fa25ffc9760&t=b10ae0da-ff8b-5c6e-60d8-f20e659bbc17&l=8341
|
@Aaronontheweb Can not reproduce this locally, and can not find anything about this in Azure logs - url you provided shows "Build not found" error, and didn't find such failures in last CI runs. But With that said, we could get real exception in our logs from |
@IgorFedchenko merged in the latest changes - going to re-run the test suite now. My guess is that it's a mismatched manifest in the serializer - sending one thing but trying to read another. |
Turned out that Also, disabled one MNTR spec phase where one of the nodes was allocating memory and we were asserting that this node will receive less messages. This is actually not true in MNTR tests, because all nodes share common system resources and most likely to have same memory capacity from the router point of view. Weird spec, skipped it for now. And tried to fix another racing spec with increasing timeout of waiting for cluster members to be up. Actually, as I can see, there were quite a few errors, so @Aaronontheweb thanks again for pointing me out to the logs output. |
That'll break the next time we generate the protobuf APIs. In that case I'd just create your own |
@Aaronontheweb Updated this shared I have a question about one last MNTR test failing, do not have an idea why it could pass locally but fail on CI. AwaitClusterUp(roles: _config.NodeList.ToArray()); And it fails, with something like this on each node:
But before this failure, there is a weird error:
So how is it possible that one node is entering next barrier while others are waiting for the first one? public void AwaitClusterUp(params RoleName[] roles)
{
// make sure that the node-to-join is started before other join
RunOn(StartClusterNode, roles.First());
EnterBarrier(roles.First().Name + "-started");
if (roles.Skip(1).Contains(Myself)) Cluster.Join(GetAddress(roles.First()));
if (roles.Contains(Myself))
{
AwaitMembersUp(roles.Length);
}
EnterBarrier(roles.Select(r => r.Name).Aggregate((a, b) => a + "-" + b) + "-joined");
} Do you have an idea why this could happen? Also, I would not say that this particular multi node test is very useful - it checks that extension is working at all, while other tests are also using this extension (so checking it is working), so I could skip the test. But just not like the idea that something is not working and I do not understand why. But not sure that this is related to |
lol.... I think I do. A fun but common distributed programming bug. That code assumes that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a minor bug with the way this spec was written - not bug with the MNTR itself, just the way you were encoding the NodeList
on this PR using a HashSet<T>
. The built-in List<T>
for Roles
will solve this specific issue.
public readonly RoleName Node4; | ||
public readonly RoleName Node5; | ||
|
||
public IImmutableSet<RoleName> NodeList => ImmutableHashSet.Create(Node1, Node2, Node3, Node4, Node5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is your bug.... The hashset may not list these in a consistent order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh.... Exactly... They are unordered here... Missed this... Thanks!!
@@ -93,7 +91,7 @@ private async Task Should_collect_and_publish_metrics_and_gossip_them_around_the | |||
{ | |||
await AwaitAssertAsync(() => | |||
{ | |||
AwaitClusterUp(roles: _config.NodeList.ToArray()); | |||
AwaitClusterUp(Roles.ToArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aaronontheweb Could you give me a hint for the future: what is the difference here? Isn't Roles
just same list of roles stored internally?
I see that I am using this Roles
in another spec that passes without issues, and maybe this will work here too - but why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roles is something that gets populated automatically by the MultiNodeSpec
base class - the content of the list is ordered and populated by the base classes themselves. The issue was that the HashSet<T>
you were using could change the roles slightly since I think the HashCode
s were different on a process by process basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, or course, set does not guarantee same ordering... Great catch, @Aaronontheweb , thanks a lot!
@IgorFedchenko looks like we need API Approvals again here: https://dev.azure.com/dotnet/Akka.NET/_build/results?buildId=27752&view=ms.vss-test-web.build-test-results-tab&runId=1085134&resultId=100383&paneView=debug - this is due to the changes in Akka.Remote. |
Done |
This will close #2070 , and maybe some other related issues (like #2069)
The goal of this PR is to port Akka.Cluster.Metrics into Akka.NET. Maybe this will span into several pull requests, will see.
I will keep this in draft state until will have working specs for this (that is, once this will be ready for final reviews/merge).
I will start from implementing the core functions and metric collectors, and then will add adaptive load balancing routers.