Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix/4849 add cluster stress spec #4899

Closed

Conversation

Aaronontheweb
Copy link
Member

No description provided.

@Aaronontheweb
Copy link
Member Author

Looks like I found a small bug in the MNTR here:

private IEnumerable<RoleName> RoleNames(Type specType)
        {
            var ctorWithConfig = FindConfigConstructor(specType);
            var configType = ctorWithConfig.GetParameters().First().ParameterType;
            var args = ConfigConstructorParamValues(configType);
            var configInstance = Activator.CreateInstance(configType, args);
            var roleType = typeof(RoleName);
            var configProps = configType.GetProperties(BindingFlags.Instance | BindingFlags.Public);
            var roleProps = configProps.Where(p => p.PropertyType == roleType && p.Name != "Myself").Select(p => (RoleName)p.GetValue(configInstance));
            var configFields = configType.GetFields(BindingFlags.Instance | BindingFlags.Public);
            var roleFields = configFields.Where(f => f.FieldType == roleType && f.Name != "Myself").Select(f => (RoleName)f.GetValue(configInstance));
            var roles = roleProps.Concat(roleFields).Distinct();
            return roles;
        }

This can only find RoleNames that have been declared as fields on an MNTR config - any code that dynamically returns a number of Roles, such as the StressSpec, will not run. I'll have to patch this also.

@Aaronontheweb
Copy link
Member Author

Not sure why the build failed on Linux here - didn't get much of an error message back.

@Aaronontheweb Aaronontheweb force-pushed the fix/4849-addClusterStressSpec branch 2 times, most recently from 93b3a47 to 67b703f Compare April 6, 2021 21:52
@Aaronontheweb
Copy link
Member Author

StressSpec is still unstable at the moment - flips on my machine. Suspect that there is a problem with Reachability or PhiAccrualFailureDetector that is responsible for some issues with cluster stability. Going to keep working that problem until I can get the flip rate down.

@@ -348,6 +348,7 @@ Target "MultiNodeTests" (fun _ ->

Target "MultiNodeTestsNetCore" (fun _ ->
if not skipBuild.Value then
setEnvironVar "akka.cluster.assert" "on" // needed to enable assert invariants for Akka.Cluster
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts gossip invariants when the MNTR is run

Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on these changes thus far - but I have some cleanup to do on the PR.

@@ -175,7 +175,7 @@ protected override void AfterTermination()

//TODO: ExpectedTestDuration?

void MuteLog(ActorSystem sys = null)
public virtual void MuteLog(ActorSystem sys = null)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to add this in order to subclass this behavior in other specs

@@ -62,13 +62,29 @@ public static Cluster Get(ActorSystem system)
return system.WithExtension<Cluster, ClusterExtension>();
}

static Cluster()
{
bool GetAssertInvariants()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled assert invariants inside the Cluster when the "akka.cluster.assert" environment variable is set

@@ -649,11 +655,12 @@ public ImmutableHashSet<UniqueAddress> Receivers(UniqueAddress sender)
{
if (iter.MoveNext() == false || n == 0)
{
iter.Dispose(); // dispose enumerator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't properly disposing the enumerator from the HeartbeatNodeRing before.

? slice1
: take(remaining, NodeRing().Until(sender).Where(c => !c.Equals(sender)).GetEnumerator(), slice1).Item2;
? slice1 // or, wrap0around
: take(remaining, NodeRing().TakeWhile(x => x != sender).GetEnumerator(), slice1).Item2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned up the LINQ syntax here

else if (x.Uid == y.Uid) return 0;
else return 1;
return result;
var ha = x.Uid;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifts the ring to primarily factor in the UID in the node ring comparisons, rather than the address. Cheaper and faster while remaining stable so long as the membership doesn't change.

/// Aggregated status of a subject node is defined as (in this order):
/// - Terminated if any observer node considers it as Terminated
/// - Unreachable if any observer node considers it as Unreachable
/// - Reachable otherwise, i.e. no observer node considers it as Unreachable
/// </summary>
internal class Reachability //TODO: ISerializable?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't actually change anything in this file - just ReSharper'd it.

@@ -508,7 +508,6 @@ public int InitialParticipants
/// </summary>
public void RunOn(Action thunk, params RoleName[] nodes)
{
if (nodes.Length == 0) throw new ArgumentException("No node given to run on.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to allow the MNTR to call RunOn with no nodes

@@ -30,11 +31,11 @@ public DefaultFailureDetectorRegistry(Func<FailureDetector> factory)

private readonly Func<FailureDetector> _factory;

private AtomicReference<Dictionary<T, FailureDetector>> _resourceToFailureDetector = new AtomicReference<Dictionary<T, FailureDetector>>(new Dictionary<T, FailureDetector>());
private AtomicReference<ImmutableDictionary<T, FailureDetector>> _resourceToFailureDetector = new AtomicReference<ImmutableDictionary<T, FailureDetector>>(ImmutableDictionary<T, FailureDetector>.Empty);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up some potential mutability issues here.

Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some additional comments - this PR is ready for review now.

@@ -1964,8 +1965,11 @@ public ReceiveGossipType ReceiveGossip(GossipEnvelope envelope)
// for all new joining nodes we remove them from the failure detector
foreach (var node in _latestGossip.Members)
{
if (node.Status == MemberStatus.Joining && !localGossip.Members.Contains(node))
if (!localGossip.Members.Contains(node))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major bugfix for #4849

/// </summary>
internal class Reachability //TODO: ISerializable?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made no real changes to this class - just ReSharper'd it.

@@ -154,7 +154,7 @@ public State(HeartbeatHistory history, long? timeStamp)

private AtomicReference<State> _state;

private State state
internal State state
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made internal for debugging purposes

@Aaronontheweb
Copy link
Member Author

Replaced via #4940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant