Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make JoinAsync and JoinSeedNodesAsync more robust by checking cluster UP status #6033

Conversation

Arkatufus
Copy link
Contributor

@Arkatufus Arkatufus commented Jul 5, 2022

Changes

  • JoinAsync and JoinSeedNodesAsync are now idempotent, always returning the same Task instance when called multiple times.
  • More robust memory clean-up
  • Add IsUp property to Cluster
  • Shortcut RegisterOnMemberUp and JoinAsync to immediately return if cluster is already up.
  • KISS and code reuse, both JoinAsync and JoinSeedNodesAsync shares the same async instance.

@Aaronontheweb
Copy link
Member

There is still no way to remove the cluster message listener actors, these listeners will persist until the Cluster is shut down.

by design, although the member up actors could be programmed to shut themselves down after they've executed.

@Aaronontheweb
Copy link
Member

It is possible that variables inside these listener delegates went out of scope or are nullified, causing NRE to be thrown, filling the log with confusing error messages, and if not guarded properly, propagates to the ClusterDaemon and kills it.

Unlikely with closures, but theoretically possible. I wouldn't worry about it - end-user's responsibility.

@Aaronontheweb Aaronontheweb added this to the 1.5.0 milestone Jul 6, 2022
Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes don't look 100% safe to me

src/core/Akka.Cluster/Cluster.cs Outdated Show resolved Hide resolved
src/core/Akka.Cluster/Cluster.cs Outdated Show resolved Hide resolved
src/core/Akka.Cluster/Cluster.cs Show resolved Hide resolved
src/core/Akka.Cluster/Cluster.cs Outdated Show resolved Hide resolved
src/core/Akka.Cluster/Cluster.cs Outdated Show resolved Hide resolved
src/core/Akka.Cluster/Cluster.cs Show resolved Hide resolved
@Arkatufus
Copy link
Contributor Author

Changed how the state is handled, no longer using null as a marker.
Note that the code assumes that you can only join a cluster successfully once, just like leaving.

@Aaronontheweb @to11mtm, Would be great if you can re-review this code again.

{
_isUp.GetAndSet(true);
// If there is an async join operation in progress, complete it.
_asyncJoinTaskSource?.Complete();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a -little- closer...

IMO Ideally, we should only lock long enough to see if there is an existing continuation set up, and if so re-use that.

Does _asyncJoinTaskSource have a TryComplete to use instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses TryComplete internally, the lock is to guard against possible null value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_asyncJoinTaskSource is actually a custom wrapper class around TaskCompletionSource to provide timeout with custom exception support.

@Arkatufus Arkatufus requested a review from to11mtm July 19, 2022 16:47
@Aaronontheweb
Copy link
Member

Here is how I would simplify this:

  1. Update MemberUp listeners for callbacks to terminate themselves once fired - no need to have them around forever once they've processed their event.
  2. Start a new one for each JoinAsync attempt - no need for shared state concurrency that way. They'll all get the same event regardless of who started the join.
  3. If we're already part of a cluster, just return a completed task (we're not going to handle edge cases where a user wants us to join a node different than the ones we're already in a cluster with - they can read the logs.) You can determine this by checking Cluster.SelfMember.

@Arkatufus
Copy link
Contributor Author

I've checked the cluster status change listener actor, its appropriately designed, no problem there.

….com:Arkatufus/akka.net into cluster/fix_JoinAsync_and_JoinSeedNodesAsync
@Arkatufus Arkatufus changed the title Make JoinAsync and JoinSeedNodesAsync more robust by using an async state Make JoinAsync and JoinSeedNodesAsync more robust by checking cluster UP status Jul 25, 2022
Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

_clusterDaemons.Tell(new InternalClusterAction.AddOnMemberRemovedListener(() => tcs.TrySetResult(null)));
_clusterDaemons.Tell(new InternalClusterAction.AddOnMemberRemovedListener(() =>
{
tcs.TrySetResult(null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) July 27, 2022 16:21
@Aaronontheweb Aaronontheweb merged commit b1eb688 into akkadotnet:dev Jul 27, 2022
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Feb 28, 2023
We used the `cluster.seed-node-timeout` property incorrectly in akkadotnet#6033 - that is a "how long did it take from me to hear back from a seed" setting, but we're using it like a "how much time do I have to join the cluster?" setting.

Added a function designed to give us at least 20s of leeway.
Aaronontheweb added a commit that referenced this pull request Feb 28, 2023
* fix timing regressions in `Cluster.JoinAsync` methods

We used the `cluster.seed-node-timeout` property incorrectly in #6033 - that is a "how long did it take from me to hear back from a seed" setting, but we're using it like a "how much time do I have to join the cluster?" setting.

Added a function designed to give us at least 20s of leeway.

* Update Cluster.cs

* fixed addr

* Fix timeout calculation

* Fix CancellationTokenSource

---------

Co-authored-by: Gregorius Soedharmo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants