Multicast Join #848
Replies: 12 comments 5 replies
-
This is not a right solution to approach the problem I am afraid. JGroups doesn't allow for this AFAIR - the join is going to the coordinator only, so it will always be unicast. Let's back up a little – do you really require UDP multicast? |
Beta Was this translation helpful? Give feedback.
-
Hi Neal, long time no see (if you're the allweatherinc-Neal :-))! I'm not sure I understand your question. Let me explain how a JOIN works:
If IP multicasting doesn't work in one direction, then the JOIN-REQ/RSP should work, so P should get the new view containing itself. The VIEW multicast may or may not fail, depending on which direction the IP mcast fails. You say that sending the requests as multicasts will 'alleviate the problem'; this contradicts your original statement, leaving me slightly confused... |
Beta Was this translation helpful? Give feedback.
-
So you want to use multicast to fail-fast, rather than retrying, but doesn't the unicast req-rsp work? |
Beta Was this translation helpful? Give feedback.
-
I'm certainly not going to change the way joining a new member works. Sending a multicast when a unicast is enough is overkill. Think about the ramifications if the transport is not UDP, e.g. TCP. This would require N-1 messages to be sent rather than 1. Also, if multicast doesn't work in one direction, wouldn't your system be affected in other parts, too? Having said that, I could easily see a protocol which turns a unicast JOIN-REQ into a multicast one, and a unicast JOIN-RSP into a multicast as well. A receiver would drop a JOIN-REQ unless it is the current coordinator. The receiver of a JOIN-RSP would drop it, too, unless it is the joiner. This is fairly simple to implement, but I wouldn't want it in JGroups, because it is very specific to your environment. Speaking of which, IMO the better solution would be for Cisco to fix their switch issues with multicasting. In addition, this would not require changes in the JOIN logic. |
Beta Was this translation helpful? Give feedback.
-
Hello Bela,
C’est moi.
Understood how JOIN works. For us, the last step is “the other way around”. But I don’t think that matters.
If we were to do it in the “core JGroups” I was thinking of using an @Property (like use_mcast_xmit in NAKACK). That way it would only affect those that wanted the “fail fast” functionality. I think the impact would be minimal.
The current state is that the network “sorts itself out”, but that the JOINer will retry again, causing VIEW churn. By using multicast as a “quality gate” we hope to eliminate that.
Regards,
Neal
From: Bela Ban ***@***.***>
Sent: Thursday, October 17, 2024 7:06 AM
To: belaban/JGroups ***@***.***>
Cc: Neal Dillman ***@***.***>; Mention ***@***.***>
Subject: Re: [belaban/JGroups] Multicast Join (Discussion #848)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Neal, long time no see (if you're the allweatherinc-Neal :-))!
I'm not sure I understand your question. Let me explain how a JOIN works:
* The joiner P runs a discovery protocol, at the end of which it gets the address of the coordinator A
* P then sends a unicast JOIN-REQ to A
* After adding P to the view, A sends a unicast JOIN-RSP back to P, followed by a multicast VIEW to all (or the other way round)
If IP multicasting doesn't work in one direction, then the JOIN-REQ/RSP should work, so P should get the new view containing itself. The VIEW multicast may or may not fail, depending on which direction the IP mcast fails.
You say that sending the requests as multicasts will 'alleviate the problem'; this contradicts your original statement, leaving me slightly confused...
—
Reply to this email directly, view it on GitHub<#848 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7O3K7Y7GJX7J2U6TJQ6TTZ3676BAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGIZDIMQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
There are a few scenarios, but in our case the heartbeat protocol (which extends FD_ALL2 and is similar) will knock anyone out who is not heartbeating (or who’s mulitcasts cannot reach the coordinator). However, that is after they are in the VIEW. Hence churn. There are other scenarios as well.
Note that we do not run FD_SOCK, as it just exacerbates the issue since the unicast sockets work when multicast does not.
Oh, and to head off another question: We are currently running 3.6.9-Final “Patch 7”, meaning we have patched it seven times with our changes or changes from newer JGroups. We will be upgrading though. I do not think the situation we are seeing will change on 5.3.x.
Also, it is worth noting that the issue is rare, but JGroups (and the overall system) is stable enough that we are trying to work around anything that can cause an issue – regardless of the real cause.
From: Bela Ban ***@***.***>
Sent: Thursday, October 17, 2024 7:09 AM
To: belaban/JGroups ***@***.***>
Cc: Neal Dillman ***@***.***>; Mention ***@***.***>
Subject: Re: [belaban/JGroups] Multicast Join (Discussion #848)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
So you want to use multicast to fail-fast, rather than retrying, but doesn't the unicast req-rsp work?
—
Reply to this email directly, view it on GitHub<#848 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7O3K35ULBDTZKE3HGWDT3Z37AIXAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGIZDOOA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I agree that multicast in place of unicast is overkill. As mentioned, was thinking an @Property would be the place for this, with it defaulting to off. Comments would be clear as to the purpose. It should have minimal impact on the overall code – and I would not think it would impact the actual logic if JOIN. I’ll look at the alternative of adding a new protocol – perhaps that will work just as effectively.
Speaking of protocols, we have several we need to contribute back. I’ll start a separate thread for that so it can be determined which ones should be added.
Yes, I think it would be great if Cisco could fix their switches and routers to consistently and properly work with multicast. Turns out they are not the only ones. Might be that Peplink and Ubiquity have issues as well. Perhaps we should have a forum to post which vendors have issues that they haven’t fixed. Then we could list those that fix issues quickly when they are shown a problem (like HP/Aruba and FortiNet). Or even vendors that have implementations that have always worked. But I digress. In general, we don’t get to pick the networking hardware, so we have to do our best with what is there.
Regards,
Neal
From: Bela Ban ***@***.***>
Sent: Thursday, October 17, 2024 7:25 AM
To: belaban/JGroups ***@***.***>
Cc: Neal Dillman ***@***.***>; Mention ***@***.***>
Subject: Re: [belaban/JGroups] Multicast Join (Discussion #848)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
I'm certainly not going to change the way joining a new member works. Sending a multicast when a unicast is enough is overkill. Think about the ramifications if the transport is not UDP, e.g. TCP. This would require N-1 messages to be sent rather than 1.
Also, if multicast doesn't work in one direction, wouldn't your system be affected in other parts, too?
Having said that, I could easily see a protocol which turns a unicast JOIN-REQ into a multicast one, and a unicast JOIN-RSP into a multicast as well. A receiver would drop a JOIN-REQ unless it is the current coordinator. The receiver of a JOIN-RSP would drop it, too, unless it is the joiner.
This is fairly simple to implement, but I wouldn't want it in JGroups, because it is very specific to your environment. Speaking of which, IMO the better solution would be for Cisco to fix their switch issues with multicasting.
In addition, this would not require changes in the JOIN logic.
—
Reply to this email directly, view it on GitHub<#848 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7O3KZSUT2GNTNB56CPWALZ37CENAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGI2DQNQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
If multicast does not ALWAYS work in your environment then you cannot
use it. You already know the network is broken, so reconfiguring
JGroups to not use multicast is the solution.
Your proposal adds complexity to only detect one VERY specific use case
where multicast works exactly one way while a node is joining. You'll
still have problems when it works during the join but then fails later
(which is actually quite common with misconfigured Cisco switches). And
you still have problems if the joining node's outgoing multicast is not
working as it will form its own cluster. So it doesn't fully fix the
problem either.
Detecting broken multicast at run-time in general has limited use, as
the only real fix is to either fix the network or reconfigure JGroups
not to use it, both of which are manual actions outside of JGroups' control.
…-Dennis
On 10/17/24 12:45 PM, awindillman wrote:
I agree that multicast in place of unicast is overkill. As mentioned,
was thinking an @Property would be the place for this, with it
defaulting to off. Comments would be clear as to the purpose. It should
have minimal impact on the overall code – and I would not think it would
impact the actual logic if JOIN. I’ll look at the alternative of adding
a new protocol – perhaps that will work just as effectively.
Speaking of protocols, we have several we need to contribute back. I’ll
start a separate thread for that so it can be determined which ones
should be added.
Yes, I think it would be great if Cisco could fix their switches and
routers to consistently and properly work with multicast. Turns out they
are not the only ones. Might be that Peplink and Ubiquity have issues as
well. Perhaps we should have a forum to post which vendors have issues
that they haven’t fixed. Then we could list those that fix issues
quickly when they are shown a problem (like HP/Aruba and FortiNet). Or
even vendors that have implementations that have always worked. But I
digress. In general, we don’t get to pick the networking hardware, so we
have to do our best with what is there.
Regards,
Neal
From: Bela Ban ***@***.***>
Sent: Thursday, October 17, 2024 7:25 AM
To: belaban/JGroups ***@***.***>
Cc: Neal Dillman ***@***.***>; Mention ***@***.***>
Subject: Re: [belaban/JGroups] Multicast Join (Discussion #848)
CAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you recognize the sender and know
the content is safe.
I'm certainly not going to change the way joining a new member works.
Sending a multicast when a unicast is enough is overkill. Think about
the ramifications if the transport is not UDP, e.g. TCP. This would
require N-1 messages to be sent rather than 1.
Also, if multicast doesn't work in one direction, wouldn't your system
be affected in other parts, too?
Having said that, I could easily see a protocol which turns a unicast
JOIN-REQ into a multicast one, and a unicast JOIN-RSP into a multicast
as well. A receiver would drop a JOIN-REQ unless it is the current
coordinator. The receiver of a JOIN-RSP would drop it, too, unless it is
the joiner.
This is fairly simple to implement, but I wouldn't want it in JGroups,
because it is very specific to your environment. Speaking of which, IMO
the better solution would be for Cisco to fix their switch issues with
multicasting.
In addition, this would not require changes in the JOIN logic.
—
Reply to this email directly, view it on
GitHub<#848 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7O3KZSUT2GNTNB56CPWALZ37CENAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGI2DQNQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#848 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEGEWVDNKH3XE66ALD5BSTZ37STZAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGQYTIMA>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Dennis,
Yes, it is a very specific scenario. And the goal is simply to protect against VIEW churn and related issues. It is OK that the host with the problem gets isolated – that is better than continuous VIEW changes. When the problem occurs after JOIN the host gets SUSPECTed and knocked out for lack of heartbeat – then that host would try to reJOIN and get segregated.
Regards,
Neal
From: dereed ***@***.***>
Sent: Thursday, October 17, 2024 10:19 AM
To: belaban/JGroups ***@***.***>
Cc: Neal Dillman ***@***.***>; Mention ***@***.***>
Subject: Re: [belaban/JGroups] Multicast Join (Discussion #848)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
If multicast does not ALWAYS work in your environment then you cannot
use it. You already know the network is broken, so reconfiguring
JGroups to not use multicast is the solution.
Your proposal adds complexity to only detect one VERY specific use case
where multicast works exactly one way while a node is joining. You'll
still have problems when it works during the join but then fails later
(which is actually quite common with misconfigured Cisco switches). And
you still have problems if the joining node's outgoing multicast is not
working as it will form its own cluster. So it doesn't fully fix the
problem either.
Detecting broken multicast at run-time in general has limited use, as
the only real fix is to either fix the network or reconfigure JGroups
not to use it, both of which are manual actions outside of JGroups' control.
-Dennis
On 10/17/24 12:45 PM, awindillman wrote:
I agree that multicast in place of unicast is overkill. As mentioned,
was thinking an @Property would be the place for this, with it
defaulting to off. Comments would be clear as to the purpose. It should
have minimal impact on the overall code – and I would not think it would
impact the actual logic if JOIN. I’ll look at the alternative of adding
a new protocol – perhaps that will work just as effectively.
Speaking of protocols, we have several we need to contribute back. I’ll
start a separate thread for that so it can be determined which ones
should be added.
Yes, I think it would be great if Cisco could fix their switches and
routers to consistently and properly work with multicast. Turns out they
are not the only ones. Might be that Peplink and Ubiquity have issues as
well. Perhaps we should have a forum to post which vendors have issues
that they haven’t fixed. Then we could list those that fix issues
quickly when they are shown a problem (like HP/Aruba and FortiNet). Or
even vendors that have implementations that have always worked. But I
digress. In general, we don’t get to pick the networking hardware, so we
have to do our best with what is there.
Regards,
Neal
From: Bela Ban ***@***.***>
Sent: Thursday, October 17, 2024 7:25 AM
To: belaban/JGroups ***@***.***>
Cc: Neal Dillman ***@***.***>; Mention ***@***.***>
Subject: Re: [belaban/JGroups] Multicast Join (Discussion #848)
CAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you recognize the sender and know
the content is safe.
I'm certainly not going to change the way joining a new member works.
Sending a multicast when a unicast is enough is overkill. Think about
the ramifications if the transport is not UDP, e.g. TCP. This would
require N-1 messages to be sent rather than 1.
Also, if multicast doesn't work in one direction, wouldn't your system
be affected in other parts, too?
Having said that, I could easily see a protocol which turns a unicast
JOIN-REQ into a multicast one, and a unicast JOIN-RSP into a multicast
as well. A receiver would drop a JOIN-REQ unless it is the current
coordinator. The receiver of a JOIN-RSP would drop it, too, unless it is
the joiner.
This is fairly simple to implement, but I wouldn't want it in JGroups,
because it is very specific to your environment. Speaking of which, IMO
the better solution would be for Cisco to fix their switch issues with
multicasting.
In addition, this would not require changes in the JOIN logic.
—
Reply to this email directly, view it on
GitHub<#848 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7O3KZSUT2GNTNB56CPWALZ37CENAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGI2DQNQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#848 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEGEWVDNKH3XE66ALD5BSTZ37STZAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGQYTIMA>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
—
Reply to this email directly, view it on GitHub<#848 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7O3K2AMUBGXIEGBZPOEFTZ37WPDAVCNFSM6AAAAABQDI6J6GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOJXGQ2DKMA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Neal |
Beta Was this translation helpful? Give feedback.
-
Hi Bela, Sorry for the slow response.. customer meetings and a vacation got in the way. Firstly, JGroups has never crashed any planes on our watch. And we aren't going to start any time soon. We do still use a patched 3.6.9 for the reasons you mention, but changes I am referring to will be on 5.x. Upgrade is overdue. My concern in a separate protocol is just one of compatibility. I would hate for an upgrade to break it. But I do understand the other perspective as well. Speaking of additional protocols, we have a few that may be desirable to add to the project. I'll create a separate discussion for that. Regards, |
Beta Was this translation helpful? Give feedback.
-
Hi Neal [1] https://github.com/jgroups-extras/upgrade-jgroups |
Beta Was this translation helpful? Give feedback.
-
We occasionally run into situations where, for no reason related to JGroups, multicast does not function in one direction or another. This can cause, in our environment, a host to join and leave the group over and over (for a few possible reasons).
If multicast were required from the joiner for initiation and acknowledgement of the join (meaning packets in both directions), that problem would be alleviated and the join would fail (which it should). Is it already possible to configure the stack to require a multicast join as described? Failing that, would it be better to make that an option on the existing protocols or is that option "special" enough that subclassing would be preferred (mostly a question for Bela, I guess)?
Regards,
Neal
Beta Was this translation helpful? Give feedback.
All reactions