-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add non-default config that allows InboundQuarantineCheck to ignore 'harmless' quarantine events #1555
Conversation
Without an active SBR, no node will be shutdown: it is the SBR that downs itself when receiving "eliminate quarantined association when not used (harmless=true)" in withAssociation {
(remoteSystem, remoteAddress, _, localArtery, localProbe) =>
remoteSystem.eventStream.subscribe(testActor, classOf[ThisActorSystemQuarantinedEvent]) // event to watch out for, indicator of the issue
val remoteEcho = remoteSystem.actorSelection("/user/echo").resolveOne(remainingOrDefault).futureValue
val localAddress = RARP(system).provider.getDefaultAddress
val localEchoRef = remoteSystem.actorSelection(RootActorPath(localAddress) / localProbe.ref.path.elements).resolveOne(remainingOrDefault).futureValue
remoteEcho.tell("ping", localEchoRef)
localProbe.expectMsg("ping")
val association = localArtery.association(remoteAddress)
val remoteUid = futureUniqueRemoteAddress(association).futureValue.uid
localArtery.quarantine(remoteAddress, Some(remoteUid), "HarmlessTest", harmless = true)
association.associationState.isQuarantined(remoteUid) shouldBe true
eventually {
remoteEcho.tell("ping", localEchoRef) // trigger sending message from remote to local, which will trigger local to wrongfully notify remote that it is quarantined
expectMsgType[ThisActorSystemQuarantinedEvent] // this is what remote emits when it learns it is quarantined by local. This is not correct and is what (with SBR enabled) triggers killing the node.
}
} |
I added the new test case but I am aware that it needs to be moved to the cluster or cluster-tests projects and the Split Brain Resolver added. I am busy on other tasks so don't expect to get back to this for a while. |
What would it add to move the test to the cluster or cluster-tests projects? To me this is a bug of the |
I've added a change to InboundQuarantineCheck based on #578 (comment). This may not be the best solution but it seems to help in this one test case. |
It seems good to me like that, thank you! |
@raboof @mdedetrich @jrudolph what do you think about the runtime change? We could add a config to users to control if the new runtime check is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got a chance to review this change.
The fact that 'harmless' quarantines are not propagated, but are returned in InboundQuarantineCheck
, indeed looks like a bug.
I think there's a good possibility that this caused the scenario in #578 and that it'd be worth releasing this new behavior. I'd even lean towards making the new behavior the default, but I'm also OK with being conservative and first testing with the reporters of #578.
Whether this fix completely prevents situations like the one described in #578 is harder to say with confidence: there's still quite a few situations that can cause non-harmless quarantines, and it's possible there's still situations where a single misbehaving node may 'take out' its surrounding peers.
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp.this") | ||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp.apply") | ||
ProblemFilters.exclude[IncompatibleSignatureProblem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp.unapply") | ||
ProblemFilters.exclude[MissingTypesProblem]("org.apache.pekko.remote.artery.AssociationState$QuarantinedTimestamp$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is in private[remote]
context, so indeed these changes are safe.
I think you could use a wildcard:
ProblemFilters.exclude[Problem]("org.apache.pekko.remote.artery.AssociationState#QuarantinedTimestamp*")
Thanks for the review. I was running the tests a few weeks ago and I'm not sure that they work. @fredfp tested this a few weeks ago and said it helped but I don't yet think this is ready - until the new tests are improved. |
Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala
A workmate told me this ,but as I am not using the cluster at work, so thanks for take care of this hard task. I knew a team is using akka/pekko cluster with a centralized nodes server, we called it vipserver, which will decided which nodes are on the same cluster as a single source of truth instead of this gossip thing, where a wrong message spreading the cluster can take the whole cluster down. Another reason I'm not using clustering at work is because the chaos monkey, where the sre team will schedule some randomly network partition, and I think that will always require a reboot by the application owner, I am a little lazy here. |
The tests seem to be working for me today. If reviewers are amenable, we could merge this to main and 1.1.x branches and document that the config exists to enable an experimental fix for #578. |
pekko.remote.artery.propagate-harmless-quarantine-events = off