Resolved - this was caused by petabridge/akkadotnet-bootstrap#128
Akka.Cluster.SBR vs. Akka.Cluster SplitBrainResolver demo - from Akka.NET v1.4.17 and later
This repository demonstrates the lack of robustness of the old Akka.NET cluster split brain resolvers compared to the new ones introduced into Akka.NET in v1.4.17.
So the classic SBRs, which have been a part of Akka.NET for years - let's run the classic "keep majority" SBR strategy, which we can configure in Lighthouse via Akka.Bootstrap.Docker's environment variable substitution:
spec:
serviceName: lighthouse
replicas: 2
selector:
matchLabels:
app: lighthouse
template:
metadata:
labels:
app: lighthouse
spec:
terminationGracePeriodSeconds: 35
containers:
- name: lighthouse
image: petabridge/lighthouse:1.5.1
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "pbm 127.0.0.1:9110 cluster leave"]
env:
- name: ACTORSYSTEM
value: AkkaTrader
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_IP
value: "$(POD_NAME).lighthouse"
- name: CLUSTER_SEEDS
value: akka.tcp://$(ACTORSYSTEM)@lighthouse-0.lighthouse:4053,akka.tcp://$(ACTORSYSTEM)@lighthouse-1.lighthouse:4053,akka.tcp://$(ACTORSYSTEM)@lighthouse-2.lighthouse:4053
- name: AKKA__CLUSTER__DOWNING_PROVIDER_CLASS
value: "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
- name: AKKA__CLUSTER__SPLIT_BRAIN_RESOLVER__ACTIVE_STRATEGY
value: "keep-majority"
livenessProbe:
tcpSocket:
port: 4053
ports:
- containerPort: 4053
protocol: TCP
Let's see what type of Akka.NET cluster forms when we run the following command:
kubectl apply -f ./lighthouse-oldsbr.yml
Using some Petabridge.Cmd
CLI commands to inspect the state of the cluster, here is what we see immediately after initial formation on node 1:
PS> kubectl -n lightouse-sbr1 exec lighthouse-0 -- pbm cluster show
akka.tcp://[email protected]:4053 | [lighthouse] | up | | 1.4.0
Count: 1 nodes
And then on node 2:
PS> kubectl -n lightouse-sbr1 exec lighthouse-1 -- pbm cluster show
akka.tcp://[email protected]:4053 | [lighthouse] | up | | 1.4.0
Count: 1 nodes
So what we have here is a split brain. How did that happen?
Logs from node 1:
[Docker-Bootstrap] IP=lighthouse-0.lighthouse
[Docker-Bootstrap] PORT=4053
[Docker-Bootstrap] SEEDS=["akka.tcp://[email protected]:4053","akka.tcp://[email protected]:4053","akka.tcp://[email protected]:4053"]
[Lighthouse] ActorSystem: AkkaTrader; IP: lighthouse-0.lighthouse; PORT: 4053
[Lighthouse] Performing pre-boot sanity check. Should be able to parse address [akka.tcp://[email protected]:4053]
[Lighthouse] Parse successful.
[INFO][04/02/2021 17:25:27][Thread 0001][remoting (akka://AkkaTrader)] Starting remoting
[INFO][04/02/2021 17:25:27][Thread 0001][remoting (akka://AkkaTrader)] Remoting started; listening on addresses : [akka.tcp://[email protected]:4053]
[INFO][04/02/2021 17:25:27][Thread 0001][remoting (akka://AkkaTrader)] Remoting now listens on addresses: [akka.tcp://[email protected]:4053]
[INFO][04/02/2021 17:25:27][Thread 0001][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - Starting up...
[INFO][04/02/2021 17:25:27][Thread 0001][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - Started up successfully
Press Control + C to terminate.
[INFO][04/02/2021 17:25:27][Thread 0019][akka.tcp://[email protected]:4053/user/petabridge.cmd] petabridge.cmd host bound to [0.0.0.0:9110]
[WARNING][04/02/2021 17:25:27][Thread 0013][akka.tcp://[email protected]:4053/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FAkkaTrader%40lighthouse-2.lighthouse%3A4053-2/endpointWriter] AssociationError [akka.tcp://[email protected]:4053] -> akka.tcp://[email protected]:4053: Error [Association failed with akka.tcp://[email protected]:4053] []
[WARNING][04/02/2021 17:25:27][Thread 0012][akka.tcp://[email protected]:4053/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FAkkaTrader%40lighthouse-1.lighthouse%3A4053-1/endpointWriter] AssociationError [akka.tcp://[email protected]:4053] -> akka.tcp://[email protected]:4053: Error [Association failed with akka.tcp://[email protected]:4053] []
[WARNING][04/02/2021 17:25:27][Thread 0012][remoting (akka://AkkaTrader)] Tried to associate with unreachable remote address [akka.tcp://[email protected]:4053]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://[email protected]:4053] Caused by: [System.AggregateException: One or more errors occurred. (Name or service not known)
---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
at System.Net.Dns.InternalGetHostByName(String hostName)
at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Akka.Remote.Transport.DotNetty.DotNettyTransport.ResolveNameAsync(DnsEndPoint address, AddressFamily addressFamily)
at Akka.Remote.Transport.DotNetty.DotNettyTransport.DnsToIPEndpoint(DnsEndPoint dns)
at Akka.Remote.Transport.DotNetty.TcpTransport.MapEndpointAsync(EndPoint socketAddress)
at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress)
at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at System.Threading.Tasks.Task`1.get_Result()
at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task`1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__274_0(Object obj)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)]
[WARNING][04/02/2021 17:25:27][Thread 0012][remoting (akka://AkkaTrader)] Tried to associate with unreachable remote address [akka.tcp://[email protected]:4053]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://[email protected]:4053] Caused by: [System.AggregateException: One or more errors occurred. (Name or service not known)
---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
at System.Net.Dns.InternalGetHostByName(String hostName)
at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Akka.Remote.Transport.DotNetty.DotNettyTransport.ResolveNameAsync(DnsEndPoint address, AddressFamily addressFamily)
at Akka.Remote.Transport.DotNetty.DotNettyTransport.DnsToIPEndpoint(DnsEndPoint dns)
at Akka.Remote.Transport.DotNetty.TcpTransport.MapEndpointAsync(EndPoint socketAddress)
at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress)
at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at System.Threading.Tasks.Task`1.get_Result()
at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task`1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.Tasks.Task.<>c.<.cctor>b__274_0(Object obj)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)]
[INFO][04/02/2021 17:25:27][Thread 0004][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [1] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:27][Thread 0004][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [2] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:27][Thread 0011][akka.tcp://[email protected]:4053/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FAkkaTrader%40lighthouse-2.lighthouse%3A4053-2] Removing receive buffers for [akka.tcp://[email protected]:4053]->[akka.tcp://[email protected]:4053]
[INFO][04/02/2021 17:25:27][Thread 0013][akka.tcp://[email protected]:4053/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FAkkaTrader%40lighthouse-1.lighthouse%3A4053-1] Removing receive buffers for [akka.tcp://[email protected]:4053]->[akka.tcp://[email protected]:4053]
[INFO][04/02/2021 17:25:28][Thread 0004][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [3] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:28][Thread 0004][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [4] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:29][Thread 0009][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [5] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:29][Thread 0009][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [6] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:30][Thread 0009][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [7] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:30][Thread 0009][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [8] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:31][Thread 0009][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [9] dead letters encountered. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:31][Thread 0009][akka://AkkaTrader/deadLetters] Message [InitJoin] from akka://AkkaTrader/system/cluster/core/daemon/firstSeedNodeProcess-1 to akka://AkkaTrader/deadLetters was not delivered. [10] dead letters encountered, no more dead letters will be logged in next [00:05:00]. If this is not an expected behavior then akka://AkkaTrader/deadLetters may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO][04/02/2021 17:25:32][Thread 0004][Cluster (akka://AkkaTrader)] Cluster Node [1.4.0] - Node [akka.tcp://[email protected]:4053] is JOINING itself (with roles [lighthouse], version [1.4.0]) and forming a new cluster
[INFO][04/02/2021 17:25:32][Thread 0004][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - is the new leader among reachable nodes (more leaders may exist)
[INFO][04/02/2021 17:25:32][Thread 0004][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - Leader is moving node [akka.tcp://[email protected]:4053] to [Up]
[INFO][04/02/2021 17:26:02][Thread 0022][akka.tcp://[email protected]:4053/user/petabridge.cmd] Received petabridge.cmd connection from client [127.0.0.1:34872]
[INFO][04/02/2021 17:26:02][Thread 0009][akka.tcp://[email protected]:4053/user/petabridge.cmd/127.0.0.1%3A34872] petabridge.cmd client disconnected. Shutting down...
[INFO][04/02/2021 17:26:02][Thread 0022][akka.tcp://[email protected]:4053/user/petabridge.cmd/127.0.0.1%3A34872] Terminating petabridge.cmd connection to client.
lighthouse-2
was never started under this configuration, but it looks like we failed to connect to node 1 and never bothered after that - instead forming a new cluster. As the first node in the seed-nodes
list, this is the intended behavior - it's supposed to form a new cluster with itself.
Now what does node 2's logs look like?
[Docker-Bootstrap] IP=lighthouse-1.lighthouse
[Docker-Bootstrap] PORT=4053
[Docker-Bootstrap] SEEDS=[]
[Lighthouse] ActorSystem: AkkaTrader; IP: lighthouse-1.lighthouse; PORT: 4053
[Lighthouse] Performing pre-boot sanity check. Should be able to parse address [akka.tcp://[email protected]:4053]
[Lighthouse] Parse successful.
[INFO][04/02/2021 17:34:21][Thread 0001][remoting (akka://AkkaTrader)] Starting remoting
[INFO][04/02/2021 17:34:22][Thread 0001][remoting (akka://AkkaTrader)] Remoting started; listening on addresses : [akka.tcp://[email protected]:4053]
[INFO][04/02/2021 17:34:22][Thread 0001][remoting (akka://AkkaTrader)] Remoting now listens on addresses: [akka.tcp://[email protected]:4053]
[INFO][04/02/2021 17:34:22][Thread 0001][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - Starting up...
[INFO][04/02/2021 17:34:22][Thread 0001][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - Started up successfully
Press Control + C to terminate.
[INFO][04/02/2021 17:34:22][Thread 0009][akka.tcp://[email protected]:4053/system/cluster/core/daemon/downingProvider] SBR started. Config: strategy [KeepMajority], stable-after [00:00:20], down-all-when-unstable [00:00:15], selfUniqueAddress [akka.tcp://[email protected]:4053].
[INFO][04/02/2021 17:34:22][Thread 0004][akka.tcp://[email protected]:4053/user/petabridge.cmd] petabridge.cmd host bound to [0.0.0.0:9110]
[INFO][04/02/2021 17:34:22][Thread 0014][Cluster (akka://AkkaTrader)] Cluster Node [1.4.0] - Node [akka.tcp://[email protected]:4053] is JOINING itself (with roles [lighthouse], version [1.4.0]) and forming a new cluster
[INFO][04/02/2021 17:34:22][Thread 0014][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - is the new leader among reachable nodes (more leaders may exist)
[INFO][04/02/2021 17:34:22][Thread 0014][Cluster (akka://AkkaTrader)] Cluster Node [akka.tcp://[email protected]:4053] - Leader is moving node [akka.tcp://[email protected]:4053] to [Up]
[INFO][04/02/2021 17:34:22][Thread 0014][akka.tcp://[email protected]:4053/system/cluster/core/daemon/downingProvider] This node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[INFO][04/02/2021 17:34:51][Thread 0004][akka.tcp://[email protected]:4053/user/petabridge.cmd] Received petabridge.cmd connection from client [127.0.0.1:39200]
[INFO][04/02/2021 17:34:51][Thread 0024][akka.tcp://[email protected]:4053/user/petabridge.cmd/127.0.0.1%3A39200] petabridge.cmd client disconnected. Shutting down...
[INFO][04/02/2021 17:34:51][Thread 0024][akka.tcp://[email protected]:4053/user/petabridge.cmd/127.0.0.1%3A39200] Terminating petabridge.cmd connection to client.
Notice that the SEEDS
environment variable is empty - that's a problem.