You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#188 addresses pure network connection issues leading to loss, delay, or out-of-order transmission of messages, and some crash-recovery failures of nodes, eg. when a node crashes without having messages to send. It does not solve all reliability issues as it's still possible for a node to irrecoverably lose messages, leading to Head stalling, as an emitter.
What
Nodes are "allowed" to crash cleanly and then restarts, reconnects to peers, and don't lose messages so that the Head can progress again.
There are basically a few cases to cover:
messages can be lost in the lower-level Ouroboros layer because they have been pulled from the broadcast channel but not sent, but this is covered by retransmission mechanism
messages can be lost in between the moment they are received by the Network layer and the moment they are handled by the HeadLogic. This requires an on-disk log which is cleared when the message is handled
sent messages are not persisted and therefore will be lost in case of crash, preventing retransmission to peers
Note that if some messages have been delivered but not handled by a node, and it crashes, they will be retransmitted by the peer as soon as our node sends a Heartbeat signalling it lost all knowledge of its peers' messages ids.
Important: We do not cover Byzantine fault model, eg. peers doing crazy stuff with the protocol :)
How
We could implement the Logged Reliable Broadcast algorithm, eg. store messages pending and resent on a persistent queue instead of keeping them in memory.
Q: How much history should we keep and persist?
We wanted to GC old messages depending on what's peers' view but we dropped it in Introduce Reliability network layer #1074 because we could not make it work reliably
If we want to provide crash-recovery, we should actually keep all messages we send (and perhaps some we received?) in order to guarantee that a peer recovering without any memory can still catch-up
The text was updated successfully, but these errors were encountered:
ghost
changed the title
Ensure resilience in the case crash-recovery failures from nodes (nodes can crash "cleanly" and then restarts and reconnects to peers)
Ensure resilience in the case crash-recovery failures from nodes
Sep 19, 2023
ch1bo
changed the title
Ensure resilience in the case crash-recovery failures from nodes
Network resilience and crash-recovery to node failure
Sep 20, 2023
ch1bo
changed the title
Network resilience and crash-recovery to node failure
Network crash-recovery to node failure
Sep 20, 2023
Work has been done as part of #1101 which seems to cover most of the crash-recovery cases, but there might be some tricky corner cases not covered (eg. item 2 in the What section).
In order to move forward, we chose to close this issue and have a follow-up issue dedicated to implementing stress-test in the spirit of Jepsen on a cluster of node in order to verify how reliable we are.
Why
#188 addresses pure network connection issues leading to loss, delay, or out-of-order transmission of messages, and some crash-recovery failures of nodes, eg. when a node crashes without having messages to send. It does not solve all reliability issues as it's still possible for a node to irrecoverably lose messages, leading to Head stalling, as an emitter.
What
Nodes are "allowed" to crash cleanly and then restarts, reconnects to peers, and don't lose messages so that the Head can progress again.
There are basically a few cases to cover:
Ouroboros
layer because they have been pulled from the broadcast channel but not sent, but this is covered by retransmission mechanismNetwork
layer and the moment they are handled by theHeadLogic
. This requires an on-disk log which is cleared when the message is handledNote that if some messages have been delivered but not handled by a node, and it crashes, they will be retransmitted by the peer as soon as our node sends a
Heartbeat
signalling it lost all knowledge of its peers' messages ids.Important: We do not cover Byzantine fault model, eg. peers doing crazy stuff with the protocol :)
How
We could implement the Logged Reliable Broadcast algorithm, eg. store messages pending and resent on a persistent queue instead of keeping them in memory.
Q: How much history should we keep and persist?
The text was updated successfully, but these errors were encountered: