Network resilient to node's crash-recovery failure mode #1079

ghost · 2023-09-19T12:38:55Z

Why

#188 addresses pure network connection issues leading to loss, delay, or out-of-order transmission of messages, and some crash-recovery failures of nodes, eg. when a node crashes without having messages to send. It does not solve all reliability issues as it's still possible for a node to irrecoverably lose messages, leading to Head stalling, as an emitter.

What

Nodes are "allowed" to crash cleanly and then restarts, reconnects to peers, and don't lose messages so that the Head can progress again.

There are basically a few cases to cover:

messages can be lost in the lower-level Ouroboros layer because they have been pulled from the broadcast channel but not sent, but this is covered by retransmission mechanism
messages can be lost in between the moment they are received by the Network layer and the moment they are handled by the HeadLogic. This requires an on-disk log which is cleared when the message is handled
sent messages are not persisted and therefore will be lost in case of crash, preventing retransmission to peers

Note that if some messages have been delivered but not handled by a node, and it crashes, they will be retransmitted by the peer as soon as our node sends a Heartbeat signalling it lost all knowledge of its peers' messages ids.

Important: We do not cover Byzantine fault model, eg. peers doing crazy stuff with the protocol :)

How

We could implement the Logged Reliable Broadcast algorithm, eg. store messages pending and resent on a persistent queue instead of keeping them in memory.

Q: How much history should we keep and persist?

We wanted to GC old messages depending on what's peers' view but we dropped it in Introduce Reliability network layer #1074 because we could not make it work reliably
If we want to provide crash-recovery, we should actually keep all messages we send (and perhaps some we received?) in order to guarantee that a peer recovering without any memory can still catch-up

The text was updated successfully, but these errors were encountered:

ghost · 2023-10-10T12:20:52Z

Work has been done as part of #1101 which seems to cover most of the crash-recovery cases, but there might be some tricky corner cases not covered (eg. item 2 in the What section).
In order to move forward, we chose to close this issue and have a follow-up issue dedicated to implementing stress-test in the spirit of Jepsen on a cluster of node in order to verify how reliable we are.

ghost mentioned this issue Sep 19, 2023

Network resilience to disconnects #188

Closed

2 tasks

ghost changed the title ~~Ensure resilience in the case crash-recovery failures from nodes (nodes can crash "cleanly" and then restarts and reconnects to peers)~~ Ensure resilience in the case crash-recovery failures from nodes Sep 19, 2023

ghost added 💬 feature A feature on our roadmap green 💚 Low complexity or well understood feature labels Sep 19, 2023

ghost added this to Hydra Head Roadmap Sep 19, 2023

ghost moved this to Next in Hydra Head Roadmap Sep 19, 2023

ch1bo changed the title ~~Ensure resilience in the case crash-recovery failures from nodes~~ Network resilience and crash-recovery to node failure Sep 20, 2023

ch1bo changed the title ~~Network resilience and crash-recovery to node failure~~ Network crash-recovery to node failure Sep 20, 2023

ch1bo assigned ghost Sep 20, 2023

ghost added this to the 0.14.0 milestone Oct 3, 2023

ghost added the L2 Affect off-chain part of the Head protocol/network label Oct 3, 2023

ghost mentioned this issue Oct 3, 2023

Introduce Reliability network layer #1074

Merged

10 tasks

v0d1ch mentioned this issue Oct 5, 2023

Reliable persistence #1101

Merged

4 tasks

ghost changed the title ~~Network crash-recovery to node failure~~ Network resilient to node's crash-recovery failure mode Oct 10, 2023

ghost closed this as completed Oct 10, 2023

github-project-automation bot moved this from Next to Done in Hydra Head Roadmap Oct 10, 2023

ghost mentioned this issue Oct 10, 2023

Provide tests covering the resilience of a Hydra Head cluster #1106

Closed

ch1bo unassigned ghost Oct 30, 2023

v0d1ch mentioned this issue May 7, 2024

Don't persist the network messages and their acknowledgements #1417

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network resilient to node's crash-recovery failure mode #1079

Network resilient to node's crash-recovery failure mode #1079

ghost commented Sep 19, 2023 •

edited by ghost

Loading

ghost commented Oct 10, 2023

Network resilient to node's crash-recovery failure mode #1079

Network resilient to node's crash-recovery failure mode #1079

Comments

ghost commented Sep 19, 2023 • edited by ghost Loading

Why

What

How

ghost commented Oct 10, 2023

ghost commented Sep 19, 2023 •

edited by ghost

Loading