Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network resilient to node's crash-recovery failure mode #1079

Closed
ghost opened this issue Sep 19, 2023 · 1 comment
Closed

Network resilient to node's crash-recovery failure mode #1079

ghost opened this issue Sep 19, 2023 · 1 comment
Labels
green 💚 Low complexity or well understood feature L2 Affect off-chain part of the Head protocol/network 💬 feature A feature on our roadmap
Milestone

Comments

@ghost
Copy link

ghost commented Sep 19, 2023

Why

#188 addresses pure network connection issues leading to loss, delay, or out-of-order transmission of messages, and some crash-recovery failures of nodes, eg. when a node crashes without having messages to send. It does not solve all reliability issues as it's still possible for a node to irrecoverably lose messages, leading to Head stalling, as an emitter.

What

Nodes are "allowed" to crash cleanly and then restarts, reconnects to peers, and don't lose messages so that the Head can progress again.

There are basically a few cases to cover:

  • messages can be lost in the lower-level Ouroboros layer because they have been pulled from the broadcast channel but not sent, but this is covered by retransmission mechanism
  • messages can be lost in between the moment they are received by the Network layer and the moment they are handled by the HeadLogic. This requires an on-disk log which is cleared when the message is handled
  • sent messages are not persisted and therefore will be lost in case of crash, preventing retransmission to peers

Note that if some messages have been delivered but not handled by a node, and it crashes, they will be retransmitted by the peer as soon as our node sends a Heartbeat signalling it lost all knowledge of its peers' messages ids.

Important: We do not cover Byzantine fault model, eg. peers doing crazy stuff with the protocol :)

How

We could implement the Logged Reliable Broadcast algorithm, eg. store messages pending and resent on a persistent queue instead of keeping them in memory.

Q: How much history should we keep and persist?

  • We wanted to GC old messages depending on what's peers' view but we dropped it in Introduce Reliability network layer #1074 because we could not make it work reliably
  • If we want to provide crash-recovery, we should actually keep all messages we send (and perhaps some we received?) in order to guarantee that a peer recovering without any memory can still catch-up
@ghost ghost mentioned this issue Sep 19, 2023
2 tasks
@ghost ghost changed the title Ensure resilience in the case crash-recovery failures from nodes (nodes can crash "cleanly" and then restarts and reconnects to peers) Ensure resilience in the case crash-recovery failures from nodes Sep 19, 2023
@ghost ghost added 💬 feature A feature on our roadmap green 💚 Low complexity or well understood feature labels Sep 19, 2023
@ghost ghost added this to Hydra Head Roadmap Sep 19, 2023
@ghost ghost moved this to Next in Hydra Head Roadmap Sep 19, 2023
@ch1bo ch1bo changed the title Ensure resilience in the case crash-recovery failures from nodes Network resilience and crash-recovery to node failure Sep 20, 2023
@ch1bo ch1bo changed the title Network resilience and crash-recovery to node failure Network crash-recovery to node failure Sep 20, 2023
@ch1bo ch1bo assigned ghost Sep 20, 2023
@ghost ghost added this to the 0.14.0 milestone Oct 3, 2023
@ghost ghost added the L2 Affect off-chain part of the Head protocol/network label Oct 3, 2023
@ghost ghost mentioned this issue Oct 3, 2023
10 tasks
@v0d1ch v0d1ch mentioned this issue Oct 5, 2023
4 tasks
@ghost ghost changed the title Network crash-recovery to node failure Network resilient to node's crash-recovery failure mode Oct 10, 2023
@ghost
Copy link
Author

ghost commented Oct 10, 2023

Work has been done as part of #1101 which seems to cover most of the crash-recovery cases, but there might be some tricky corner cases not covered (eg. item 2 in the What section).
In order to move forward, we chose to close this issue and have a follow-up issue dedicated to implementing stress-test in the spirit of Jepsen on a cluster of node in order to verify how reliable we are.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
green 💚 Low complexity or well understood feature L2 Affect off-chain part of the Head protocol/network 💬 feature A feature on our roadmap
Projects
None yet
Development

No branches or pull requests

0 participants