Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't persist the network messages and their acknowledgements #1417

Closed
v0d1ch opened this issue May 7, 2024 · 9 comments
Closed

Don't persist the network messages and their acknowledgements #1417

v0d1ch opened this issue May 7, 2024 · 9 comments
Assignees
Labels
green 💚 Low complexity or well understood feature 💬 feature A feature on our roadmap

Comments

@v0d1ch
Copy link
Contributor

v0d1ch commented May 7, 2024

Why

We already saw a couple of times now that resending the lost/missed messages was not so robust since the Head still got stuck but now it wasn't clear what are the values we would need to set for the acks counter in order to fix the problem/resend lost messages.

The initial idea of having the reliability layer store the index of sent/seen messages and replay them seems to not work so well because of couple of factors:

  • The node can crash after sending a message but before storing it on disk
  • The node can crash after storing the message on disk but before putting it to a queue for processing

On top of everything the saved acks do not correspond directly to stored network messages since we store them separately and this is not atomic process which can lead to failures.

In general it seems like storing the network messages and acks is not beneficial enough to justify it and therefore I propose this idea to remove it completely and keep them in memory like before.

By doing this we rely on other nodes not to crash at the same time so the node that crashed could catch up by re-receiving the lost network messages from other node/s.

There are couple of issues related to implementation of reliability layer of hydra-node like #1079 and a bug item that could be related #1202

What

Remove persistence handle from the hydra-node networking part.

How

Remove completely the MessagePersistence argument to withNetwork function which eliminates message storing/reading from disk.

@v0d1ch v0d1ch added the 💭 idea An idea or feature request label May 7, 2024
@v0d1ch v0d1ch self-assigned this May 7, 2024
@ch1bo
Copy link
Collaborator

ch1bo commented May 7, 2024

I think we should re-open #1079 and maybe reconsider the persisted queue too to mitigate messages that were transmitted, but not handled in the head logic (i.e. on the receiving side).

@v0d1ch
Copy link
Contributor Author

v0d1ch commented May 8, 2024

You mean also add the option to persist whatever was in the queue at certain point in time? At which point would we persist the data?

@ffakenz
Copy link
Contributor

ffakenz commented May 16, 2024

If u remove the network-messages then you cannot re-transmit messages anymore.

On the other hand, removing the acknowledgements would result in lots of messages over the network. However it might work and lead to a head getting unstuck.

@locallycompact
Copy link
Contributor

Why do we need to persist messages at all? Why can we not do what the cardano-node is doing? A cardano-node only needs to ask whoever is available whether they have a longer chain. It does not rely on people storing outbound messages. Why should our situation be different?

@ch1bo
Copy link
Collaborator

ch1bo commented May 23, 2024

A cardano-node only needs to ask whoever is available whether they have a longer chain.

This is a good idea and something which was considered in the past (as mentioned in this ADR; there are other PRs and logbook items we could dig up). It was called a "pull-base approach" and this is what I think you mean in the essence of your comment. Have the network participants pull data from each other then sending it.

However, that approach ultimately will come to similar questions about persistence. How much of data would we keep on the other end such that network participants can pull it?

@locallycompact
Copy link
Contributor

Nothing on the network layer. You only ask another node what it believes to be the case regarding the actual chain state and then you verify it. The chain (history of snapshots), and the working area (signatures from unconfirmed snapshots).

@locallycompact locallycompact added 💬 feature A feature on our roadmap green 💚 Low complexity or well understood feature and removed 💭 idea An idea or feature request labels May 27, 2024
@ch1bo
Copy link
Collaborator

ch1bo commented Jul 22, 2024

We just stumbled over the acks being persisted again today. In this case someone had wiped their acks persistence, but someone else did not (and/or the --peer configuration also changed).

We concluded that having the acks persisted is very fragile and it's unclear when and when not to reset that piece of state!

@noonio
Copy link
Contributor

noonio commented Aug 13, 2024

In order to this we should have some tests covering the full networking spec ( see #1532 and #1080 ).

@noonio noonio linked a pull request Sep 2, 2024 that will close this issue
4 tasks
@ffakenz ffakenz self-assigned this Sep 2, 2024
@ch1bo ch1bo linked a pull request Sep 2, 2024 that will close this issue
4 tasks
@ffakenz
Copy link
Contributor

ffakenz commented Sep 2, 2024

Exploration: #1593

Closing as "not planned"

@ffakenz ffakenz closed this as completed Sep 2, 2024
@ch1bo ch1bo closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
green 💚 Low complexity or well understood feature 💬 feature A feature on our roadmap
Projects
Status: Done
5 participants