-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Reliability network layer #1074
Conversation
b04e837
to
0505036
Compare
9b2b760
to
ff34c79
Compare
ff34c79
to
86c3bda
Compare
Transactions CostsSizes and execution budgets for Hydra protocol transactions. Note that unlisted parameters are currently using
Script summary
Cost of Init Transaction
Cost of Commit TransactionThis is using ada-only outputs for better comparability.
Cost of CollectCom Transaction
Cost of Close Transaction
Cost of Contest Transaction
Cost of Abort TransactionSome variation because of random mixture of still initial and already committed outputs.
Cost of FanOut TransactionInvolves spending head output and burning head tokens. Uses ada-only UTxO for better comparability.
End-To-End Benchmark ResultsThis page is intended to collect the latest end-to-end benchmarks results produced by Hydra's Continuous Integration system from the latest Please take those results with a grain of salt as they are currently produced from very limited cloud VMs and not controlled hardware. Instead of focusing on the absolute results, the emphasis should be on relative results, eg. how the timings for a scenario evolve as the code changes. Generated at 2023-10-04 07:36:15.915592466 UTC 3-nodes ScenarioA rather typical setup, with 3 nodes forming a Hydra head.
Baseline ScenarioThis scenario represents a minimal case and as such is a good baseline against which to assess the overhead introduced by more complex setups. There is a single hydra-node d with a single client submitting single input and single output transactions with a constant UTxO set of 1.
|
82135cf
to
e953445
Compare
e953445
to
3ade65a
Compare
58aab93
to
3b1f5c9
Compare
3b1f5c9
to
291aaa5
Compare
9e5c6a8
to
7317ab6
Compare
38f7f64
to
5139a31
Compare
be3f87f
to
ca77109
Compare
8ea1ed6
to
bfd3563
Compare
02324dd
to
12afd8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that there is extensive module documentation, however I was missing comments in the actual implementation, which made reviewing harder than it could have been.
As indicated by comments in the code, the whole "layering" of this network stack is starting to leak heavily left and right. I also do think that we reached the maximum of complexity we can tackle this way and I doubt that the re-use of the NetworkComponent
interface across these layers is helping to tackle the complexity. No need to address it in part of this PR, but I feel the urge of collapsing the whole stack into a single layer which just calls out into the individual concerns upon receiving a message / when sending a message.
On the reliability layer itself: I think the idea is fine (assuming I understood the algorithm correctly), but the implementation appears to be very fragile! In particular, access to all these vectors is often unchecked and exceptions are raised left and right if some of the assumptions are not met. As this is network code, I think we should be very defensive when dealing with received messages!
Also, I found the choice for an IntMap
surprising. From what I can see the operations on seenMessages
only consist of insert (on one end; on broadcast
), sequential access (to get the missing messages which must be accessed in sequence), and delete (on the other end; when everyone has seen an index). Hence, a Seq
might be rather what we want. Maybe not needed in this PR. What could make sense though is to encapsulate parts of the logic into smaller handles / interfaces:
data SentMessages m msg = SentMessages
{ insertMsg :: msg -> m ()
, getMissing :: Int -> m [msg]
, removeMsg :: msg -> m ()
}
This should make the code a bit more readable and remove some TVar noise all over the place.
At least not fast enough compared to the time we were giving their messages to arrive. Sending a message once every ten seconds and expecting all the messages to reach the peer in less than 100 seconds does not always work.
Bug was exposed by running: ``` cabal test hydra-node --test-options '-m Reliability --seed 1054015251' ``` The problem was caused by Bob increasing his local view of received messsages from Alice from 15 to 16 when receiving a Ping from Alice when, actually, he never received this message 16 before. As a consequence, Alice would not resend message 16 or, when she resends message 16, Bob would ignore it anyway as it's expecting :x
…other If Alice is lagging behind Bob and Bob is lagging behind Alice then nobody would resend any message to its peer. Here we remove one condition to unlock this.
So we should not include ourself to the `seenMessages` map or, otherwise, in real life, we will never garbage collect.
This is meant to ensure we only try to resend messages whenever the peer is quiescent, which was the original intent of using Pings in the first place in order to avoid resending messages too often. The assumption is that disconnections and messages drop should be few and far between in normal operations and it's therefore fine to rely on the Ping's roundtrip time to check for peers state.
Timeouts are inherently unreliable, esp. given an arbitrary and unknown list of messages and an arbitrary ordering of actions. Tests might fail because one of the peers stops before the other and therefore fails to send Pings which will notify the peer it's missing message, or fail to take into account the peer's Pings. This commit replaces complicated timeout logic with a simple STM-based check that _both_ peers received all the messages.
This was used for GC messages and will be rewritten later
Also it throws an error if we do not find ourselves in the list of all parties. This is an absurd given we included ourselves to the list before sorting.
cd64d57
to
ecb9522
Compare
No idea why the CI is red, going to bypass and merge it 🤷 |
🌻 This PR aims to improve our network stack and make it more resilient by using message tracking and resending missed messages.
❓ This PR only addresses part of the resilience story from #188, eg. the part that deals with connection failures. Future PR should address the other issues, eg. resilience to transient node crashes
🚧 TODO:
fromJust
from the codePrune messages backlog list and limit its growth as we probably don't want it to clog memory