Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Reliability network layer #1074

Merged
102 commits merged into from
Oct 4, 2023
Merged

Introduce Reliability network layer #1074

102 commits merged into from
Oct 4, 2023

Conversation

v0d1ch
Copy link
Contributor

@v0d1ch v0d1ch commented Sep 13, 2023

🌻 This PR aims to improve our network stack and make it more resilient by using message tracking and resending missed messages.

❓ This PR only addresses part of the resilience story from #188, eg. the part that deals with connection failures. Future PR should address the other issues, eg. resilience to transient node crashes

🚧 TODO:

  • test it in real life with the demo (shut down the network for one node, post a transaction, restart the network, it should work)
  • Move ADR to another PR
  • Write stress/model-based tests to improve confidence in the solution (optional)
  • More tests coverage for potential failures
    • remove use of partial functions, eg. fromJust from the code
    • have unit tests for each error path
  • Prune messages backlog list and limit its growth as we probably don't want it to clog memory
  • Fix Log schemas

@github-actions
Copy link

github-actions bot commented Sep 13, 2023

Test Results

367 tests  +21   361 ✔️ +20   17m 34s ⏱️ + 2m 30s
124 suites +  6       5 💤 ±  0 
    6 files   +  1       1 +  1 

For more details on these failures, see this check.

Results for commit ecb9522. ± Comparison against base commit f0c03e3.

♻️ This comment has been updated with latest results.

@v0d1ch v0d1ch force-pushed the ensemble/network_model branch 2 times, most recently from 9b2b760 to ff34c79 Compare September 15, 2023 15:37
@ghost ghost force-pushed the ensemble/network_model branch from ff34c79 to 86c3bda Compare September 17, 2023 15:13
@github-actions
Copy link

github-actions bot commented Sep 17, 2023

Transactions Costs

Sizes and execution budgets for Hydra protocol transactions. Note that unlisted parameters are currently using arbitrary values and results are not fully deterministic and comparable to previous runs.

Metadata
Generated at 2023-10-04 07:44:29.676200484 UTC
Max. memory units 14000000
Max. CPU units 10000000000
Max. tx size (kB) 16384

Script summary

Name Hash Size (Bytes)
νInitial eaf589de11c6c805af24b759e7794d62661d3db4ade79594892ebaec 4106
νCommit 8dcc1fb34d1ba168dfb0b82e7d1a31956a2db5856f268146b0fd7f2a 2051
νHead e35bdf32cd3806596150c1cbab6ab5456bd957b36019ed2746bf481d 8797
μHead 386ad19467be96131379dacf57a9351a762da2dee3486a855f0409c9* 4151
  • The minting policy hash is only usable for comparison. As the script is parameterized, the actual script is unique per Head.

Cost of Init Transaction

Parties Tx size % max Mem % max CPU Min fee ₳
1 4742 11.79 4.65 0.49
2 4949 13.97 5.47 0.52
3 5154 16.67 6.51 0.56
5 5564 21.16 8.21 0.63
10 6590 33.11 12.77 0.80
37 12124 96.69 36.98 1.73

Cost of Commit Transaction

This is using ada-only outputs for better comparability.

UTxO Tx size % max Mem % max CPU Min fee ₳
1 599 12.55 4.94 0.31
2 786 16.26 6.61 0.36
3 975 20.20 8.37 0.42
5 1345 28.23 11.94 0.52
10 2284 50.99 21.84 0.82
18 3792 94.67 40.33 1.37

Cost of CollectCom Transaction

Parties UTxO (bytes) Tx size % max Mem % max CPU Min fee ₳
1 57 814 24.40 9.70 0.45
2 114 1136 36.62 14.72 0.60
3 169 1460 54.75 22.06 0.82
4 226 1774 70.53 28.67 1.00
5 283 2104 92.15 37.53 1.26

Cost of Close Transaction

Parties Tx size % max Mem % max CPU Min fee ₳
1 685 19.23 8.84 0.40
2 932 21.01 10.45 0.44
3 1132 22.38 11.87 0.47
5 1474 25.14 14.47 0.52
10 2534 33.35 22.02 0.69
49 10628 99.03 81.53 2.01

Cost of Contest Transaction

Parties Tx size % max Mem % max CPU Min fee ₳
1 684 22.61 9.94 0.43
2 960 24.59 11.79 0.48
3 1177 27.08 13.65 0.52
5 1513 30.34 16.50 0.58
10 2609 39.68 24.48 0.76
43 9334 97.75 75.04 1.89

Cost of Abort Transaction

Some variation because of random mixture of still initial and already committed outputs.

Parties Tx size % max Mem % max CPU Min fee ₳
1 4965 21.22 9.16 0.61
2 5389 36.24 15.80 0.79
3 5860 54.24 23.80 1.02
4 6270 75.61 33.23 1.28

Cost of FanOut Transaction

Involves spending head output and burning head tokens. Uses ada-only UTxO for better comparability.

Parties UTxO UTxO (bytes) Tx size % max Mem % max CPU Min fee ₳
5 0 0 4768 8.95 3.77 0.46
5 1 57 4800 10.25 4.57 0.48
5 5 283 4945 15.28 7.68 0.55
5 10 568 5129 21.94 11.72 0.64
5 20 1138 5485 34.68 19.56 0.81
5 30 1707 5847 47.94 27.62 0.99
5 40 2279 6209 60.94 35.57 1.17
5 50 2848 6568 73.17 43.21 1.34
5 70 3985 7284 99.73 59.36 1.70

End-To-End Benchmark Results

This page is intended to collect the latest end-to-end benchmarks results produced by Hydra's Continuous Integration system from the latest master code.

Please take those results with a grain of salt as they are currently produced from very limited cloud VMs and not controlled hardware. Instead of focusing on the absolute results, the emphasis should be on relative results, eg. how the timings for a scenario evolve as the code changes.

Generated at 2023-10-04 07:36:15.915592466 UTC

3-nodes Scenario

A rather typical setup, with 3 nodes forming a Hydra head.

Number of nodes 3
Number of txs 900
Avg. Confirmation Time (ms) 95.149266526
P99 242.37822907999998ms
P95 228.9985739ms
P50 73.9630525ms
Number of Invalid txs 0

Baseline Scenario

This scenario represents a minimal case and as such is a good baseline against which to assess the overhead introduced by more complex setups. There is a single hydra-node d with a single client submitting single input and single output transactions with a constant UTxO set of 1.

Number of nodes 1
Number of txs 300
Avg. Confirmation Time (ms) 3.875501526
P99 15.444141319999975ms
P95 8.391805150000003ms
P50 2.9316855ms
Number of Invalid txs 0

@ghost
Copy link

ghost commented Sep 20, 2023

I have added some documentation on the network components:
Screenshot 2023-09-20 at 08 00 23

@ghost ghost force-pushed the ensemble/network_model branch from 3b1f5c9 to 291aaa5 Compare September 20, 2023 09:32
@v0d1ch v0d1ch marked this pull request as ready for review September 20, 2023 11:22
@v0d1ch v0d1ch marked this pull request as draft September 20, 2023 11:22
@pgrange pgrange force-pushed the ensemble/network_model branch 4 times, most recently from 9e5c6a8 to 7317ab6 Compare September 20, 2023 13:32
@v0d1ch v0d1ch force-pushed the ensemble/network_model branch 2 times, most recently from 8ea1ed6 to bfd3563 Compare September 21, 2023 16:32
@ghost ghost force-pushed the ensemble/network_model branch 2 times, most recently from 02324dd to 12afd8b Compare September 26, 2023 09:48
@ghost ghost marked this pull request as ready for review September 26, 2023 10:34
@ghost ghost requested review from ffakenz, locallycompact and ch1bo September 26, 2023 10:34
@ch1bo ch1bo self-assigned this Sep 26, 2023
Copy link
Collaborator

@ch1bo ch1bo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that there is extensive module documentation, however I was missing comments in the actual implementation, which made reviewing harder than it could have been.

As indicated by comments in the code, the whole "layering" of this network stack is starting to leak heavily left and right. I also do think that we reached the maximum of complexity we can tackle this way and I doubt that the re-use of the NetworkComponent interface across these layers is helping to tackle the complexity. No need to address it in part of this PR, but I feel the urge of collapsing the whole stack into a single layer which just calls out into the individual concerns upon receiving a message / when sending a message.

On the reliability layer itself: I think the idea is fine (assuming I understood the algorithm correctly), but the implementation appears to be very fragile! In particular, access to all these vectors is often unchecked and exceptions are raised left and right if some of the assumptions are not met. As this is network code, I think we should be very defensive when dealing with received messages!

Also, I found the choice for an IntMap surprising. From what I can see the operations on seenMessages only consist of insert (on one end; on broadcast), sequential access (to get the missing messages which must be accessed in sequence), and delete (on the other end; when everyone has seen an index). Hence, a Seq might be rather what we want. Maybe not needed in this PR. What could make sense though is to encapsulate parts of the logic into smaller handles / interfaces:

data SentMessages m msg = SentMessages
   { insertMsg :: msg -> m ()
   , getMissing :: Int -> m [msg]
   , removeMsg :: msg -> m ()
   }

This should make the code a bit more readable and remove some TVar noise all over the place.

hydra-node/src/Hydra/Node/Network.hs Show resolved Hide resolved
hydra-node/src/Hydra/Node/Network.hs Outdated Show resolved Hide resolved
hydra-node/src/Hydra/Node/Network.hs Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Outdated Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Outdated Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Outdated Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Outdated Show resolved Hide resolved
hydra-node/src/Hydra/Network/Reliability.hs Outdated Show resolved Hide resolved
abailly and others added 25 commits October 4, 2023 09:32
At least not fast enough compared to the time we were giving their
messages to arrive.

Sending a message once every ten seconds and expecting all the
messages to reach the peer in less than 100 seconds does not
always work.
Bug was exposed by running:
```
cabal test hydra-node --test-options '-m Reliability --seed 1054015251'
```

The problem was caused by Bob increasing his local view of received
messsages from Alice from 15 to 16 when receiving a Ping from Alice
when, actually, he never received this message 16 before.

As a consequence, Alice would not resend message 16 or, when she
resends message 16, Bob would ignore it anyway as it's expecting
:x
…other

If Alice is lagging behind Bob and Bob is lagging behind Alice then
nobody would resend any message to its peer.

Here we remove one condition to unlock this.
So we should not include ourself to the `seenMessages` map or,
otherwise, in real life, we will never garbage collect.
This is meant to ensure we only try to resend messages whenever the peer is quiescent,
which was the original intent of using Pings in the first place in order to avoid
resending messages too often. The assumption is that disconnections and messages
drop should be few and far between in normal operations and it's therefore fine to
rely on the Ping's roundtrip time to check for peers state.
Timeouts are inherently unreliable, esp. given an arbitrary and
unknown list of messages and an arbitrary ordering of actions. Tests
might fail because one of the peers stops before the other and
therefore fails to send Pings which will notify the peer it's missing
message, or fail to take into account the peer's Pings.

This commit replaces complicated timeout logic with a simple STM-based
check that _both_ peers received all the messages.
This was used for GC messages and will be rewritten later
Also it throws an error if we do not find ourselves in the list of all parties.
This is an absurd given we included ourselves to the list before sorting.
@ghost ghost force-pushed the ensemble/network_model branch from cd64d57 to ecb9522 Compare October 4, 2023 07:32
@ghost
Copy link

ghost commented Oct 4, 2023

No idea why the CI is red, going to bypass and merge it 🤷

@ghost ghost merged commit abf8881 into master Oct 4, 2023
17 of 18 checks passed
@ghost ghost deleted the ensemble/network_model branch October 4, 2023 08:10
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants