Make persistence incremental thread-safe for Network Reliability layer #1211

ghost · 2023-12-13T10:26:39Z

This PR attempts to fix #1202 by removing concurrent loading and appending of messages, and using an in-memory cache to resend messages. This is achieved by preventing loading from a different thread in PersistenceIncremental once we have started appending to it.

This is a bit lame, and there are actually more problems lurking in this persistence layer as we cannot guarantee the vector clock indices are always in sync with the messages we persist. We should start thinking of moving to a persistence backend where we can guarantee atomicity of updates to vector clock and persisted message.

CHANGELOG updated or not needed
Documentation updated or not needed
Haddocks updated or not needed
No new TODOs introduced or explained herafter

github-actions · 2023-12-13T10:35:55Z

Transactions Costs

Sizes and execution budgets for Hydra protocol transactions. Note that unlisted parameters are currently using arbitrary values and results are not fully deterministic and comparable to previous runs.

Metadata
Generated at	2023-12-15 10:47:41.092821605 UTC
Max. memory units	14000000
Max. CPU units	10000000000
Max. tx size (kB)	16384

Script summary

Name	Hash	Size (Bytes)
νInitial	985245919fcc6c0c5cd116023cd2c947c43e80dcbb5075fe12433fbb	4072
νCommit	7cb20fa71eb4c563ca283566ebe0aa65859d96c3f8cba35c52c181fd	2043
νHead	7a36661f5c15e9f1783aeaab890812c59b7286cbbc6de762d3110772	8816
μHead	8b111ac12274e46314769295a1c5dcab1d260096fc469fd698065463*	3851

The minting policy hash is only usable for comparison. As the script is parameterized, the actual script is unique per Head.

Cost of Init Transaction

Parties	Tx size	% max Mem	% max CPU	Min fee ₳
1	4374	10.48	4.05	0.46
2	4575	12.47	4.79	0.49
3	4774	14.87	5.71	0.52
5	5183	19.12	7.31	0.59
10	6184	30.20	11.51	0.75
41	12418	98.92	37.55	1.77

Cost of Commit Transaction

This is using ada-only outputs for better comparability.

UTxO	Tx size	% max Mem	% max CPU	Min fee ₳
1	534	11.37	4.44	0.30
2	724	15.04	6.07	0.35
3	908	18.85	7.75	0.40
5	1285	26.90	11.27	0.51
10	2211	49.55	20.97	0.80
19	3905	99.43	41.75	1.43

Cost of CollectCom Transaction

Parties	UTxO (bytes)	Tx size	% max Mem	% max CPU	Min fee ₳
1	57	480	21.78	8.51	0.41
2	114	594	34.08	13.42	0.55
3	170	704	45.93	18.28	0.68
4	225	810	62.12	24.85	0.87
5	283	920	82.43	33.02	1.09
6	339	1031	95.16	38.53	1.24

Cost of Close Transaction

Parties	Tx size	% max Mem	% max CPU	Min fee ₳
1	551	16.64	7.66	0.36
2	728	18.45	9.31	0.40
3	887	19.81	10.63	0.43
5	1318	23.85	14.26	0.50
10	2065	31.71	21.62	0.65
50	8924	97.24	83.21	1.93

Cost of Contest Transaction

Parties	Tx size	% max Mem	% max CPU	Min fee ₳
1	582	20.22	8.97	0.40
2	763	22.21	10.68	0.44
3	966	24.20	12.40	0.48
5	1237	27.83	15.47	0.54
10	2212	37.98	24.19	0.73
44	7848	99.59	77.87	1.86

Cost of Abort Transaction

Some variation because of random mixture of still initial and already committed outputs.

Parties	Tx size	% max Mem	% max CPU	Min fee ₳
1	4222	15.14	6.24	0.50
2	4483	31.74	13.84	0.70
3	4674	47.58	20.89	0.89
4	4850	66.26	29.17	1.11
5	5013	86.75	38.33	1.35

Cost of FanOut Transaction

Involves spending head output and burning head tokens. Uses ada-only UTxO for better comparability.

Parties	UTxO	UTxO (bytes)	Tx size	% max Mem	% max CPU	Min fee ₳
5	0	0	4205	7.92	3.32	0.42
5	1	57	4239	9.36	4.17	0.44
5	5	285	4375	14.19	7.18	0.51
5	10	570	4545	20.55	11.08	0.60
5	20	1136	4881	32.64	18.62	0.76
5	30	1708	5225	45.03	26.28	0.93
5	40	2277	5564	57.46	33.97	1.10
5	50	2847	5904	69.77	41.61	1.27
5	74	4212	6718	99.34	59.96	1.68

End-To-End Benchmark Results

This page is intended to collect the latest end-to-end benchmarks results produced by Hydra's Continuous Integration system from the latest master code.

Please take those results with a grain of salt as they are currently produced from very limited cloud VMs and not controlled hardware. Instead of focusing on the absolute results, the emphasis should be on relative results, eg. how the timings for a scenario evolve as the code changes.

Generated at 2023-12-15 10:50:09.935283054 UTC

Baseline Scenario

Number of nodes	3
Number of txs	9000
Avg. Confirmation Time (ms)	23.084302383
P99	112.53818575000017ms
P95	32.08209404999999ms
P50	20.4642205ms
Number of Invalid txs	0

Baseline Scenario

Number of nodes	1
Number of txs	3000
Avg. Confirmation Time (ms)	4.154921120
P99	6.760193569999995ms
P95	4.906416499999999ms
P50	4.014542ms
Number of Invalid txs	0

github-actions · 2023-12-13T14:38:24Z

Test Results

378 tests +1 373 ✔️ +1 21m 11s ⏱️ -16s
128 suites ±0     5 💤 ±0
    5 files ±0     0 ❌ ±0

Results for commit 1b07b81. ± Comparison against base commit 5d1593a.

♻️ This comment has been updated with latest results.

ch1bo

This should fix the bug and I added some minor nitpicks myself.

hydra-node/test/Hydra/Network/ReliabilitySpec.hs

ghost · 2023-12-14T21:50:07Z

hydra-node/test/Hydra/PersistenceSpec.hs

+              (forever $ threadDelay 0.01 >> loadAll)
+              (forM_ moreItems $ \item -> append item >> threadDelay 0.01)
+              `shouldThrow` \case
+                IncorrectAccessException{} -> True


Funnily, I had something like that initially, but I found giving a proper name to the race_ ... block more legible and stating explicitly what it's supposed to do instead of requiring the user to unpack the whole thing.

This small experiment demonstrates how to detect "race conditions" when some atomic result is expected, the idea is to use a TVar to simulate files operations and show the consequences of non-atomic writes and reads on messages resending.

This test is not very satisfying but for now I cannot come up with a better idea. It shows that with large messages (~10kB) there's exception thrown but the peers can get stuck, which is a different problem than the one observed in production.

We try to improve naming the log messages' fields, make it more consistent between different messages, and replace field accessor function and positional construction with proper field name punning.

We used to do that but then moved to always reloading from file, which lead to race conditions when reading/writing concurrently and incorrect deserialisation of messages. It seems both safer and faster to use a local cache, and only reload messages at start of the network layer.

ghost force-pushed the abailly-iohk/thread-safe-reliability-persistence branch from 6bd29fa to a3e2aa6 Compare December 13, 2023 14:27

ghost force-pushed the abailly-iohk/thread-safe-reliability-persistence branch from a3e2aa6 to 588a7a3 Compare December 13, 2023 15:25

ghost requested review from ch1bo and v0d1ch December 13, 2023 15:25

ch1bo force-pushed the abailly-iohk/thread-safe-reliability-persistence branch from 588a7a3 to 63a1a83 Compare December 14, 2023 10:41

ch1bo requested a review from a team December 14, 2023 10:42

ch1bo force-pushed the abailly-iohk/thread-safe-reliability-persistence branch from 8853335 to 77d749b Compare December 14, 2023 17:29

ch1bo approved these changes Dec 14, 2023

View reviewed changes

hydra-node/test/Hydra/Network/ReliabilitySpec.hs Outdated Show resolved Hide resolved

ch1bo requested a review from a team December 14, 2023 17:30

ghost commented Dec 14, 2023

View reviewed changes

abailly and others added 13 commits December 15, 2023 11:45

Experiment with IOSimPOR

3724987

This small experiment demonstrates how to detect "race conditions" when some atomic result is expected, the idea is to use a TVar to simulate files operations and show the consequences of non-atomic writes and reads on messages resending.

Unify terminology in log messages and use punning

48d5595

We try to improve naming the log messages' fields, make it more consistent between different messages, and replace field accessor function and positional construction with proper field name punning.

Prevent concurrent loading and appending in PersistenceIncremental

9e1d747

Removing useless persistence stress test

1f842bb

Remove spurious import

32a290a

Add some haddock to PersistenceIncremental

839e5c0

Fix Logs schema

b357477

Do not reuse persistence handle across several servers

e422c91

Use writeTVar instead of modifyTVar' .. const

9daddbb

Remove unused code

eb60638

Slightly refactored persistence test

1b07b81

ch1bo force-pushed the abailly-iohk/thread-safe-reliability-persistence branch from 77d749b to 1b07b81 Compare December 15, 2023 10:45

v0d1ch approved these changes Dec 16, 2023

View reviewed changes

v0d1ch merged commit 124156f into master Dec 16, 2023
21 checks passed

v0d1ch deleted the abailly-iohk/thread-safe-reliability-persistence branch December 16, 2023 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make persistence incremental thread-safe for Network Reliability layer #1211

Make persistence incremental thread-safe for Network Reliability layer #1211

ghost commented Dec 13, 2023 •

edited by ghost

Loading

github-actions bot commented Dec 13, 2023 •

edited

Loading

github-actions bot commented Dec 13, 2023 •

edited

Loading

ch1bo left a comment

ghost Dec 14, 2023

Make persistence incremental thread-safe for Network Reliability layer #1211

Make persistence incremental thread-safe for Network Reliability layer #1211

Conversation

ghost commented Dec 13, 2023 • edited by ghost Loading

github-actions bot commented Dec 13, 2023 • edited Loading

Transactions Costs

Script summary

Cost of Init Transaction

Cost of Commit Transaction

Cost of CollectCom Transaction

Cost of Close Transaction

Cost of Contest Transaction

Cost of Abort Transaction

Cost of FanOut Transaction

End-To-End Benchmark Results

Baseline Scenario

Baseline Scenario

github-actions bot commented Dec 13, 2023 • edited Loading

Test Results

ch1bo left a comment

Choose a reason for hiding this comment

ghost Dec 14, 2023

Choose a reason for hiding this comment

ghost commented Dec 13, 2023 •

edited by ghost

Loading

github-actions bot commented Dec 13, 2023 •

edited

Loading

github-actions bot commented Dec 13, 2023 •

edited

Loading