Node fails to broadcast AckSn / PersistenceException #1202

ch1bo · 2023-12-11T11:27:51Z

Context & versions

Seen using version 0.14.0

Steps to reproduce

Open a distributed Hydra Head (we used 5 parties)
Process transactions every now and then (we used hydraw)
Get "lucky", that an AckSn is missing from a snapshot because one node did not send it / restarted when sending it

Actual behavior

The head becomes "stuck" as the snapshot signature of restarting party is missing and stays missing. A hydra-node restart will not fix it.

Expected behavior

The head does not become "stuck" at all or a hydra-node restart would fix it (wishful thinking?)

Hypothesis

We investigated the issue when it happened on Sashas node on mainnet. Notes are recorded in the logbook.
No detailed stderr logs were available, only the NodeOptions after the BeginEffect of AckSn + some errors in the network layer (ouroboros-framework subscription traces) do suggest that the hydra-node crashed. Maybe we saw a PersistenceException on network-messages trace on stderr when asking for docker logs. This would indicate that the persistence handle used in the Reliability layer failed in presence of reading/writing from different threads (it's not designed to be thread-safe).

The text was updated successfully, but these errors were encountered:

ghost · 2023-12-11T18:58:16Z

I had a look at the Reliability module and it quite obviously relies on the atomicity of append w.r.t. loadAll from PersistenceIncremental.
Looking at Persistence module, I noticed that we started using withBinaryFileDurableAtomic, then moved to withBinaryFileDurable then finally to withBinaryFile, for performance reasons: With durable/atomic, the benchmark runs very slowly.
Googling around I found this interesting SO answer which basically says that guarantees about atomicity of writes w.r.t reads with O_APPEND are very low, even though writes are atomic w.r.t each other, so it's perfectly possible that a loadAll sees only partially written data.
There are tests for PersistenceSpec but they don't check for concurrent accesses to the file as this is probably not the intended behaviour anyway.

I wanted to demonstrate the need for atomic read/write by writing a test in IOSim, simulating the persistence layer with non atomic writes to a TVar, and use IOSimPOR to explore schedules until it finds a case where the non-atomic write is problematic, but I wonder if this is really useful as the case is pretty clear.

Anyway, we have a couple of options here:

Implement a PersistenceIncrementalAtomic just for messages storing
Revert the commit that introduced the concurrent read/write (we used to keep the messages in a memory cache to prevent exactly that problem of concurrent access to the file)
Add a lock to the existing code to ensure atomicity of access

ch1bo added the bug 🐛 Something isn't working label Dec 11, 2023

ghost self-assigned this Dec 12, 2023

ghost mentioned this issue Dec 13, 2023

Make persistence incremental thread-safe for Network Reliability layer #1211

Merged

4 tasks

ch1bo unassigned ghost Dec 14, 2023

v0d1ch closed this as completed in #1211 Dec 16, 2023

ch1bo added this to the 0.15.0 milestone Dec 22, 2023

v0d1ch mentioned this issue May 7, 2024

Don't persist the network messages and their acknowledgements #1417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node fails to broadcast AckSn / PersistenceException #1202

Node fails to broadcast AckSn / PersistenceException #1202

ch1bo commented Dec 11, 2023

ghost commented Dec 11, 2023

Node fails to broadcast AckSn / PersistenceException #1202

Node fails to broadcast AckSn / PersistenceException #1202

Comments

ch1bo commented Dec 11, 2023

Context & versions

Steps to reproduce

Actual behavior

Expected behavior

Hypothesis

ghost commented Dec 11, 2023