You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Process transactions every now and then (we used hydraw)
Get "lucky", that an AckSn is missing from a snapshot because one node did not send it / restarted when sending it
Actual behavior
The head becomes "stuck" as the snapshot signature of restarting party is missing and stays missing. A hydra-node restart will not fix it.
Expected behavior
The head does not become "stuck" at all or a hydra-node restart would fix it (wishful thinking?)
Hypothesis
We investigated the issue when it happened on Sashas node on mainnet. Notes are recorded in the logbook.
No detailed stderr logs were available, only the NodeOptions after the BeginEffect of AckSn + some errors in the network layer (ouroboros-framework subscription traces) do suggest that the hydra-node crashed. Maybe we saw a PersistenceException on network-messages trace on stderr when asking for docker logs. This would indicate that the persistence handle used in the Reliability layer failed in presence of reading/writing from different threads (it's not designed to be thread-safe).
The text was updated successfully, but these errors were encountered:
I had a look at the Reliability module and it quite obviously relies on the atomicity of append w.r.t. loadAll from PersistenceIncremental.
Looking at Persistence module, I noticed that we started using withBinaryFileDurableAtomic, then moved to withBinaryFileDurable then finally to withBinaryFile, for performance reasons: With durable/atomic, the benchmark runs very slowly.
Googling around I found this interesting SO answer which basically says that guarantees about atomicity of writes w.r.t reads with O_APPEND are very low, even though writes are atomic w.r.t each other, so it's perfectly possible that a loadAll sees only partially written data.
There are tests for PersistenceSpec but they don't check for concurrent accesses to the file as this is probably not the intended behaviour anyway.
I wanted to demonstrate the need for atomic read/write by writing a test in IOSim, simulating the persistence layer with non atomic writes to a TVar, and use IOSimPOR to explore schedules until it finds a case where the non-atomic write is problematic, but I wonder if this is really useful as the case is pretty clear.
Anyway, we have a couple of options here:
Implement a PersistenceIncrementalAtomic just for messages storing
Revert the commit that introduced the concurrent read/write (we used to keep the messages in a memory cache to prevent exactly that problem of concurrent access to the file)
Add a lock to the existing code to ensure atomicity of access
Context & versions
Seen using version 0.14.0
Steps to reproduce
hydraw
)AckSn
is missing from a snapshot because one node did not send it / restarted when sending itActual behavior
The head becomes "stuck" as the snapshot signature of restarting party is missing and stays missing. A
hydra-node
restart will not fix it.Expected behavior
The head does not become "stuck" at all or a
hydra-node
restart would fix it (wishful thinking?)Hypothesis
We investigated the issue when it happened on Sashas node on mainnet. Notes are recorded in the logbook.
No detailed stderr logs were available, only the
NodeOptions
after theBeginEffect
ofAckSn
+ some errors in the network layer (ouroboros-framework subscription traces) do suggest that thehydra-node
crashed. Maybe we saw a PersistenceException onnetwork-messages
trace on stderr when asking fordocker logs
. This would indicate that the persistence handle used in theReliability
layer failed in presence of reading/writing from different threads (it's not designed to be thread-safe).The text was updated successfully, but these errors were encountered: