Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node fails to broadcast AckSn / PersistenceException #1202

Closed
ch1bo opened this issue Dec 11, 2023 · 1 comment · Fixed by #1211
Closed

Node fails to broadcast AckSn / PersistenceException #1202

ch1bo opened this issue Dec 11, 2023 · 1 comment · Fixed by #1211
Labels
bug 🐛 Something isn't working
Milestone

Comments

@ch1bo
Copy link
Collaborator

ch1bo commented Dec 11, 2023

Context & versions

Seen using version 0.14.0

Steps to reproduce

  1. Open a distributed Hydra Head (we used 5 parties)
  2. Process transactions every now and then (we used hydraw)
  3. Get "lucky", that an AckSn is missing from a snapshot because one node did not send it / restarted when sending it

Actual behavior

The head becomes "stuck" as the snapshot signature of restarting party is missing and stays missing. A hydra-node restart will not fix it.

Expected behavior

The head does not become "stuck" at all or a hydra-node restart would fix it (wishful thinking?)

Hypothesis

We investigated the issue when it happened on Sashas node on mainnet. Notes are recorded in the logbook.
No detailed stderr logs were available, only the NodeOptions after the BeginEffect of AckSn + some errors in the network layer (ouroboros-framework subscription traces) do suggest that the hydra-node crashed. Maybe we saw a PersistenceException on network-messages trace on stderr when asking for docker logs. This would indicate that the persistence handle used in the Reliability layer failed in presence of reading/writing from different threads (it's not designed to be thread-safe).

@ch1bo ch1bo added the bug 🐛 Something isn't working label Dec 11, 2023
@ghost
Copy link

ghost commented Dec 11, 2023

  1. I had a look at the Reliability module and it quite obviously relies on the atomicity of append w.r.t. loadAll from PersistenceIncremental.
  2. Looking at Persistence module, I noticed that we started using withBinaryFileDurableAtomic, then moved to withBinaryFileDurable then finally to withBinaryFile, for performance reasons: With durable/atomic, the benchmark runs very slowly.
  3. Googling around I found this interesting SO answer which basically says that guarantees about atomicity of writes w.r.t reads with O_APPEND are very low, even though writes are atomic w.r.t each other, so it's perfectly possible that a loadAll sees only partially written data.
  4. There are tests for PersistenceSpec but they don't check for concurrent accesses to the file as this is probably not the intended behaviour anyway.

I wanted to demonstrate the need for atomic read/write by writing a test in IOSim, simulating the persistence layer with non atomic writes to a TVar, and use IOSimPOR to explore schedules until it finds a case where the non-atomic write is problematic, but I wonder if this is really useful as the case is pretty clear.

Anyway, we have a couple of options here:

  • Implement a PersistenceIncrementalAtomic just for messages storing
  • Revert the commit that introduced the concurrent read/write (we used to keep the messages in a memory cache to prevent exactly that problem of concurrent access to the file)
  • Add a lock to the existing code to ensure atomicity of access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant