Possible durability bug in the presence of write failures #857

cfcosta · 2024-09-02T15:32:25Z

The project I'm working on uses a deterministic simulator to guarantee correctness over a long time, as well as resilience to failures. I'm using redb as the main storage for the node.

Specifically, what I'm testing is:

An operation of the system takes a list of inputs and generates a list of outputs. In the case of failure, the user should be able to retry using the same inputs again, without any of them being marked as spent.

To do so, I created a new StorageBackend that injects failures on all storage methods:

pub struct TestBackend {
    failure_rate: f64,
    should_fail: Arc<AtomicBool>,
    inner: Arc<InMemoryBackend>,
}

impl StorageBackend for TestBackend {
    fn len(&self) -> Result<u64, std::io::Error> {
        self.maybe_fail()?;
        self.inner.len()
    }

    fn read(&self, offset: u64, len: usize) -> Result<Vec<u8>, std::io::Error> {
        self.maybe_fail()?;
        self.inner.read(offset, len)
    }

    fn set_len(&self, len: u64) -> Result<(), std::io::Error> {
        self.maybe_fail()?;
        self.inner.set_len(len)
    }

    fn sync_data(&self, eventual: bool) -> Result<(), std::io::Error> {
        self.maybe_fail()?;
        self.inner.sync_data(eventual)
    }

    fn write(&self, offset: u64, data: &[u8]) -> Result<(), std::io::Error> {
        self.maybe_fail()?;
        self.inner.write(offset, data)
    }
}

Those are the results:

Failures on len, set_len and read are handled gracefully (an error is thrown, but the database remains consistent/readable).
Failures on write or sync_data panic with this error:

thread 'main' panicked at {path}/redb-2.1.2/src/tree_store/page_store/page_manager.rs:511:9:
assertion failed: !self.needs_recovery.load(Ordering::Acquire)

The Bug?

I'm not precisely sure if this should be considered an error or not. To me, storage failures should not crash my application, and I would expect it to just throw an error, but this behavior is still much better than having the database be inconsistent.

It could be that I'm misusing the database, or that those internal interfaces should never be implemented in this way. In this case, I think adding some information on the documentation would help a lot.

What I did not test?

I did not test re-opening a new instance of the database after it panics
I did not remove the assertion on the code to see if this would break anything.

Links

My Setup

I managed to reliably trigger the assertion on both a M2 and an AMD machine running NixOS. It triggered for me in both stable and nightly rust.

The text was updated successfully, but these errors were encountered:

cfcosta · 2024-09-02T15:54:17Z

Update:

I tested re-opening the db (changed to FileBackend as the base to re-use the same instance) and it seems the database still accepts writes as expected. I did not check reading previously added keys though.

cberner mentioned this issue Sep 2, 2024

Fix panic and return PreviousIo error instead #860

Merged

cberner closed this as completed in #860 Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible durability bug in the presence of write failures #857

Possible durability bug in the presence of write failures #857

cfcosta commented Sep 2, 2024 •

edited

Loading

cfcosta commented Sep 2, 2024

Possible durability bug in the presence of write failures #857

Possible durability bug in the presence of write failures #857

Comments

cfcosta commented Sep 2, 2024 • edited Loading

The Bug?

What I did not test?

Links

My Setup

cfcosta commented Sep 2, 2024

cfcosta commented Sep 2, 2024 •

edited

Loading