Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible durability bug in the presence of write failures #857

Closed
cfcosta opened this issue Sep 2, 2024 · 1 comment · Fixed by #860
Closed

Possible durability bug in the presence of write failures #857

cfcosta opened this issue Sep 2, 2024 · 1 comment · Fixed by #860

Comments

@cfcosta
Copy link

cfcosta commented Sep 2, 2024

The project I'm working on uses a deterministic simulator to guarantee correctness over a long time, as well as resilience to failures. I'm using redb as the main storage for the node.

Specifically, what I'm testing is:

An operation of the system takes a list of inputs and generates a list of outputs. In the case of failure, the user should be able to retry using the same inputs again, without any of them being marked as spent.

To do so, I created a new StorageBackend that injects failures on all storage methods:

pub struct TestBackend {
    failure_rate: f64,
    should_fail: Arc<AtomicBool>,
    inner: Arc<InMemoryBackend>,
}

impl StorageBackend for TestBackend {
    fn len(&self) -> Result<u64, std::io::Error> {
        self.maybe_fail()?;
        self.inner.len()
    }

    fn read(&self, offset: u64, len: usize) -> Result<Vec<u8>, std::io::Error> {
        self.maybe_fail()?;
        self.inner.read(offset, len)
    }

    fn set_len(&self, len: u64) -> Result<(), std::io::Error> {
        self.maybe_fail()?;
        self.inner.set_len(len)
    }

    fn sync_data(&self, eventual: bool) -> Result<(), std::io::Error> {
        self.maybe_fail()?;
        self.inner.sync_data(eventual)
    }

    fn write(&self, offset: u64, data: &[u8]) -> Result<(), std::io::Error> {
        self.maybe_fail()?;
        self.inner.write(offset, data)
    }
}

Those are the results:

  • Failures on len, set_len and read are handled gracefully (an error is thrown, but the database remains consistent/readable).
  • Failures on write or sync_data panic with this error:
thread 'main' panicked at {path}/redb-2.1.2/src/tree_store/page_store/page_manager.rs:511:9:
assertion failed: !self.needs_recovery.load(Ordering::Acquire)

The Bug?

I'm not precisely sure if this should be considered an error or not. To me, storage failures should not crash my application, and I would expect it to just throw an error, but this behavior is still much better than having the database be inconsistent.

It could be that I'm misusing the database, or that those internal interfaces should never be implemented in this way. In this case, I think adding some information on the documentation would help a lot.

What I did not test?

  • I did not test re-opening a new instance of the database after it panics
  • I did not remove the assertion on the code to see if this would break anything.

Links

My Setup

I managed to reliably trigger the assertion on both a M2 and an AMD machine running NixOS. It triggered for me in both stable and nightly rust.

@cfcosta
Copy link
Author

cfcosta commented Sep 2, 2024

Update:

I tested re-opening the db (changed to FileBackend as the base to re-use the same instance) and it seems the database still accepts writes as expected. I did not check reading previously added keys though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant