Add set_quick_repair()

This adds a new quick-repair mode, which gives instant recovery after a crash at the cost of slower commits. To make this work, each commit with quick-repair enabled needs to save the allocator state somewhere. We can't use the region headers, because we'd be overwriting them in place; we might crash partway through the overwrite, and then we'd need a full repair. So we instead save the allocator state to a new table in the system tree. Writing to the table is slightly tricky, because it needs to be done without allocating (see below), but other than that it's a perfectly ordinary transactional write with all the usual guarantees. The other requirement to avoid full repair is knowing whether the last transaction used 2-phase commit. For this, we add a new two_phase_commit bit to the god byte, which is always updated atomically along with swapping the primary bit. Old redb versions will ignore the new flag when reading and clear it when writing, which is exactly what we want. This turns out to also fix a longstanding bug where 2-phase commit hasn't been providing any security benefit at all. The checksum forgery attack described in the documentation for 1-phase commit actually works equally well against 2-phase commit! The problem is that even though 2-phase commit guarantees the primary is valid, redb ignores the primary flag when repairing. It always picks whichever commit slot is newer, as long as the checksum is valid. So if you crash partway through a commit, it'll try to recover using the partially-written secondary rather than the fully-written primary, regardless of the commit strategy. The fix for this is exactly the two_phase_commit bit described above. After a crash, we check whether the last transaction used 2-phase commit; if so, we only look at the primary (which is guaranteed to be valid) and ignore the secondary. Quick-repair needs this check anyway for safety, so we get the bug fix for free. To write to the allocator state table without allocating, I've introduced a new insert_inplace() function. It's similar to insert_reserve(), but more general and maybe simpler. To use it, you have to first do an ordinary insert() with your desired key and a value of the appropriate length; then later in the same transaction you can call insert_inplace() to replace the value with a new one. Unlike insert_reserve(), this works with values that don't implement MutInPlaceValue, and it lets you hold multiple reservations simultaneously. insert_inplace() could be safely exposed to users, but I don't think there's any reason to. Since it doesn't give you a mutable reference, there's no benefit over insert() unless you're storing data that cares about its own position in the database. So for now it's private, and I haven't bothered making a new error type for it; it just panics if you don't satisfy the preconditions. The fuzzer is perfect for testing quick-repair, because it can simulate a crash, reopen the database (using quick-repair if possible), and then verify that the resulting allocator state exactly matches what would happen if it ran a full repair. I've modified the fuzzer to generate quick-repair commits in addition to ordinary commits.
cberner · Nov 17, 2024 · c66447a · c66447a
1 parent f29c131
commit c66447a
Show file tree

Hide file tree

Showing 11 changed files with 588 additions and 99 deletions.
diff --git a/docs/design.md b/docs/design.md
@@ -88,12 +88,16 @@ controls which transaction pointer is the primary.
 `magic number` must be set to the ASCII letters 'redb' followed by 0x1A, 0x0A, 0xA9, 0x0D, 0x0A. This sequence is
 inspired by the PNG magic number.
 
-`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing two flags:
+`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing three flags:
 * first bit: `primary_bit` flag which indicates whether transaction slot 0 or transaction slot 1 contains the latest commit.
-  redb relies on the fact that this is a single bit to perform atomic commits.
-* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database.
-  During the recovery process, the region tracker and regional allocator states -- described below -- are reconstructed
-  by walking the btree from all active roots.
+* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database. This can be
+  a full repair, in which the region tracker and regional allocator states -- described below -- are reconstructed by walking
+  the btree from all active roots, or a quick-repair, in which the state is simply loaded from the allocator state table.
+* third bit: `two_phase_commit` flag, which indicates whether the transaction in the primary slot was written using 2-phase
+  commit. If so, the primary slot is guaranteed to be valid, and repair won't look at the secondary slot. This flag is always
+  updated atomically along with the primary bit.
+
+redb relies on the fact that this is a single byte to perform atomic commits.
 
 `page size` is the size of a redb page in bytes
 
@@ -155,7 +159,9 @@ changed during an upgrade.
 
 ### Region tracker
 The region tracker is an array of `BtreeBitmap`s that tracks the page orders which are free in each region.
-It is stored in a page in the data section of a region:
+There are two different places it can be stored: on shutdown, it's written to a page in the data section of
+a region, and when making a commit with quick-repair enabled, it's written to an entry in the allocator state
+table. The former is valid only after a clean shutdown; the latter is usable even after a crash.
 ```
 <-------------------------------------------- 8 bytes ------------------------------------------->
 ==================================================================================================
@@ -216,6 +222,11 @@ range has been allocated
 * n bytes: free index data
 * n bytes: allocated data
 
+Like the region tracker, there are two different places where the regional allocator state can be
+stored. On shutdown, it's written to the region header as described above, and when making a commit
+with quick-repair enabled, it's written to an entry in the allocator state table. The former is valid
+only after a clean shutdown; the latter is usable even after a crash.
+
 ```
 <-------------------------------------------- 8 bytes ------------------------------------------->
 ==================================================================================================
@@ -461,6 +472,12 @@ To repair the database after an unclean shutdown we must:
 2) Update the allocator state, so that it is consistent with all the database roots in the above
    transaction
 
+If the last commit before the crash had quick-repair enabled, then these are both trivial. The
+primary commit slot is guaranteed to be valid, because it was written using 2-phase commit, and
+the corresponding allocator state is stored in the allocator state table.
+
+Otherwise, we need to perform a full repair:
+
 For (1), if the primary commit slot is invalid we switch to the secondary slot.
 
 For (2), we rebuild the allocator state by walking the following trees and marking all referenced

diff --git a/fuzz/fuzz_targets/common.rs b/fuzz/fuzz_targets/common.rs
@@ -164,6 +164,7 @@ pub(crate) enum FuzzOperation {
 pub(crate) struct FuzzTransaction {
     pub ops: Vec<FuzzOperation>,
     pub durable: bool,
+    pub quick_repair: bool,
     pub commit: bool,
     pub create_ephemeral_savepoint: bool,
     pub create_persistent_savepoint: bool,

diff --git a/fuzz/fuzz_targets/fuzz_redb.rs b/fuzz/fuzz_targets/fuzz_redb.rs
@@ -583,6 +583,7 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
         if !transaction.durable {
             txn.set_durability(Durability::None);
         }
+        txn.set_quick_repair(transaction.quick_repair);
         let mut counter_table = txn.open_table(COUNTER_TABLE).unwrap();
         let uncommitted_id = txn_id as u64 + 1;
         counter_table.insert((), uncommitted_id)?;

diff --git a/src/db.rs b/src/db.rs
@@ -19,7 +19,9 @@ use std::sync::{Arc, Mutex};
 
 use crate::error::TransactionError;
 use crate::sealed::Sealed;
-use crate::transactions::SAVEPOINT_TABLE;
+use crate::transactions::{
+    AllocatorStateKey, AllocatorStateTree, ALLOCATOR_STATE_TABLE_NAME, SAVEPOINT_TABLE,
+};
 use crate::tree_store::file_backend::FileBackend;
 #[cfg(feature = "logging")]
 use log::{debug, info, warn};
@@ -429,7 +431,9 @@ impl Database {
             return Err(CompactionError::TransactionInProgress);
         }
         // Commit to free up any pending free pages
-        // Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter
+        // Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter.
+        // Once https://github.com/cberner/redb/issues/829 is fixed, we should upgrade this to use quick-repair -- that way the user
+        // can cancel the compaction without requiring a full repair afterwards
         let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
         if txn.list_persistent_savepoints()?.next().is_some() {
             return Err(CompactionError::PersistentSavepointExists);
@@ -609,6 +613,12 @@ impl Database {
         repair_callback: &(dyn Fn(&mut RepairSession) + 'static),
     ) -> Result<[Option<BtreeHeader>; 3], DatabaseError> {
         if !Self::verify_primary_checksums(mem.clone())? {
+            if mem.used_two_phase_commit() {
+                return Err(DatabaseError::Storage(StorageError::Corrupted(
+                    "Primary is corrupted despite 2-phase commit".to_string(),
+                )));
+            }
+
             // 0.3 because the repair takes 3 full scans and the first is done now
             let mut handle = RepairSession::new(0.3);
             repair_callback(&mut handle);
@@ -701,23 +711,31 @@ impl Database {
         )?;
         let mut mem = Arc::new(mem);
         if mem.needs_repair()? {
-            #[cfg(feature = "logging")]
-            warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
-            let mut handle = RepairSession::new(0.0);
-            repair_callback(&mut handle);
-            if handle.aborted() {
-                return Err(DatabaseError::RepairAborted);
+            // If the last transaction used 2-phase commit and updated the allocator state table, then
+            // we can just load the allocator state from there. Otherwise, we need a full repair
+            if Self::try_quick_repair(mem.clone())? {
+                #[cfg(feature = "logging")]
+                info!("Quick-repair successful, full repair not needed");
+            } else {
+                #[cfg(feature = "logging")]
+                warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
+                let mut handle = RepairSession::new(0.0);
+                repair_callback(&mut handle);
+                if handle.aborted() {
+                    return Err(DatabaseError::RepairAborted);
+                }
+                let [data_root, system_root, freed_root] =
+                    Self::do_repair(&mut mem, repair_callback)?;
+                let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
+                mem.commit(
+                    data_root,
+                    system_root,
+                    freed_root,
+                    next_transaction_id,
+                    false,
+                    true,
+                )?;
             }
-            let [data_root, system_root, freed_root] = Self::do_repair(&mut mem, repair_callback)?;
-            let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
-            mem.commit(
-                data_root,
-                system_root,
-                freed_root,
-                next_transaction_id,
-                false,
-                true,
-            )?;
         }
 
         mem.begin_writable()?;
@@ -752,6 +770,42 @@ impl Database {
         Ok(db)
     }
 
+    // Returns true if quick-repair was successful, or false if a full repair is needed
+    fn try_quick_repair(mem: Arc<TransactionalMemory>) -> Result<bool> {
+        // Quick-repair is only possible if the primary was written using 2-phase commit
+        if !mem.used_two_phase_commit() {
+            return Ok(false);
+        }
+
+        // See if the allocator state table is present in the system table tree
+        let fake_freed_pages = Arc::new(Mutex::new(vec![]));
+        let system_table_tree = TableTreeMut::new(
+            mem.get_system_root(),
+            Arc::new(TransactionGuard::fake()),
+            mem.clone(),
+            fake_freed_pages.clone(),
+        );
+        let Some(allocator_state_table) = system_table_tree
+            .get_table::<AllocatorStateKey, &[u8]>(ALLOCATOR_STATE_TABLE_NAME, TableType::Normal)
+            .map_err(|e| e.into_storage_error_or_corrupted("Unexpected TableError"))?
+        else {
+            return Ok(false);
+        };
+
+        // Load the allocator state from the table
+        let InternalTableDefinition::Normal { table_root, .. } = allocator_state_table else {
+            unreachable!();
+        };
+        let tree = AllocatorStateTree::new(
+            table_root,
+            Arc::new(TransactionGuard::fake()),
+            mem.clone(),
+            fake_freed_pages,
+        );
+
+        mem.try_load_allocator_state(&tree)
+    }
+
     fn allocate_read_transaction(&self) -> Result<TransactionGuard> {
         let id = self
             .transaction_tracker