Skip to content

Commit

Permalink
Add set_quick_repair()
Browse files Browse the repository at this point in the history
This adds a new quick-repair mode, which gives instant recovery after a
crash at the cost of slower commits.

To make this work, each commit with quick-repair enabled needs to save
the allocator state somewhere. We can't use the region headers, because
we'd be overwriting them in place; we might crash partway through the
overwrite, and then we'd need a full repair. So we instead save the
allocator state to a new table in the system tree. Writing to the table
is slightly tricky, because it needs to be done without allocating (see
below), but other than that it's a perfectly ordinary transactional write
with all the usual guarantees.

The other requirement to avoid full repair is knowing whether the last
transaction used 2-phase commit. For this, we add a new two_phase_commit
bit to the god byte, which is always updated atomically along with
swapping the primary bit. Old redb versions will ignore the new flag
when reading and clear it when writing, which is exactly what we want.

This turns out to also fix a longstanding bug where 2-phase commit hasn't
been providing any security benefit at all. The checksum forgery attack
described in the documentation for 1-phase commit actually works equally
well against 2-phase commit! The problem is that even though 2-phase
commit guarantees the primary is valid, redb ignores the primary flag
when repairing. It always picks whichever commit slot is newer, as long
as the checksum is valid. So if you crash partway through a commit,
it'll try to recover using the partially-written secondary rather than
the fully-written primary, regardless of the commit strategy.

The fix for this is exactly the two_phase_commit bit described above.
After a crash, we check whether the last transaction used 2-phase commit;
if so, we only look at the primary (which is guaranteed to be valid) and
ignore the secondary. Quick-repair needs this check anyway for safety,
so we get the bug fix for free.

To write to the allocator state table without allocating, I've introduced
a new insert_inplace() function. It's similar to insert_reserve(), but
more general and maybe simpler. To use it, you have to first do an
ordinary insert() with your desired key and a value of the appropriate
length; then later in the same transaction you can call insert_inplace()
to replace the value with a new one. Unlike insert_reserve(), this works
with values that don't implement MutInPlaceValue, and it lets you hold
multiple reservations simultaneously.

insert_inplace() could be safely exposed to users, but I don't think
there's any reason to. Since it doesn't give you a mutable reference,
there's no benefit over insert() unless you're storing data that cares
about its own position in the database. So for now it's private, and I
haven't bothered making a new error type for it; it just panics if you
don't satisfy the preconditions.

The fuzzer is perfect for testing quick-repair, because it can simulate
a crash, reopen the database (using quick-repair if possible), and then
verify that the resulting allocator state exactly matches what would
happen if it ran a full repair. I've modified the fuzzer to generate
quick-repair commits in addition to ordinary commits.
  • Loading branch information
mconst authored and cberner committed Nov 17, 2024
1 parent f29c131 commit c66447a
Show file tree
Hide file tree
Showing 11 changed files with 588 additions and 99 deletions.
29 changes: 23 additions & 6 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,16 @@ controls which transaction pointer is the primary.
`magic number` must be set to the ASCII letters 'redb' followed by 0x1A, 0x0A, 0xA9, 0x0D, 0x0A. This sequence is
inspired by the PNG magic number.

`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing two flags:
`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing three flags:
* first bit: `primary_bit` flag which indicates whether transaction slot 0 or transaction slot 1 contains the latest commit.
redb relies on the fact that this is a single bit to perform atomic commits.
* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database.
During the recovery process, the region tracker and regional allocator states -- described below -- are reconstructed
by walking the btree from all active roots.
* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database. This can be
a full repair, in which the region tracker and regional allocator states -- described below -- are reconstructed by walking
the btree from all active roots, or a quick-repair, in which the state is simply loaded from the allocator state table.
* third bit: `two_phase_commit` flag, which indicates whether the transaction in the primary slot was written using 2-phase
commit. If so, the primary slot is guaranteed to be valid, and repair won't look at the secondary slot. This flag is always
updated atomically along with the primary bit.

redb relies on the fact that this is a single byte to perform atomic commits.

`page size` is the size of a redb page in bytes

Expand Down Expand Up @@ -155,7 +159,9 @@ changed during an upgrade.

### Region tracker
The region tracker is an array of `BtreeBitmap`s that tracks the page orders which are free in each region.
It is stored in a page in the data section of a region:
There are two different places it can be stored: on shutdown, it's written to a page in the data section of
a region, and when making a commit with quick-repair enabled, it's written to an entry in the allocator state
table. The former is valid only after a clean shutdown; the latter is usable even after a crash.
```
<-------------------------------------------- 8 bytes ------------------------------------------->
==================================================================================================
Expand Down Expand Up @@ -216,6 +222,11 @@ range has been allocated
* n bytes: free index data
* n bytes: allocated data

Like the region tracker, there are two different places where the regional allocator state can be
stored. On shutdown, it's written to the region header as described above, and when making a commit
with quick-repair enabled, it's written to an entry in the allocator state table. The former is valid
only after a clean shutdown; the latter is usable even after a crash.

```
<-------------------------------------------- 8 bytes ------------------------------------------->
==================================================================================================
Expand Down Expand Up @@ -461,6 +472,12 @@ To repair the database after an unclean shutdown we must:
2) Update the allocator state, so that it is consistent with all the database roots in the above
transaction

If the last commit before the crash had quick-repair enabled, then these are both trivial. The
primary commit slot is guaranteed to be valid, because it was written using 2-phase commit, and
the corresponding allocator state is stored in the allocator state table.

Otherwise, we need to perform a full repair:

For (1), if the primary commit slot is invalid we switch to the secondary slot.

For (2), we rebuild the allocator state by walking the following trees and marking all referenced
Expand Down
1 change: 1 addition & 0 deletions fuzz/fuzz_targets/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ pub(crate) enum FuzzOperation {
pub(crate) struct FuzzTransaction {
pub ops: Vec<FuzzOperation>,
pub durable: bool,
pub quick_repair: bool,
pub commit: bool,
pub create_ephemeral_savepoint: bool,
pub create_persistent_savepoint: bool,
Expand Down
1 change: 1 addition & 0 deletions fuzz/fuzz_targets/fuzz_redb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,7 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
if !transaction.durable {
txn.set_durability(Durability::None);
}
txn.set_quick_repair(transaction.quick_repair);
let mut counter_table = txn.open_table(COUNTER_TABLE).unwrap();
let uncommitted_id = txn_id as u64 + 1;
counter_table.insert((), uncommitted_id)?;
Expand Down
90 changes: 72 additions & 18 deletions src/db.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ use std::sync::{Arc, Mutex};

use crate::error::TransactionError;
use crate::sealed::Sealed;
use crate::transactions::SAVEPOINT_TABLE;
use crate::transactions::{
AllocatorStateKey, AllocatorStateTree, ALLOCATOR_STATE_TABLE_NAME, SAVEPOINT_TABLE,
};
use crate::tree_store::file_backend::FileBackend;
#[cfg(feature = "logging")]
use log::{debug, info, warn};
Expand Down Expand Up @@ -429,7 +431,9 @@ impl Database {
return Err(CompactionError::TransactionInProgress);
}
// Commit to free up any pending free pages
// Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter
// Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter.
// Once https://github.com/cberner/redb/issues/829 is fixed, we should upgrade this to use quick-repair -- that way the user
// can cancel the compaction without requiring a full repair afterwards
let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
if txn.list_persistent_savepoints()?.next().is_some() {
return Err(CompactionError::PersistentSavepointExists);
Expand Down Expand Up @@ -609,6 +613,12 @@ impl Database {
repair_callback: &(dyn Fn(&mut RepairSession) + 'static),
) -> Result<[Option<BtreeHeader>; 3], DatabaseError> {
if !Self::verify_primary_checksums(mem.clone())? {
if mem.used_two_phase_commit() {
return Err(DatabaseError::Storage(StorageError::Corrupted(
"Primary is corrupted despite 2-phase commit".to_string(),
)));
}

// 0.3 because the repair takes 3 full scans and the first is done now
let mut handle = RepairSession::new(0.3);
repair_callback(&mut handle);
Expand Down Expand Up @@ -701,23 +711,31 @@ impl Database {
)?;
let mut mem = Arc::new(mem);
if mem.needs_repair()? {
#[cfg(feature = "logging")]
warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
let mut handle = RepairSession::new(0.0);
repair_callback(&mut handle);
if handle.aborted() {
return Err(DatabaseError::RepairAborted);
// If the last transaction used 2-phase commit and updated the allocator state table, then
// we can just load the allocator state from there. Otherwise, we need a full repair
if Self::try_quick_repair(mem.clone())? {
#[cfg(feature = "logging")]
info!("Quick-repair successful, full repair not needed");
} else {
#[cfg(feature = "logging")]
warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
let mut handle = RepairSession::new(0.0);
repair_callback(&mut handle);
if handle.aborted() {
return Err(DatabaseError::RepairAborted);
}
let [data_root, system_root, freed_root] =
Self::do_repair(&mut mem, repair_callback)?;
let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
mem.commit(
data_root,
system_root,
freed_root,
next_transaction_id,
false,
true,
)?;
}
let [data_root, system_root, freed_root] = Self::do_repair(&mut mem, repair_callback)?;
let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
mem.commit(
data_root,
system_root,
freed_root,
next_transaction_id,
false,
true,
)?;
}

mem.begin_writable()?;
Expand Down Expand Up @@ -752,6 +770,42 @@ impl Database {
Ok(db)
}

// Returns true if quick-repair was successful, or false if a full repair is needed
fn try_quick_repair(mem: Arc<TransactionalMemory>) -> Result<bool> {
// Quick-repair is only possible if the primary was written using 2-phase commit
if !mem.used_two_phase_commit() {
return Ok(false);
}

// See if the allocator state table is present in the system table tree
let fake_freed_pages = Arc::new(Mutex::new(vec![]));
let system_table_tree = TableTreeMut::new(
mem.get_system_root(),
Arc::new(TransactionGuard::fake()),
mem.clone(),
fake_freed_pages.clone(),
);
let Some(allocator_state_table) = system_table_tree
.get_table::<AllocatorStateKey, &[u8]>(ALLOCATOR_STATE_TABLE_NAME, TableType::Normal)
.map_err(|e| e.into_storage_error_or_corrupted("Unexpected TableError"))?
else {
return Ok(false);
};

// Load the allocator state from the table
let InternalTableDefinition::Normal { table_root, .. } = allocator_state_table else {
unreachable!();
};
let tree = AllocatorStateTree::new(
table_root,
Arc::new(TransactionGuard::fake()),
mem.clone(),
fake_freed_pages,
);

mem.try_load_allocator_state(&tree)
}

fn allocate_read_transaction(&self) -> Result<TransactionGuard> {
let id = self
.transaction_tracker
Expand Down
Loading

0 comments on commit c66447a

Please sign in to comment.