Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design for a persistent Account storage #2769

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
- [Reliable Vote Transmission](reliable-vote-transmission.md)
- [Bank Forks](bank-forks.md)
- [Blocktree Validation](fullnode-with-ledger-notifications.md)
- [Persistent Account Storage](persistent-account-storage.md)
- [Cluster Economics](ed_overview.md)
- [Validation-client Economics](ed_validation_client_economics.md)
- [State-validation Protocol-based Rewards](ed_vce_state_validation_protocol_based_rewards.md)
Expand Down
120 changes: 120 additions & 0 deletions book/src/persistent-account-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Persistent Account Storage

The set of Accounts represent the current computed state of all the transactions
that have been processed by a fullnode. Each fullnode needs to maintain this
entire set. Each block that is proposed by the network represents a change to
this set, and since each block is a potential rollback point the changes need to
be reversible.

Persistent storage like NVMEs are 20 to 40 times cheaper then DDR. The problem
aeyakovenko marked this conversation as resolved.
Show resolved Hide resolved
with persistent storage is that write and read performance is much slower then
DDR, and care must be taken in how data is read or written to. Both reads and
writes can be split between multiple storage drives and accessed in parallel.
This design proposes a data structure that allows for concurrent reads and
concurrent writes of storage. Writes are optimized by using an AppendVec data
structure, which allows a single writer to append while allowing access to many
concurrent readers. The accounts index maintains a pointer to a spot where the
account was appended to for every fork, thus removing the need for explicit
check-pointing of state.

# AppendVec

AppendVec is a data structure that allows for random reads concurrent with a
single append only writer. Grow, or resizing the capacity of the AppendVec
requires exclusive access. This is implemented with an atomic `len`, which is
aeyakovenko marked this conversation as resolved.
Show resolved Hide resolved
updated at the end of a completed append.

The underlying memory for an AppendVec is a memory mapped file. Memory mapped
files allow for fast random access, and paging is handled by the OS.

# Account Index

The account index is designed to support a single index for all the currently
forked Accounts.

```
type AppendVecId = usize;

type Fork = usize;
aeyakovenko marked this conversation as resolved.
Show resolved Hide resolved

struct AccountMap(HashMap<Fork, (AppendVecId, u64)>);
aeyakovenko marked this conversation as resolved.
Show resolved Hide resolved

type AccountIndex = HashMap<Pubkey, AccountMap>;

```

The index is a map of account Pubkeys to a map of forks and the location of the
Account data in an AppendVec. To get the latest version of an account:

```
/// Load the account for the pubkey.
/// This function will load the account from the greatest or equal to fork.
pub fn load_slow(&self, fork: u64, pubkey: &Pubkey) -> Option<&Account>
```

The read is satisfied by pointing to a memory mapped location in the
`AppendVecId` at the stored offset.

# Append Only Writes

All the updates to Accounts occur as append only updates. So for every account
update a new version is stored in the AppendVec.

It is possible to optimize updates within a single fork by returning a mutable
reference to an already stored account in a fork. The Bank already tracks
concurrent access of accounts and guarantees that a write to a specific account
fork will not be concurrent with a read to an account at that fork. To support
this operation, AppendVec should implement this function:

`fn get_mut(&self, index: u64) -> &mut T`

This api allows for concurrent mutable access to a memory region at `index`. It
relies on the Bank to guarantee exclusive access to that index.

# Garbage collection

As accounts get updated, they move to the end of the AppendVec. Once capacity
has run out, a new AppendVec can be created and updates can be stored there.
Eventually references to an older AppendVec will disappear because all the
accounts have been updated, and the old AppendVec can be deleted.

To speed up this process, its possible to move Accounts that have not been
recently updated to the front of a new AppendVec. This form of garbage
collection can be done without requiring exclusive locks to any of the data
structures except for the index update.
aeyakovenko marked this conversation as resolved.
Show resolved Hide resolved

# Index Recovery

Each bank thread has exclusive access to the accounts during append, since the
accounts locks cannot be released until the data is committed. But there is no
explicit order of writes between the separate AppendVec files. To create an
ordering, the index maintains an atomic write version counter. Each append to
the AppendVec records the index write version number for that append in the
entry for the Account in the AppendVec.

To recover the index, all the AppendVec files can be read in any order, and the
latest write version for every fork should be stored in the index.

# Snapshots

To snapshot, the underlying memory mapped files in the AppendVec need to be
flushed to disk. The index can be written out to disk as well.

aeyakovenko marked this conversation as resolved.
Show resolved Hide resolved
# Performance

* Append only writes are fast. SSDs and NVMEs as well as all the OS level
kernel data structures allow for appends to run as fast as PCI or NVMe bandwidth
will allow (2,700 MB/s).

* Each replay and banking thread writes concurrently to its own AppendVec.

* Each AppendVec could potentially be hosted on a separate NVMe.

* Each replay and banking thread has concurrent read access to all the
AppendVecs without blocking writes.

* Index requires an exclusive write lock for writes. Single thread performance
for HashMap updates is on the order of 10m per second.

* Banking and Replay stages should use process 32 threads per NVMe. NVMes have
optimal performance with 32 concurrent readers or writers.