-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for the new AppendVec storage #28550
Proposal for the new AppendVec storage #28550
Conversation
some other brainstorming ideas.
these days, append vecs are written with all the information present prior to writing. |
goals:
|
we've now activated the feature that stops updating rent epoch for rent exempt accounts. |
#28585 |
Note that we are no longer updating rent_epoch with each epoch. As a result, we will now have a wide range of values for |
Thanks for sharing your thoughts, Jeff! It really helps reorganize the unstructured ideas in my mind and boost more ideas. Probably worth discussing the items/ideas one by one.
Sorting will provide two benefits:
|
The very first version in my mind was one file, but I later found that if we want to keep supporting I will keep thinking about what would be the best layout and format for AppendVec storage together with accounts index. |
For lamports, can I know what is the maximum allowed number of lamports per account?
For this, I am thinking loud that if the basic unit of the data size is 1k or something, then our number here can just represent the size in terms of how many units. (i.e., 1000 means 1000k). In that case, u32 will be able to represent up to 4TB, which should be enough. If the above idea sounds good, and if we are okay with 4 billion maximum for lamports, then we can use one single u64 to store both lamports and account data size? |
Hmm, if we store all accounts data in a different section, then we can probably skip storing the account data size as we can derive the account data size by (the account data offset of the next account entry) - (the offset account data of the current account entry) |
Updated the design. The proposal now includes the general design and several graphs. Here's a quick summary. The new AppendVec storage format uses ~80 bytes for each account on average, down from the existing Specifically:
The new AppendVec file is organized into 6 different blocks: header,
|
Changed to ready-to-review for collecting initial feedback. |
| | up to 4,294,967,295 owners per append vec | | ||
| data_offset | 4 bytes | | ||
+-----------------+-------------------------------------------+ | ||
| rent status | 1 bit | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is able to be deterministically calculated based on lamports and data len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Yes! So we can remove this bit!
// the id can be used to fetch its pubkey | ||
pub owner_local_id: u32, | ||
data_offset: u32, | ||
data_size: u32, // derived when first loading the AppendVec from the storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, data size is currently limited to 10MB, iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we happen to have some size unit for the data size? Such as 1k or 64 bytes or anything like this? If we have a size unit, then we can represent a wider range of data sizes with fewer bytes.
+-------------------------------------------------------------+ | ||
| Account data block (variable size) | | ||
+-------------------------------------------------------------+ | ||
| rent_epoch | 0-8 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may have to be aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel the same way. Then we might need to store the account data size for each account as there might be gaps between two account data entries with paddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this accessed? I imagine it's read, not directly cast to a reference, so it can be left unaligned and read with https://doc.rust-lang.org/stable/std/ptr/fn.read_unaligned.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's for read operations. Having it aligned should allow reads to be more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initial feedback, @jeffwashington! Will address the alignment for the data block and extend the design to cover ancient AppendVec.
+-------------------------------------------------------------+ | ||
| Account data block (variable size) | | ||
+-------------------------------------------------------------+ | ||
| rent_epoch | 0-8 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel the same way. Then we might need to store the account data size for each account as there might be gaps between two account data entries with paddings.
Updated the proposal:
|
Updated the proposal to include compression and the concept of the account data block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some first pass thoughts for now. consider getting input from the firedancer team as well. they are highly skilled in data path optimization and will need to work with us at least wrt snapshot format
+---------------------------------------------+ | ||
| Header | | ||
+---------------------------------------------+ | ||
| format_version | 8 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding a header_size
field
| Header | | ||
+---------------------------------------------+ | ||
| format_version | 8 bytes | | ||
| file_size | 8 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to recover the file size from the filesystem/inode entry? it can still be included in a hash w/o strictly being serialized
#### Header Block | ||
The header includes high level information of an AppendVec file. | ||
|
||
All the numerical fields in the header are `u64` integers for simplicity as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this statement seems to be contradicted by several fields with size designated as 4-bytes
below?
| data_block_offset | 8 bytes | | ||
| blob_data_block_offset | 8 bytes | | ||
| compression_algorithm | 8 bytes | | ||
| append_vec_hash | 32 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting this hash in the header complicates its generation. we need to straddle it with two hasher updates. consider one of
- creating a footer and storing the hash there instead
- merklizing the blocks and storing the root in the header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great we share the same idea. The current prototype uses the footer approach instead. Let me update the doc later today.
| owners_offset | 8 bytes | | ||
| data_block_offset | 8 bytes | | ||
| blob_data_block_offset | 8 bytes | | ||
| compression_algorithm | 8 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does compression here only apply to snapshot storage or also to any paged out appendvecs? any reason not to prefer a zerocopyable interface for the latter?
upon a complete read, i feel like i'm missing some important context around the use and intention of compression here. can you add a section describing when, where and how compression will be used?
offers forward compability as long as the newer version only adds new fields | ||
and does not change the definition of the existing fields. | ||
|
||
#### (Small) Account Data Blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this block page-aligned within the appendvec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current doc here is a bit outdated as I am currently prototyping.
In the current prototype, the account data blocks will be the first thing in the file, so the first block is guaranteed to be page-aligned. But since each block is compressed, the rest of the blocks will not be page-aligned.
Let me think if we could get benefits from both compress and page alignment.
+-------------------------------------------------------------+ | ||
| Account data (after decompression) | | ||
+-------------------------------------------------------------+ | ||
| rent_epoch | 0-8 bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero or 8 bytes, right? this should probably be appended instead of prepended and NonZeroU64
, then we get free memory layout optimization from the compiler (assuming rent-epoch == 0 is an invalid thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the correction. Yes, it should be 0 or 8 bytes.
If we read account data more often than rent-epoch (I think this assumption should be yes), then yes it should be appended instead of prepended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rent_epoch
is quite the story. It is only relevant anymore for rent-paying accounts, which is a shrinking set of ~1M accounts on mnb. For everyone else, I have a feature to set the value to Epoch::MAX for all rent exempt accounts. Thus, we don't need to store it at all for rent exempt accounts (the vast bulk of accounts). For rent paying accounts it is 1 of 4 values. Slot - 1, Slot, Slot + 1, 0. Where Slot is the slot of the append vec itself. It is possible we could reduce these values down to fewer than 4. Maybe even 2 values:
- rent already collected as of this slot
- rent last collected as of prior slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hustling to get this feature ready.
#28683
and does not change the definition of the existing fields. | ||
|
||
#### (Small) Account Data Blocks | ||
One small account data block contains multiple accounts' data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is each data entry a full page, or variable-length?
cross-page load/store perf is usually pretty abysmal. like an order of magnitude bad
This is a good read if we're considering retaining mmap-based file-backing. tl;dr, we probably shouldn't Are You Sure You Want to Use MMAP in Your Database Management System? (pdf) |
Thanks for the feedback, @t-nelson! The doc is currently out-of-date as I am implementing the prototype but many of your comments are still applied! Will update the doc later today and have fire-dancers folks involved.
Rest assured that the new design will not use MMAP :). We will LRU or some other suitable cache mechanisms for reads. |
Am still fixing bugs in the prototype. I want to make sure everything runs well before updating the proposal. Converting this PR to draft. |
fwiw, typically a proposal should be accepted before the implementation is started 😉 |
Yep yep, it's just a prototype to make sure it can actually produce the expected outcome :p. And here it is: #28790 So far I am seeing up to ~75% size reduction, which is good! Instructions to test the file format is also under the comment in #28790. Will update the proposal shortly. |
Hello @jeffwashington and @t-nelson, Thanks for the feedback so far! While I am still trying to add more sections/descriptions and polish the document, I've included the most important sections and information that allow our discussion to proceed. Let me know what you think about the new format. I've prototyped the new format and tested it by converting ~105GB append-vec files. The new format only uses ~40GB. |
This PR includes a proposal that discusses three potential areas that we can further reduce
+the storage size of AppendVec:
executable
+rent_epoch
fields (saves 8 bytes per account).