Proposal for the new AppendVec storage #28550

yhchiang-sol · 2022-10-23T04:53:28Z

This PR includes a proposal that discusses three potential areas that we can further reduce
+the storage size of AppendVec:

Use one byte for executable + rent_epoch fields (saves 8 bytes per account).
Skip persisting the account hash for each account (saves ~32 bytes per account).
Storing account data separately (better compression rate on account data).

jeffwashington · 2022-10-25T16:50:20Z

some other brainstorming ideas.

use lookup table for common owners per appendvec (we should run a distribution graph for top owners on mnb). Special case the default pubkey owner for 'system'. Store a table of up to N (N=16?) owners that are present in the append vec. Then, for each entry, store a few bits indicating which of the table entries is the owner or we may have to serialize this specific owner.
special case 0 lamport accounts. How many do we store on mnb? Would this matter?
figure out how to reduce the bytes use for # lamports. We already have unaligned data for 'data'. u64 is 8 bytes
reduce the bytes used for data len. We already have unaligned data for 'data'
consider storing the pubkeys in 1 section of the file. We rarely need to scan pubkeys. Having to map them into memory for account load just wastes io, memory, and cpu.
maybe conditionally reorder on write to reduce padding required if we continue to store data side by side
consider tradeoffs of 2 files per append vec vs 1 (storing data separately). Maybe we store them together, but in separate sections and maybe we compress the 2 halves separately for snapshots?
special case for 0 len data. basically, I'm imagining a byte or few bytes with a few common enum values (default account=0, account owned by system program with zero data=1, account owned by program '1' in lookup table with zero data=2, ...). Maybe it is a bit field, where 0 data is 1 bit. owner is 4 bits, lamports < 3 bytes is 1 bit, something like that.
sorted by pubkey for some reason?
figure how how much space we waste on alignment

these days, append vecs are written with all the information present prior to writing.
This is true except for ancient append vecs, fwiw.

jeffwashington · 2022-10-25T16:51:50Z

goals:

improve compressability for snapshots
reduce how much has to be mapped into memory to load accounts
reduce file size to save disk, reduce snapshot size, reduce how much has to be read in/written and scanned/loaded/mapped
consider cpu/mem/disk tradeoffs.
eliminate redundant data

jeffwashington · 2022-10-25T16:54:43Z

we've now activated the feature that stops updating rent epoch for rent exempt accounts.
we also don't allow creating rent paying accounts anymore
so, rent_epoch will soon become irrelevant.
rent_epoch is still a component of an accounts hash
at some point, we can stop storing/loading rent_epoch completely.
we should investigate the distrubution of rent_epoch and what we expect it to be in the future.
Should we do something to freeze rent_exempt accounts at a known rent_epoch (like SLOT::MAX) so that we can stop saving rent_epoch and use a bit to represent that the rent_epoch is fixed? I'll create a feature for this.

jeffwashington · 2022-10-25T18:00:23Z

#28585
setting rent_epoch to known, constant value for rent exempt accounts.

jeffwashington · 2022-10-25T18:01:27Z

Note that we are no longer updating rent_epoch with each epoch. As a result, we will now have a wide range of values for rent_epoch going forward. Previously they were all within a few epochs of the append vec's slot's epoch.

yhchiang-sol · 2022-10-26T03:17:49Z

Thanks for sharing your thoughts, Jeff! It really helps reorganize the unstructured ideas in my mind and boost more ideas.

Probably worth discussing the items/ideas one by one.

sorted by pubkey for some reason?

Sorting will provide two benefits:

Enable binary search. This will help reduce the account index size as we don't need to save the reduced_offset inside the AccountInfo, which is 4 bytes per account. This enables us to trade CPU for smaller memory and storage usage for the accounts index.
Possibly better compression rate as sorted pubkeys are more likely to share the same prefix, but this all depends on the density of the pubkeys.

yhchiang-sol · 2022-10-26T03:21:45Z

consider tradeoffs of 2 files per append vec vs 1 (storing data separately). Maybe we store them together, but in separate sections and maybe we compress the 2 halves separately for snapshots?

The very first version in my mind was one file, but I later found that if we want to keep supporting append_ptr using the existing mmapped append_vec file, then we are going to need two files as we are not able to "append" anymore since we have more than one section in the file.

I will keep thinking about what would be the best layout and format for AppendVec storage together with accounts index.

yhchiang-sol · 2022-10-26T03:30:05Z

figure out how to reduce the bytes use for # lamports. We already have unaligned data for 'data'. u64 is 8 bytes

For lamports, can I know what is the maximum allowed number of lamports per account?

reduce the bytes used for data len. We already have unaligned data for 'data'

For this, I am thinking loud that if the basic unit of the data size is 1k or something, then our number here can just represent the size in terms of how many units. (i.e., 1000 means 1000k). In that case, u32 will be able to represent up to 4TB, which should be enough.

If the above idea sounds good, and if we are okay with 4 billion maximum for lamports, then we can use one single u64 to store both lamports and account data size?

yhchiang-sol · 2022-10-26T03:37:03Z

special case for 0 len data. basically, I'm imagining a byte or few bytes with a few common enum values (default account=0, account owned by system program with zero data=1, account owned by program '1' in lookup table with zero data=2, ...). Maybe it is a bit field, where 0 data is 1 bit. owner is 4 bits, lamports < 3 bytes is 1 bit, something like that.

Hmm, if we store all accounts data in a different section, then we can probably skip storing the account data size as we can derive the account data size by (the account data offset of the next account entry) - (the offset account data of the current account entry)

yhchiang-sol · 2022-10-31T07:22:00Z

Updated the design. The proposal now includes the general design and several graphs.
While I am still working on adding more text and details, the proposal should be ready for initial feedback.

Here's a quick summary.

The new AppendVec storage format uses ~80 bytes for each account on average, down from the existing
129 bytes per account. The new format is also more compression friendly for account data.

Specifically:

Rent-exempt account w/o shared owner uses 80 bytes each.
Rent-exempt account w/ shared owner uses 40 + 32 / N bytes each,
where N is the number of accounts sharing its owner.
Non-rent exempt account uses an extra 8 bytes for rent epoch.

The new AppendVec file is organized into 6 different blocks: header,
account metas block, account pubkeys block, owners block, account data block,
and footer.

+---------------------------------------------------+
| header                | 40 bytes                  |
+---------------------------------------------------+
| account metas block   | 16 bytes per account      |
+---------------------------------------------------+
| account pubkeys block | 32 bytes per account      |
+---------------------------------------------------+
| owners block          | 32 bytes per unique owner |
+---------------------------------------------------+
| account data block    | N + optional 8 bytes      |
|                       | for rent epoch            |
+---------------------------------------------------+
| footer                | 40 bytes                  |
+---------------------------------------------------+

yhchiang-sol · 2022-10-31T07:23:32Z

Changed to ready-to-review for collecting initial feedback.

docs/src/proposals/append-vec-storage.md

jeffwashington · 2022-10-31T13:52:01Z

docs/src/proposals/append-vec-storage.md

+|                 | up to 4,294,967,295 owners per append vec |
+| data_offset     | 4 bytes                                   |
+-----------------+-------------------------------------------+
+| rent status     | 1 bit                                     |


this is able to be deterministically calculated based on lamports and data len

Thanks for pointing this out. Yes! So we can remove this bit!

jeffwashington · 2022-10-31T13:53:30Z

docs/src/proposals/append-vec-storage.md

+                       // the id can be used to fetch its pubkey
+    pub owner_local_id: u32,
+    data_offset: u32,
+    data_size: u32,   // derived when first loading the AppendVec from the storage


fwiw, data size is currently limited to 10MB, iirc.

Do we happen to have some size unit for the data size? Such as 1k or 64 bytes or anything like this? If we have a size unit, then we can represent a wider range of data sizes with fewer bytes.

docs/src/proposals/append-vec-storage.md

jeffwashington · 2022-10-31T13:59:23Z

docs/src/proposals/append-vec-storage.md

+-------------------------------------------------------------+
+| Account data block (variable size)                          |
+-------------------------------------------------------------+
+| rent_epoch      |  0-8 bytes                                |


I think this may have to be aligned.

Feel the same way. Then we might need to store the account data size for each account as there might be gaps between two account data entries with paddings.

How is this accessed? I imagine it's read, not directly cast to a reference, so it can be left unaligned and read with https://doc.rust-lang.org/stable/std/ptr/fn.read_unaligned.html

Yep, it's for read operations. Having it aligned should allow reads to be more performant.

yhchiang-sol

Thanks for the initial feedback, @jeffwashington! Will address the alignment for the data block and extend the design to cover ancient AppendVec.

yhchiang-sol · 2022-11-01T02:44:12Z

docs/src/proposals/append-vec-storage.md

+-------------------------------------------------------------+
+| Account data block (variable size)                          |
+-------------------------------------------------------------+
+| rent_epoch      |  0-8 bytes                                |


Feel the same way. Then we might need to store the account data size for each account as there might be gaps between two account data entries with paddings.

yhchiang-sol · 2022-11-02T00:40:21Z

Updated the proposal:

Removed write_version.
Added account meta entry size and owner entry size in the header to enable forward compatibility.
Added alignment information.
More descriptions.

yhchiang-sol · 2022-11-10T15:09:33Z

Updated the proposal to include compression and the concept of the account data block.

t-nelson

just some first pass thoughts for now. consider getting input from the firedancer team as well. they are highly skilled in data path optimization and will need to work with us at least wrt snapshot format

t-nelson · 2022-11-21T19:01:10Z

docs/src/proposals/append-vec-storage.md

+---------------------------------------------+
+| Header                                      |
+---------------------------------------------+
+| format_version               |  8 bytes     |


consider adding a header_size field

t-nelson · 2022-11-21T19:03:03Z

docs/src/proposals/append-vec-storage.md

+| Header                                      |
+---------------------------------------------+
+| format_version               |  8 bytes     |
+| file_size                    |  8 bytes     |


any reason not to recover the file size from the filesystem/inode entry? it can still be included in a hash w/o strictly being serialized

t-nelson · 2022-11-21T19:05:57Z

docs/src/proposals/append-vec-storage.md

+#### Header Block
+The header includes high level information of an AppendVec file.
+
+All the numerical fields in the header are `u64` integers for simplicity as


this statement seems to be contradicted by several fields with size designated as 4-bytes below?

t-nelson · 2022-11-21T19:17:49Z

docs/src/proposals/append-vec-storage.md

+| data_block_offset            |  8 bytes     |
+| blob_data_block_offset       |  8 bytes     |
+| compression_algorithm        |  8 bytes     |
+| append_vec_hash              | 32 bytes     |


putting this hash in the header complicates its generation. we need to straddle it with two hasher updates. consider one of

creating a footer and storing the hash there instead

merklizing the blocks and storing the root in the header

Great we share the same idea. The current prototype uses the footer approach instead. Let me update the doc later today.

t-nelson · 2022-11-21T19:22:49Z

docs/src/proposals/append-vec-storage.md

+| owners_offset                |  8 bytes     |
+| data_block_offset            |  8 bytes     |
+| blob_data_block_offset       |  8 bytes     |
+| compression_algorithm        |  8 bytes     |


~~does compression here only apply to snapshot storage or also to any paged out appendvecs? any reason not to prefer a zerocopyable interface for the latter?~~

upon a complete read, i feel like i'm missing some important context around the use and intention of compression here. can you add a section describing when, where and how compression will be used?

docs/src/proposals/append-vec-storage.md

t-nelson · 2022-11-21T20:46:43Z

docs/src/proposals/append-vec-storage.md

+offers forward compability as long as the newer version only adds new fields
+and does not change the definition of the existing fields.
+
+#### (Small) Account Data Blocks


is this block page-aligned within the appendvec?

The current doc here is a bit outdated as I am currently prototyping.

In the current prototype, the account data blocks will be the first thing in the file, so the first block is guaranteed to be page-aligned. But since each block is compressed, the rest of the blocks will not be page-aligned.

Let me think if we could get benefits from both compress and page alignment.

docs/src/proposals/append-vec-storage.md

t-nelson · 2022-11-21T21:04:05Z

docs/src/proposals/append-vec-storage.md

+-------------------------------------------------------------+
+| Account data (after decompression)                          |
+-------------------------------------------------------------+
+| rent_epoch      |  0-8 bytes                                |


zero or 8 bytes, right? this should probably be appended instead of prepended and NonZeroU64, then we get free memory layout optimization from the compiler (assuming rent-epoch == 0 is an invalid thing)

Thanks for the correction. Yes, it should be 0 or 8 bytes.

If we read account data more often than rent-epoch (I think this assumption should be yes), then yes it should be appended instead of prepended.

rent_epoch is quite the story. It is only relevant anymore for rent-paying accounts, which is a shrinking set of ~1M accounts on mnb. For everyone else, I have a feature to set the value to Epoch::MAX for all rent exempt accounts. Thus, we don't need to store it at all for rent exempt accounts (the vast bulk of accounts). For rent paying accounts it is 1 of 4 values. Slot - 1, Slot, Slot + 1, 0. Where Slot is the slot of the append vec itself. It is possible we could reduce these values down to fewer than 4. Maybe even 2 values:

rent already collected as of this slot

rent last collected as of prior slot

I am hustling to get this feature ready.
#28683

t-nelson · 2022-11-21T21:36:39Z

docs/src/proposals/append-vec-storage.md

+and does not change the definition of the existing fields.
+
+#### (Small) Account Data Blocks
+One small account data block contains multiple accounts' data.


is each data entry a full page, or variable-length?

cross-page load/store perf is usually pretty abysmal. like an order of magnitude bad

t-nelson · 2022-11-21T22:18:25Z

This is a good read if we're considering retaining mmap-based file-backing. tl;dr, we probably shouldn't

Are You Sure You Want to Use MMAP in Your Database Management System? (pdf)

yhchiang-sol · 2022-11-29T21:45:46Z

Thanks for the feedback, @t-nelson! The doc is currently out-of-date as I am implementing the prototype but many of your comments are still applied! Will update the doc later today and have fire-dancers folks involved.

This is a good read if we're considering retaining mmap-based file-backing. tl;dr, we probably shouldn't
[Are You Sure You Want to Use MMAP in Your Database Management System?]
(https://www.cidrdb.org/cidr2022/papers/p13-crotty.pdf) (pdf)

Rest assured that the new design will not use MMAP :). We will LRU or some other suitable cache mechanisms for reads.

yhchiang-sol · 2022-12-01T18:42:24Z

Am still fixing bugs in the prototype. I want to make sure everything runs well before updating the proposal. Converting this PR to draft.

t-nelson · 2022-12-01T19:08:29Z

fwiw, typically a proposal should be accepted before the implementation is started 😉

yhchiang-sol · 2022-12-02T20:35:11Z

fwiw, typically a proposal should be accepted before the implementation is started

Yep yep, it's just a prototype to make sure it can actually produce the expected outcome :p.

And here it is: #28790

So far I am seeing up to ~75% size reduction, which is good! Instructions to test the file format is also under the comment in #28790.

Will update the proposal shortly.

yhchiang-sol · 2022-12-06T23:22:03Z

Hello @jeffwashington and @t-nelson,

Thanks for the feedback so far!

While I am still trying to add more sections/descriptions and polish the document, I've included the most important sections and information that allow our discussion to proceed.

Let me know what you think about the new format.

I've prototyped the new format and tested it by converting ~105GB append-vec files. The new format only uses ~40GB.
If I further compress the files, it is ~30GB, which can be used as an estimation for its snapshot size.

yhchiang-sol marked this pull request as draft October 23, 2022 04:53

yhchiang-sol changed the title ~~(Draft) Proposal for smaller AppendVec storage~~ (Draft)(WIP) Proposal for smaller AppendVec storage Oct 25, 2022

yhchiang-sol changed the title ~~(Draft)(WIP) Proposal for smaller AppendVec storage~~ (Draft) Proposal for smaller AppendVec storage Oct 31, 2022

yhchiang-sol requested a review from jeffwashington October 31, 2022 07:13

yhchiang-sol changed the title ~~(Draft) Proposal for smaller AppendVec storage~~ Proposal for smaller AppendVec storage (Collecting initial feedback) Oct 31, 2022

yhchiang-sol marked this pull request as ready for review October 31, 2022 07:22

yhchiang-sol changed the title ~~Proposal for smaller AppendVec storage (Collecting initial feedback)~~ Proposal for the new AppendVec storage Oct 31, 2022

jeffwashington reviewed Oct 31, 2022

View reviewed changes

docs/src/proposals/append-vec-storage.md Outdated Show resolved Hide resolved