Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for the new AppendVec storage #28550

Closed
wants to merge 1 commit into from
Closed

Proposal for the new AppendVec storage #28550

wants to merge 1 commit into from

Conversation

yhchiang-sol
Copy link
Contributor

This PR includes a proposal that discusses three potential areas that we can further reduce
+the storage size of AppendVec:

  • Use one byte for executable + rent_epoch fields (saves 8 bytes per account).
  • Skip persisting the account hash for each account (saves ~32 bytes per account).
  • Storing account data separately (better compression rate on account data).

@yhchiang-sol yhchiang-sol marked this pull request as draft October 23, 2022 04:53
@jeffwashington
Copy link
Contributor

some other brainstorming ideas.

  1. use lookup table for common owners per appendvec (we should run a distribution graph for top owners on mnb). Special case the default pubkey owner for 'system'. Store a table of up to N (N=16?) owners that are present in the append vec. Then, for each entry, store a few bits indicating which of the table entries is the owner or we may have to serialize this specific owner.
  2. special case 0 lamport accounts. How many do we store on mnb? Would this matter?
  3. figure out how to reduce the bytes use for # lamports. We already have unaligned data for 'data'. u64 is 8 bytes
  4. reduce the bytes used for data len. We already have unaligned data for 'data'
  5. consider storing the pubkeys in 1 section of the file. We rarely need to scan pubkeys. Having to map them into memory for account load just wastes io, memory, and cpu.
  6. maybe conditionally reorder on write to reduce padding required if we continue to store data side by side
  7. consider tradeoffs of 2 files per append vec vs 1 (storing data separately). Maybe we store them together, but in separate sections and maybe we compress the 2 halves separately for snapshots?
  8. special case for 0 len data. basically, I'm imagining a byte or few bytes with a few common enum values (default account=0, account owned by system program with zero data=1, account owned by program '1' in lookup table with zero data=2, ...). Maybe it is a bit field, where 0 data is 1 bit. owner is 4 bits, lamports < 3 bytes is 1 bit, something like that.
  9. sorted by pubkey for some reason?
  10. figure how how much space we waste on alignment

these days, append vecs are written with all the information present prior to writing.
This is true except for ancient append vecs, fwiw.

@jeffwashington
Copy link
Contributor

goals:

  1. improve compressability for snapshots
  2. reduce how much has to be mapped into memory to load accounts
  3. reduce file size to save disk, reduce snapshot size, reduce how much has to be read in/written and scanned/loaded/mapped
  4. consider cpu/mem/disk tradeoffs.
  5. eliminate redundant data

@jeffwashington
Copy link
Contributor

we've now activated the feature that stops updating rent epoch for rent exempt accounts.
we also don't allow creating rent paying accounts anymore
so, rent_epoch will soon become irrelevant.
rent_epoch is still a component of an accounts hash
at some point, we can stop storing/loading rent_epoch completely.
we should investigate the distrubution of rent_epoch and what we expect it to be in the future.
Should we do something to freeze rent_exempt accounts at a known rent_epoch (like SLOT::MAX) so that we can stop saving rent_epoch and use a bit to represent that the rent_epoch is fixed? I'll create a feature for this.

@yhchiang-sol yhchiang-sol changed the title (Draft) Proposal for smaller AppendVec storage (Draft)(WIP) Proposal for smaller AppendVec storage Oct 25, 2022
@jeffwashington
Copy link
Contributor

#28585
setting rent_epoch to known, constant value for rent exempt accounts.

@jeffwashington
Copy link
Contributor

Note that we are no longer updating rent_epoch with each epoch. As a result, we will now have a wide range of values for rent_epoch going forward. Previously they were all within a few epochs of the append vec's slot's epoch.

@yhchiang-sol
Copy link
Contributor Author

Thanks for sharing your thoughts, Jeff! It really helps reorganize the unstructured ideas in my mind and boost more ideas.

Probably worth discussing the items/ideas one by one.

  • sorted by pubkey for some reason?

Sorting will provide two benefits:

  • Enable binary search. This will help reduce the account index size as we don't need to save the reduced_offset inside the AccountInfo, which is 4 bytes per account. This enables us to trade CPU for smaller memory and storage usage for the accounts index.

  • Possibly better compression rate as sorted pubkeys are more likely to share the same prefix, but this all depends on the density of the pubkeys.

@yhchiang-sol
Copy link
Contributor Author

consider tradeoffs of 2 files per append vec vs 1 (storing data separately). Maybe we store them together, but in separate sections and maybe we compress the 2 halves separately for snapshots?

The very first version in my mind was one file, but I later found that if we want to keep supporting append_ptr using the existing mmapped append_vec file, then we are going to need two files as we are not able to "append" anymore since we have more than one section in the file.

I will keep thinking about what would be the best layout and format for AppendVec storage together with accounts index.

@yhchiang-sol
Copy link
Contributor Author

figure out how to reduce the bytes use for # lamports. We already have unaligned data for 'data'. u64 is 8 bytes

For lamports, can I know what is the maximum allowed number of lamports per account?

reduce the bytes used for data len. We already have unaligned data for 'data'

For this, I am thinking loud that if the basic unit of the data size is 1k or something, then our number here can just represent the size in terms of how many units. (i.e., 1000 means 1000k). In that case, u32 will be able to represent up to 4TB, which should be enough.

If the above idea sounds good, and if we are okay with 4 billion maximum for lamports, then we can use one single u64 to store both lamports and account data size?

@yhchiang-sol
Copy link
Contributor Author

yhchiang-sol commented Oct 26, 2022

special case for 0 len data. basically, I'm imagining a byte or few bytes with a few common enum values (default account=0, account owned by system program with zero data=1, account owned by program '1' in lookup table with zero data=2, ...). Maybe it is a bit field, where 0 data is 1 bit. owner is 4 bits, lamports < 3 bytes is 1 bit, something like that.

Hmm, if we store all accounts data in a different section, then we can probably skip storing the account data size as we can derive the account data size by (the account data offset of the next account entry) - (the offset account data of the current account entry)

@yhchiang-sol yhchiang-sol changed the title (Draft)(WIP) Proposal for smaller AppendVec storage (Draft) Proposal for smaller AppendVec storage Oct 31, 2022
@yhchiang-sol
Copy link
Contributor Author

yhchiang-sol commented Oct 31, 2022

Updated the design. The proposal now includes the general design and several graphs.
While I am still working on adding more text and details, the proposal should be ready for initial feedback.

Here's a quick summary.

The new AppendVec storage format uses ~80 bytes for each account on average, down from the existing
129 bytes per account. The new format is also more compression friendly for account data.

Specifically:

  • Rent-exempt account w/o shared owner uses 80 bytes each.
  • Rent-exempt account w/ shared owner uses 40 + 32 / N bytes each,
    where N is the number of accounts sharing its owner.
  • Non-rent exempt account uses an extra 8 bytes for rent epoch.

The new AppendVec file is organized into 6 different blocks: header,
account metas block, account pubkeys block, owners block, account data block,
and footer.

+---------------------------------------------------+
| header                | 40 bytes                  |
+---------------------------------------------------+
| account metas block   | 16 bytes per account      |
+---------------------------------------------------+
| account pubkeys block | 32 bytes per account      |
+---------------------------------------------------+
| owners block          | 32 bytes per unique owner |
+---------------------------------------------------+
| account data block    | N + optional 8 bytes      |
|                       | for rent epoch            |
+---------------------------------------------------+
| footer                | 40 bytes                  |
+---------------------------------------------------+

@yhchiang-sol yhchiang-sol changed the title (Draft) Proposal for smaller AppendVec storage Proposal for smaller AppendVec storage (Collecting initial feedback) Oct 31, 2022
@yhchiang-sol yhchiang-sol marked this pull request as ready for review October 31, 2022 07:22
@yhchiang-sol yhchiang-sol changed the title Proposal for smaller AppendVec storage (Collecting initial feedback) Proposal for the new AppendVec storage Oct 31, 2022
@yhchiang-sol
Copy link
Contributor Author

Changed to ready-to-review for collecting initial feedback.

| | up to 4,294,967,295 owners per append vec |
| data_offset | 4 bytes |
+-----------------+-------------------------------------------+
| rent status | 1 bit |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is able to be deterministically calculated based on lamports and data len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Yes! So we can remove this bit!

// the id can be used to fetch its pubkey
pub owner_local_id: u32,
data_offset: u32,
data_size: u32, // derived when first loading the AppendVec from the storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, data size is currently limited to 10MB, iirc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we happen to have some size unit for the data size? Such as 1k or 64 bytes or anything like this? If we have a size unit, then we can represent a wider range of data sizes with fewer bytes.

+-------------------------------------------------------------+
| Account data block (variable size) |
+-------------------------------------------------------------+
| rent_epoch | 0-8 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have to be aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel the same way. Then we might need to store the account data size for each account as there might be gaps between two account data entries with paddings.

Copy link
Contributor

@alessandrod alessandrod Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this accessed? I imagine it's read, not directly cast to a reference, so it can be left unaligned and read with https://doc.rust-lang.org/stable/std/ptr/fn.read_unaligned.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's for read operations. Having it aligned should allow reads to be more performant.

Copy link
Contributor Author

@yhchiang-sol yhchiang-sol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial feedback, @jeffwashington! Will address the alignment for the data block and extend the design to cover ancient AppendVec.

+-------------------------------------------------------------+
| Account data block (variable size) |
+-------------------------------------------------------------+
| rent_epoch | 0-8 bytes |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel the same way. Then we might need to store the account data size for each account as there might be gaps between two account data entries with paddings.

@yhchiang-sol
Copy link
Contributor Author

Updated the proposal:

  • Removed write_version.
  • Added account meta entry size and owner entry size in the header to enable forward compatibility.
  • Added alignment information.
  • More descriptions.

@yhchiang-sol
Copy link
Contributor Author

Updated the proposal to include compression and the concept of the account data block.

Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some first pass thoughts for now. consider getting input from the firedancer team as well. they are highly skilled in data path optimization and will need to work with us at least wrt snapshot format

+---------------------------------------------+
| Header |
+---------------------------------------------+
| format_version | 8 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a header_size field

| Header |
+---------------------------------------------+
| format_version | 8 bytes |
| file_size | 8 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to recover the file size from the filesystem/inode entry? it can still be included in a hash w/o strictly being serialized

#### Header Block
The header includes high level information of an AppendVec file.

All the numerical fields in the header are `u64` integers for simplicity as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this statement seems to be contradicted by several fields with size designated as 4-bytes below?

| data_block_offset | 8 bytes |
| blob_data_block_offset | 8 bytes |
| compression_algorithm | 8 bytes |
| append_vec_hash | 32 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting this hash in the header complicates its generation. we need to straddle it with two hasher updates. consider one of

  • creating a footer and storing the hash there instead
  • merklizing the blocks and storing the root in the header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great we share the same idea. The current prototype uses the footer approach instead. Let me update the doc later today.

| owners_offset | 8 bytes |
| data_block_offset | 8 bytes |
| blob_data_block_offset | 8 bytes |
| compression_algorithm | 8 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does compression here only apply to snapshot storage or also to any paged out appendvecs? any reason not to prefer a zerocopyable interface for the latter?

upon a complete read, i feel like i'm missing some important context around the use and intention of compression here. can you add a section describing when, where and how compression will be used?

docs/src/proposals/append-vec-storage.md Show resolved Hide resolved
offers forward compability as long as the newer version only adds new fields
and does not change the definition of the existing fields.

#### (Small) Account Data Blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this block page-aligned within the appendvec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current doc here is a bit outdated as I am currently prototyping.

In the current prototype, the account data blocks will be the first thing in the file, so the first block is guaranteed to be page-aligned. But since each block is compressed, the rest of the blocks will not be page-aligned.

Let me think if we could get benefits from both compress and page alignment.

docs/src/proposals/append-vec-storage.md Outdated Show resolved Hide resolved
+-------------------------------------------------------------+
| Account data (after decompression) |
+-------------------------------------------------------------+
| rent_epoch | 0-8 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero or 8 bytes, right? this should probably be appended instead of prepended and NonZeroU64, then we get free memory layout optimization from the compiler (assuming rent-epoch == 0 is an invalid thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction. Yes, it should be 0 or 8 bytes.

If we read account data more often than rent-epoch (I think this assumption should be yes), then yes it should be appended instead of prepended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rent_epoch is quite the story. It is only relevant anymore for rent-paying accounts, which is a shrinking set of ~1M accounts on mnb. For everyone else, I have a feature to set the value to Epoch::MAX for all rent exempt accounts. Thus, we don't need to store it at all for rent exempt accounts (the vast bulk of accounts). For rent paying accounts it is 1 of 4 values. Slot - 1, Slot, Slot + 1, 0. Where Slot is the slot of the append vec itself. It is possible we could reduce these values down to fewer than 4. Maybe even 2 values:

  1. rent already collected as of this slot
  2. rent last collected as of prior slot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hustling to get this feature ready.
#28683

and does not change the definition of the existing fields.

#### (Small) Account Data Blocks
One small account data block contains multiple accounts' data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is each data entry a full page, or variable-length?

cross-page load/store perf is usually pretty abysmal. like an order of magnitude bad

@t-nelson
Copy link
Contributor

This is a good read if we're considering retaining mmap-based file-backing. tl;dr, we probably shouldn't

Are You Sure You Want to Use MMAP in Your Database Management System? (pdf)

@yhchiang-sol
Copy link
Contributor Author

Thanks for the feedback, @t-nelson! The doc is currently out-of-date as I am implementing the prototype but many of your comments are still applied! Will update the doc later today and have fire-dancers folks involved.

This is a good read if we're considering retaining mmap-based file-backing. tl;dr, we probably shouldn't
[Are You Sure You Want to Use MMAP in Your Database Management System?]
(https://www.cidrdb.org/cidr2022/papers/p13-crotty.pdf) (pdf)

Rest assured that the new design will not use MMAP :). We will LRU or some other suitable cache mechanisms for reads.

@yhchiang-sol yhchiang-sol marked this pull request as draft December 1, 2022 18:41
@yhchiang-sol
Copy link
Contributor Author

Am still fixing bugs in the prototype. I want to make sure everything runs well before updating the proposal. Converting this PR to draft.

@t-nelson
Copy link
Contributor

t-nelson commented Dec 1, 2022

fwiw, typically a proposal should be accepted before the implementation is started 😉

@yhchiang-sol
Copy link
Contributor Author

yhchiang-sol commented Dec 2, 2022

fwiw, typically a proposal should be accepted before the implementation is started

Yep yep, it's just a prototype to make sure it can actually produce the expected outcome :p.

And here it is: #28790

So far I am seeing up to ~75% size reduction, which is good! Instructions to test the file format is also under the comment in #28790.

Will update the proposal shortly.

@yhchiang-sol
Copy link
Contributor Author

Hello @jeffwashington and @t-nelson,

Thanks for the feedback so far!

While I am still trying to add more sections/descriptions and polish the document, I've included the most important sections and information that allow our discussion to proceed.

Let me know what you think about the new format.

I've prototyped the new format and tested it by converting ~105GB append-vec files. The new format only uses ~40GB.
If I further compress the files, it is ~30GB, which can be used as an estimation for its snapshot size.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 19, 2023
@github-actions github-actions bot closed this Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants