Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing account state on every slot inefficient and reduces incremental snapshot efficiency #17178

Closed
sakridge opened this issue May 11, 2021 · 13 comments
Assignees
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@sakridge
Copy link
Member

Problem

Some percentage of the accounts is rewritten on each slot.

self.store_account(&pubkey, &account);

This limits the number of append-vecs, but it creates more write IO and makes incremental snapshots larger since there are updates to accounts which are not necessary.

Proposed Solution

Don't rewrite on each slot. Be able to combine cross-slot accounts into a single append-vec.

@sakridge
Copy link
Member Author

Related to #17088

@sakridge
Copy link
Member Author

cc @ryoqun

@ryoqun
Copy link
Member

ryoqun commented May 12, 2021

it creates more write IO and makes incremental snapshots larger since there are updates to accounts which are not necessary.

yeah, this is kind of waste. but it was easy to address infinite append vec issue and introduce stale acocunt's rent collection and periodic on-chain whole-state check at the time...

I think some slot-based reflink mechanism should alleviate the situation while combining older appendvecs. So, we keep to be apparently rewriting the accounts for inclusion to bank hash for on-chain verification. The trick is that the previous updated slot for given account is deterministic across the cluster. So, we could even embed the slot pointer into delta snapshots.

Or rather, we just can give up this on-chain state check altogether and completely relying on background accounts hash check? I rather avoid it for the very blockchain existence.

@sakridge
Copy link
Member Author

it creates more write IO and makes incremental snapshots larger since there are updates to accounts which are not necessary.

yeah, this is kind of waste. but it was easy to address infinite append vec issue and introduce stale acocunt's rent collection and periodic on-chain whole-state check at the time...

I think some slot-based reflink mechanism should alleviate the situation while combining older appendvecs. So, we keep to be apparently rewriting the accounts for inclusion to bank hash for on-chain verification. The trick is that the previous updated slot for given account is deterministic across the cluster. So, we could even embed the slot pointer into delta snapshots.

Or rather, we just can give up this on-chain state check altogether and completely relying on background accounts hash check? I rather avoid it for the very blockchain existence.

Yea. At the time it was the right thing. I think we are getting to the point where we need to re-think the compaction strategy though because of the accounts growth and need for incremental snapshots.

@brooksprumo brooksprumo self-assigned this Oct 21, 2021
@brooksprumo
Copy link
Contributor

I'll look into this.

@ryoqun
Copy link
Member

ryoqun commented Oct 27, 2021

hi, i revisited this tricky problem again.

let's recap some things before any call to establish the common understanding, considering recent progress of incremental snapshots and disk-based indexes (great work and progress!).

@brooksprumo @jeffwashington (cc: @sakridge )

problems

(1): incremental snapshot is space-inefficient (@brooksprumo, actually how bad as of now? and is there target size of incremental snapshot for mainnet-beta? also, dumb question: how often should full snapshot be recreated ideally?)
(2): disk-based index is time-inefficient due to the ordered key requirement of rent collection (@jeffwashington is this right? how much improvement can we expect if we can lift the ordering requirement?)
(3): as the number of accounts increase, rent collection mechanism as-is will be the bottleneck if done for each bank.

observations

(a): vast of accounts aren't touched by txes in an epoch (or more)
(b): vast of accounts are rent-exempt (means there is no strict requirement of account rewrites)
(c): rent won't be dynamic in foreseeable future. (rent is similar to effective burn, actually by nature, considering the actual usage pattern at mainnet-beta. so, there is no strong reason to scan the whole accounts per epoch)

requirements

(i): relieve the mentioned inefficiency problems
(ii): still force staked nodes to make those seldom-touched account data available (i.e. discourage cutting the storage hardware cost for cheating)
(iii): appendvec count should be bounded
(iv): ideally, deterministic and differentiated data access cost should be able to be incorporated with the new comprehensive fee. ref: #20531
(v): detect any silent data corruption for on-chain data early and nondeterministic data mutation to it, as a safety net.
(vi): dos attack vector free

(wild) possibilities

(x): tweak the on-chain rent protocol?
(y): only support rent-exempt?

my current idea

the below is the latest solution i came up with:

more concrete than the previous rough idea (#17178 (comment)) and some spices from @aeyakovenko 's deletion slot idea: https://discord.com/channels/428295358100013066/838890116386521088/874445426777022474

summary: introduce full-snapshot-persisted last-level overlay ordered account map for all of stale (= not touched in last 432k slots) rent-exempt accounts and restrict eager rent collection to non-rent-exempt accounts only.

haha, i guess you cannot get it with summary... the basic idea is offload the costly operations from hot path as much as possible.

here's the details.

firstly, we introduce new kind of data persistence along side AppendVecs, which contains all of stale rent-exempt accounts. this will be implemented as very large single appendvec and will be accompanied ordered disk-based index for its accounts keyed by pubkeys. (let's call this stalevec)

and it works as the last-level backing store for the AccountsDB. so at high level, any read for AccountsDb consults (now unordered) index for appendvec and then ordered index for stalevec.

naturally, stalevec will be persisted to full snapshot (not incremental snapshot).

lastly, we tweak eager rent collection only to scan over not-rent-exempt accounts. this again will mean we need a third (small) ordered index for the not-rent-exempt accounts for the much scoped new eager rent collection. so, dos against eager rent collection will force the attacker to burn sizable amount of SOLs.

also, we incorporate stale account's hash into bank hash by special handling without actually re-writing them (to satisfy requirement ii). note that this can be calculated separately at the start of new bank by a separate thread. (bonus would be to include some epoch-specific seed into it)

thus, new apppendvec will only contain updated accounts and rent-collected accounts. in other words, bare minimal; nice for incremental snapshot.

lastly any load from the stalevec will cost a little more than appendvec to reflect actual extra system resource usage.

misc:

  • special handling of 0-lamport accounts which is still in stalevec will be needed
  • replace account.rent_epoch with account.updated_slot (for deterministic eviction from appendvec to stalevec and deterministic fee calc)
  • i think we can avoid any writes into the stalevec/index to avoid any additional dos vector in the event of promotion (from stalevec to appendvec) on account updates.

questions and better ideas are greatly welcome. :)

@jeffwashington
Copy link
Contributor

(2): disk-based index is time-inefficient due to the ordered key requirement of rent collection (@jeffwashington is this right? how much improvement can we expect if we can lift the ordering requirement?)

Ordering requirement is pretty minimal effect on perf, I think. Ordered range requirement does require binning by high bits of pubkey, which is not ideal for attacks.

@jeffwashington
Copy link
Contributor

(2): disk-based index is time-inefficient due to the ordered key requirement of rent collection (@jeffwashington is this right? how much improvement can we expect if we can lift the ordering requirement?)

Ordering requirement is pretty minimal effect on perf, I think. Ordered range requirement does require binning by high bits of pubkey, which is not ideal for attacks.

To follow up on ordering:
Only non-rent scans require some type of sort now, and we concluded those were only used when secondary indexes don't exist and someone does a scan. Rent collection returns all pubkeys that are within a range by scanning the 1 or more buckets that are known to contain the requested range and the results are returned unordered. This isn't perfect, but I don't believe this scan is bad performance if we HAVE to do a scan of some type. Obviously, if we could avoid a scan like this altogether, it would be better performance.

@brooksprumo
Copy link
Contributor

(1): incremental snapshot is space-inefficient (@brooksprumo, actually how bad as of now? and is there target size of incremental snapshot for mainnet-beta? also, dumb question: how often should full snapshot be recreated ideally?)

@ryoqun Not dumb at all! Right now, the default interval for creating full snapshots is every 100,000 slots. The default interval for creating incremental snapshots is 100 slots.

Incremental snapshots can range from very good, to fine. Refer to this dashboard (and set the duration to the past 7 days). Immediately after a full snapshot is taken, the incremental snapshots are very small, ~200 MB. On the opposite side, just before the next full snapshot is taken (i.e. when the incremental snapshot is its largest), it's just under 3 GB. For comparison, the full snapshots are now almost 12 GB!

Looking at the charts, the incremental snapshot size grows linearly between two full snapshots. I believe this is due to extra accounts being re-stored by rent collection when they otherwise didn't need to be (since they were unchanged). This could potentially further improve the benefit of incremental snapshots.

@brooksprumo
Copy link
Contributor

@ryoqun Your idea is very interesting! I previously discussed this a bit first with @sakridge, and then later with @jeffwashington. With Stephen, he posed the idea of storing multiple slots within a single AppendVec. Jeff and I took this idea further into creating the idea of AncientAppendVecs, for accounts that are ~2 (or more) epochs old.

Your stalevec idea sounds very similar! Some things Jeff pointed out was around compaction, and when to shrink an AncientAppendVec. Also, how many should there be? And also keeping them sorted, but we were thinking by slot, not pubkey.

This lead into thinking about flush, and how with the Accounts Cache, flush (and shrink) are the only two places that create new AppendVecs, and the contents are fully known. This could allow flush to create a single AppendVec that could include multiple slots. This has benefits for what Jeff is looking at (and I'll let him talk more about that). I'm trying to get an idea like this working, as a proof of concept.

@brooksprumo
Copy link
Contributor

@aeyakovenko posted an idea on Discord today that could solve this:

  • Accounts pay for rent upfront
  • Each account would now track when its rent expires (i.e. rent_expiration_slot)
  • Accounts would need to pre-pay for the next rent duration before the next rent_expiration_slot (maybe this could be done automatically?)

This would help the validators, as now they'll be compensated for all accounts (instead of just rent-paying accounts, which are the minority).

This would help the runtime by removing the need to run rent collection by scanning all the accounts every epoch. I.e. generate_index() could clean up these accounts when starting up from a snapshot.

Some questions:

  • How to handle the pre-paying for the next rent duration? Who checks that? We're trying to remove the need to scan all the accounts...
  • How to handle the cleaning up of these accounts? Is generate_index() sufficient? What if a node never reboots? Will it be storing all these now-dead accounts indefinitely? How will this work with/affect the bank capitalization checks?

@brooksprumo
Copy link
Contributor

To follow along from my comment above, I was thinking about how to tie in lazy rent collection (i.e. no scanning of accounts) with incentives to find-and-close accounts that should be rent collected.

There are parallels I see to how some DeFi programs that I've heard about do this (from what I remember). Specifically, accounts that should be liquidated/closed are done via a transaction, not by a runtime. For example, lets say Account A should be closed. Other people (ex Account B) can send transactions to query that account to identify it should be closed, then send another transaction to tell the program to close the account. Then Account B would get some portion of the reward for closing Account A. The remaining portion goes to the program (or runtime/validators in our case).

Regular users are now incentivized to find and close accounts. These are regular transactions, so the economics of transactions already handles this. A new CloseAccountsAndReapRewards transaction would have a high enough cost to dissuade users from sending erroneous transactions without being sure the account should actually be closed.

What remains is how much rewards should be given to the user, and where those rewards come from. If accounts should only be closed-and-collected once they've reached a balance of 0 lamports, there would be no additional rewards to collect.

If accounts can be closed below a threshold (i.e. rent exempt minimum), that could work, but traditionally those rewards have gone 100% to the validators; would the validators now get a smaller reward?

Maybe creating new accounts should have a fixed upfront cost in addition to rent? This way, validators get the rewards when accounts are created, and users get the rewards when accounts are closed?

Obviously still lots of details to nail down, but is this a feasible path to continue investigating?

@jeffwashington
Copy link
Contributor

A lot of progress has been made on this. It is making its way to completion.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 31, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

4 participants