Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid early clean and bad snapshot by ref-counting #8724

Merged
merged 4 commits into from
Mar 13, 2020

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Mar 9, 2020

Problem

I've just found another corner case where a validator can create bad snapshot.....

A validator can clean any rooted slots at any timings unless the account's balance is 0-lamport. So, online processing of clean_accounts can have incomplete view of alive storages given a pubkey. So, it could purge a shielding zero lamport pubkey too early and create a bad snapshot. In that case, if restored from such a snapshot, the ill-state AccountsIndex could incorrectly revive the 0-lamport account using the old state from old storage.

For details, please see the accompanied unit test. (it's huge..)

Anyway, this bug manifests more often if we enable off-line snapshot clean because the apparently-leaking storages are acutally shielding such a bad account view revival in some cases, ironically.

Summary of Changes

Introduce ref-count from storages to the index. Its counting lifetime is bound so that it's unconditionally {in,de}cremented at AccountsIndex::insert() and clean_dead_slots, respectively.

This is a bit costly (basically: temporal cost: +1 wlock per 1 account update, spatial cost: 8 byte per pubkey)

@sakridge This PR is still kind of draft.. Is there better solution? If none, I'll do remaining cleanings like the hideous scattered .0s and outdated names.

@mvines I've not tested but this may be possible cause of past reported snapshot verification error in the SLP cluster with 0.23, where the recent regressed snapshot verification bug shouldn't exist.

Part of #8168

@ryoqun ryoqun added the work in progress This isn't quite right yet label Mar 9, 2020
@ryoqun ryoqun requested a review from sakridge March 9, 2020 06:07
@codecov
Copy link

codecov bot commented Mar 9, 2020

Codecov Report

Merging #8724 into master will increase coverage by 0.0%.
The diff coverage is 99.2%.

@@          Coverage Diff           @@
##           master   #8724   +/-   ##
======================================
  Coverage    79.9%   80.0%           
======================================
  Files         261     261           
  Lines       57141   57238   +97     
======================================
+ Hits        45699   45795   +96     
- Misses      11442   11443    +1     

@ryoqun ryoqun added the v1.0 label Mar 9, 2020
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
@sakridge
Copy link
Member

sakridge commented Mar 9, 2020

@ryoqun Good find. I will take a look.

for slot in dead_slots.iter() {
for store in storage.0.get(slot).unwrap().values() {
for account in store.accounts.accounts(0) {
index.unref_from_storage(&account.meta.pubkey);
Copy link
Member Author

@ryoqun ryoqun Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no better approach to this problem, we can parallelize/async this code as an further improvement to this PR.

@sakridge
Copy link
Member

@ryoqun Good find. I will take a look.

Hmm. This is a good way to do it.

I'm not super excited about the storage scan we have to do when cleaning up the slot. The other way I suppose would be to keep another index for each appendvec which is a hashmap of pubkey to ref count for that appendvec. I'm still thinking about it though.

@sakridge
Copy link
Member

@ryoqun Good find. I will take a look.

Hmm. This is a good way to do it.

I'm not super excited about the storage scan we have to do when cleaning up the slot. The other way I suppose would be to keep another index for each appendvec which is a hashmap of pubkey to ref count for that appendvec. I'm still thinking about it though.

Another option could be to not remove the fork entries from the accounts index until the append vec store is actually removed for that entry.

@ryoqun
Copy link
Member Author

ryoqun commented Mar 10, 2020

@ryoqun Good find. I will take a look.

Hmm. This is a good way to do it.

I'm not super excited about the storage scan we have to do when cleaning up the slot.

Yeah, me too...

The other way I suppose would be to keep another index for each appendvec which is a hashmap of pubkey to ref count for that appendvec. I'm still thinking about it though.

I considered that impl, too. But, I opted to do the current PR's impl because having two index maps of 32 byte keys (which could be huge) could waste memory too much. I decided that without much thought, though. So, your thoughts are very welcome!

@ryoqun
Copy link
Member Author

ryoqun commented Mar 10, 2020

@ryoqun Good find. I will take a look.

Hmm. This is a good way to do it.
I'm not super excited about the storage scan we have to do when cleaning up the slot. The other way I suppose would be to keep another index for each appendvec which is a hashmap of pubkey to ref count for that appendvec. I'm still thinking about it though.

Another option could be to not remove the fork entries from the accounts index until the append vec store is actually removed for that entry.

I went against that approach, mainly because the number of fork entries recently got desired to small with #8436, which traverses many entries each time snapshot intervals, while cleaning for the next run. Also, there is similar concern for lazy cleaning's fork entry traversal inside regular index updates.

As with my previous comment, I'm decided without much thought. I'm very open to your opinions!

@ryoqun
Copy link
Member Author

ryoqun commented Mar 10, 2020

I've extensively tested this PR locally and found this PR actually fixes the actual bank hash mismatch error when cleaning a recent TdS snapshot with new cleaning impl (not pushed to #8337).

@ryoqun
Copy link
Member Author

ryoqun commented Mar 11, 2020

@sakridge How's your thoughts? Do you need some bench info or anything to decide which implementation strategy we should follow?

@sakridge
Copy link
Member

@sakridge How's your thoughts? Do you need some bench info or anything to decide which implementation strategy we should follow?

I think this is reasonable. I like that it seems like the work could be done async. The other methods would require more locking in the critical update path. Let's move forward with this. It would be nice to get an idea of what the performance is like though before we merge it.

@ryoqun ryoqun changed the title [wip] Avoid early clean and bad snapshot by ref-counting Avoid early clean and bad snapshot by ref-counting Mar 11, 2020
@ryoqun ryoqun force-pushed the no-storage-gc-based-early-clean branch from 9701385 to 1697ca5 Compare March 11, 2020 09:17
@ryoqun
Copy link
Member Author

ryoqun commented Mar 11, 2020

I've done minor clean-ups and rebased on the current HEAD of origin/master and measure the before and after as-is:

grafana-testnet-monitor-edge-ryoqun-storage-gc-before.pdf
grafana-testnet-monitor-edge-ryoqun-storage-gc-after-v0.pdf

As guessed, performance is considerably affected.

I'll try some perf improvements.

@ryoqun
Copy link
Member Author

ryoqun commented Mar 11, 2020

@sakridge My performance understanding from the preceding benchmark results in a similar manner to this and this:

  • Overall, the critical section of snapshot creation got slow (+~1s, +2x)
    • Max confirmation time is increased (+1.3x for 99th percentile, +1.8x for max) due to more noticeable stoppage for each snapshot creation, which entails the largest reclaiming
  • However, the overall TPS isn't changed (=~ +/-1%)
  • Lock contention is visible on the "Resource Usage" panel again like Do periodic inbound cleaning for rooted slots #8436.

@ryoqun ryoqun marked this pull request as ready for review March 11, 2020 18:17
@ryoqun
Copy link
Member Author

ryoqun commented Mar 11, 2020

@sakridge Also, for the record, the pdf creating step is easy as:

  1. ./colo.sh create -n 3 -c 1
  2. ./init-metrics.sh $(whoami)
  3. ./net.sh start
  4. wait 30 min
  5. ./net.sh stop
  6. Chromium: Print to PDF with A3 paper size scale with custom 25% and with background colors

@ryoqun ryoqun merged commit 952cd38 into solana-labs:master Mar 13, 2020
mergify bot pushed a commit that referenced this pull request Mar 13, 2020
* Avoid early clean and bad snapshot by ref-counting

* Add measure

* Clean ups

* clean ups

(cherry picked from commit 952cd38)
for (_slot_id, account_info) in account_infos {
if *store_counts.get(&account_info.store_id).unwrap() != 0 {
if *store_counts.get(&account_info.store_id).unwrap() == 0 {
would_unref_count += 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, this counting was flawed from start....

This doesn't align with this incrementing when there are multiple stores: https://github.com/solana-labs/solana/pull/8724/files#diff-ef68c28d9d63e66355c44904d9877825R125

Fixed in #12462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work in progress This isn't quite right yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants