Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zero-lamport accounts preventing slot cleanup #12606

Merged
merged 6 commits into from
Oct 9, 2020

Conversation

carllin
Copy link
Contributor

@carllin carllin commented Oct 1, 2020

Problem

Some slot S containing an update for an account with key K may not be cleaned up if at the time of clean_slots, the latest update in K is in some rooted slot S' > S and K has zero lamports. This is because in clean_slots()

if account_info.lamports == 0 {

and

else if accounts_index.uncleaned_roots.contains(slot)

are mutually exclusive, and uncleaned_roots is purged at the end of the clean_slots() loop, so those slots will not be re-examined if the pubkey is not updated again in some future root that is added to uncleaned root

Summary of Changes

Fixes #

@carllin carllin requested review from ryoqun and sakridge October 1, 2020 02:44
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
@@ -681,7 +686,8 @@ impl AccountsDB {
let (slot, account_info) = &list[index];
if account_info.lamports == 0 {
purges.insert(*pubkey, accounts_index.would_purge(pubkey));
} else if accounts_index.uncleaned_roots.contains(slot) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing another AccountsDB bug.. ;)

iirc, there was some reason that I intentionally chained with else if. But I can't remember the exact reason. And, sorry for the lack of comment. I'll try to remember and report back here. Anyway, a lot happened since then. So, this is just ok as long as the ci is green and running validator causes no bad snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, no worries! Was it to avoid deleting the slot such that an AppendVec added earlier to purges wouldn't be deleted, i.e. so this wouldn't panic? https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L720. I suspect this might be the reason because if you change the else if to an if on master, and try running this test: https://github.com/solana-labs/solana/pull/12606/files#diff-2099c5256db4eb5975c8834af38f6456R3118, that unwrap() will panic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for the record) this line is added #8436.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin Hmm, it seems some of other tests are failing. Could you look at it? Those failing tests will be very cryptic to you. So don't hesitate to ask questions. At first glance, I can't tell these are actually failing by being broken or these test assertions need to just be updated.

I'm still trying to remember something. Still, the memory is blur but it was like this:

  • clean_old_rooted_accounts should conceptually be complement to the cleaning of zero lamport accounts. So, for every given live account, we do either clean_old_rooted_accounts or cleaning of zero lamport accounts. Even if we don't clean old_rooted accounts for zero lamport accounts, the cleaning of zero lamport logic can purge them eventually. So there is no need to do both; it's kind of waste doing both for each account every time we run clean_accounts().
  • There was a problem / bug when we're too eager to clean old index entries of zero lamport accounts. So combined the above bullet point concern, I've avoided doing both and chose if else approach. (That's why tests failing I guess).
    • ... And I failed to comment as such because that assumption was so prevalent at the time. It was obvious thing to me. (my bad).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, no worries! Was it to avoid deleting the slot such that an AppendVec added earlier to purges wouldn't be deleted, i.e. so this wouldn't panic? https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L720. I suspect this might be the reason because if you change the else if to an if on master, and try running this test: https://github.com/solana-labs/solana/pull/12606/files#diff-2099c5256db4eb5975c8834af38f6456R3118, that unwrap() will panic

As said in the my above comment, I don't think my thought for the original implementation decision started from these concerns. At the time, I didn't even tried to do if approach. Anyway, I think your observation is correct for the interrelation of two cleaning and panic of wrap with the if approach..

Copy link
Contributor Author

@carllin carllin Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin Hmm, it seems some of other tests are failing. Could you look at it? Those failing tests will be very cryptic to you. So don't hesitate to ask questions. At first glance, I can't tell these are actually failing by being broken or these test assertions need to just be updated.

I'm still trying to remember something. Still, the memory is blur but it was like this:

  • clean_old_rooted_accounts should conceptually be complement to the cleaning of zero lamport accounts. So, for every given live account, we do either clean_old_rooted_accounts or cleaning of zero lamport accounts. Even if we don't clean old_rooted accounts for zero lamport accounts, the cleaning of zero lamport logic can purge them eventually. So there is no need to do both; it's kind of waste doing both for each account every time we run clean_accounts().

  • There was a problem / bug when we're too eager to clean old index entries of zero lamport accounts. So combined the above bullet point concern, I've avoided doing both and chose if else approach. (That's why tests failing I guess).

    • ... And I failed to comment as such because that assumption was so prevalent at the time. It was obvious thing to me. (my bad).

Yeah, working on those tests! I think the ones that are failing are a mix of assertions failing and me not handling: #12606 (comment), which is breaking the issue described here: #7157 (This is the issue with old AppendVecs reviving zero-lamport accounts, which should now be addressed by calc_delete_dependencies()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! It looks that you passed the CI fully. So happy. :)

So, the once independent cleaning mechanisms finally start to interact each other in one pass after battle-tested for half a year. That sounds cool.

@carllin carllin force-pushed the FixAccountsDb2 branch 2 times, most recently from f58d192 to bd34e1e Compare October 2, 2020 07:07
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 2, 2020

Codecov Report

Merging #12606 into master will increase coverage by 0.0%.
The diff coverage is 100.0%.

@@           Coverage Diff            @@
##           master   #12606    +/-   ##
========================================
  Coverage    82.0%    82.0%            
========================================
  Files         359      359            
  Lines       83989    84123   +134     
========================================
+ Hits        68875    69009   +134     
  Misses      15114    15114            

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
@ryoqun
Copy link
Member

ryoqun commented Oct 2, 2020

@carllin not strictly, related to this pr, but I found rather worried flaky test failure: https://buildkite.com/solana-labs/solana/builds/32038#b65ccd3b-3b4d-40c6-82bc-51b685ed5948/247-1671

Could you take a look at that before merging this?

---- bank::tests::test_clean_zero_lamport_account_different_hash stdout ----
thread 'bank::tests::test_clean_zero_lamport_account_different_hash' panicked at 'No bank hash was found for this bank, that should not be possible', runtime/src/accounts.rs:671:14
stack backtrace:
   0: rust_begin_unwind
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/std/src/panicking.rs:475
   1: core::panicking::panic_fmt
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/panicking.rs:85
   2: core::option::expect_failed
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/option.rs:1211
   3: core::option::Option<T>::expect
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/option.rs:331
   4: solana_runtime::accounts::Accounts::bank_hash_info_at
             at ./src/accounts.rs:669
   5: solana_runtime::bank::Bank::hash_internal_state
             at ./src/bank.rs:3236
   6: solana_runtime::bank::Bank::rehash
             at ./src/bank.rs:1345
   7: solana_runtime::bank::tests::test_clean_zero_lamport_account_different_hash
             at ./src/bank.rs:9365
   8: solana_runtime::bank::tests::test_clean_zero_lamport_account_different_hash::{{closure}}
             at ./src/bank.rs:9359
   9: core::ops::function::FnOnce::call_once
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/ops/function.rs:233
  10: core::ops::function::FnOnce::call_once
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/ops/function.rs:233
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
 
failures:
    bank::tests::test_clean_zero_lamport_account_different_hash

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved
@ryoqun
Copy link
Member

ryoqun commented Oct 2, 2020

@carllin Sorry for late review as always... ;) Also, this time I'm afraid but I have to assign some homework to you.

@ryoqun
Copy link
Member

ryoqun commented Oct 2, 2020

Anyway, I see this pr is great progress.

I'm really looking forward to the less bloated snapshots. And slightly increased write thanks to more granular storage lock? I think zero lamport tombstones will be reclaimed much faster with this pr.

Also, please test this for rather long duration, pointing to mainnet-beta and testnet.

@carllin
Copy link
Contributor Author

carllin commented Oct 3, 2020

@ryoqun for sure, thanks for the review! Definitely letting this run against mainnet for a while.

@carllin not strictly, related to this pr, but I found rather worried flaky test failure: https://buildkite.com/solana-labs/solana/builds/32038#b65ccd3b-3b4d-40c6-82bc-51b685ed5948/247-1671

Could you take a look at that before merging this?

---- bank::tests::test_clean_zero_lamport_account_different_hash stdout ----
thread 'bank::tests::test_clean_zero_lamport_account_different_hash' panicked at 'No bank hash was found for this bank, that should not be possible', runtime/src/accounts.rs:671:14
stack backtrace:
   0: rust_begin_unwind
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/std/src/panicking.rs:475
   1: core::panicking::panic_fmt
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/panicking.rs:85
   2: core::option::expect_failed
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/option.rs:1211
   3: core::option::Option<T>::expect
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/option.rs:331
   4: solana_runtime::accounts::Accounts::bank_hash_info_at
             at ./src/accounts.rs:669
   5: solana_runtime::bank::Bank::hash_internal_state
             at ./src/bank.rs:3236
   6: solana_runtime::bank::Bank::rehash
             at ./src/bank.rs:1345
   7: solana_runtime::bank::tests::test_clean_zero_lamport_account_different_hash
             at ./src/bank.rs:9365
   8: solana_runtime::bank::tests::test_clean_zero_lamport_account_different_hash::{{closure}}
             at ./src/bank.rs:9359
   9: core::ops::function::FnOnce::call_once
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/ops/function.rs:233
  10: core::ops::function::FnOnce::call_once
             at /rustc/7e6d6e5f535321c2223f044caba16f97b825009c/library/core/src/ops/function.rs:233
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
 
failures:
    bank::tests::test_clean_zero_lamport_account_different_hash

Addressed this here: #12658

Comment on lines +793 to +795
let count = self
.storage
.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin Opps, Sorry to spot this at later stage of this pr. But, I think there is panic! possibility here, because of the shrinker thread, which is running in parallel.

By moving this self.storage.read() lock inside here from the outer loop prologue, now shrinker thread are free to update account storages while this is running. unwrap() at L798 here bite us.

This is because shrinker effectively replaces old AccountStorageEntry with new one while removing the old AccountStorageEntry (we might be accessing this from here). remove_dead_accounts() never cleans up any empty AccountStorageEntries unless all of them for given slot is empty. But the do_shrink_slot do its work diligently by removing any empty StorageEntry manually after handle_reclaims, knowing they can almost certainly can remove empty one of them at the very least.

As a matter of fact, we're not affected from other store codepath, however. Because we're mangling rooted slots. other store codepath only touches non-rooted new slots. The only two which touches the rooted slots are this cleaner and shrinker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun, shrink and clean run serially in AccountsBackgroundService, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right; Even more, the above comment is wrong: we're actually serializing the executions: https://github.com/solana-labs/solana/pull/12606/files?diff=split&w=1#diff-2099c5256db4eb5975c8834af38f6456R703

nis: So, I think this lock can go away now.

Copy link
Member

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits

Great work! Also thanks for addressing various review comments! This got a lot clearer. 👍
And, sorry for the false alarm about the shrinker<=>cleaner data race bug... ;)

@ryoqun
Copy link
Member

ryoqun commented Oct 9, 2020

@carllin btw, I'd rather like to wait backporting this to v1.4 for the duration of 1 or 2 point releases. Because this isn't strictly a bug fix, I think we can wait. Thus, I want your previous grand work (#12194) to get ripe a bit on v1.4. What do you think?

@carllin
Copy link
Contributor Author

carllin commented Oct 9, 2020

@ryoqun I think that makes sense! I'll hold off on the backport and let the other PR bake a bit, thanks for the review!

@carllin carllin merged commit 16d45b8 into solana-labs:master Oct 9, 2020
@ryoqun
Copy link
Member

ryoqun commented Oct 19, 2020

@carllin Time to backport this? it looks like testnet is doing well with #12194.

Also, we're queueing #12126 and #12787 (preferably as the next patch release after this pr). Also, I rather want to spend most of time with all these prs backported on v1.4 so that we could be confident when the mainnet finally moves to the v1.4.

@carllin carllin added the v1.4 label Oct 19, 2020
@carllin
Copy link
Contributor Author

carllin commented Oct 19, 2020

@ryoqun definitely, tagged!

mergify bot pushed a commit that referenced this pull request Oct 19, 2020
Co-authored-by: Carl Lin <[email protected]>
(cherry picked from commit 16d45b8)
mergify bot added a commit that referenced this pull request Oct 19, 2020
Co-authored-by: Carl Lin <[email protected]>
(cherry picked from commit 16d45b8)

Co-authored-by: carllin <[email protected]>
@ryoqun
Copy link
Member

ryoqun commented Oct 25, 2020

image

it looks like snapshot got bigger a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants