-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zero-lamport accounts preventing slot cleanup #12606
Conversation
@@ -681,7 +686,8 @@ impl AccountsDB { | |||
let (slot, account_info) = &list[index]; | |||
if account_info.lamports == 0 { | |||
purges.insert(*pubkey, accounts_index.would_purge(pubkey)); | |||
} else if accounts_index.uncleaned_roots.contains(slot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing another AccountsDB bug.. ;)
iirc, there was some reason that I intentionally chained with else if
. But I can't remember the exact reason. And, sorry for the lack of comment. I'll try to remember and report back here. Anyway, a lot happened since then. So, this is just ok as long as the ci is green and running validator causes no bad snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun, no worries! Was it to avoid deleting the slot such that an AppendVec added earlier to purges
wouldn't be deleted, i.e. so this wouldn't panic? https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L720. I suspect this might be the reason because if you change the else if
to an if
on master, and try running this test: https://github.com/solana-labs/solana/pull/12606/files#diff-2099c5256db4eb5975c8834af38f6456R3118, that unwrap() will panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for the record) this line is added #8436.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin Hmm, it seems some of other tests are failing. Could you look at it? Those failing tests will be very cryptic to you. So don't hesitate to ask questions. At first glance, I can't tell these are actually failing by being broken or these test assertions need to just be updated.
I'm still trying to remember something. Still, the memory is blur but it was like this:
clean_old_rooted_accounts
should conceptually be complement to the cleaning of zero lamport accounts. So, for every given live account, we do eitherclean_old_rooted_accounts
or cleaning of zero lamport accounts. Even if we don't clean old_rooted accounts for zero lamport accounts, the cleaning of zero lamport logic can purge them eventually. So there is no need to do both; it's kind of waste doing both for each account every time we runclean_accounts()
.- There was a problem / bug when we're too eager to clean old index entries of zero lamport accounts. So combined the above bullet point concern, I've avoided doing both and chose
if else
approach. (That's why tests failing I guess).- ... And I failed to comment as such because that assumption was so prevalent at the time. It was obvious thing to me. (my bad).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun, no worries! Was it to avoid deleting the slot such that an AppendVec added earlier to
purges
wouldn't be deleted, i.e. so this wouldn't panic? https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L720. I suspect this might be the reason because if you change theelse if
to anif
on master, and try running this test: https://github.com/solana-labs/solana/pull/12606/files#diff-2099c5256db4eb5975c8834af38f6456R3118, that unwrap() will panic
As said in the my above comment, I don't think my thought for the original implementation decision started from these concerns. At the time, I didn't even tried to do if
approach. Anyway, I think your observation is correct for the interrelation of two cleaning and panic of wrap with the if
approach..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin Hmm, it seems some of other tests are failing. Could you look at it? Those failing tests will be very cryptic to you. So don't hesitate to ask questions. At first glance, I can't tell these are actually failing by being broken or these test assertions need to just be updated.
I'm still trying to remember something. Still, the memory is blur but it was like this:
clean_old_rooted_accounts
should conceptually be complement to the cleaning of zero lamport accounts. So, for every given live account, we do eitherclean_old_rooted_accounts
or cleaning of zero lamport accounts. Even if we don't clean old_rooted accounts for zero lamport accounts, the cleaning of zero lamport logic can purge them eventually. So there is no need to do both; it's kind of waste doing both for each account every time we runclean_accounts()
.There was a problem / bug when we're too eager to clean old index entries of zero lamport accounts. So combined the above bullet point concern, I've avoided doing both and chose
if else
approach. (That's why tests failing I guess).
- ... And I failed to comment as such because that assumption was so prevalent at the time. It was obvious thing to me. (my bad).
Yeah, working on those tests! I think the ones that are failing are a mix of assertions failing and me not handling: #12606 (comment), which is breaking the issue described here: #7157 (This is the issue with old AppendVecs reviving zero-lamport accounts, which should now be addressed by calc_delete_dependencies()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! It looks that you passed the CI fully. So happy. :)
So, the once independent cleaning mechanisms finally start to interact each other in one pass after battle-tested for half a year. That sounds cool.
f58d192
to
bd34e1e
Compare
Codecov Report
@@ Coverage Diff @@
## master #12606 +/- ##
========================================
Coverage 82.0% 82.0%
========================================
Files 359 359
Lines 83989 84123 +134
========================================
+ Hits 68875 69009 +134
Misses 15114 15114 |
@carllin not strictly, related to this pr, but I found rather worried flaky test failure: https://buildkite.com/solana-labs/solana/builds/32038#b65ccd3b-3b4d-40c6-82bc-51b685ed5948/247-1671 Could you take a look at that before merging this?
|
@carllin Sorry for late review as always... ;) Also, this time I'm afraid but I have to assign some homework to you. |
Anyway, I see this pr is great progress. I'm really looking forward to the less bloated snapshots. And slightly increased write thanks to more granular Also, please test this for rather long duration, pointing to mainnet-beta and testnet. |
@ryoqun for sure, thanks for the review! Definitely letting this run against mainnet for a while.
Addressed this here: #12658 |
bd34e1e
to
8e945cc
Compare
8e945cc
to
c431dc1
Compare
c431dc1
to
6909264
Compare
let count = self | ||
.storage | ||
.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin Opps, Sorry to spot this at later stage of this pr. But, I think there is panic!
possibility here, because of the shrinker thread, which is running in parallel.
By moving this self.storage.read()
lock inside here from the outer loop prologue, now shrinker thread are free to update account storages while this is running. unwrap()
at L798 here bite us.
This is because shrinker effectively replaces old AccountStorageEntry with new one while removing the old AccountStorageEntry (we might be accessing this from here). remove_dead_accounts()
never cleans up any empty AccountStorageEntries unless all of them for given slot is empty. But the do_shrink_slot
do its work diligently by removing any empty StorageEntry manually after handle_reclaims
, knowing they can almost certainly can remove empty one of them at the very least.
As a matter of fact, we're not affected from other store codepath, however. Because we're mangling rooted slots. other store codepath only touches non-rooted new slots. The only two which touches the rooted slots are this cleaner and shrinker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun, shrink and clean run serially in AccountsBackgroundService, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right; Even more, the above comment is wrong: we're actually serializing the executions: https://github.com/solana-labs/solana/pull/12606/files?diff=split&w=1#diff-2099c5256db4eb5975c8834af38f6456R703
nis: So, I think this lock can go away now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits
Great work! Also thanks for addressing various review comments! This got a lot clearer. 👍
And, sorry for the false alarm about the shrinker<=>cleaner data race bug... ;)
@ryoqun I think that makes sense! I'll hold off on the backport and let the other PR bake a bit, thanks for the review! |
@carllin Time to backport this? it looks like testnet is doing well with #12194. Also, we're queueing #12126 and #12787 (preferably as the next patch release after this pr). Also, I rather want to spend most of time with all these prs backported on v1.4 so that we could be confident when the mainnet finally moves to the v1.4. |
@ryoqun definitely, tagged! |
Co-authored-by: Carl Lin <[email protected]> (cherry picked from commit 16d45b8)
Co-authored-by: Carl Lin <[email protected]> (cherry picked from commit 16d45b8) Co-authored-by: carllin <[email protected]>
Problem
Some slot
S
containing an update for an account with keyK
may not be cleaned up if at the time ofclean_slots
, the latest update inK
is in some rooted slotS' > S
andK
has zero lamports. This is because inclean_slots()
if account_info.lamports == 0 {
and
else if accounts_index.uncleaned_roots.contains(slot)
are mutually exclusive, and
uncleaned_roots
is purged at the end of the clean_slots() loop, so those slots will not be re-examined if the pubkey is not updated again in some future root that is added touncleaned root
Summary of Changes
Fixes #