-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176
Conversation
|
||
assert_load_account(&accounts, current_slot, pubkey, some_lamport); | ||
assert_load_account(&accounts, current_slot, purged_pubkey1, 0); | ||
assert_load_account(&accounts, current_slot, purged_pubkey2, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion fails as follows:
---- accounts_db::tests::test_accounts_purge_chained stdout ----
thread 'accounts_db::tests::test_accounts_purge_chained' panicked at 'assertion failed: `(left == right)`
left: `(100225, 2)`,
right: `(0, 2)`', runtime/src/accounts_db.rs:1928:9
stack backtrace:
0: backtrace::backtrace::libunwind::trace
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
1: backtrace::backtrace::trace_unsynchronized
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
..: ...
...
13: std::panicking::begin_panic_fmt
at src/libstd/panicking.rs:334
14: solana_runtime::accounts_db::tests::assert_load_account
at runtime/src/accounts_db.rs:1928
15: solana_runtime::accounts_db::tests::test_accounts_purge_chained
at runtime/src/accounts_db.rs:2176
This should pass, while purged_pubkey2
's lamport being 0
.
accounts.store(current_slot, &[(&dummy_pubkey, &dummy_account)]); | ||
accounts.add_root(current_slot); | ||
|
||
purge_zero_lamport_accounts(&accounts, current_slot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting out this purge_zero_lamport_accounts()
invocartion makes this test pass.
accounts.add_root(current_slot); | ||
|
||
current_slot += 1; | ||
accounts.store(current_slot, &[(&purged_pubkey2, &zero_lamport_account)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem of current purge_zero_lamport_account
is that it cannot recognize these chained/dependent/cascading relationship of zero lamport accounts.
In this database settings, no storage should be removed. But currently that function incorrectly removes a storage entry for this line's store. So, the old version (=not yet purged) of purged_pubkey2
re-appears incorrectly, causing this assertion failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an algorithm expert at all; however the cost of these DAG dependency aware purging logic should be costly. It think that logic at least require individual checks and/or recursive checks. And it basically will be light-weight stop-the-world GC. Also, this could be DOS-friendly.
Because I couldn't come up with a simple and encouraging logic, this PR doesn't come with actual implementation...
I think we have two paths:
- Implement that logic rigorously.
- Separate non-zero account stores and zero account stores into different ones (To cut indefinitely-long chains).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sakridge I think the path 2. sounds more reasonable. Could you share your thoughts on this bug? Also, is there any better solution in your mind? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For #2, it isn't clear to me how that exactly simplifies the problem, because don't you still have to track whether these accounts are shielding others even if they are in different stores or not?
accounts.add_root(current_slot); | ||
|
||
current_slot += 1; | ||
accounts.store(current_slot, &[(&dummy_pubkey, &dummy_account)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because of another bug
Superseded by #8218 |
Problem
Finally, I found the root cause of #8130 ...:
(Moved from #8148's description)
Snapshot's actual content can become divert from the corresponding bank hash. Currently,
purge_zero_lamport_account
simply causes too-eager storage removals when zero lamport account updates are chained across 3+ slots (For exact situation, see a test).=> This diff shows the bank hash mismatch is caused by the presence of
4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn
=> This shows bad snapshot doesn't contain some updates of
4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn
, which should have occurred at slot 136455 and 136546. To be specific, 136546 removed the account itself. However, that update doesn't exist in the bad snapshot. That's why that account appears in the bank hash calculation.=> This shows bad snapshot actually doesn't contain any storage entries for such slots first of all.
New conclusive findings
We can confirm this is really what happened on the bad-behaving TdS validator:
Firstly, relevant slots for
4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn
account updates are 136546, 136455 and 125761. 136455 is innocent slot which contains no zero lamport accounts and can be reclaimed normally. 136546 is the slot which contains the should-be-kept zero lamport account update (See above). 125761 is the slot which contained the reappeared old and bad account state. So we can say 136546 is chaining to 125761 via a zero lamport account update.And there is another zero lamport account update in 125761:
So, 125761 is definitely chaining to an older storage entry. This satisfies this bug's condition.
Remaining question is why this isn't deterministic and not cluster-wide. However, #8148 can partly explain this because that causes some not-deterministic behavior in validators.
Summary of Changes
Only add a test showing the bad behavior because I need some confirmation for the further direction.
Fixes #8130