[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176

ryoqun · 2020-02-08T17:42:25Z

Problem

Finally, I found the root cause of #8130 ...:

(Moved from #8148's description)

Snapshot's actual content can become divert from the corresponding bank hash. Currently, purge_zero_lamport_account simply causes too-eager storage removals when zero lamport account updates are chained across 3+ slots (For exact situation, see a test).

--- /good-snapshot-pubkeys  2020-02-06 18:57:12.813230728 +0000
+++ /bad-snapshot-pubkeys  2020-02-06 18:57:12.813230728 +0000
@@ -179,6 +179,7 @@
 key: 4vgoKb76Z2vj9V9z7hoQpZkkwJrkL1z35LWNd9EXSi2o
 key: 4x5cmzjq7MBZG4KZjRQ2hepsWG1X9PvnJs1C2Q68PQmQ
 key: 4xARLQv1SoiVNbmXmVYyQzEjG3YA67X1LfH9KkTLtzKd
+key: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn
 key: 4zAe2W9b3GmQKG2ZbinimDB7VgaNvfNoQDKdNKDiHuBU
 key: 4zoHsMTdhuDAkHAHNwEZqNqEmFyyqRxUxHwGuSdeepyB
 key: 52EW9njXGZzFduTLCZWfQNjPb8QMXgLZRWxEFCN3xaAx

=> This diff shows the bank hash mismatch is caused by the presence of 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn

$ less good3.log | grep 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn 
[2020-02-06T06:14:58.602035261Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 0, pubkey: (StoredMeta { write_version: 176, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 0, offset: 18168, lamports: 500000890880 })
[2020-02-06T06:15:02.010395039Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 112211, pubkey: (StoredMeta { write_version: 8079864, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 432426, offset: 0, lamports: 499973176160 })
[2020-02-06T06:15:03.103655633Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 125761, pubkey: (StoredMeta { write_version: 9075320, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 484959, offset: 408, lamports: 499973176147 })
[2020-02-06T06:15:03.411181688Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 136455, pubkey: (StoredMeta { write_version: 9867711, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 526476, offset: 22152, lamports: 973176143 })
[2020-02-06T06:15:03.439180800Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 136546, pubkey: (StoredMeta { write_version: 9874694, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 526846, offset: 22152, lamports: 0 })
$ less bad3.log | grep 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn 
[2020-02-06T06:10:31.668645696Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 0, pubkey: (StoredMeta { write_version: 176, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 2, offset: 8144, lamports: 500000890880 })
[2020-02-06T06:10:34.653084054Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 112211, pubkey: (StoredMeta { write_version: 8076459, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 427648, offset: 29168, lamports: 499973176160 })
[2020-02-06T06:10:35.794195861Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 125761, pubkey: (StoredMeta { write_version: 9071781, pubkey: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, data_len: 0 }, AccountInfo { store_id: 479476, offset: 136, lamports: 499973176147 })
[2020-02-06T06:10:39.134235318Z DEBUG solana_runtime::accounts_db] xoring.. account: Account { lamports: 499973176147 data.len: 0 owner: 11111111111111111111111111111111 executable: false rent_epoch: 12 hash: 8Mvx4iidAhxWNAhuUU2vFt5RLGdXdJdaRCFM1By7nPPM } slot: 125761, BankHash 6d5e6ce8b914d18c0e775feac5f20c08b60b0940b12d0867d8c4ebe8cebd4310 key: 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn

=> This shows bad snapshot doesn't contain some updates of 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn, which should have occurred at slot 136455 and 136546. To be specific, 136546 removed the account itself. However, that update doesn't exist in the bad snapshot. That's why that account appears in the bank hash calculation.

$ grep -E '125761|136455|136546' bad-files good-files 
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479477
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479479
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479475
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479476
bad-files:-rw-r--r-- solanad/solanad 4194304 2020-02-05 07:01 accounts/125761.479478
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526849
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526480
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484960
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526848
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526479
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526846
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484959
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526478
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484958
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526851
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526850
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484962
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526476
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:18 accounts/136455.526477
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:19 accounts/125761.484961
good-files:-rw-r--r-- mvines/mvines 4194304 2020-02-06 01:16 accounts/136546.526847

=> This shows bad snapshot actually doesn't contain any storage entries for such slots first of all.

New conclusive findings

We can confirm this is really what happened on the bad-behaving TdS validator:

Firstly, relevant slots for 4ydifDThiWuVtzV92eNGiznuQAnTZtGkb9b2XQoMGAUn account updates are 136546, 136455 and 125761. 136455 is innocent slot which contains no zero lamport accounts and can be reclaimed normally. 136546 is the slot which contains the should-be-kept zero lamport account update (See above). 125761 is the slot which contained the reappeared old and bad account state. So we can say 136546 is chaining to 125761 via a zero lamport account update.

And there is another zero lamport account update in 125761:

[2020-02-06T06:15:03.103895475Z DEBUG solana_runtime::accounts_db] ryoqun: slot_id: 125761, pubkey: (StoredMeta { write_version: 9075321, pubkey: Ag51oCkwGy5rkbGEYrcP9GDjvFGMJrEdLxvedLSTSR15, data_len: 0 }, AccountInfo { store_id: 484959, offset: 544, lamports: 0 })

So, 125761 is definitely chaining to an older storage entry. This satisfies this bug's condition.

Remaining question is why this isn't deterministic and not cluster-wide. However, #8148 can partly explain this because that causes some not-deterministic behavior in validators.

Summary of Changes

Only add a test showing the bad behavior because I need some confirmation for the further direction.

Fixes #8130

ryoqun · 2020-02-08T17:48:45Z

runtime/src/accounts_db.rs

+
+        assert_load_account(&accounts, current_slot, pubkey, some_lamport);
+        assert_load_account(&accounts, current_slot, purged_pubkey1, 0);
+        assert_load_account(&accounts, current_slot, purged_pubkey2, 0);


This assertion fails as follows:

---- accounts_db::tests::test_accounts_purge_chained stdout ---- thread 'accounts_db::tests::test_accounts_purge_chained' panicked at 'assertion failed: `(left == right)` left: `(100225, 2)`, right: `(0, 2)`', runtime/src/accounts_db.rs:1928:9 stack backtrace: 0: backtrace::backtrace::libunwind::trace at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88 1: backtrace::backtrace::trace_unsynchronized at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66 ..: ... ... 13: std::panicking::begin_panic_fmt at src/libstd/panicking.rs:334 14: solana_runtime::accounts_db::tests::assert_load_account at runtime/src/accounts_db.rs:1928 15: solana_runtime::accounts_db::tests::test_accounts_purge_chained at runtime/src/accounts_db.rs:2176

This should pass, while purged_pubkey2's lamport being 0.

ryoqun · 2020-02-08T17:49:32Z

runtime/src/accounts_db.rs

+        accounts.store(current_slot, &[(&dummy_pubkey, &dummy_account)]);
+        accounts.add_root(current_slot);
+
+        purge_zero_lamport_accounts(&accounts, current_slot);


Commenting out this purge_zero_lamport_accounts() invocartion makes this test pass.

ryoqun · 2020-02-08T17:56:23Z

runtime/src/accounts_db.rs

+        accounts.add_root(current_slot);
+
+        current_slot += 1;
+        accounts.store(current_slot, &[(&purged_pubkey2, &zero_lamport_account)]);


The problem of current purge_zero_lamport_account is that it cannot recognize these chained/dependent/cascading relationship of zero lamport accounts.

In this database settings, no storage should be removed. But currently that function incorrectly removes a storage entry for this line's store. So, the old version (=not yet purged) of purged_pubkey2 re-appears incorrectly, causing this assertion failure.

I'm not an algorithm expert at all; however the cost of these DAG dependency aware purging logic should be costly. It think that logic at least require individual checks and/or recursive checks. And it basically will be light-weight stop-the-world GC. Also, this could be DOS-friendly.

Because I couldn't come up with a simple and encouraging logic, this PR doesn't come with actual implementation...

I think we have two paths:

Implement that logic rigorously.

Separate non-zero account stores and zero account stores into different ones (To cut indefinitely-long chains).

@sakridge I think the path 2. sounds more reasonable. Could you share your thoughts on this bug? Also, is there any better solution in your mind? :)

For #2, it isn't clear to me how that exactly simplifies the problem, because don't you still have to track whether these accounts are shielding others even if they are in different stores or not?

ryoqun · 2020-02-08T17:59:06Z

runtime/src/accounts_db.rs

+        accounts.add_root(current_slot);
+
+        current_slot += 1;
+        accounts.store(current_slot, &[(&dummy_pubkey, &dummy_account)]);


This is needed because of another bug

ryoqun · 2020-02-12T22:53:44Z

Superseded by #8218

Show insufficient purge_zero_lamport_account logic

5485fd6

ryoqun added the work in progress This isn't quite right yet label Feb 8, 2020

ryoqun commented Feb 8, 2020

View reviewed changes

ryoqun requested a review from sakridge February 8, 2020 18:13

This was referenced Feb 10, 2020

Remove needless last_root for better reclaims #8148

Merged

'Snapshot bank failed to verify', ledger/src/snapshot_utils.rs:462:9 #8130

Closed

Fix accounts_db store counts in purging accounts logic #8218

Merged

ryoqun closed this Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176

[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176

ryoqun commented Feb 8, 2020 •

edited

Loading

ryoqun Feb 8, 2020

ryoqun Feb 8, 2020

ryoqun Feb 8, 2020 •

edited

Loading

ryoqun Feb 8, 2020 •

edited

Loading

ryoqun Feb 11, 2020

sakridge Feb 11, 2020

ryoqun Feb 8, 2020

ryoqun commented Feb 12, 2020

[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176

[unimpl-ed] Show insufficient purge_zero_lamport_account logic #8176

Conversation

ryoqun commented Feb 8, 2020 • edited Loading

Problem

New conclusive findings

Summary of Changes

ryoqun Feb 8, 2020

Choose a reason for hiding this comment

ryoqun Feb 8, 2020

Choose a reason for hiding this comment

ryoqun Feb 8, 2020 • edited Loading

Choose a reason for hiding this comment

ryoqun Feb 8, 2020 • edited Loading

Choose a reason for hiding this comment

ryoqun Feb 11, 2020

Choose a reason for hiding this comment

sakridge Feb 11, 2020

Choose a reason for hiding this comment

ryoqun Feb 8, 2020

Choose a reason for hiding this comment

ryoqun commented Feb 12, 2020

ryoqun commented Feb 8, 2020 •

edited

Loading

ryoqun Feb 8, 2020 •

edited

Loading

ryoqun Feb 8, 2020 •

edited

Loading