Validator can exceed expected number of mmaps #19320

sakridge · 2021-08-19T12:42:15Z

Problem

Seen on 1.6 and 1.7, the validator encounters a period where mmaps exceed the expected values which is around 430k

May take on the order of weeks of running to reproduce.

Proposed Solution

Debug leak and fix

sakridge · 2021-08-19T12:42:44Z

cc @ryoqun

ryoqun · 2021-08-19T16:27:03Z

cc @jeffwashington just in case (i think shrinker is the problem. i'm cc-ing because you're more up-to-date to accountsdb in general recently)

jeffwashington · 2021-08-19T16:28:17Z

how/is/could this be this related to the issue with shrink holding mem maps open that was fixed very late in 1.7?

jeffwashington · 2021-08-19T16:35:58Z

@ryoqun please put your thinking about why it is shrink here so I can leverage your experience. ;-)

sakridge · 2021-08-19T16:51:25Z

@jeffwashington this happens on 1.6 as well which doesn’t have those changes.

t-nelson · 2021-08-19T16:57:25Z

The rate seems to be much slower on 1.6 than 1.7. 1.6 took a few weeks to become a problem, 1.7 is pretty obvious in a couple days

jeffwashington · 2021-08-19T17:07:15Z

a metric that shows the problem.
num_snapshot_storage continues to increase.

jeffwashington · 2021-08-19T17:10:21Z

discord

ryoqun · 2021-08-19T18:04:29Z

@jeffwashington thanks for taking a look. here's some brain dump:

bad (from mainnet-beta):

number of appendvec continue to increase (only node restart resets the appendvec to bare minimal)

good (from testnet):

number of appendvec is clearly capped:

ryoqun · 2021-08-19T18:06:47Z

things to try:

examine snapshot
- also compare one with another created slightly later.
recreate snapshot by solana-ledger-tool and see which appendvec is got removed.
run v1.7 and v1.6 against mainnet-beta side by side to see any behavioral changes

ryoqun · 2021-08-19T18:09:29Z

for this instance of leak bug, fortunately restarts drastically reduces the number of appendvec. so, there should be a code, which is doing correct thing. so differential analysis might be shorter path.

also be careful not to introduce more dangerous bank (account) hash mismatch error..

ryoqun · 2021-08-19T18:15:04Z

also, this is the original pr, which I wrote:

#9527 (comment)

ryoqun · 2021-08-19T18:16:25Z

another bad one (from devnet)

so, this is only not happening on testnet?

jeffwashington · 2021-08-20T16:30:37Z

Here are 2 pubkeys that show up multiple times in a snapshot I got:

slot: 73280004, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200
slot: 73712004, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200
slot: 74576008, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200
slot: 75008004, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200


slot: 73280004, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280
slot: 73712004, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280
slot: 74576008, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280
slot: 75008004, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280

jeffwashington · 2021-08-20T17:12:10Z

@lijunwangs and I are continuing to gather data and consider this issue.
lijun suggests adding: --accounts-shrink-optimize-total-space false
to a validator to see if this has any affect on the situation. He will try this on dv3.

sakridge · 2021-08-20T21:05:03Z

I started 3 experiments:
3R9EwSKB2wiFi5oVmpRXtuQCu9Eae9VmPwGNaeX9iRZ4 - v1.7 vanilla
Djudu8BGnHVjgK1M7jXiLYsce1wMYfkvNQyogW5pGZTV - v1.7 with #19350
3db4R1UCuoSVHsEzPVDvpN6i6SDxDULmkaV4V5fcvh5e - v1.7 with #19350 + --accounts-shrink-optimize-total-space false

lijunwangs · 2021-08-20T21:09:53Z

I am doing dv3qDFk1DTF36Z62bNvrCXe9sKATA6xvVy6A798xxAS - v1.7.10 --accounts-shrink-optimize-total-space false

sakridge · 2021-08-22T12:32:57Z

I think this test might reproduce the leak:
#19360

RUST_LOG=warn cargo test --release test_banks_cleanup --lib > test.log 2>&1

AccountsDB only knows about ~427,000 + 2000 stores:

[2021-08-22T12:31:04.190386875Z WARN solana_runtime::bank_forks::tests] i: 4071000 stores: 427728 recycle: 1001 shrink: 0 dirty: 1072 time: 1.50 slots/s = 668.71 avg s/s: 1073.88

But the process maps grows.. after 4 million slots, it's up to 500k+ stores.

sakridge@merckx:~$ wc -l pmap.bad.maybe-fixed.2021y08m22d05h*
   443570 pmap.bad.maybe-fixed.2021y08m22d05h14m24s231604645ns
   447349 pmap.bad.maybe-fixed.2021y08m22d05h16m04s929941136ns
   452189 pmap.bad.maybe-fixed.2021y08m22d05h17m45s637509786ns
   457607 pmap.bad.maybe-fixed.2021y08m22d05h19m26s353425621ns
   463616 pmap.bad.maybe-fixed.2021y08m22d05h21m07s078650335ns
   470604 pmap.bad.maybe-fixed.2021y08m22d05h22m47s801580856ns
   478610 pmap.bad.maybe-fixed.2021y08m22d05h24m28s540717531ns
   487283 pmap.bad.maybe-fixed.2021y08m22d05h26m09s280400866ns
   496321 pmap.bad.maybe-fixed.2021y08m22d05h27m50s067322430ns
   505087 pmap.bad.maybe-fixed.2021y08m22d05h29m30s858126026ns
  4702236 total
sakridge@merckx:~$

sakridge · 2021-08-23T08:23:43Z

I think this test might reproduce the leak:
#19360

Actually, I didn't have caching enabled, so sometimes when it would create multiple stores in the slot. With caching enabled and no shrink, now the number of stores is stable.

lijunwangs · 2021-08-23T08:31:25Z

I have created a draft PR for the issues discussed:

#19373

lijunwangs · 2021-08-23T08:37:03Z

I am doing dv3qDFk1DTF36Z62bNvrCXe9sKATA6xvVy6A798xxAS - v1.7.10 --accounts-shrink-optimize-total-space false

I did see my GCE validator with --shrink-optimize-total-space set to true's num_snapshot_storage increases at a little faster pace than the v1.7.10 validator --accounts-shrink-optimize-total-space false. I also saw during the window I was testing with --accounts-shrink-optimize-total-space false -- the num_snapshot_storage just kept increasing.

sakridge changed the title ~~Validator leaks mmaps~~ Validator can exceed expected number of mmaps Aug 19, 2021

jeffwashington mentioned this issue Aug 20, 2021

don't skip shrinking slots that would go to 0 #19331

Closed

sakridge mentioned this issue Aug 20, 2021

Don't skip shrink #19350

Closed

This was referenced Aug 25, 2021

Backport Accounts Fixes #16838 and the test #17038 #19412

Merged

hash calculation adds really old slots to dirty_stores #19434

Merged

add test for test_clean_nonrooted #19409

Merged

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 27, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator can exceed expected number of mmaps #19320

Validator can exceed expected number of mmaps #19320

sakridge commented Aug 19, 2021

sakridge commented Aug 19, 2021

ryoqun commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

sakridge commented Aug 19, 2021

t-nelson commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

ryoqun commented Aug 19, 2021 •

edited

Loading

ryoqun commented Aug 19, 2021 •

edited

Loading

ryoqun commented Aug 19, 2021 •

edited

Loading

ryoqun commented Aug 19, 2021

ryoqun commented Aug 19, 2021

jeffwashington commented Aug 20, 2021

jeffwashington commented Aug 20, 2021

sakridge commented Aug 20, 2021

lijunwangs commented Aug 20, 2021

sakridge commented Aug 22, 2021 •

edited

Loading

sakridge commented Aug 23, 2021

lijunwangs commented Aug 23, 2021

lijunwangs commented Aug 23, 2021

Validator can exceed expected number of mmaps #19320

Validator can exceed expected number of mmaps #19320

Comments

sakridge commented Aug 19, 2021

Problem

Proposed Solution

sakridge commented Aug 19, 2021

ryoqun commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

sakridge commented Aug 19, 2021

t-nelson commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

jeffwashington commented Aug 19, 2021

ryoqun commented Aug 19, 2021 • edited Loading

ryoqun commented Aug 19, 2021 • edited Loading

ryoqun commented Aug 19, 2021 • edited Loading

ryoqun commented Aug 19, 2021

ryoqun commented Aug 19, 2021

jeffwashington commented Aug 20, 2021

jeffwashington commented Aug 20, 2021

sakridge commented Aug 20, 2021

lijunwangs commented Aug 20, 2021

sakridge commented Aug 22, 2021 • edited Loading

sakridge commented Aug 23, 2021

lijunwangs commented Aug 23, 2021

lijunwangs commented Aug 23, 2021

ryoqun commented Aug 19, 2021 •

edited

Loading

ryoqun commented Aug 19, 2021 •

edited

Loading

ryoqun commented Aug 19, 2021 •

edited

Loading

sakridge commented Aug 22, 2021 •

edited

Loading