Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator can exceed expected number of mmaps #19320

Closed
sakridge opened this issue Aug 19, 2021 · 21 comments
Closed

Validator can exceed expected number of mmaps #19320

sakridge opened this issue Aug 19, 2021 · 21 comments
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@sakridge
Copy link
Member

Problem

Seen on 1.6 and 1.7, the validator encounters a period where mmaps exceed the expected values which is around 430k

May take on the order of weeks of running to reproduce.

Proposed Solution

Debug leak and fix

@sakridge sakridge changed the title Validator leaks mmaps Validator can exceed expected number of mmaps Aug 19, 2021
@sakridge
Copy link
Member Author

cc @ryoqun

@ryoqun
Copy link
Member

ryoqun commented Aug 19, 2021

cc @jeffwashington just in case (i think shrinker is the problem. i'm cc-ing because you're more up-to-date to accountsdb in general recently)

@jeffwashington
Copy link
Contributor

how/is/could this be this related to the issue with shrink holding mem maps open that was fixed very late in 1.7?

@jeffwashington
Copy link
Contributor

@ryoqun please put your thinking about why it is shrink here so I can leverage your experience. ;-)

@sakridge
Copy link
Member Author

@jeffwashington this happens on 1.6 as well which doesn’t have those changes.

@t-nelson
Copy link
Contributor

The rate seems to be much slower on 1.6 than 1.7. 1.6 took a few weeks to become a problem, 1.7 is pretty obvious in a couple days

@jeffwashington
Copy link
Contributor

a metric that shows the problem.
num_snapshot_storage continues to increase.

@jeffwashington
Copy link
Contributor

image
discord

@ryoqun
Copy link
Member

ryoqun commented Aug 19, 2021

@jeffwashington thanks for taking a look. here's some brain dump:

bad (from mainnet-beta):

number of appendvec continue to increase (only node restart resets the appendvec to bare minimal)

image

good (from testnet):

number of appendvec is clearly capped:

image

@ryoqun
Copy link
Member

ryoqun commented Aug 19, 2021

things to try:

  • examine snapshot
    • also compare one with another created slightly later.
  • recreate snapshot by solana-ledger-tool and see which appendvec is got removed.
  • run v1.7 and v1.6 against mainnet-beta side by side to see any behavioral changes

@ryoqun
Copy link
Member

ryoqun commented Aug 19, 2021

for this instance of leak bug, fortunately restarts drastically reduces the number of appendvec. so, there should be a code, which is doing correct thing. so differential analysis might be shorter path.

also be careful not to introduce more dangerous bank (account) hash mismatch error..

@ryoqun
Copy link
Member

ryoqun commented Aug 19, 2021

also, this is the original pr, which I wrote:

#9527 (comment)

@ryoqun
Copy link
Member

ryoqun commented Aug 19, 2021

another bad one (from devnet)

so, this is only not happening on testnet?

image

@jeffwashington
Copy link
Contributor

Here are 2 pubkeys that show up multiple times in a snapshot I got:

slot: 73280004, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200
slot: 73712004, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200
slot: 74576008, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200
slot: 75008004, pubkey: BrDDGVH7pWmVfyLyWxnZ9o7ETNmZa6oj3G4Z8K7tpszE, lamports: 985455200


slot: 73280004, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280
slot: 73712004, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280
slot: 74576008, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280
slot: 75008004, pubkey: BrD9VecHQSYw27PRsiKSuPQnP9b5AFJa3kjmUqGaayQC, lamports: 2039280

@jeffwashington
Copy link
Contributor

@lijunwangs and I are continuing to gather data and consider this issue.
lijun suggests adding: --accounts-shrink-optimize-total-space false
to a validator to see if this has any affect on the situation. He will try this on dv3.

@sakridge
Copy link
Member Author

I started 3 experiments:
3R9EwSKB2wiFi5oVmpRXtuQCu9Eae9VmPwGNaeX9iRZ4 - v1.7 vanilla
Djudu8BGnHVjgK1M7jXiLYsce1wMYfkvNQyogW5pGZTV - v1.7 with #19350
3db4R1UCuoSVHsEzPVDvpN6i6SDxDULmkaV4V5fcvh5e - v1.7 with #19350 + --accounts-shrink-optimize-total-space false

@lijunwangs
Copy link
Contributor

I am doing dv3qDFk1DTF36Z62bNvrCXe9sKATA6xvVy6A798xxAS - v1.7.10 --accounts-shrink-optimize-total-space false

@sakridge
Copy link
Member Author

sakridge commented Aug 22, 2021

I think this test might reproduce the leak:
#19360

RUST_LOG=warn cargo test --release test_banks_cleanup --lib > test.log 2>&1

AccountsDB only knows about ~427,000 + 2000 stores:

[2021-08-22T12:31:04.190386875Z WARN solana_runtime::bank_forks::tests] i: 4071000 stores: 427728 recycle: 1001 shrink: 0 dirty: 1072 time: 1.50 slots/s = 668.71 avg s/s: 1073.88

But the process maps grows.. after 4 million slots, it's up to 500k+ stores.

sakridge@merckx:~$ wc -l pmap.bad.maybe-fixed.2021y08m22d05h*
   443570 pmap.bad.maybe-fixed.2021y08m22d05h14m24s231604645ns
   447349 pmap.bad.maybe-fixed.2021y08m22d05h16m04s929941136ns
   452189 pmap.bad.maybe-fixed.2021y08m22d05h17m45s637509786ns
   457607 pmap.bad.maybe-fixed.2021y08m22d05h19m26s353425621ns
   463616 pmap.bad.maybe-fixed.2021y08m22d05h21m07s078650335ns
   470604 pmap.bad.maybe-fixed.2021y08m22d05h22m47s801580856ns
   478610 pmap.bad.maybe-fixed.2021y08m22d05h24m28s540717531ns
   487283 pmap.bad.maybe-fixed.2021y08m22d05h26m09s280400866ns
   496321 pmap.bad.maybe-fixed.2021y08m22d05h27m50s067322430ns
   505087 pmap.bad.maybe-fixed.2021y08m22d05h29m30s858126026ns
  4702236 total
sakridge@merckx:~$

@sakridge
Copy link
Member Author

I think this test might reproduce the leak:
#19360

Actually, I didn't have caching enabled, so sometimes when it would create multiple stores in the slot. With caching enabled and no shrink, now the number of stores is stable.

@lijunwangs
Copy link
Contributor

I have created a draft PR for the issues discussed:

#19373

@lijunwangs
Copy link
Contributor

I am doing dv3qDFk1DTF36Z62bNvrCXe9sKATA6xvVy6A798xxAS - v1.7.10 --accounts-shrink-optimize-total-space false

I did see my GCE validator with --shrink-optimize-total-space set to true's num_snapshot_storage increases at a little faster pace than the v1.7.10 validator --accounts-shrink-optimize-total-space false. I also saw during the window I was testing with --accounts-shrink-optimize-total-space false -- the num_snapshot_storage just kept increasing.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 27, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

5 participants