Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

solana-ledger-tool verify OOM's with master on mnb #26895

Closed
steviez opened this issue Aug 2, 2022 · 6 comments · Fixed by #26914
Closed

solana-ledger-tool verify OOM's with master on mnb #26895

steviez opened this issue Aug 2, 2022 · 6 comments · Fixed by #26914
Assignees

Comments

@steviez
Copy link
Contributor

steviez commented Aug 2, 2022

Problem

Running solana-ledger-tool verify with the tip of master OOM's after processing a . For example, with a snapshot a from epoch 333, I see an OOM after processing < 250 slots. I believe this has been an issue for a little bit, as jwash also reported hitting some OOM's with master roughly 1 month back:
https://discord.com/channels/428295358100013066/439194979856809985/992161854686240778

mvines had a potential fix with #26349 that enabled AccountsBackgroundService when running verify; however, I still encountered the OOM with this.

Some things that have already been verified:

  • roots are being made (BankForks::set_root() is getting called)
  • banks are getting dropped (Bank::Drop() is getting called)

Proposed Solution

Debug and fix, several methods to approach

  • Idea 1: Profile (heaptrack) and look for any obvious offender
  • Idea 2: git bisect (1.10 does not show the issue)
    • The common base between v1.10.32 and tip of master is 1535748 which leaves about ~1700 commits (assuming the base is good and something that changed on master caused the OOM) ... not great but doable

Testing Results

The following are for attempting to process roughly 10k slots from a recent mainnet beta ledger.

9cdd84bb5e66d9db01ef9aa5c3e0277731c3f00e ==> No OOM // v1.10.32 tag
224550d65fca059a46f6293d41cee7055986a638 ==> OOM    // jwash recent improvement; also OOM'd with mvines PR on top
@steviez steviez self-assigned this Aug 2, 2022
@steviez
Copy link
Contributor Author

steviez commented Aug 2, 2022

Luckily I pinged @jeffwashington and he had just put in a fix that makes things better: 224550d

  • Without this change, I was sometimes seeing as few as ~200 processed before OOM
  • With this change, my single run made it through ~5500 processed slots before OOM

A definite improvement, but I think there is still something worth investigating here

@jeffwashington
Copy link
Contributor

maybe related to 83e0412
specifically removal of exhaustively_free_unused_resource

@solana-labs solana-labs deleted a comment from Soldecv Aug 3, 2022
@steviez
Copy link
Contributor Author

steviez commented Aug 4, 2022

Unfortunately, doing a git bisect doesn't have a good path. There was some fairly trivial patches to get compatibility with the new snapshot / ledger on older commits; however, processing sputters out from TX's not being handled the same in older version. I would essentially have to bisect to find commits that would create compatibility ... we could do this if we get desperate, but going to pursue other angles at the moment.

For completeness incase we come back, here were the commits to get snapshot / ledger compatible:

  • If a commit is unable to open snapshot, cherry-pick 8caced6
  • If a commit is unable to open rocksdb, use ldb to drop optimistic_slots column

@jeffwashington
Copy link
Contributor

jeffwashington commented Aug 4, 2022 via email

@jeffwashington
Copy link
Contributor

jeffwashington commented Aug 4, 2022 via email

@steviez
Copy link
Contributor Author

steviez commented Aug 4, 2022

specifically removal of exhaustively_free_unused_resource

In support of this, I scraped incidence rates from solana-ledger-tool verify runs of the datapoint metrics that from the gc functions that were previously called by exhaustively_free_unused_resource():

v1.10.32
  accounts_db-flush_accounts_cache ==> 183 instances
  clean_accounts ==> 215 instances
  shrink_stats ==> 278 instances
tip-of-master
  accounts_db-flush_accounts_cache ==> 0 instances
  clean_accounts ==> 0 instances
  shrink_stats ==> 0 instances
tip-of-master + mvines make ABS run in solana-ledger-tool verify
  accounts_db-flush_accounts_cache ==> 0 instances
  clean_accounts ==> 1 instances
  shrink_stats ==> 2 instances

So, these cleanup functions are not getting called in tip of master, or even with the mvines PR

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants