-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
program cache: reduce contention #1192
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1192 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 886 886
Lines 236372 236369 -3
=========================================
- Hits 194204 194185 -19
- Misses 42168 42184 +16 |
/// It is possible that multiple TX batches from different slots need different versions of a | ||
/// program. The deployment slot of a program is only known after load tho, | ||
/// so all loads for a given program key are serialized. | ||
loading_entries: Mutex<HashMap<Pubkey, (Slot, std::thread::ThreadId)>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the replacement of the index structure I am working on I also experimented with pulling cooperative_loading_lock
out of the SecondLevel
. We might be able to take this even further because when we use the account last-modified-slot we could also parallelize the loading of missing programs. Basically the "The deployment slot of a program is only known after load tho" part could be improved then.
Do you plan on backporting this to v1.18? Because I also need to add a parameter to |
This is patch 1 of N in my attempt to improve skip rate on one of our nodes which struggles in replay during leader slots. I plan to try to backport some of the patches, but not this one because of how tricky the program cache logic is. And also because we still don't have multi threaded tests for the cache 😭. I've been running a node with this for 3 days and is still making roots, but def don't want to risk it. If your fix is ready I don't mind merging after yours as I imagine the conflicts on this PR will be easy to reconcile. Or I'm happy to merge first, up to you really! |
Before this change we used to take the write lock to extract(). This means that even in the ideal case (all programs are already cached), the cache was contended by all batches and all operations were serialized. With this change we now take the write lock only when we store a new entry in the cache, and take the read lock to extract(). This means that in the common case where most/all programs are cached, there is no contention and all batches progress in parallel. This improves node replay perf by 20-25% on current mnb traffic.
b688cac
to
fd8436b
Compare
fd8436b
to
4ebdbcc
Compare
as this pr is similar to mine (#1037), i confirmed this improves favorably unified scheduler performance as well. thanks for fixing the last write-lock contention source in the program cache. :) i measured the perf before/after the merged commit (10e5086). as a quick context, this bench-marking was conducted by replaying a short slice of mb ledgers using the same minimized snapshot: (repro steps: solana-labs#35286 (comment)) As said above, I saw consistent performance improvement in my testing as well, although i couldn't see 20%-25%. I think this is due to being before(at aa6c69a)
after(at 10e5086)
similar to #1037, unified scheduler saw some drastic improvement (5-6% faster), while blockstore processor saw small gain while being consistent(<1% faster) |
@@ -401,49 +400,51 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> { | |||
&program_cache, | |||
)); | |||
} | |||
// Submit our last completed loading task. | |||
if let Some((key, program)) = program_to_store.take() { | |||
loaded_programs_for_txs.as_mut().unwrap().loaded_missing = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alessandrod I think you accidentally deleted this line here while moving the code. It was added in #1037 which was open around the same time as this PR, so most likely a merge conflict that was not resolved properly.
When I investigated the validator node of @buffalojoec (which went out of disk space as it was not configured to rotate log files) I discovered that the program cache was growing far beyond the capacity as it was not evicting anymore.
Can you open a PR to add the line back in?
I will add a metric for the number of loaded entries (complementary to the evictions metric we already have) recorded in ProgramCache::evict_using_2s_random_selection()
so that this doesn't happen again.
Before this change we used to take the write lock to extract(). This means that even in the ideal case (all programs are already cached), the cache was contended by all batches and all operations were serialized.
With this change we now take the write lock only when we store a new entry in the cache, and take the read lock to extract(). This means that in the common case where most/all programs are cached, there is no contention and all batches progress in parallel.
This improves node replay perf by 20-25% on current mnb traffic.
Before:
After:
...can't show because there's nothing to show :P But can show a zoomed in profile of replenish.
Before:
After:
you can see that locking on write() goes from taking 80% of replenish, to 0.4%