wip: pull IndexList into crate, get_mut -> get #32423

jeffwashington · 2023-07-08T00:18:06Z

Problem

We are investigating the cost of loading accounts. We found a 100x performance improvement on account load from read cache. The hit rate on the read cache is very high. Here are examples from metrics of a running validator:

read_only_accounts_cache_hits=11571
read_only_accounts_cache_misses=169

This graph shows a steady state validator using get_mut(write lcok) on the left, then a restart and back to steady state using get (using read lock) on the right. Note the y scale is log.

Currently, loading from read cache requires a write lock on the DashMap.

solana/runtime/src/read_only_accounts_cache.rs

Line 75 in a3171d3

let mut entry = match self.cache.get_mut(&key) {

This is because to keep a LRU, we use IndexList. The api to move an entry in the list to back requires a remove() followed by insert_last().

solana/runtime/src/read_only_accounts_cache.rs

Lines 88 to 89 in a3171d3

    
           queue.remove(entry.index); 
        
           entry.index = queue.insert_last(key);

This process produces a new index. The new index must be set into the entry in the DashMap. Write locks cause more contention than read locks. get doesn't need a write lock.

We need an api on IndexList like: move_to_last which does not modify the index. Then, we can get an immutable ref to the entry in the DashMap, and thus, a read lock instead of a write lock.

One option is to pull the code into our crate, remove the code we don't need, and add the new method move_to_last. This PR is to evaluate whether this is the best approach.
We'd want to copy over all tests that make sense.
We'd want to create new test(s) for move_to_last.

Another option is to get a change into the external project.

an alternative might be to create a vec of DashMaps, arranged by slot. We'd mod the slot to find the vec index for the dashmap that would contain an item of a given slot. The result would be more available write locks as a result of having more dashmaps. However, the lru list itself is currently still behind a single mutex since the lru list would presumably be shared across all (slot, pubkey) entries as it is today.

Other alternatives could be a different model where we were looser with strict lru usage ordering. Some kind of atomic age or something we increment and then search only occasionally to kick out older things. The previous implementation used a timestamp and searched the whole dashmap to find the strictly oldest items. We could store a slot (or a timestamp). When we need to evict (which is pretty uncommon), we could scan and get rid of the first things we find that are older than (10s, 1000 slots, whatever heuristic we came up with) and then stop when we've evicted enough. There are obviously downsides to different impls, but this read only cache is very stable in contents. It is also very conservatively sized at the moment.

Summary of Changes

Fixes #

jeffwashington · 2023-07-08T00:37:44Z

runtime/src/index_list.rs

+
+    /// Move the element at the index to the end.
+    /// The index remains the same.
+    pub(crate) fn move_to_last(&mut self, index: Index) {


codecov · 2023-07-08T01:31:33Z

Codecov Report

Merging #32423 (efb5166) into master (c9f7cb5) will increase coverage by 0.0%.
Report is 1 commits behind head on master.
The diff coverage is 90.7%.

@@           Coverage Diff            @@
##           master   #32423    +/-   ##
========================================
  Coverage    82.0%    82.0%            
========================================
  Files         785      786     +1     
  Lines      211736   211939   +203     
========================================
+ Hits       173641   173854   +213     
+ Misses      38095    38085    -10

behzadnouri · 2023-07-08T15:56:41Z

Looking at the code change, we still need an exclusive lock on the self.queue. So I am wondering why that does not create a lock contention as bad as self.cache.get_mut.

Maybe the issue is that entry.account.clone() happen while we have a write-lock on the entry.
Can you try the patch below and see if it shows any difference in the performance?

diff --git a/runtime/src/read_only_accounts_cache.rs b/runtime/src/read_only_accounts_cache.rs
index 9956af64bb..cbbb1d901a 100644
--- a/runtime/src/read_only_accounts_cache.rs
+++ b/runtime/src/read_only_accounts_cache.rs
@@ -88,6 +88,7 @@ impl ReadOnlyAccountsCache {
             queue.remove(entry.index);
             entry.index = queue.insert_last(key);
         }
+        let entry = entry.downgrade();
         Some(entry.account.clone())
     }

jeffwashington · 2023-07-08T23:02:43Z

@behzadnouri

We ARE also in contention on the indexlist mutex, AND in master, we are holding the write lock on the dashmap WHILE we wait on and do the indexlist mutex changes. This has us holding the write lock on dash map even longer.

Here are the results of cycling once every 5 minutes through 4 different read cache.load implementations:

For selector:
0 is master, 1 is master with a few things moved out of lock, 2 is downgrade change from behzad, 3 is read lock only in this pr

Bottom right graph is total time in looking up in DashMap + lru update time (move to back, however it is done).

The changes in this pr produce a ~10x better overall time per account. The lookup itself in the dashmap is ~100x.
Note, however, that the changes in this pr DO increase the lru update. This indicates we're no longer blocked on the dashmap write locks which shows us fully how much we're blocking on the IndexList mutex lock.
It would obviously be great to speed up the lru update as well somehow. Brainstorm possibilities include updating lru in the background, changing the lru algorithm, changing the lru locking mechanism, changing the lru data structure, changing the lru model, ...

Note also that cloning the account appears to be very cheap in all cases.

All times are in ns, calculated with us totals per second/# hits in that second.

behzadnouri · 2023-07-17T13:17:39Z

The odd thing is that self.cache.get_mut only locks one shard of the dashmap, and, last time I looked into this, there are num-cpus many shards. So the lock here wouldn't necessarily block other threads.
On the other hand self.queue.lock is across all threads, so I would have expected a lot more lock contention on the self.queue rather than the self.cache; but the plots show a different thing.

Some possibilities are that:

Dashmap doing something weird.
Or, this is an artifact of how the metric is is calculated. Can you share the code which computes above metrics?

jeffwashington · 2023-07-17T22:29:36Z

The odd thing is that self.cache.get_mut only locks one shard of the dashmap, and, last time I looked into this, there are num-cpus many shards. So the lock here wouldn't necessarily block other threads. On the other hand self.queue.lock is across all threads, so I would have expected a lot more lock contention on the self.queue rather than the self.cache; but the plots show a different thing.

Some possibilities are that:

Dashmap doing something weird.

Or, this is an artifact of how the metric is is calculated. Can you share the code which computes above metrics?

branch:
https://github.com/jeffwashington/solana/tree/jl3_test_cycle_pr

It cycles through 5 'selector' values of how to do the read cache::load().
0 is master.
4 is a read lock on dashmap, and a bg update of the lru using a simple vec like a queue. I have another version that uses a channel that isn't in this. Just trying to reduce the lru update to the minimal amount.

measurement of get_mut in master implementation in this branch:
https://github.com/jeffwashington/solana/blob/e5a87b462f164159423805596f8abe6451fcee91/runtime/src/read_only_accounts_cache.rs#L132-L140

jeffwashington · 2023-07-17T22:37:09Z

SELECT mean("read_only_accounts_cache_update_lru_per_account_ns")+mean("read_only_accounts_cache_load_get_mut_per_account_ns") AS "read_only_accounts_cache_get_and_update_lru_per_account_ns" FROM "mainnet-beta"."autogen"."accounts_db_store_timings" WHERE time > :dashboardTime: AND time < :upperDashboardTime: :hostids: GROUP BY time(15s) FILL(null)

viewing the sum metrics per account.

You can try this dashboard.

jeffwashington · 2023-07-17T22:59:36Z

@behzadnouri here are the top loads on the read only cache in 1 s.
We are hitting the same shard because we're hitting the same account, over and over.

top read only cache loads
17316,Vote111111111111111111111111111111111111111,68%
825,ComputeBudget111111111111111111111111111111,3%
518,BPFLoaderUpgradeab1e11111111111111111111111,2%
457,TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA,1%
308,11111111111111111111111111111111,1%
219,FsJ3A3u2vn5cTVofAjvy6y5kwABJAqYWpe4975bi2epH,0%
195,HEvSKofvBgfaexv23kMabbYqxasxU3mQ4ibBMEmJWHny,0%
184,cjg3oHmg9uuPsP8D6g29NWvhySJkdYdAo9D25PRbKXJ,0%
177,whirLbMiicVdio4qvUfM5KAg6Ct8VwpYzGff3uctyCc,0%
150,NTYeYJ1wr4bpM5xo6zx5En44SvJFAd35zTxxNoERYqd,0%
131,E8cU1WiRWjanGxmn96ewBgk9vPTcL6AEZ1t6F6fkgUWe,0%
129,M2mx93ekt1fmXSVkTrUL9xVFHkmME8HTUi5Cyc5aF7K,0%
118,GiU5UpxKvYsSEpCBUrpZYKFqQJvzAMDnYjT9C4iqnahx,0%
112,72xbspJP8vtJ91Lp7k6mqJXFiPSvpad5BBRrixfAWh8m,0%
100,CyZuD7RPDcrqCGbNvLCyqk6Py9cEZTKmNKujfPi3ynDd,0%
100,6TDapMxCnEzss5YCJZXZtTSHwe6ZJeWikwvPzLnZvq7m,0%
100,SW1TCH7qEPTdLsDHRgPuMQjbQxKdH2aBStViMFnt64f,0%

Another adjustment here is we could never update the lru for the vote program, and never throw it out of the read only cache. Except, it is rewritten once per epoch and the entries in the read only cache are (Pubkey, Slot) tuples...

behzadnouri · 2023-07-17T23:08:25Z

@jeffwashington #32518 is an alternative which uses atomics for indices in the cache entries. It has equivalent impact as load no longer needs a mutable entry. So I would expect to see the same improvement.

Can you also test #32518 in your setup and compare?

jeffwashington · 2023-07-18T23:02:23Z

@jeffwashington #32518 is an alternative which uses atomics for indices in the cache entries. It has equivalent impact as load no longer needs a mutable entry. So I would expect to see the same improvement.

Can you also test #32518 in your setup and compare?

been busy, but I got numbers on this last night.
selector=0 is master
selector=5 is atomic index and get.
Results are the atomic index change is better than master. But not as good as we can get by keeping a stable index (due to other effects of releasing locks and avoiding contention on the lock during load. But, just changing to a get from get_mut, we now end up blocking on the mutex on the lru queue. The get is much faster, but the lru update is now much slower.

another variant I tried was only updating the lru on the vote account 1/10,000 of the time. (selector=6)
Those numbers were much better, and are similar to the best I could get through other methods:

Bottom graph is log scale of same middle graph. Bottom 2 graphs show total time (get + lru update).

behzadnouri · 2023-07-20T13:59:16Z

The perf difference between #32518 and this pr is pretty strange.
The only difference is the code after self.queue is locked but that should only be few nanoseconds anyway.

jeffwashington · 2023-07-20T17:11:42Z

The only difference is the code after self.queue is locked but that should only be few nanoseconds anyway.

The difference inside the mutex lock is move_to_last vs remove + insert_last. I suspect move_to_last is much faster. And since we know the critical section inside the mutex lock is the critical path all the concurrent loads are waiting on, it seems to make sense that more work inside that lock is serialized and causing delays.

Here are metrics between three version on a steady state mnb validator:
3 is this pr (using the atomic index)
9 is this pr (using the atomic index), but using remove/insert_last instead of move_to_last
9 should be equivalent to 5 except for if selector == 3 {
5 is the atomic index pr with remove/insert_last

being explicit about the differences between #3 and #9:

            let (_, update_lru_us) = measure_us!({
                let mut queue = self.queue.lock().unwrap();
                if selector == 3 {
                    queue.move_to_last(entry.index());
                }
                else { // selector == 9
                    queue.remove(entry.index());
                    entry.set_index(queue.insert_last(key));
                }
            });

Clearly, 3 (this pr) is the fastest.
9 and 5 are equivalent, as we'd expect.
Thus, the (remove + insert_last plus set_index) cost vs. move_to_last within the mutex lock is the culprit.

HaoranYi · 2023-08-01T23:55:58Z

@behzadnouri here are the top loads on the read only cache in 1 s. We are hitting the same shard because we're hitting the same account, over and over.

top read only cache loads
17316,Vote111111111111111111111111111111111111111,68%
825,ComputeBudget111111111111111111111111111111,3%
518,BPFLoaderUpgradeab1e11111111111111111111111,2%
457,TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA,1%
308,11111111111111111111111111111111,1%
219,FsJ3A3u2vn5cTVofAjvy6y5kwABJAqYWpe4975bi2epH,0%
195,HEvSKofvBgfaexv23kMabbYqxasxU3mQ4ibBMEmJWHny,0%
184,cjg3oHmg9uuPsP8D6g29NWvhySJkdYdAo9D25PRbKXJ,0%
177,whirLbMiicVdio4qvUfM5KAg6Ct8VwpYzGff3uctyCc,0%
150,NTYeYJ1wr4bpM5xo6zx5En44SvJFAd35zTxxNoERYqd,0%
131,E8cU1WiRWjanGxmn96ewBgk9vPTcL6AEZ1t6F6fkgUWe,0%
129,M2mx93ekt1fmXSVkTrUL9xVFHkmME8HTUi5Cyc5aF7K,0%
118,GiU5UpxKvYsSEpCBUrpZYKFqQJvzAMDnYjT9C4iqnahx,0%
112,72xbspJP8vtJ91Lp7k6mqJXFiPSvpad5BBRrixfAWh8m,0%
100,CyZuD7RPDcrqCGbNvLCyqk6Py9cEZTKmNKujfPi3ynDd,0%
100,6TDapMxCnEzss5YCJZXZtTSHwe6ZJeWikwvPzLnZvq7m,0%
100,SW1TCH7qEPTdLsDHRgPuMQjbQxKdH2aBStViMFnt64f,0%

Another adjustment here is we could never update the lru for the vote program, and never throw it out of the read only cache. Except, it is rewritten once per epoch and the entries in the read only cache are (Pubkey, Slot) tuples...

it is interesting that Vote111111111111111111111111111111111111111 takes more than 68% of load. And those accounts are read-only ones. We could pin these accounts in a separate array in the cache and return them directly when the requested pubkey matches.

jeffwashington · 2023-08-10T16:57:32Z

We could pin these accounts in a separate array in the cache and return them directly when the requested pubkey matches.

In theory, yes, we could do something special.
Note that the key of the read only cache hashmap is really a tuple of (Slot, Pubkey). When we do rewrites, vote program and such all move to newer slots and txs stop loading it at the old slot. So, the old entry falls out of the read only cache and the new one moves in.

jeffwashington · 2023-08-11T15:21:08Z

Probably spent way too much time on this:
replay_stats.load_us best perf:
best perf of these (400ms, 1/32, 4000ms, 1/64)

best read only cache load perf:
1/64 (best), 4000ms, 1/32, 400ms

But, this is because we are missing more with 1/32 and 1/64:

We also see the cache holding more entries for 1/32 and 1/64:

Not sure if that means we're kicking out 20(?) large programs so we are holding more smaller accounts? Not sure yet. But more misses is faster on read only load, but slower on overall load. This should explain why on 1/k we miss more, read only load faster, but replay.load slower.

I'm trying again with 1/16 sampling.

Note that all these numbers are with the move_to_last optimization on IndexList. That reduces the load time for hits and updating lru. The important aspects appear to be contention on the lock for lru update and time spent in the lock updating and hits/misses. These both affect load time at all levels.

jeffwashington force-pushed the jl4 branch from 30ad9cc to c2e4917 Compare July 8, 2023 00:37

jeffwashington commented Jul 8, 2023

View reviewed changes

jeffwashington changed the title ~~pull IndexList into crate, get_mut -> get~~ wip: pull IndexList into crate, get_mut -> get Jul 8, 2023

jeffwashington requested review from behzadnouri and brooksprumo July 8, 2023 00:51

jeffwashington mentioned this pull request Jul 14, 2023

clone IndexList for perf modifications #32501

Closed

jeffwashington mentioned this pull request Jul 19, 2023

add move_to_last to IndexList #32535

Closed

jeffwashington removed request for brooksprumo and behzadnouri August 10, 2023 15:56

jeffwashington force-pushed the jl4 branch from c2e4917 to d0e973f Compare August 10, 2023 16:55

pull IndexList into crate, get_mut -> get

efb5166

jeffwashington force-pushed the jl4 branch from d0e973f to efb5166 Compare August 10, 2023 17:46

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Aug 28, 2023

github-actions bot closed this Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: pull IndexList into crate, get_mut -> get #32423

wip: pull IndexList into crate, get_mut -> get #32423

jeffwashington commented Jul 8, 2023 •

edited

Loading

jeffwashington Jul 8, 2023

codecov bot commented Jul 8, 2023 •

edited

Loading

behzadnouri commented Jul 8, 2023

jeffwashington commented Jul 8, 2023 •

edited

Loading

behzadnouri commented Jul 17, 2023

jeffwashington commented Jul 17, 2023

jeffwashington commented Jul 17, 2023

jeffwashington commented Jul 17, 2023 •

edited

Loading

behzadnouri commented Jul 17, 2023

jeffwashington commented Jul 18, 2023

behzadnouri commented Jul 20, 2023

jeffwashington commented Jul 20, 2023

HaoranYi commented Aug 1, 2023 •

edited

Loading

jeffwashington commented Aug 10, 2023

jeffwashington commented Aug 11, 2023 •

edited

Loading

	queue.remove(entry.index);
	entry.index = queue.insert_last(key);

wip: pull IndexList into crate, get_mut -> get #32423

wip: pull IndexList into crate, get_mut -> get #32423

Conversation

jeffwashington commented Jul 8, 2023 • edited Loading

Problem

Summary of Changes

jeffwashington Jul 8, 2023

Choose a reason for hiding this comment

codecov bot commented Jul 8, 2023 • edited Loading

Codecov Report

behzadnouri commented Jul 8, 2023

jeffwashington commented Jul 8, 2023 • edited Loading

behzadnouri commented Jul 17, 2023

jeffwashington commented Jul 17, 2023

jeffwashington commented Jul 17, 2023

jeffwashington commented Jul 17, 2023 • edited Loading

behzadnouri commented Jul 17, 2023

jeffwashington commented Jul 18, 2023

behzadnouri commented Jul 20, 2023

jeffwashington commented Jul 20, 2023

HaoranYi commented Aug 1, 2023 • edited Loading

jeffwashington commented Aug 10, 2023

jeffwashington commented Aug 11, 2023 • edited Loading

jeffwashington commented Jul 8, 2023 •

edited

Loading

codecov bot commented Jul 8, 2023 •

edited

Loading

jeffwashington commented Jul 8, 2023 •

edited

Loading

jeffwashington commented Jul 17, 2023 •

edited

Loading

HaoranYi commented Aug 1, 2023 •

edited

Loading

jeffwashington commented Aug 11, 2023 •

edited

Loading