Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: pull IndexList into crate, get_mut -> get #32423

Closed
wants to merge 1 commit into from

Conversation

jeffwashington
Copy link
Contributor

@jeffwashington jeffwashington commented Jul 8, 2023

Problem

We are investigating the cost of loading accounts. We found a 100x performance improvement on account load from read cache. The hit rate on the read cache is very high. Here are examples from metrics of a running validator:

read_only_accounts_cache_hits=11571
read_only_accounts_cache_misses=169

This graph shows a steady state validator using get_mut(write lcok) on the left, then a restart and back to steady state using get (using read lock) on the right. Note the y scale is log.
image

Currently, loading from read cache requires a write lock on the DashMap.

let mut entry = match self.cache.get_mut(&key) {

This is because to keep a LRU, we use IndexList. The api to move an entry in the list to back requires a remove() followed by insert_last().

queue.remove(entry.index);
entry.index = queue.insert_last(key);

This process produces a new index. The new index must be set into the entry in the DashMap. Write locks cause more contention than read locks. get doesn't need a write lock.

We need an api on IndexList like: move_to_last which does not modify the index. Then, we can get an immutable ref to the entry in the DashMap, and thus, a read lock instead of a write lock.

One option is to pull the code into our crate, remove the code we don't need, and add the new method move_to_last. This PR is to evaluate whether this is the best approach.
We'd want to copy over all tests that make sense.
We'd want to create new test(s) for move_to_last.

Another option is to get a change into the external project.

an alternative might be to create a vec of DashMaps, arranged by slot. We'd mod the slot to find the vec index for the dashmap that would contain an item of a given slot. The result would be more available write locks as a result of having more dashmaps. However, the lru list itself is currently still behind a single mutex since the lru list would presumably be shared across all (slot, pubkey) entries as it is today.

Other alternatives could be a different model where we were looser with strict lru usage ordering. Some kind of atomic age or something we increment and then search only occasionally to kick out older things. The previous implementation used a timestamp and searched the whole dashmap to find the strictly oldest items. We could store a slot (or a timestamp). When we need to evict (which is pretty uncommon), we could scan and get rid of the first things we find that are older than (10s, 1000 slots, whatever heuristic we came up with) and then stop when we've evicted enough. There are obviously downsides to different impls, but this read only cache is very stable in contents. It is also very conservatively sized at the moment.

Summary of Changes

Fixes #


/// Move the element at the index to the end.
/// The index remains the same.
pub(crate) fn move_to_last(&mut self, index: Index) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new fn

@jeffwashington jeffwashington changed the title pull IndexList into crate, get_mut -> get wip: pull IndexList into crate, get_mut -> get Jul 8, 2023
@codecov
Copy link

codecov bot commented Jul 8, 2023

Codecov Report

Merging #32423 (efb5166) into master (c9f7cb5) will increase coverage by 0.0%.
Report is 1 commits behind head on master.
The diff coverage is 90.7%.

@@           Coverage Diff            @@
##           master   #32423    +/-   ##
========================================
  Coverage    82.0%    82.0%            
========================================
  Files         785      786     +1     
  Lines      211736   211939   +203     
========================================
+ Hits       173641   173854   +213     
+ Misses      38095    38085    -10     

@behzadnouri
Copy link
Contributor

Looking at the code change, we still need an exclusive lock on the self.queue. So I am wondering why that does not create a lock contention as bad as self.cache.get_mut.

Maybe the issue is that entry.account.clone() happen while we have a write-lock on the entry.
Can you try the patch below and see if it shows any difference in the performance?

diff --git a/runtime/src/read_only_accounts_cache.rs b/runtime/src/read_only_accounts_cache.rs
index 9956af64bb..cbbb1d901a 100644
--- a/runtime/src/read_only_accounts_cache.rs
+++ b/runtime/src/read_only_accounts_cache.rs
@@ -88,6 +88,7 @@ impl ReadOnlyAccountsCache {
             queue.remove(entry.index);
             entry.index = queue.insert_last(key);
         }
+        let entry = entry.downgrade();
         Some(entry.account.clone())
     }

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Jul 8, 2023

@behzadnouri

We ARE also in contention on the indexlist mutex, AND in master, we are holding the write lock on the dashmap WHILE we wait on and do the indexlist mutex changes. This has us holding the write lock on dash map even longer.

Here are the results of cycling once every 5 minutes through 4 different read cache.load implementations:
image

For selector:
0 is master, 1 is master with a few things moved out of lock, 2 is downgrade change from behzad, 3 is read lock only in this pr

Bottom right graph is total time in looking up in DashMap + lru update time (move to back, however it is done).

The changes in this pr produce a ~10x better overall time per account. The lookup itself in the dashmap is ~100x.
Note, however, that the changes in this pr DO increase the lru update. This indicates we're no longer blocked on the dashmap write locks which shows us fully how much we're blocking on the IndexList mutex lock.
It would obviously be great to speed up the lru update as well somehow. Brainstorm possibilities include updating lru in the background, changing the lru algorithm, changing the lru locking mechanism, changing the lru data structure, changing the lru model, ...

Note also that cloning the account appears to be very cheap in all cases.

All times are in ns, calculated with us totals per second/# hits in that second.

@behzadnouri
Copy link
Contributor

The odd thing is that self.cache.get_mut only locks one shard of the dashmap, and, last time I looked into this, there are num-cpus many shards. So the lock here wouldn't necessarily block other threads.
On the other hand self.queue.lock is across all threads, so I would have expected a lot more lock contention on the self.queue rather than the self.cache; but the plots show a different thing.

Some possibilities are that:

  • Dashmap doing something weird.
  • Or, this is an artifact of how the metric is is calculated. Can you share the code which computes above metrics?

@jeffwashington
Copy link
Contributor Author

The odd thing is that self.cache.get_mut only locks one shard of the dashmap, and, last time I looked into this, there are num-cpus many shards. So the lock here wouldn't necessarily block other threads. On the other hand self.queue.lock is across all threads, so I would have expected a lot more lock contention on the self.queue rather than the self.cache; but the plots show a different thing.

Some possibilities are that:

  • Dashmap doing something weird.
  • Or, this is an artifact of how the metric is is calculated. Can you share the code which computes above metrics?

branch:
https://github.com/jeffwashington/solana/tree/jl3_test_cycle_pr

It cycles through 5 'selector' values of how to do the read cache::load().
0 is master.
4 is a read lock on dashmap, and a bg update of the lru using a simple vec like a queue. I have another version that uses a channel that isn't in this. Just trying to reduce the lru update to the minimal amount.

measurement of get_mut in master implementation in this branch:
https://github.com/jeffwashington/solana/blob/e5a87b462f164159423805596f8abe6451fcee91/runtime/src/read_only_accounts_cache.rs#L132-L140

@jeffwashington
Copy link
Contributor Author

SELECT mean("read_only_accounts_cache_update_lru_per_account_ns")+mean("read_only_accounts_cache_load_get_mut_per_account_ns") AS "read_only_accounts_cache_get_and_update_lru_per_account_ns" FROM "mainnet-beta"."autogen"."accounts_db_store_timings" WHERE time > :dashboardTime: AND time < :upperDashboardTime: :hostids: GROUP BY time(15s) FILL(null)

viewing the sum metrics per account.

You can try this dashboard.

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Jul 17, 2023

@behzadnouri here are the top loads on the read only cache in 1 s.
We are hitting the same shard because we're hitting the same account, over and over.

top read only cache loads
17316,Vote111111111111111111111111111111111111111,68%
825,ComputeBudget111111111111111111111111111111,3%
518,BPFLoaderUpgradeab1e11111111111111111111111,2%
457,TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA,1%
308,11111111111111111111111111111111,1%
219,FsJ3A3u2vn5cTVofAjvy6y5kwABJAqYWpe4975bi2epH,0%
195,HEvSKofvBgfaexv23kMabbYqxasxU3mQ4ibBMEmJWHny,0%
184,cjg3oHmg9uuPsP8D6g29NWvhySJkdYdAo9D25PRbKXJ,0%
177,whirLbMiicVdio4qvUfM5KAg6Ct8VwpYzGff3uctyCc,0%
150,NTYeYJ1wr4bpM5xo6zx5En44SvJFAd35zTxxNoERYqd,0%
131,E8cU1WiRWjanGxmn96ewBgk9vPTcL6AEZ1t6F6fkgUWe,0%
129,M2mx93ekt1fmXSVkTrUL9xVFHkmME8HTUi5Cyc5aF7K,0%
118,GiU5UpxKvYsSEpCBUrpZYKFqQJvzAMDnYjT9C4iqnahx,0%
112,72xbspJP8vtJ91Lp7k6mqJXFiPSvpad5BBRrixfAWh8m,0%
100,CyZuD7RPDcrqCGbNvLCyqk6Py9cEZTKmNKujfPi3ynDd,0%
100,6TDapMxCnEzss5YCJZXZtTSHwe6ZJeWikwvPzLnZvq7m,0%
100,SW1TCH7qEPTdLsDHRgPuMQjbQxKdH2aBStViMFnt64f,0%

Another adjustment here is we could never update the lru for the vote program, and never throw it out of the read only cache. Except, it is rewritten once per epoch and the entries in the read only cache are (Pubkey, Slot) tuples...

@behzadnouri
Copy link
Contributor

@jeffwashington #32518 is an alternative which uses atomics for indices in the cache entries. It has equivalent impact as load no longer needs a mutable entry. So I would expect to see the same improvement.

Can you also test #32518 in your setup and compare?

@jeffwashington
Copy link
Contributor Author

@jeffwashington #32518 is an alternative which uses atomics for indices in the cache entries. It has equivalent impact as load no longer needs a mutable entry. So I would expect to see the same improvement.

Can you also test #32518 in your setup and compare?

been busy, but I got numbers on this last night.
selector=0 is master
selector=5 is atomic index and get.
Results are the atomic index change is better than master. But not as good as we can get by keeping a stable index (due to other effects of releasing locks and avoiding contention on the lock during load. But, just changing to a get from get_mut, we now end up blocking on the mutex on the lru queue. The get is much faster, but the lru update is now much slower.

another variant I tried was only updating the lru on the vote account 1/10,000 of the time. (selector=6)
Those numbers were much better, and are similar to the best I could get through other methods:

Bottom graph is log scale of same middle graph. Bottom 2 graphs show total time (get + lru update).
image

@behzadnouri
Copy link
Contributor

The perf difference between #32518 and this pr is pretty strange.
The only difference is the code after self.queue is locked but that should only be few nanoseconds anyway.

@jeffwashington
Copy link
Contributor Author

The only difference is the code after self.queue is locked but that should only be few nanoseconds anyway.

The difference inside the mutex lock is move_to_last vs remove + insert_last. I suspect move_to_last is much faster. And since we know the critical section inside the mutex lock is the critical path all the concurrent loads are waiting on, it seems to make sense that more work inside that lock is serialized and causing delays.

Here are metrics between three version on a steady state mnb validator:
3 is this pr (using the atomic index)
9 is this pr (using the atomic index), but using remove/insert_last instead of move_to_last
9 should be equivalent to 5 except for if selector == 3 {
5 is the atomic index pr with remove/insert_last

being explicit about the differences between #3 and #9:

            let (_, update_lru_us) = measure_us!({
                let mut queue = self.queue.lock().unwrap();
                if selector == 3 {
                    queue.move_to_last(entry.index());
                }
                else { // selector == 9
                    queue.remove(entry.index());
                    entry.set_index(queue.insert_last(key));
                }
            });

Clearly, 3 (this pr) is the fastest.
9 and 5 are equivalent, as we'd expect.
Thus, the (remove + insert_last plus set_index) cost vs. move_to_last within the mutex lock is the culprit.

image

@HaoranYi
Copy link
Contributor

HaoranYi commented Aug 1, 2023

@behzadnouri here are the top loads on the read only cache in 1 s. We are hitting the same shard because we're hitting the same account, over and over.

top read only cache loads
17316,Vote111111111111111111111111111111111111111,68%
825,ComputeBudget111111111111111111111111111111,3%
518,BPFLoaderUpgradeab1e11111111111111111111111,2%
457,TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA,1%
308,11111111111111111111111111111111,1%
219,FsJ3A3u2vn5cTVofAjvy6y5kwABJAqYWpe4975bi2epH,0%
195,HEvSKofvBgfaexv23kMabbYqxasxU3mQ4ibBMEmJWHny,0%
184,cjg3oHmg9uuPsP8D6g29NWvhySJkdYdAo9D25PRbKXJ,0%
177,whirLbMiicVdio4qvUfM5KAg6Ct8VwpYzGff3uctyCc,0%
150,NTYeYJ1wr4bpM5xo6zx5En44SvJFAd35zTxxNoERYqd,0%
131,E8cU1WiRWjanGxmn96ewBgk9vPTcL6AEZ1t6F6fkgUWe,0%
129,M2mx93ekt1fmXSVkTrUL9xVFHkmME8HTUi5Cyc5aF7K,0%
118,GiU5UpxKvYsSEpCBUrpZYKFqQJvzAMDnYjT9C4iqnahx,0%
112,72xbspJP8vtJ91Lp7k6mqJXFiPSvpad5BBRrixfAWh8m,0%
100,CyZuD7RPDcrqCGbNvLCyqk6Py9cEZTKmNKujfPi3ynDd,0%
100,6TDapMxCnEzss5YCJZXZtTSHwe6ZJeWikwvPzLnZvq7m,0%
100,SW1TCH7qEPTdLsDHRgPuMQjbQxKdH2aBStViMFnt64f,0%

Another adjustment here is we could never update the lru for the vote program, and never throw it out of the read only cache. Except, it is rewritten once per epoch and the entries in the read only cache are (Pubkey, Slot) tuples...

it is interesting that Vote111111111111111111111111111111111111111 takes more than 68% of load. And those accounts are read-only ones. We could pin these accounts in a separate array in the cache and return them directly when the requested pubkey matches.

@jeffwashington
Copy link
Contributor Author

We could pin these accounts in a separate array in the cache and return them directly when the requested pubkey matches.

In theory, yes, we could do something special.
Note that the key of the read only cache hashmap is really a tuple of (Slot, Pubkey). When we do rewrites, vote program and such all move to newer slots and txs stop loading it at the old slot. So, the old entry falls out of the read only cache and the new one moves in.

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Aug 11, 2023

Probably spent way too much time on this:
replay_stats.load_us best perf:
best perf of these (400ms, 1/32, 4000ms, 1/64)
image

best read only cache load perf:
1/64 (best), 4000ms, 1/32, 400ms
image

But, this is because we are missing more with 1/32 and 1/64:
image

We also see the cache holding more entries for 1/32 and 1/64:
image

Not sure if that means we're kicking out 20(?) large programs so we are holding more smaller accounts? Not sure yet. But more misses is faster on read only load, but slower on overall load. This should explain why on 1/k we miss more, read only load faster, but replay.load slower.

I'm trying again with 1/16 sampling.

Note that all these numbers are with the move_to_last optimization on IndexList. That reduces the load time for hits and updating lru. The important aspects appear to be contention on the lock for lru update and time spent in the lock updating and hits/misses. These both affect load time at all levels.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Aug 28, 2023
@github-actions github-actions bot closed this Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants