Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update lru for read only cache if it has been long enough since last access #32560

Merged
merged 3 commits into from
Aug 29, 2023

Conversation

jeffwashington
Copy link
Contributor

@jeffwashington jeffwashington commented Jul 20, 2023

Problem

Working on improving load time. Read cache spends a lot of time blocked on global mutex updating eviction lru queue.

Summary of Changes

update lru for read only cache if it has been long enough since last access. The idea is that entries in the read only accounts cache will fall into 2 camps: read frequently or infrequently. Frequently read items will be accessed again prior to eviction and lru will be updated (assume 50s delay from insertion to eviction). Infrequently read items will not be accessed again prior to eviction. By avoiding lru modification on every access, the common case of read cache hits is greatly improved. The only casualties will be accounts which are read twice within the time limit (100ms atm) and then evicted after 50s instead of 50.1s. The cache will still operate correctly, and the load will work correctly, we will just lose some performance for those special items who are infrequently loaded in this type of pattern. System performance and access patterns on accounts are not deterministic and the read cache will always have noise. The fastest code is code that doesn't run at all. By eliminating almost all calls to the lru update code, loads are on average much faster.

Fixes #

@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #32560 (581cfc5) into master (2e5c2e5) will decrease coverage by 0.1%.
Report is 29 commits behind head on master.
The diff coverage is 92.6%.

❗ Current head 581cfc5 differs from pull request most recent head ab78a5c. Consider uploading reports for the commit ab78a5c to get more accurate results

@@            Coverage Diff            @@
##           master   #32560     +/-   ##
=========================================
- Coverage    82.0%    81.9%   -0.1%     
=========================================
  Files         784      785      +1     
  Lines      212542   211208   -1334     
=========================================
- Hits       174343   173174   -1169     
+ Misses      38199    38034    -165     

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Jul 31, 2023

@brooksprumo I added you during draft. I wanted to get your thoughts on the concept.

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this help a validator on mnb? I could see it being beneficial for the very hot accounts that are read a lot. Won't there still be lock contention, just delayed? Maybe this is a win due to reduction in updates from the hot accounts?

Lastly, I imagine we may not update an account because it doesn't get re-read (e.g. read sooner than the skip amount, not updated, then evicted, then read just afterwards, when it would've originally been still in the cache). I'm hoping this is both infrequent and inconsequential.

@jeffwashington
Copy link
Contributor Author

mnb validator startup and catchup/steady state. y axis is age since last lru update (in ms) of item evicted from read only accounts cache.
x axis is each eviction from read only accounts cache.
image

768k evictions over this time period.
233 are evictions of less than 2s since last lru update.

So, evictions are almost always of old things, indicating we would have had plenty of time to access the item again if it was ever accessed again. Even if something was read twice within 100ms and only updated once, it would still have to go unaccessed for the next 50-90s. If it hasn't been accessed in 50-90s, it is perfectly ok to evict it.

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Aug 3, 2023

mnb validator. Left is before this change. Right is with this change:
image
gap in the middle is restart/catchup and can be ignored.

@jeffwashington
Copy link
Contributor Author

hits & misses. Left is without this change, right is with this change. Noise in the middle is startup costs (ie. not steady state).
purple line is this machine. light blue machine is an equivalent machine without this change.
image

hits and misses track with and without this change.

@jeffwashington jeffwashington marked this pull request as ready for review August 3, 2023 15:51
@behzadnouri
Copy link
Contributor

My concern is that this is pretty heuristic and the risk with heuristic approach is that we don't know when it will perform poorly; e.g. when there is high load, or certain transactions pattern, or the set of active accounts are rotated, etc.

That issue aside, shouldn't the update frequency be like every k lookups instead of being time-based?

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Aug 3, 2023

That issue aside, shouldn't the update frequency be like every k lookups instead of being time-based?

Pick a k. Say k=10.

  1. Any account read less than 10 times per 50s would be thrown out and reloaded once per 50s.
  2. Any account accessed 15k times per second would update the lru 1.5k times per second.
    Both of these are worse behavior than time based. Increasing k makes the first problem worse. Decreasing k makes the second problem worse.

For time based at 100ms:

  1. Any account read not read twice within 50.1s, with at least 100ms separating the two reads will be thrown out 50.1s after the first read. Status quo behavior is any account risks being thrown out in approximately 50s after the last read (at steady state). This is equivalent behavior.
  2. Any account reads within 100ms of the last account read on that pubkey will skip lru updates. So, an account update 15k times per second will update the lru a max of 10 times per second vs 15k per second. Even an account updated 100 times per second will update the lru a max of 10 times per second.

This cache is only there to improve performance of loading common accounts. Behavior is correct whether the account is in the read accounts cache or not. Plenty of accounts are only read once and live briefly in the accounts cache (~50s). It seems fine to use a heuristic approach that improves performance for all hits and misses and that sufficiently approximates theoretical ideal eviction behavior when the possible penalty is decreased performance for loading accounts which are accessed many times close together and then not accessed for a long time (50s) and then accessed again. We get expected greatly increased performance for theoretically possible occasional worse performance. Of course, this is measured against mnb traffic today. Traffic could change. LRU is still a reasonable eviction scheme if you have to pick one, and this tweak to the LRU record keeping is still sufficiently ideal LRU.
Worst case behavior is that for each account that is non-ideally evicted, we will incur a cache miss once per ~50s. These are only accounts which are accessed twice within 100ms and then accessed again 50-50.1s later. Accessing it > 50.1s later will have seen it evicted anyway. Accessing it again within 50s will put it back as the newest in the lru queue. So there is a narrow window of time for any changed behavior from today's scheme.

@behzadnouri
Copy link
Contributor

behzadnouri commented Aug 3, 2023

Oh, you are tracking the time per entry.
I was suggesting to have a counter for the whole cache (not per entry), and update lru every k lookups. Effectively you are sampling the load function at rate 1/k for lru update; so if account a is looked up x times more frequently than account b, it would have x times more chances that it gets lru update in one of those loads.
Something like below:

diff --git a/runtime/src/read_only_accounts_cache.rs b/runtime/src/read_only_accounts_cache.rs
index 23a0af60f4..98334ad6ae 100644
--- a/runtime/src/read_only_accounts_cache.rs
+++ b/runtime/src/read_only_accounts_cache.rs
@@ -15,6 +15,7 @@ use {
     },
 };
 
+const LRU_UPDATE_CADENCE: u64 = 10;
 const CACHE_ENTRY_SIZE: usize =
     std::mem::size_of::<ReadOnlyAccountCacheEntry>() + 2 * std::mem::size_of::<ReadOnlyCacheKey>();
 
@@ -35,6 +36,7 @@ pub(crate) struct ReadOnlyAccountsCache {
     // always sorted in the order that they have last been accessed. When doing
     // LRU eviction, cache entries are evicted from the front of the queue.
     queue: Mutex<IndexList<ReadOnlyCacheKey>>,
+    counter: AtomicU64,
     max_data_size: usize,
     data_size: AtomicUsize,
     hits: AtomicU64,
@@ -48,6 +50,7 @@ impl ReadOnlyAccountsCache {
         Self {
             max_data_size,
             cache: DashMap::default(),
+            counter: AtomicU64::default(),
             queue: Mutex::<IndexList<ReadOnlyCacheKey>>::default(),
             data_size: AtomicUsize::default(),
             hits: AtomicU64::default(),
@@ -84,7 +87,7 @@ impl ReadOnlyAccountsCache {
             // Move the entry to the end of the queue.
             // self.queue is modified while holding a reference to the cache entry;
             // so that another thread cannot write to the same key.
-            {
+            if self.counter.fetch_add(1, Ordering::Relaxed) % LRU_UPDATE_CADENCE == 0 {
                 let mut queue = self.queue.lock().unwrap();
                 queue.remove(entry.index);
                 entry.index = queue.insert_last(key);

@jeffwashington
Copy link
Contributor Author

I was suggesting to have a counter for the whole cache (not per entry)

Are you more comfortable with this 1/k sampling method?
It seems the 1/k sampling method will result in accounts referenced once per slot, for example, to be expected to be commonly evicted unnecessarily. I would expect it to be not uncommon to have an account be referenced once per slot or once per N slots, certainly periodically with a period less than 50s. Such accounts would always remain in the cache today but are expected to be evicted in the 1/k sampling method.

brooksprumo
brooksprumo previously approved these changes Aug 3, 2023
Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Since caching is an optimization, and not required for correctness, this feels ok to me. We can tweak the parameters as well, as we test it out on mnb/rollout on testnet/on pop-net.

I offer some naming tweaks, but just suggestions that would help my brain. Not required.

Lastly, I would think the 1/k-sampling would heavily weigh the updates for the hottest accounts. Is that right? So we may entirely miss updating luke-warm/cold-ish accounts since a majority of the updates will happen for the vote program.

runtime/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
runtime/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
brooksprumo
brooksprumo previously approved these changes Aug 4, 2023
Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@behzadnouri
Copy link
Contributor

behzadnouri commented Aug 4, 2023

It seems the 1/k sampling method will result in accounts referenced once per slot, for example, to be expected to be commonly evicted unnecessarily. I would expect it to be not uncommon to have an account be referenced once per slot or once per N slots, certainly periodically with a period less than 50s. Such accounts would always remain in the cache today but are expected to be evicted in the 1/k sampling method.

I don't really understand above.
If account a is loaded more frequently than account b, we want the cache to be more likely to hold account a compared to account b.
To see if account a is loaded more often than account b, you can either inspect every load, or inspect a sample of 1/k of loads. The fact that you are sampling loads at rate 1/k will not make account b to appear more frequent than account a. If account a is loaded x times more frequently than account b, it will still be x time more frequent in 1/k sampling.

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Aug 4, 2023

Here's how I thought about it.
TL;DR:
Accounts read > 1 times per 50s are likely to be inefficiently evicted in fifo order and then re-added. The more often an account is read, the more likely it gets sampled EVER as a result as a read. An account read once per slot is under-sampled by about *8 to have a good chance to be sampled once per 50s.

Details:
50s is about the time on mnb that our 400M byte limit on the read accounts cache would cause an entry to be evicted if we used fifo semantics and not lru. mnb has about 20k cache reads/s.
An account that is read less than once per 50s will be inserted into the read cache and evicted 50s later. This is fifo.
Accounts which are infrequently accessed (but more than once per 50s, such as once per slot) are likely to never be bumped in the lru if we sample 1/1,000 and will be expected to behave like accounts that are read less than once per 50s. They will be inserted into the cache, maybe used during their time in the cache, then evicted after 50s. Then, re-added on next read. This seems unnecessarily wasteful. This will keep a handful of the hottest accounts in memory and everything else will be fifo: inserted and evicted on a 50s cycle.

20,000 all accounts reads/s
1,000,000 total reads over 50s
0.001 sample ratio (1/1000)
1,000 samples taken during 50s

account a:
2.5 reads/s (once per slot)
125 reads during 50s
0.000125 ratio of total reads that are account a

Highly likely that we basically never update lru due to a read of account a
Thus, it is highly likely/expected that we will evict account a once per 50s
And this is true for every account which is read multiple times in 50s but avoids getting sampled.
For all but the highest frequency reads, accounts will enter the lru at the top and make their way steadily to the bottom.
So, lru will effectively be fifo on being added to cache.

@behzadnouri
Copy link
Contributor

With 1/k sampling rate:

  • if k == 1, then every load will update the lru and the lru policy is implemented precisely.
  • if k == ∞, i.e. infinity, then it is just fifo semantics.

A reasonable k will depend on the size of the cache which would preserve enough of the lru policy characteristics while mitigating lru update overhead at load times.

Assuming that k is reasonably chosen, and despite that:

Highly likely that we basically never update lru due to a read of account a

then that means that account a is indeed very infrequently loaded and so cache capacity shouldn't be wasted for it.

@jeffwashington jeffwashington force-pushed the jul14_2 branch 2 times, most recently from c82d735 to 581cfc5 Compare August 7, 2023 18:05
@jeffwashington
Copy link
Contributor Author

jeffwashington commented Aug 8, 2023

image
All these are built off #32518 (read only accounts use read lock)
Top two are master
middle blue is #32721 (random eviction)
Bottom two are this pr (update lru at most once per 100ms per account).
On 5 equivalent bare metal machines, in theory.

Note that hit percent seems to track as perfectly as we could hope for:
image

@jeffwashington
Copy link
Contributor Author

This cuts out the 2 master machines. Top is #32721 (random eviction)
Bottom 2 are 100ms max update per account:
image

@jeffwashington
Copy link
Contributor Author

jeffwashington commented Aug 8, 2023

With random eviction, we are getting a slightly lower hit %. Light blue is random eviction. Hard to show on graphs, so I stretched vertically.
As we would expect, random eviction does cause useful things to be evicted more often than an accurate lru.
Note that cache misses are faster than cache hits since we never update lru on cache misses. So, load would be slower over all when we miss the cache. But, these metrics, which are just read only cache load (hits and misses) would be best/lowest if we missed every time.
image

@jeffwashington
Copy link
Contributor Author

Some data I don't trust yet.
I think #11 is missing more often. Misses are faster than hits.

10 - master 2444e091835580ab2ba003d2aa70d2e2d73fa3ab
14 - jwash/jul14_2_test_no_lock - 53857f0fd88313ea97e339db9081c9d1c96e11f4
12 - behzad/read-only-cache-2rand - a928e028ae1c414b9cbf35a2bcde800af30b1ea0
13 - jwash/jul14_2_test - d08cdb16b0915a2cde5d37d89b4a43ebbf64aace
11 - jwash/jul14_2_test_periodic - ffc16aa30bc80058db49949b9158609a18bfbe2d

10 - orange, master (with read lock commit)
14 - pink, update lru if there is no contention on lru update mutex
12 - dark blue, 2rand eviction from behzad
13 - purple, update lru if it has been 100ms update per account
11 - cyan, periodic sampling 1/64

image

@jeffwashington
Copy link
Contributor Author

These results indicate periodic sampling (k=64) performed better than 100ms per account update. But, it appears miss rate was higher with k=64, so overall load would be slower for misses. I'm trying again adding k=32 and sampling max of 400ms per account. We could probably bump 400ms to much higher (like 10s) and greatly reduce # samples we take and thus, contention. But, we can't bump k much higher because we increase misses I think. Still digging in. We are never deterministic. I guess a long ledger tool run could give us more reliable data. There will still be a lot of noise.

@jeffwashington
Copy link
Contributor Author

We can eliminate several contenders:

  1. updating lru when try_lock returns, indicating we'll have no lock contention. This probably jsut updates lru a lot more often. The results were 4th slowest.
  2. 2rand - at least at those tuning parameters was 3rd fastest.
  3. master, which was slowest. But, we knew that.

@jeffwashington
Copy link
Contributor Author

I have more data on misses now. These are all over a 6 hour window on mnb.
image

top line is 1/32 sampling
middle lines are both 1/8 sampling (I intended to try 1/16 but I messed up)
Bottom 2 lines are this pr with 400ms and 4000ms updates per account

hit % is usually clearly won by 400ms/4000ms:
image

read only cache load time per hit or miss.
best 2 are 4000ms and 1/32 sampling (they sample the fewest and 1/32 misses the most)
1/8 and 400ms cost about the same per load, but 1/8 misses more.
image

misses:

Usually, worst is 1/32, 1/8 is middle, 400ms/4000ms are fewest misses.
image
Misses did briefly invert on 1 3000s data point, by a small range.
image

replay stats load. This should cover all loads, hits and misses, write cache, read cache, from append vec, etc.
This is probably the best metric.
best is 1/8
4000ms is middle
1/32 is fourth best
400ms is worst

image

read only cache entries:

Most entries: 1/32
Middle entries: 1/8
Least entries: 400ms/4000ms
I do not understand why this is happening.
image

Cache maintains a consistent total size of entries in cache (struct size + data len):
image

The spikes are temporary while we are evicting.

@jeffwashington
Copy link
Contributor Author

Some observations:

  1. A time based (400ms or 4000ms) per account lru model conceptually lines up with lru being time based and avoids over sampling high frequency data. But it probably samples more, costing more.
  2. A 1/k sampling model is easy to understand. Practically, a 1/k sampling model spreads out the lru updates (once per k), meaning we should never have lock contention on the lru update lock. Sampling less saves time. Never having lock contention while sampling saves time. But, we over sample high frequency accounts. This causes us to miss more. But, for some reason, we keep more accounts in the cache. This seems to imply we are throwing out some low frequency large accounts that we have to re-load, hurting the overall load numbers.

There are still unclear effects of these 2 methodologies. The effects may be tied to the specific access patterns in effect at the moment on mnb.

Both methods produce correct results.

@jeffwashington
Copy link
Contributor Author

@behzadnouri
I have been extensively testing variations of the various methods for weeks now.
Conclusion: the best performance comes with a variation of this pr.

A 1/k sampling rate oversamples the high frequency accounts and then evicts and re-adds enough accounts (and large accounts) that the overall replay performance is worse.
This pr is approximating lru by capping the frequency of how often we update lru for a given pubkey. I'm struggling to see a downside to this approach. Even the LRU timestamp within each entry in the readonly cache fits within the packed size of the struct. My recommendation is we proceed with this pr, with a constant of 16kms. This shows itself to be sufficient to reduce overall # of samples taken while keeping lru sufficiently accurate for maximum hits.

read only cache load time per (hits + misses):
light blue (best) - 11, 16k ms max sampling, this pr
red - 13 - 4k ms max sampling this pr
orange - 14 - 1/8 sampling (I've tried 1/16, 1/32, 1/1000, 1/1, etc.)
purple (worst) - 12 - 1/4 sampling
image

replay stats, load_us:
16k ms max sampling is best
1/4 sampling is worst
image

read only accounts cache hit percent:
red - 13 - 4k ms max sampling this pr (4k misses slightly less than
light blue (best) - 11, 16k ms max sampling, this pr
1/k are both worst
image

bytes of account data loaded as a result of misses (some are unavoidable):

best: 4k and 16k ms this pr
worst: 1/k
image

misses:
best: 4k, 16kms this pr
worst: 1/k
image

@jeffwashington
Copy link
Contributor Author

this pr (purple) vs master (blue) against mnb:
image

image

image

image

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - the numbers for the performance improvements look good.

We can always tune this more. We also own the full stack, so we can pick heuristics/implementations that best match our workloads. I think this PR shows how a workload-aware impl can net some big wins. If/when the workload changes, we can pick the best impl for it.

@behzadnouri
Copy link
Contributor

I'm struggling to see a downside to this approach.

The downside I see is that this is too fine tuned for the current mainnet load and, being a heuristic approach, it might not work out well if the load characteristics change.
For example, if the load spikes so that many more different accounts are loaded with much higher frequency, then 100ms skip will effectively approach fifo semantics and break the cache. 1/k sampling was immune to this.

@jeffwashington
Copy link
Contributor Author

We have discussed and measured various implementations for quite some time. It sounds like we can agree that reducing how often we update the lru queue is a reasonable approach.
Time since the last lru update for a given pubkey showed to be the most accurate representation of an ideal lru (current master) for the current mnb workload.
This results in the lowest overall load time for replay, which is the metric we're ultimately after.
Both approaches are better than master today by a lot.
We can always tune or modify the strategy at a later point.
Loads will be correct with either scheme.
We could probably construct cases (many times pathological) where the read only cache would begin to behave like a fifo under various tuning parameters and methodologies, where 'in' is inserting in the cache, and 'out' is eviction due to us storing a newer entry. In some cases, fifo is the correct, ideal behavior given specific access patterns.
The cache size itself is another tuning parameter we could change.
None of this affects consensus of other validators.

@jeffwashington jeffwashington merged commit 4d452fc into solana-labs:master Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants