-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(p2p): cache responses to serve without roundtrip to db #2352
Conversation
dcdaaf7
to
8079739
Compare
crates/services/p2p/src/service.rs
Outdated
impl CachedView { | ||
fn new(metrics: bool) -> Self { | ||
Self { | ||
sealed_block_headers: DashMap::new(), | ||
transactions_on_blocks: DashMap::new(), | ||
metrics, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to also support sub-ranges or even partial ranges here, but can be in the future :)
009dc14
to
5b03fa0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this. I Hate to be annoying here, but to approve this I need to:
- See the Changelog updated.
- Understand the reasoning behind the current caching strategy and the benefits/drawbacks over an LRU cache.
- Be certain that we don't open the door to OOM attacks by allowing our cache to be overloaded.
Let me know your thoughts on 2 and 3. I'm happy to jump on a call to discuss this and figure out a good path forward.
pub struct CachedView { | ||
sealed_block_headers: DashMap<Range<u32>, Vec<SealedBlockHeader>>, | ||
transactions_on_blocks: DashMap<Range<u32>, Vec<Transactions>>, | ||
metrics: bool, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit hesitant to the current approach of storing everything and clearing on a regular interval. Right now, there is no memory limit of the cache, and we use ranges as keys. So if someone queries the ranges (1..=4, 1..=2, 3..=4), we'd store all blocks in the 1..=4 range twice - and this could theoretically grow quadratically for larger ranges.
I would assume that the most popular queries at a given time are quite similar. Why not use a normal LRU cache with fixed memory size? Alternatively just maintain a cache over the last
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, its still wip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, I see this PR is still a draft :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now use block height as the key in 6422210
we will retain the time-based eviction strategy for now
… instead of a per-range basis
Co-authored-by: Mårten Blankfors <[email protected]>
Co-authored-by: Rafał Chabowski <[email protected]>
} | ||
} | ||
|
||
pub(super) fn clear(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use some LRU cache instead of wiping the whole cache every 10 seconds? This is just a thought since the current approach should also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in the future we can use an LRU :) we discussed it somewhere above too
crates/services/p2p/src/service.rs
Outdated
cache_reset_interval: Duration, | ||
next_cache_reset_time: Instant, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it could be an internal logic of the CachedView
and on each insert/get we can clean it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be, but I wanted to not make the get_from_cache_or_db
require a mutable reference to Self
because it's just a getter. no strong opinion here, so if you want it that way, i can move it around
|
||
for height in range.clone() { | ||
if let Some(item) = cache.get(&height) { | ||
items.push(item.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be follow up PR, but it would be nice if we avoid heavy clone here and used Arc
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment here - d897cba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
associated issue: #2436
let block_height_range = 0..100; | ||
let sealed_headers = default_sealed_headers(block_height_range.clone()); | ||
let result = cached_view | ||
.get_sealed_headers(&db, block_height_range.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the cache to be linked to the DB at the time it is created, rather than having to specify the DB when invoking the function get_sealed_headers
or get_transactions
. Just curious to know what's the reason behind this choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will notice that the view of the current tip of the db (LatestView) is passed into the CachedView while making calls
LGTM. I have a side question of whether the cache should be cleared in case of a DB rollback, to avoid inconsistencies? |
that's a good question! i wonder if we have a hook from the db to be notified when it gets rolled back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good question! i wonder if we have a hook from the db to be notified when it gets rolled back.
We only can do rollback before services started. We don't need to handle the case when we do rollback during running of the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the usage of the concurrent LRU cache would solve all race conditions between updating cache and actually usage of it. Plus it will remove the usage of a separate timer.
Could we try to look into it in this PR?
The most of the logic of PR remains the same, it will just remove "cleanup" logic.
crates/services/p2p/src/service.rs
Outdated
@@ -444,7 +450,7 @@ impl<V, T> UninitializedTask<V, SharedState, T> { | |||
} | |||
} | |||
|
|||
impl<P: TaskP2PService, V, B: Broadcast, T> Task<P, V, B, T> { | |||
impl<P: TaskP2PService, V: AtomicView, B: Broadcast, T> Task<P, V, B, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: Would be better to move all constraints to where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in b677c01
added in 496071f there is still the explicit clone that is done before sending the value as a response though |
self.cache.insert(key.clone(), Arc::new(value)); | ||
|
||
// Update the access order. | ||
order.retain(|k| k != &key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to use already existing implementation of the LRU from crates.io. The current implementation is too slow. and operation here is expensive. Plus, usage of the Mutex
to manage order
destroyed all benefits of using DashMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without using Dashmap
, we will need a mutable reference to the underlying LruCache, from https://crates.io/crates/lru for example. and for every p2p request that comes through, we will need to pass a mutable reference of the CachedView
fuel-core/crates/services/p2p/src/service.rs
Line 581 in 496071f
let cached_view = self.cached_view.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want lock-free lru then we can introduce a nonce
to each of the elements being inserted, and to evict them we can sort, iterate and remove. this would increase time spent in eviction though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use something like https://docs.rs/crossbeam-skiplist/latest/crossbeam_skiplist/ to order the insertions of elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want lock-free LRU. The Dashmap
has locks inside, and I'm trying to say that usage of the Mutex
on upper level removes benefits of using Dashmap
that uses RW lock inside.
We don't need to re-invent LRU implementation, we just can reuse already existing implementation that works with &self
without requiring usage of the mutex/rwlock by us explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced dashmap with quick_cache which supports LRU-style eviction (Clock-PRO) - 1bdcf00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice stuff!
Linked Issues/PRs
Description
When we request transactions for a given block range, we shouldn't only keep using the same peer causing pressure on it. we should pick a random one with the same height and try to get the transactions from that instead.This PR caches p2p responses (ttl 10 seconds by default) and serves requests from cache falling back to db for others.
Checklist
Before requesting review
After merging, notify other teams
[Add or remove entries as needed]