fix(p2p): cache responses to serve without roundtrip to db #2352

rymnc · 2024-10-14T15:45:42Z

Linked Issues/PRs

the intermittent outages on testnet

Description

When we request transactions for a given block range, we shouldn't only keep using the same peer causing pressure on it. we should pick a random one with the same height and try to get the transactions from that instead.

This PR caches p2p responses (ttl 10 seconds by default) and serves requests from cache falling back to db for others.

Checklist

Breaking changes are clearly marked as such in the PR description and changelog
New behavior is reflected in tests
The specification matches the implemented behavior (link update PR if changes are needed)

Before requesting review

I have reviewed the code myself
I have created follow-up issues caused by this PR and linked them here

After merging, notify other teams

[Add or remove entries as needed]

Rust SDK
Sway compiler
Platform documentation (for out-of-organization contributors, the person merging the PR will do this)
Someone else?

rymnc · 2024-10-15T14:53:42Z

crates/services/p2p/src/service.rs

+impl CachedView {
+    fn new(metrics: bool) -> Self {
+        Self {
+            sealed_block_headers: DashMap::new(),
+            transactions_on_blocks: DashMap::new(),
+            metrics,
+        }
+    }


we probably want to also support sub-ranges or even partial ranges here, but can be in the future :)

netrome

Thanks for implementing this. I Hate to be annoying here, but to approve this I need to:

See the Changelog updated.
Understand the reasoning behind the current caching strategy and the benefits/drawbacks over an LRU cache.
Be certain that we don't open the door to OOM attacks by allowing our cache to be overloaded.

Let me know your thoughts on 2 and 3. I'm happy to jump on a call to discuss this and figure out a good path forward.

CHANGELOG.md

crates/services/p2p/src/service.rs

netrome · 2024-10-15T19:01:34Z

crates/services/p2p/src/cached_view.rs

+pub struct CachedView {
+    sealed_block_headers: DashMap<Range<u32>, Vec<SealedBlockHeader>>,
+    transactions_on_blocks: DashMap<Range<u32>, Vec<Transactions>>,
+    metrics: bool,
+}


I'm a bit hesitant to the current approach of storing everything and clearing on a regular interval. Right now, there is no memory limit of the cache, and we use ranges as keys. So if someone queries the ranges (1..=4, 1..=2, 3..=4), we'd store all blocks in the 1..=4 range twice - and this could theoretically grow quadratically for larger ranges.

I would assume that the most popular queries at a given time are quite similar. Why not use a normal LRU cache with fixed memory size? Alternatively just maintain a cache over the last $N$ block headers and their transactions, evicting old ones as new ones gets populated?

yup, its still wip.

Ah right, I see this PR is still a draft :)

we now use block height as the key in 6422210

we will retain the time-based eviction strategy for now

… instead of a per-range basis

rymnc · 2024-10-16T06:35:34Z

synced a testnet node and had 2 local nodes sync from it at the same time -

Co-authored-by: Mårten Blankfors <[email protected]>

Co-authored-by: Rafał Chabowski <[email protected]>

xgreenx · 2024-11-13T21:26:29Z

crates/services/p2p/src/cached_view.rs

+        }
+    }
+
+    pub(super) fn clear(&self) {


Maybe we could use some LRU cache instead of wiping the whole cache every 10 seconds? This is just a thought since the current approach should also work.

Yeah, in the future we can use an LRU :) we discussed it somewhere above too

xgreenx · 2024-11-13T21:27:26Z

crates/services/p2p/src/service.rs

+    cache_reset_interval: Duration,
+    next_cache_reset_time: Instant,


Looks like it could be an internal logic of the CachedView and on each insert/get we can clean it.

it could be, but I wanted to not make the get_from_cache_or_db require a mutable reference to Self because it's just a getter. no strong opinion here, so if you want it that way, i can move it around

xgreenx · 2024-11-13T21:43:06Z

crates/services/p2p/src/cached_view.rs

+
+        for height in range.clone() {
+            if let Some(item) = cache.get(&height) {
+                items.push(item.clone());


It can be follow up PR, but it would be nice if we avoid heavy clone here and used Arc instead

added comment here - d897cba

associated issue: #2436

acerone85 · 2024-11-14T10:17:53Z

crates/services/p2p/src/cached_view.rs

+        let block_height_range = 0..100;
+        let sealed_headers = default_sealed_headers(block_height_range.clone());
+        let result = cached_view
+            .get_sealed_headers(&db, block_height_range.clone())


I would expect the cache to be linked to the DB at the time it is created, rather than having to specify the DB when invoking the function get_sealed_headers or get_transactions. Just curious to know what's the reason behind this choice?

you will notice that the view of the current tip of the db (LatestView) is passed into the CachedView while making calls

acerone85 · 2024-11-14T10:32:17Z

LGTM. I have a side question of whether the cache should be cleared in case of a DB rollback, to avoid inconsistencies?

rymnc · 2024-11-14T10:34:24Z

LGTM. I have a side question of whether the cache should be cleared in case of a DB rollback, to avoid inconsistencies?

that's a good question! i wonder if we have a hook from the db to be notified when it gets rolled back.

xgreenx

that's a good question! i wonder if we have a hook from the db to be notified when it gets rolled back.

We only can do rollback before services started. We don't need to handle the case when we do rollback during running of the node.

xgreenx

I think the usage of the concurrent LRU cache would solve all race conditions between updating cache and actually usage of it. Plus it will remove the usage of a separate timer.

Could we try to look into it in this PR?

The most of the logic of PR remains the same, it will just remove "cleanup" logic.

xgreenx · 2024-11-16T21:18:55Z

crates/services/p2p/src/service.rs

@@ -444,7 +450,7 @@ impl<V, T> UninitializedTask<V, SharedState, T> {
    }
 }

-impl<P: TaskP2PService, V, B: Broadcast, T> Task<P, V, B, T> {
+impl<P: TaskP2PService, V: AtomicView, B: Broadcast, T> Task<P, V, B, T> {


[nit]: Would be better to move all constraints to where

addressed in b677c01

rymnc · 2024-11-18T06:23:20Z

I think the usage of the concurrent LRU cache would solve all race conditions between updating cache and actually usage of it. Plus it will remove the usage of a separate timer.

added in 496071f

there is still the explicit clone that is done before sending the value as a response though

xgreenx · 2024-11-18T12:16:59Z

crates/services/p2p/src/cached_view.rs

+        self.cache.insert(key.clone(), Arc::new(value));
+
+        // Update the access order.
+        order.retain(|k| k != &key);


I meant to use already existing implementation of the LRU from crates.io. The current implementation is too slow. and operation here is expensive. Plus, usage of the Mutex to manage order destroyed all benefits of using DashMap.

without using Dashmap, we will need a mutable reference to the underlying LruCache, from https://crates.io/crates/lru for example. and for every p2p request that comes through, we will need to pass a mutable reference of the CachedView

fuel-core/crates/services/p2p/src/service.rs

Line 581 in 496071f

let cached_view = self.cached_view.clone();

, which isn't possible without a mutex (which is what I assume you want to avoid). This is why I implemented the lru cache using a dashmap.

if you want lock-free lru then we can introduce a nonce to each of the elements being inserted, and to evict them we can sort, iterate and remove. this would increase time spent in eviction though

we could use something like https://docs.rs/crossbeam-skiplist/latest/crossbeam_skiplist/ to order the insertions of elements

I don't want lock-free LRU. The Dashmap has locks inside, and I'm trying to say that usage of the Mutex on upper level removes benefits of using Dashmap that uses RW lock inside.

We don't need to re-invent LRU implementation, we just can reuse already existing implementation that works with &self without requiring usage of the mutex/rwlock by us explicitly.

replaced dashmap with quick_cache which supports LRU-style eviction (Clock-PRO) - 1bdcf00

crates/services/p2p/src/service.rs

netrome

Nice stuff!

rymnc requested review from xgreenx, Dentosal and MitchTurner as code owners October 14, 2024 15:45

rymnc requested a review from a team October 14, 2024 15:47

rymnc changed the title ~~fix(p2p): get transactions from a random peer with the same height~~ fix(p2p): cache responses to serve without roundtrip to db Oct 14, 2024

rymnc added 3 commits October 15, 2024 01:52

chore: changelog

ba6bdab

test: arc mutex on cachedview, reset every 10 sec

5a51e0c

fix: err

8079739

rymnc force-pushed the fix/p2p-round-robin branch from dcdaaf7 to 8079739 Compare October 14, 2024 20:22

rymnc marked this pull request as draft October 14, 2024 21:37

rymnc added 2 commits October 15, 2024 03:25

fix(p2p): use dashmap, remove mutex, remove instantiation with view

19a32f3

fix: fmt

805a866

rymnc self-assigned this Oct 14, 2024

rymnc added the fuel-p2p label Oct 14, 2024

rymnc added 4 commits October 15, 2024 05:42

Merge branch 'master' into fix/p2p-round-robin

fd0aea0

chore: add metrics to cache hits/misses

c5db7a6

fix: clippy

b4763be

Merge branch 'master' into fix/p2p-round-robin

b6bbf61

rymnc commented Oct 15, 2024

View reviewed changes

chore: refactor cached_view into own module

5b03fa0

rymnc force-pushed the fix/p2p-round-robin branch from 009dc14 to 5b03fa0 Compare October 15, 2024 15:33

netrome requested changes Oct 15, 2024

View reviewed changes

chore: retain time based clear, but cache is now on a per block basis…

6422210

… instead of a per-range basis

fix: metrics logging and clearing

2b2a8fb

netrome mentioned this pull request Oct 16, 2024

Keep data in fails cases in sync service #2361

Open

5 tasks

rymnc and others added 3 commits October 28, 2024 16:00

Merge branch 'master' into fix/p2p-round-robin

0a3724d

fix: fmt and clippy

f597ac9

Update CHANGELOG.md

25515ee

Co-authored-by: Mårten Blankfors <[email protected]>

rymnc requested a review from netrome October 29, 2024 09:39

rymnc linked an issue Oct 31, 2024 that may be closed by this pull request

P2P is doing a lot of database lookups #2023

Closed

chore: visibility of CachedView

7a0a776

Co-authored-by: Rafał Chabowski <[email protected]>

xgreenx reviewed Nov 13, 2024

View reviewed changes

acerone85 reviewed Nov 14, 2024

View reviewed changes

rymnc mentioned this pull request Nov 14, 2024

chore: replace expensive clones in p2p caching with Arc #2436

Open

1 task

chore: todos

d897cba

rymnc requested a review from acerone85 November 14, 2024 10:34

xgreenx reviewed Nov 16, 2024

View reviewed changes

rymnc added 2 commits November 18, 2024 10:58

fix: nit

b677c01

chore: with lru cache

496071f

xgreenx requested changes Nov 18, 2024

View reviewed changes

rymnc added 2 commits November 18, 2024 18:53

fix: replace dashmap with quick_cache

1bdcf00

fix: unnecessary clone

099fe4f

rymnc requested a review from xgreenx November 18, 2024 13:28

Removed useless Arc

0b328c2

xgreenx previously approved these changes Nov 18, 2024

View reviewed changes

crates/services/p2p/src/service.rs Outdated Show resolved Hide resolved

Update crates/services/p2p/src/service.rs

ccb42d5

xgreenx dismissed their stale review via ccb42d5 November 18, 2024 14:17

xgreenx previously approved these changes Nov 18, 2024

View reviewed changes

Merge branch 'master' into fix/p2p-round-robin

e27f74e

xgreenx dismissed their stale review via e27f74e November 18, 2024 14:18

xgreenx approved these changes Nov 18, 2024

View reviewed changes

Merge branch 'master' into fix/p2p-round-robin

44af77e

netrome approved these changes Nov 19, 2024

View reviewed changes

rymnc merged commit be4e33c into master Nov 19, 2024
31 checks passed

rymnc deleted the fix/p2p-round-robin branch November 19, 2024 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(p2p): cache responses to serve without roundtrip to db #2352

fix(p2p): cache responses to serve without roundtrip to db #2352

rymnc commented Oct 14, 2024 •

edited

Loading

rymnc Oct 15, 2024

netrome left a comment

netrome Oct 15, 2024

rymnc Oct 15, 2024

netrome Oct 15, 2024

rymnc Oct 16, 2024

rymnc commented Oct 16, 2024

xgreenx Nov 13, 2024

rymnc Nov 14, 2024

xgreenx Nov 13, 2024

rymnc Nov 14, 2024

xgreenx Nov 13, 2024

rymnc Nov 14, 2024

rymnc Nov 14, 2024

acerone85 Nov 14, 2024

rymnc Nov 14, 2024

acerone85 commented Nov 14, 2024

rymnc commented Nov 14, 2024

xgreenx left a comment

xgreenx left a comment

xgreenx Nov 16, 2024

rymnc Nov 18, 2024

rymnc commented Nov 18, 2024

xgreenx Nov 18, 2024

rymnc Nov 18, 2024

rymnc Nov 18, 2024

rymnc Nov 18, 2024

xgreenx Nov 18, 2024

rymnc Nov 18, 2024 •

edited

Loading

netrome left a comment

		cache_reset_interval: Duration,
		next_cache_reset_time: Instant,

fix(p2p): cache responses to serve without roundtrip to db #2352

fix(p2p): cache responses to serve without roundtrip to db #2352

Conversation

rymnc commented Oct 14, 2024 • edited Loading

Linked Issues/PRs

Description

Checklist

Before requesting review

After merging, notify other teams

Choose a reason for hiding this comment

netrome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rymnc commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acerone85 commented Nov 14, 2024

rymnc commented Nov 14, 2024

xgreenx left a comment

Choose a reason for hiding this comment

xgreenx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rymnc commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rymnc Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

netrome left a comment

Choose a reason for hiding this comment

rymnc commented Oct 14, 2024 •

edited

Loading

rymnc Nov 18, 2024 •

edited

Loading