Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(networking): add backoff period after failed dial #1462

Merged
merged 7 commits into from
Jan 23, 2023

Conversation

alrevuelta
Copy link
Contributor

@alrevuelta alrevuelta commented Dec 13, 2022

Closes #1414

Summary:
After failing to dial a peer, it adds a backoff to it so that we don't attempt to dial the same peer again during some time. The time to wait depends on the amount of consecutive failures, calculated with the following formula:

initialBackoffInSec*(backoffFactor^(failedAttempts-1))

Note that initialBackoffInSec and backoffFactor are configurable values that allow to configure how aggressive the backoffs are. Using initialBackoffInSec=120 and backoffFactor=4 the times to wait would be:

120s, 480s, 1920, 7680s

This PR helps nodes to rapidly increase their number of connections, since less time is wasted in trying to connect to nodes that fail. This improvement is even more noticeable in networks whith a high ratio of non reachable peers. Note that this backoff only applies to relay peers.

Changes:

  • Add exponential backoff after a failed dial.
  • Some minor test refactoring.
  • Some minor logging improvements.
  • Directly use switch.dial in ping (keep alive)

@alrevuelta alrevuelta changed the title feat(p2p): add backoff period after failed dial feat(networking): add backoff period after failed dial Dec 13, 2022
@status-im-auto
Copy link
Collaborator

status-im-auto commented Dec 13, 2022

Jenkins Builds

Click to see older builds (7)
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 41f72ac #1 2022-12-13 23:00:26 ~13 min macos 📦bin
✔️ c2058e0 #2 2022-12-14 23:00:49 ~13 min macos 📦bin
✔️ 8af035c #3 2022-12-15 23:17:09 ~29 min macos 📦bin
7c5aa5a #4 2023-01-04 23:03:53 ~16 min macos 📄log
b28405a #5 2023-01-13 22:49:32 ~2 min macos 📄log
99b3327 #6 2023-01-16 22:49:28 ~2 min macos 📄log
✔️ faac57b #7 2023-01-18 23:02:16 ~14 min macos 📦bin
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 7cf6b15 #8 2023-01-19 23:01:45 ~14 min macos 📦bin
✔️ 75c67bf #9 2023-01-20 23:01:21 ~13 min macos 📦bin

@alrevuelta alrevuelta marked this pull request as ready for review December 15, 2022 09:36
Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree with the idea of exponential backoffs, I'm not sure I agree that this should be part of (every) dialPeer. In my mind, if an "application" (or just a protocol) calls dialPeer, it should attempt to connect to that peer immediately without considering delays/backoffs (except maybe a very recent failure). The protocol/application may "know better" than the peer manager about when a peer has become available again - especially when we're in the 14h backoff period. If a protocol attempts to continuously dial an unreachable peer, there is a problem with that protocol that should be addressed. The only protocol that currently requires continuous connection attempts, however, is Relay. In my mind then there should be a connectivity loop that continuously attempts to connect to some peers from the peer store and respects the backoff period for these peers within this connectivity loop. In this approach, dialPeer will be available to protocols that want to make an ad-hoc connection to a specific peer and should remain a simple way to attempt to dial the peer (there may be scope for some sanity checks here, but I don't think a protocol should attempt to dial a peer and then wait ~14 hours for the dial to be attempted).

@jm-clius
Copy link
Contributor

Of course, similar connectivity guarantees could be useful for service protocols in future too. I just think this connectivity attempts and backoffs should be done in parallel and not serially/ad-hoc whenever attempting to dial.

@alrevuelta
Copy link
Contributor Author

Thanks @jm-clius, fair comments. How about only respecting the backoff for the relay protocol, and having a direct path for the service protocols? Also related to "service protocol slots" #1461.

Some comments:

The protocol/application may "know better" than the peer manager about when a peer has become available again

mm not sure I follow. Can you provide an example? Not sure how the application layer can have knowledge of this.

The only protocol that currently requires continuous connection attempts, however, is Relay. In my mind then there should be a connectivity loop that continuously attempts to connect to some peers from the peer store and respects the backoff period for these peers within this connectivity loop

Nice, as stated I will implement that, and have a direct path for service protocols.

it should attempt to connect to that peer immediately without considering delays/backoffs (except maybe a very recent failure)

Sure. Linking this to the "service protocol slots" #1461. I can implement n retries for "slotted" peers. If its fine will leave that for the PR fixing that issue.

@alrevuelta
Copy link
Contributor Author

alrevuelta commented Dec 16, 2022

Will leave this PR on hold since with the existing code, I can't differenciate between service (store, lp,...) and relay peers. Meaning that a store peer configured with setStorePeer can be overriden by any peer supporting also store.

Once I implement this feature #1461 I will be able to know which peers are "slotted" (aka set as preferred service/lp peer) and don't apply the backoff to them.

Edit: I actually can withproto eg proto=StoreCodec but that might be a different one than the one provided as store-peer

@alrevuelta
Copy link
Contributor Author

@jm-clius Fixed your comment by adding a flag to make respecting the backoff optional. Let me know what you think :)

This feature is currently unused and will be used in #1477 in the connectivity loop that you also refer to, and only for relay peers.

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will review in more detail after the weekend, in case I've missed some intricacies. :)

Adding the respectBackoff flag is indeed better, but I'm still not convinced that this logic should be part of the dialer.

As I see it, we have two distinct needs here:

  1. A dialer interface that must allow any protocol/application to attempt to dial peers, maintain connectivity related peer books, etc.
  2. At least one "application" of the peer manager/store (the "connect loop") that wants to use this dialer to continuously attempt to connect to all relevant peers in the peer store. This application should only attempt connection to peers that are not being backed off from, remove peers that it has the authority to do (i.e. not static peers), etc.

Mixing logic from (2) into (1) seems to me to create some confusion. What if an application wants to respectBackoff? Currently it will simply receive a none(Connection) in return if the peer is being backed off from. In my mind it has no reason not to continue attempting to connect to this peer and doesn't gain any information to help it make better decisions in future. Furthermore, the dialer will now make decisions such as removing peers from the peer store after max failed attempts. It seems to me to be doing that because it "knows" that this is what the connect loop would expect of it, since the connect loop is selecting peers from the peer store for attempted connection. This has no bearing, however, on other applications/protocols that may be managing their own peers.

waku/v2/node/peer_manager/peer_manager.nim Outdated Show resolved Hide resolved
Comment on lines 88 to 92
var deadline = sleepAsync(dialTimeout)
let dialFut = pm.switch.dial(peerId, addrs, proto)

var reasonFailed = ""
try:
# Attempt to dial remote peer
if (await dialFut.withTimeout(DefaultDialTimeout)):
await dialFut or deadline
if dialFut.finished():
if not deadline.finished():
deadline.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use dialFut.withTimeout()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reverted my changes, using again withTimeout.

i thought there was a possible race condition and wanted also to cancel the dial if the timer timed out, but noticed it didnt made much sense.

waku_peers_dials.inc(labelValues = [reasonFailed])

# If failed too many times, remove peer from peer store
if respectBackoff and pm.peerStore[NumberFailedConnBook][peerId] >= pm.maxFailedAttempts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird to me - the fact that me dialing some peer could result in it being removed from the peer store (and doing it again would result in it being added again, presumably?). Of course, we know that for the application of "continuously attempt to connect to all available relay peers" this would make sense, i.e. to eventually stop attempting to dial peers that continue to fail. But this highlights my concern that this logic should not be part of the dialer, even if behind a respectBackoff flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@alrevuelta
Copy link
Contributor Author

@jm-clius I agree with your 1. and 2. needs. Main reason I added the respectBackoff to dialPeer and mixed the logic is so that its easier to unit test. If I respect the backoff directly in the loop, that makes it more difficult to unit test.

Plan b is to have a separate function, but that involves duplicating lots of logic: update metrics, dial + timeouts, etc.

Will convert the PR to draft and implement a vanila connectivity loop in a separate PR. Then I will add the backoff to that connectivity loop, which should comply with your "not mixing logic" requirement.

@alrevuelta alrevuelta force-pushed the add-exponential-backoff branch 2 times, most recently from 83c78b5 to fc238bf Compare January 13, 2023 08:42
@alrevuelta alrevuelta marked this pull request as draft January 13, 2023 09:11
@alrevuelta alrevuelta changed the base branch from master to add-connectivity-loop January 13, 2023 09:15
@alrevuelta alrevuelta force-pushed the add-exponential-backoff branch 4 times, most recently from 8bfedcb to 99b3327 Compare January 16, 2023 10:54
Base automatically changed from add-connectivity-loop to master January 18, 2023 14:17
@alrevuelta alrevuelta marked this pull request as ready for review January 20, 2023 08:14
Copy link
Contributor

@rymnc rymnc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple questions :)


trace "Discovered peers", count=discoveredPeers.get().len()
if discoveredPeersRes.isOk:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if discoveredPeersRes.isOk:
if discoveredPeersRes.isOk():

nit, but i believe this is the style guide we're adopting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should handle an error if findRandomPeers fails. An error log atleast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thanks!

# If it errored we wait an exponential backoff from last connection
# the more failed attemps, the greater the backoff since last attempt
let now = Moment.init(getTime().toUnix, Second)
let lastFailed = peerStore[LastFailedConnBook][peerId]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let lastFailed = peerStore[LastFailedConnBook][peerId]
let lastFailed = peerStore.getLastFailedPeer(peerId)

or similar, which would allow us to change the underlying datastructure if required in the futute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, but I'm just trying to follow nimlibp2p peerstore https://github.com/status-im/nim-libp2p/blob/unstable/libp2p/peerstore.nim#L148 pattern, of using custom books.

Doing this will require to have 6-7 new getter functions, with just one line of code, and not sure I see the benefit right now.

But I'm totally open for suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! just something to keep in mind if we decide to have more complex functionality later.

Comment on lines +40 to +41
InitialBackoffInSec = 120
BackoffFactor = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering what people's views are on making this configurable by the operator. To me, it allows for more aggressive dialing behaviour. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm these are some default safe values that i've tested. what do you mean by configurable? new cli flags? not sure if that would be to low level for an operator? note though that they can be changed when creating the peermanager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I mean cli flags :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm in favour of using (hard-coded) defaults until it becomes clear that making them configurable is useful to an operator. These default values could be part of a BCP RFC, for example, so that other client implementations can follow suit and agreement can be reached on what the most reasonable default is.

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Makes sense to me now that the dialPeer mechanism does not make any decisions on whether to dial a peer or not. Some minor comments below. My biggest concern is prioritising some mechanism to manage the number of peers kept in the store to avoid leaks. Would be good to monitor peer management behaviour closely once this is merged (and auto-deployed to wakuv2.test)

Comment on lines +40 to +41
InitialBackoffInSec = 120
BackoffFactor = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm in favour of using (hard-coded) defaults until it becomes clear that making them configurable is useful to an operator. These default values could be part of a BCP RFC, for example, so that other client implementations can follow suit and agreement can be reached on what the most reasonable default is.


let numPeersToConnect = min(min(maxConnections - numConPeers, disconnectedPeers.len), MaxParalelDials)
var notConnectedPeers = pm.peerStore.getNotConnectedPeers().mapIt(RemotePeerInfo.init(it.peerId, it.addrs))
var withinBackoffPeers = notConnectedPeers.filterIt(pm.peerStore.canBeConnected(it.peerId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is a var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, leftover. will fix.


let numPeersToConnect = min(min(maxConnections - numConPeers, disconnectedPeers.len), MaxParalelDials)
var notConnectedPeers = pm.peerStore.getNotConnectedPeers().mapIt(RemotePeerInfo.init(it.peerId, it.addrs))
var withinBackoffPeers = notConnectedPeers.filterIt(pm.peerStore.canBeConnected(it.peerId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely nitpicky: shouldn't these be something like outsideBackoffPeers? :D To me this sounds like these peers are still within the period of backing off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right! will fix.

try:
let conn = await node.switch.dial(peer.peerId, peer.addrs, PingCodec)
let pingDelay = await node.libp2pPing.ping(conn)
except CatchableError as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the move away from Result (or Option) based error handling here - is it because there are CatchableError possible here that we can't enumerate and deal with explicitly? Note that we prefer explicit error handling, see e.g. https://status-im.github.io/nim-style-guide/errors.exceptions.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change here is to use switch.dial instead of peerManager.dialPeer for the ping. Main reason is that dialPeer updates metrics on ok/nok connections, failed attempts, last time failed, Can/Cannot be connected etc.

And since here we are just pinging this peer, I don't think using that function makes sense. For example, we will be updating the metrics with ok connections, every time we ping a peer.

And actually this should be more like getConnection because we are not dialing/connecting to any peer, but getting an already existing connection and sending a ping.

I agree with explicir error handling, but here I had to add the try except because switch.dial doesn't return Result[xx].


# If it errored we wait an exponential backoff from last connection
# the more failed attemps, the greater the backoff since last attempt
let now = Moment.init(getTime().toUnix, Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth extracting this in future as another argument for canBeConnected, so that canBeConnected becomes an isolated utility-type function with predictable unit testing outputs and so that you only have to read the current system time once when checking the canBeConnected() status for multiple peers (as is the most common use case, I think)

return true
return false

proc delete*(peerStore: PeerStore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict, this is not yet used anywhere? Given the fact that in existing deployments peer IDs are cycled very often (I think), we should add a mechanism to manage the size of the peer store fairly urgently - unsure of the implication if this memory essentially leaks in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, not used in this PR but tracked here in "Prune peers from peerstore".
shouldn't leak since its limited by .withPeerStore(capacity=xxx) but yes, we have to handle it more gracefully.

Copy link
Contributor

@LNSD LNSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@alrevuelta alrevuelta merged commit 028efc8 into master Jan 23, 2023
@alrevuelta alrevuelta deleted the add-exponential-backoff branch January 23, 2023 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore(networking): too many failed dials, improve strategy
5 participants