-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deal with a temporary loss of network connectivity #1354
Conversation
Bootstrap nodes are special and deserve special treatment. We're now retrying failed dials forever, to be more resilient in the face of temporary bootstrap node downtimes at program start. This means it no longer makes sense to die if we didn't connect to a bootstrap node in 30 seconds. It's not the user who should be redialing by restarting beacon_node, it's beacon_node itself that should do that. TODO: when invalidating peers that we previously dialed, check if they are bootstrap nodes and, if so, add them back to Eth2Node.connQueue, to deal with a loss of connectivity on our side (ISP hiccup).
The rationale for handling the initial connection attempts in a special way is that the failure may result from misconfiguration. We don't want to make the failure too silent and difficult to diagnose in the intended real-world environment (e.g. running the beacon node as a system service), so I think we should log a warning or an error every 30 seconds or so until a connection is established. |
beacon_chain/eth2_network.nim
Outdated
network.addSeen(pi, SeenTableTimeDeadPeer) | ||
if remotePeerInfo.peerId in bootstrapPeerIDs: | ||
# keep trying | ||
await network.connQueue.addLast(remotePeerInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is add is redundant, as in discovery bootstrap nodes are also treated special and are never removed from the routing table. So they would keep getting added in the discovery loop. The not adding to the seen list that you do there is required of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that bootstrap nodes are exempt from removal in Protocol.replaceNode()
but where are they re-added to Eth2Node.connQueue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the discoveryLoop they have a chance of being passed again through the randomNodes
call. Chance, as it is a random selection, but if no bootstrap nodes were reachable ( = no other nodes added to the routing table) it will always be those bootstrap nodes.
And as you no longer do the addSeen
for those bootstrap nodes, a new connection should be attempted I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you need to check not only PeerID
but also you need to check if PeerInfo
you are sending to connection worker has TCP
based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is add is redundant
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you need to check not only
PeerID
but also you need to check ifPeerInfo
you are sending to connection worker hasTCP
based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.
That filter should be done earlier, in the discovery loop (it happens now already on eth2 field)
Looks like a better way to handle the failing bootnodes (the most important fix here is the not adding to the seen list for bootstrap nodes, although perhaps a smaller The re-adding to the connection queue is not necessary, see comments I made. And, I will probably change that somewhat, so that failing bootnodes are only added again once we drop out of peers (see status-im/nim-eth#280). This would then also not require the smaller |
Is somebody could explain me why we trying to connect via libp2p to bootstrap nodes which can be discovery5 only nodes? |
For example bootstrap node is UDP only. In such case it will be impossible to even perform dial to this node, because |
From my point of view |
@cheatfate, the information in the ENR tells you whether the node also accepts TCP connections (LibP2P). You don't have to have fixed rules - you can check the record and then decide what to do. |
Currently, most bootstrap nodes are not only discovery nodes. But I do agree that we should not just blindly try to connect to these, as some are indeed only bootstrap nodes. And actually, that is what we do. Currently, |
@zah i know, but this PR do not check anything and just sending this peers to connection queue again and again https://github.com/status-im/nim-beacon-chain/pull/1354/files#diff-9a4cd7edc16fa179b1a30af959c1ab28R750 |
So this is more as an attempt to have temporary bootstrap (or local!) hiccups at start-up to not have a 10 minutes delay to get connected to that node. Is it necessary? No, the bootstrap node will still be used at the level of discovery and if it works, it will help you discover other nodes. The case of endless dials to non reachable bootnodes could also be resolved in discv5, see my previous comment and issue linked. |
There was already issue which was made by @zah with very good approach how to fix problems with race condition for
|
Implemented, @zah. |
@cheatfate If I'm not mistaken, all points except for 2. are already the case. Yet there are still failures. Anyhow, if this is only about the local simulations, then yes, a better fix would be to fix ^ |
beacon_chain/eth2_network.nim
Outdated
network.addSeen(pi, SeenTableTimeDeadPeer) | ||
if remotePeerInfo.peerId in bootstrapPeerIDs: | ||
# keep trying | ||
await network.connQueue.addLast(remotePeerInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you need to check not only PeerID
but also you need to check if PeerInfo
you are sending to connection worker has TCP
based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.
Bootstrap nodes should not be crucial point of initial connection. So if libp2p dial is timed out there no reason to connect to this node again and again. The more important here is to check if discovery5 is healthy enough... |
@kdeme you should agree that if discovery returns 3 nodes, then 3 nodes are already online and bound their server sockets... And if connection dial attempt fails it should be investigated... |
We're adding failed bootstrap nodes to the back of a FIFO queue, so they doesn't prevent other nodes from being dialled. |
I don't think we're discovering potential peers any other way, when starting with an empty state. |
@stefantalpalaru |
I know but, last time I asked, Discovery v5 had only one source of initial peers: bootstrap nodes. |
@stefantalpalaru discovery5 works independently from |
@stefantalpalaru this changes you trying to apply should be done in |
Well, that depends on the implementation (online != discovery online != libp2p online), but in our case I would assume that yes, as the libp2p switch (https://github.com/status-im/nim-beacon-chain/blob/devel/beacon_chain/eth2_network.nim#L845) is started (and awaited) before discovery. |
If we can rely on DiscV5 to redial bootstrap nodes all the time, we can indeed get rid of this parallel redialling on the libp2p side. But what if the same bootstrap node we're never connecting to as an Eth2 peer, just because we couldn't connect the first time, is the best source for syncing blocks and getting attestation traffic? Why would we deprive our node of such a valuable peer, just because it was "seen"? How about we periodically empty |
We can do much better filtering with discovery5, So for example it is absolutely possible that without proper working internet connection |
Random sampling prevents such a scenario in which some nodes never get a chance to be dialled. |
@@ -1130,17 +1135,19 @@ proc announcedENR*(node: Eth2Node): enr.Record = | |||
proc shortForm*(id: KeyPair): string = | |||
$PeerID.init(id.pubkey) | |||
|
|||
let BOOTSTRAP_NODE_CHECK_INTERVAL = 30.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is constant and should be placed at the beginning of the file with comment where it used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The further it is from the place it's used, the harder it is to find. Jumping all over the file just to understand a few lines of code makes our job harder.
@@ -1130,17 +1135,19 @@ proc announcedENR*(node: Eth2Node): enr.Record = | |||
proc shortForm*(id: KeyPair): string = | |||
$PeerID.init(id.pubkey) | |||
|
|||
let BOOTSTRAP_NODE_CHECK_INTERVAL = 30.seconds | |||
proc checkIfConnectedToBootstrapNode(p: pointer) {.gcsafe.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This procedure looks very ugly, but it can look like this:
proc checkIfConnected(node: Eth2Node) {.async.} =
while true:
await sleepAsync(30.seconds)
if node.discovery.bootstrapRecords.len > 0 and len(node.peerPool) == 0:
warn "Failed to connect to any node", bootstrapEnrs = node.discovery.bootstrapRecords
traceAsyncErrors checkIfConnected(node))
But also you should not use metric values for program logic, it is better to check PeerPool for number of connections currently available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should not use metric values for program logic
Why not? Keeps the number of variables low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already present place which perform tracking of libp2p peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? Keeps the number of variables low.
metrics are a one-way data flow out of the application - they're an optional feature and we should be able to compile the application with them completely disabled at compile time - using them as regular variables makes the code harder to analyze as they now start to serve multiple orthogonal concerns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your version might look better to you, but it does the wrong thing by continuing to check for that condition after it became false. You fix that and it becomes uglier than my version. Further more, you dropped that part of its name that made clear what it actually does.
Also, do you remember what traceAsyncErrors()
does without reading its code? I don't. Its name sounds like something that's currently broken in Chronos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefantalpalaru rename your PR title please, because PR content is absolutely different from PR title.
You should name it: deal with temporary loss of network connectivity while node is starting
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, for what you want, "Failed to connect to any node" is the wrong message, because it implies historical data which you don't have. You need "Not connected to any peer right now. There may be a problem with your network connectivity."
I'm not an author of this PR i have made a proposal, if you do not like message you can easily change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. I want that warning to only appear before succesfully connecting to a peer. Detecting any other problem is a separate concern better addressed in a separate procedure, in a separate PR.
Your PR could easily fix the issue for both cases - "network loss on node start", "network loss while node is working". And this can be done by changing 2 LOCs... So why not introduce such changes in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a dedicated boolean for it then
Done.
You should name it:
deal with temporary loss of network connectivity while node is starting
.
But reducing those two timeouts is useful all the time, not just at program start.
traceAsyncErrors
catching exceptions in absolutely legal way https://github.com/status-im/nim-eth/blob/master/eth/async_utils.nim#L11
The logic inside catchOrQuit()
is wrong right now: https://github.com/status-im/nim-eth/blob/765883c454be726799f4e724b4dc2ca8fe25bc74/eth/async_utils.nim#L7
Defects don't end up in Future.error
, but are raised directly, so that else
branch is unreachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR could easily fix the issue for both cases - "network loss on node start", "network loss while node is working". And this can be done by changing 2 LOCs...
It might get more complicated than that. Before telling the average user that his network might be down (or there's a problem with the firewall), I should look at any other network traffic I might have available - stuff like Discovery v5 nodes or any UDP multicasting we might be doing and getting replies from.
I should also have more reliable ways of establishing if the problem is with the Eth2 network itself (and I should probably ping some servers with high uptime before making a fool of myself by lying to the user).
The overall logic of this PR is already lost in never-ending disputes. Proposed boolean variable will be set to And because my review comments are not going to be addressed i'm not going to approve this PR. |
Also this PR title is incorrect and this PR do not deal anything with temporary loss of network connectivity. It only warn user when node is starting that you are not connected to bootstrap nodes... So if bootstrap nodes become unavailable or will not accept TCP connections at all, |
Bootstrap nodes are special and deserve special treatment. We're now
retrying failed dials forever, to be more resilient in the face of
temporary bootstrap node downtimes at program start.
This means it no longer makes sense to die if we didn't connect to a
bootstrap node in 30 seconds. It's not the user who should be redialing
by restarting beacon_node, it's beacon_node itself that should do that.
TODO: when invalidating peers that we previously dialed, check if they
are bootstrap nodes and, if so, add them back to Eth2Node.connQueue, to
deal with a loss of connectivity on our side (ISP hiccup).