deal with a temporary loss of network connectivity #1354

stefantalpalaru · 2020-07-22T03:11:02Z

Bootstrap nodes are special and deserve special treatment. We're now
retrying failed dials forever, to be more resilient in the face of
temporary bootstrap node downtimes at program start.

This means it no longer makes sense to die if we didn't connect to a
bootstrap node in 30 seconds. It's not the user who should be redialing
by restarting beacon_node, it's beacon_node itself that should do that.

TODO: when invalidating peers that we previously dialed, check if they
are bootstrap nodes and, if so, add them back to Eth2Node.connQueue, to
deal with a loss of connectivity on our side (ISP hiccup).

Bootstrap nodes are special and deserve special treatment. We're now retrying failed dials forever, to be more resilient in the face of temporary bootstrap node downtimes at program start. This means it no longer makes sense to die if we didn't connect to a bootstrap node in 30 seconds. It's not the user who should be redialing by restarting beacon_node, it's beacon_node itself that should do that. TODO: when invalidating peers that we previously dialed, check if they are bootstrap nodes and, if so, add them back to Eth2Node.connQueue, to deal with a loss of connectivity on our side (ISP hiccup).

zah · 2020-07-22T05:00:45Z

The rationale for handling the initial connection attempts in a special way is that the failure may result from misconfiguration. We don't want to make the failure too silent and difficult to diagnose in the intended real-world environment (e.g. running the beacon node as a system service), so I think we should log a warning or an error every 30 seconds or so until a connection is established.

kdeme · 2020-07-22T07:27:51Z

beacon_chain/eth2_network.nim

-            network.addSeen(pi, SeenTableTimeDeadPeer)
+            if remotePeerInfo.peerId in bootstrapPeerIDs:
+              # keep trying
+              await network.connQueue.addLast(remotePeerInfo)


Currently, this is add is redundant, as in discovery bootstrap nodes are also treated special and are never removed from the routing table. So they would keep getting added in the discovery loop. The not adding to the seen list that you do there is required of course

I see that bootstrap nodes are exempt from removal in Protocol.replaceNode() but where are they re-added to Eth2Node.connQueue?

In the discoveryLoop they have a chance of being passed again through the randomNodes call. Chance, as it is a random selection, but if no bootstrap nodes were reachable ( = no other nodes added to the routing table) it will always be those bootstrap nodes.

And as you no longer do the addSeen for those bootstrap nodes, a new connection should be attempted I believe.

Here you need to check not only PeerID but also you need to check if PeerInfo you are sending to connection worker has TCP based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

Currently, this is add is redundant

Removed.

Here you need to check not only PeerID but also you need to check if PeerInfo you are sending to connection worker has TCP based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

That filter should be done earlier, in the discovery loop (it happens now already on eth2 field)

beacon_chain/eth2_network.nim

kdeme · 2020-07-22T07:39:40Z

Looks like a better way to handle the failing bootnodes (the most important fix here is the not adding to the seen list for bootstrap nodes, although perhaps a smaller SeenTableTimeTimeout and SeenTableTimeDeadPeer for bootnodes would be better there).
But @zah his comment is important, else a user running in just INFO mode will not notice this or not notice it immediately at least.

The re-adding to the connection queue is not necessary, see comments I made. And, I will probably change that somewhat, so that failing bootnodes are only added again once we drop out of peers (see status-im/nim-eth#280). This would then also not require the smaller SeenTableTimeXXX values I mentioned.

cheatfate · 2020-07-22T10:59:40Z

Is somebody could explain me why we trying to connect via libp2p to bootstrap nodes which can be discovery5 only nodes?

cheatfate · 2020-07-22T11:07:19Z

For example bootstrap node is UDP only. In such case it will be impossible to even perform dial to this node, because nim-libp2p
not support udp endpoints. So with proposed patch we will keep this node in connection queue and one of our connection workers will perform infinite loop with connection attempts to this node, and if there 10 udp-only bootstrap nodes we going to fill our connection queue with useless attempts to dial to udp-only bootrap nodes.

cheatfate · 2020-07-22T11:11:05Z

From my point of view discovery5 should never return bootstrap nodes in randomNodes and we should never attempt to perform connection with bootstrap nodes.

zah · 2020-07-22T11:27:11Z

@cheatfate, the information in the ENR tells you whether the node also accepts TCP connections (LibP2P). You don't have to have fixed rules - you can check the record and then decide what to do.

kdeme · 2020-07-22T11:30:19Z

Currently, most bootstrap nodes are not only discovery nodes. But I do agree that we should not just blindly try to connect to these, as some are indeed only bootstrap nodes.

And actually, that is what we do. Currently, randomNodes call will not pass nodes which do not have the correct eth2 field, hence bootstrap nodes that only do discovery should be filtered out (unless they hardcode fill this field in, but that would be wrong). I could/should probably add a check on the tcp field too, thanks for that reminder.

cheatfate · 2020-07-22T11:31:17Z

@zah i know, but this PR do not check anything and just sending this peers to connection queue again and again https://github.com/status-im/nim-beacon-chain/pull/1354/files#diff-9a4cd7edc16fa179b1a30af959c1ab28R750

kdeme · 2020-07-22T11:33:21Z

So this is more as an attempt to have temporary bootstrap (or local!) hiccups at start-up to not have a 10 minutes delay to get connected to that node. Is it necessary? No, the bootstrap node will still be used at the level of discovery and if it works, it will help you discover other nodes.

The case of endless dials to non reachable bootnodes could also be resolved in discv5, see my previous comment and issue linked.

cheatfate · 2020-07-22T11:39:37Z

There was already issue which was made by @zah with very good approach how to fix problems with race condition for eth2_network_simulation:

Adjust beacon_node to start libp2p networking before discovery5 networking.
Start discovery5-only bootstrap node.
Start all the nodes which will connect to bootstrap node via discovery5 only.
This nodes will form a mesh without any problems, because disvovery5 will return in randomNodes only currently available to connect nodes, so there will be no failing dials anymore.

stefantalpalaru · 2020-07-22T11:51:09Z

We don't want to make the failure too silent and difficult to diagnose in the intended real-world environment (e.g. running the beacon node as a system service), so I think we should log a warning or an error every 30 seconds or so until a connection is established.

Implemented, @zah.

kdeme · 2020-07-22T11:57:03Z

@cheatfate If I'm not mistaken, all points except for 2. are already the case. Yet there are still failures.
I think, that even though discovery is started after libp2p networking (I switched those two lines a while back), there still seems to be an issue of dials.

Anyhow, if this is only about the local simulations, then yes, a better fix would be to fix ^

cheatfate · 2020-07-22T11:58:27Z

beacon_chain/eth2_network.nim

-            network.addSeen(pi, SeenTableTimeDeadPeer)
+            if remotePeerInfo.peerId in bootstrapPeerIDs:
+              # keep trying
+              await network.connQueue.addLast(remotePeerInfo)


Here you need to check not only PeerID but also you need to check if PeerInfo you are sending to connection worker has TCP based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

cheatfate · 2020-07-22T12:01:49Z

Bootstrap nodes should not be crucial point of initial connection. So if libp2p dial is timed out there no reason to connect to this node again and again. The more important here is to check if discovery5 is healthy enough...

cheatfate · 2020-07-22T12:03:45Z

@kdeme you should agree that if discovery returns 3 nodes, then 3 nodes are already online and bound their server sockets... And if connection dial attempt fails it should be investigated...

stefantalpalaru · 2020-07-22T12:04:25Z

If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

We're adding failed bootstrap nodes to the back of a FIFO queue, so they doesn't prevent other nodes from being dialled.

stefantalpalaru · 2020-07-22T12:05:37Z

Bootstrap nodes should not be crucial point of initial connection.

I don't think we're discovering potential peers any other way, when starting with an empty state.

cheatfate · 2020-07-22T12:06:37Z

@stefantalpalaru nim-libp2p do not discover peers.

stefantalpalaru · 2020-07-22T12:17:43Z

nim-libp2p do not discover peers.

I know but, last time I asked, Discovery v5 had only one source of initial peers: bootstrap nodes.

cheatfate · 2020-07-22T12:27:13Z

@stefantalpalaru discovery5 works independently from nim-libp2p, discovery5 do not require any libp2p connections to bootstrap node to be established.

cheatfate · 2020-07-22T12:35:22Z

@stefantalpalaru this changes you trying to apply should be done in discovery5 loop, e.g. if we are not getting new peers using discovery5 we should warn about it every N seconds/minutes. But there no reason to do same thing on libp2p side.

kdeme · 2020-07-22T13:14:12Z

@kdeme you should agree that if discovery returns 3 nodes, then 3 nodes are already online and bound their server sockets... And if connection dial attempt fails it should be investigated...

Well, that depends on the implementation (online != discovery online != libp2p online), but in our case I would assume that yes, as the libp2p switch (https://github.com/status-im/nim-beacon-chain/blob/devel/beacon_chain/eth2_network.nim#L845) is started (and awaited) before discovery.
I agree that this is the main issue that needs investigation.

stefantalpalaru · 2020-07-22T14:00:30Z

if we are not getting new peers using discovery5 we should warn about it every N seconds/minutes. But there no reason to do same thing on libp2p side.

If we can rely on DiscV5 to redial bootstrap nodes all the time, we can indeed get rid of this parallel redialling on the libp2p side.

But what if the same bootstrap node we're never connecting to as an Eth2 peer, just because we couldn't connect the first time, is the best source for syncing blocks and getting attestation traffic? Why would we deprive our node of such a valuable peer, just because it was "seen"?

How about we periodically empty Eth2Node.seenTable?

cheatfate · 2020-07-22T15:21:55Z

We can do much better filtering with discovery5, SeenTable is structure that is supposed to workaround problems with discovery5. For example when i call randomNodes i do not want to receive peers which i have received with previous randomNodes call. If it happens we don't need to use SeenTable, But it happens that discovery5 returns same bunch of nodes like it was returned with previous call but with different order...

So for example it is absolutely possible that without proper working internet connection randomNodes will return you your bootstrap nodes all the time with different order.

stefantalpalaru · 2020-07-22T17:35:49Z

it is absolutely possible that without proper working internet connection randomNodes will return you your bootstrap nodes all the time with different order

Random sampling prevents such a scenario in which some nodes never get a chance to be dialled.

stefantalpalaru · 2020-07-22T17:55:40Z

We're now sleeping one second between dials in connectWorker() but it makes no difference on my machine, when I stop the network interface while an Altona node is running. The CPU usage is 2-3%, because beacon_node is mostly waiting for I/O, like I expected.

Before the 1s sleep:

After the 1s sleep:

stefantalpalaru · 2020-07-22T19:08:48Z

After reducing a couple of seenTable timeouts to one minute each, we can get comparable results by relying only on Discovery v5 to restore our connectivity:

cheatfate · 2020-07-23T14:45:14Z

beacon_chain/eth2_network.nim

@@ -1130,17 +1135,19 @@ proc announcedENR*(node: Eth2Node): enr.Record =
 proc shortForm*(id: KeyPair): string =
  $PeerID.init(id.pubkey)

+let BOOTSTRAP_NODE_CHECK_INTERVAL = 30.seconds


This is constant and should be placed at the beginning of the file with comment where it used.

The further it is from the place it's used, the harder it is to find. Jumping all over the file just to understand a few lines of code makes our job harder.

cheatfate · 2020-07-23T14:49:35Z

beacon_chain/eth2_network.nim

@@ -1130,17 +1135,19 @@ proc announcedENR*(node: Eth2Node): enr.Record =
 proc shortForm*(id: KeyPair): string =
  $PeerID.init(id.pubkey)

+let BOOTSTRAP_NODE_CHECK_INTERVAL = 30.seconds
+proc checkIfConnectedToBootstrapNode(p: pointer) {.gcsafe.} =


This procedure looks very ugly, but it can look like this:

proc checkIfConnected(node: Eth2Node) {.async.} = while true: await sleepAsync(30.seconds) if node.discovery.bootstrapRecords.len > 0 and len(node.peerPool) == 0: warn "Failed to connect to any node", bootstrapEnrs = node.discovery.bootstrapRecords traceAsyncErrors checkIfConnected(node))

But also you should not use metric values for program logic, it is better to check PeerPool for number of connections currently available.

you should not use metric values for program logic

Why not? Keeps the number of variables low.

There already present place which perform tracking of libp2p peers.

Why not? Keeps the number of variables low.

metrics are a one-way data flow out of the application - they're an optional feature and we should be able to compile the application with them completely disabled at compile time - using them as regular variables makes the code harder to analyze as they now start to serve multiple orthogonal concerns

Your version might look better to you, but it does the wrong thing by continuing to check for that condition after it became false. You fix that and it becomes uglier than my version. Further more, you dropped that part of its name that made clear what it actually does.

Also, do you remember what traceAsyncErrors() does without reading its code? I don't. Its name sounds like something that's currently broken in Chronos.

@stefantalpalaru rename your PR title please, because PR content is absolutely different from PR title.

You should name it: deal with temporary loss of network connectivity while node is starting.

BTW, for what you want, "Failed to connect to any node" is the wrong message, because it implies historical data which you don't have. You need "Not connected to any peer right now. There may be a problem with your network connectivity."

I'm not an author of this PR i have made a proposal, if you do not like message you can easily change it.

Exactly. I want that warning to only appear before succesfully connecting to a peer. Detecting any other problem is a separate concern better addressed in a separate procedure, in a separate PR.

Your PR could easily fix the issue for both cases - "network loss on node start", "network loss while node is working". And this can be done by changing 2 LOCs... So why not introduce such changes in this PR.

create a dedicated boolean for it then

Done.

You should name it: deal with temporary loss of network connectivity while node is starting.

But reducing those two timeouts is useful all the time, not just at program start.

traceAsyncErrors catching exceptions in absolutely legal way https://github.com/status-im/nim-eth/blob/master/eth/async_utils.nim#L11

The logic inside catchOrQuit() is wrong right now: https://github.com/status-im/nim-eth/blob/765883c454be726799f4e724b4dc2ca8fe25bc74/eth/async_utils.nim#L7

Defects don't end up in Future.error, but are raised directly, so that else branch is unreachable.

Your PR could easily fix the issue for both cases - "network loss on node start", "network loss while node is working". And this can be done by changing 2 LOCs...

It might get more complicated than that. Before telling the average user that his network might be down (or there's a problem with the firewall), I should look at any other network traffic I might have available - stuff like Discovery v5 nodes or any UDP multicasting we might be doing and getting replies from.

I should also have more reliable ways of establishing if the problem is with the Eth2 network itself (and I should probably ping some servers with high uptime before making a fool of myself by lying to the user).

cheatfate · 2020-07-23T16:08:46Z

The overall logic of this PR is already lost in never-ending disputes. Proposed boolean variable will be set to true only if you perform outgoing connection to any node, but its possible that you will receive some incoming network connections and become fully functional while check will continue spam you with Failed to connect to any bootstrap node. I think functionality of this PR should be split to at least 2 different PRs.

And because my review comments are not going to be addressed i'm not going to approve this PR.

cheatfate · 2020-07-23T16:13:18Z

Also this PR title is incorrect and this PR do not deal anything with temporary loss of network connectivity. It only warn user when node is starting that you are not connected to bootstrap nodes... So if bootstrap nodes become unavailable or will not accept TCP connections at all, beacon_node will put in logs infinite Failed to connect to any bootstrap node. End user can be easily confused that his network connection is broken, or application is broken, because of this message and restart the node (or reset network connection), but this message only means that you can't connect to specific bootstrap nodes, but you still able to connect to other network nodes.

kdeme reviewed Jul 22, 2020

View reviewed changes

checkIfConnectedToBootstrapNode() is back, but now just warns

09f31ab

cheatfate self-requested a review July 22, 2020 11:53

cheatfate requested changes Jul 22, 2020

View reviewed changes

let DiscV5 reschedule bootstrap nodes

2ccd959

connectWorker(): sleep 1s between dials

f6a6e30

no more special treatment for bootstrap nodes

646bb91

stefantalpalaru changed the title ~~keep dialing bootstrap nodes~~ deal with a temporary loss of network connectivity Jul 22, 2020

dryajov mentioned this pull request Jul 23, 2020

libp2p lifecycle hooks and pubsub monitor #1360

Closed

launch_local_testnet.sh: increase BOOTSTRAP_TIMEOUT

c7e5cae

cheatfate reviewed Jul 23, 2020

View reviewed changes

don't use metric value in program logic

cbb0aca

stefantalpalaru merged commit c47532f into devel Jul 23, 2020

stefantalpalaru deleted the bootstrapdialing branch July 23, 2020 20:51

deal with a temporary loss of network connectivity #1354

deal with a temporary loss of network connectivity #1354

Conversation

stefantalpalaru commented Jul 22, 2020

zah commented Jul 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdeme commented Jul 22, 2020

cheatfate commented Jul 22, 2020

cheatfate commented Jul 22, 2020 • edited Loading

cheatfate commented Jul 22, 2020

zah commented Jul 22, 2020

kdeme commented Jul 22, 2020

cheatfate commented Jul 22, 2020

kdeme commented Jul 22, 2020

cheatfate commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020

kdeme commented Jul 22, 2020

Choose a reason for hiding this comment

cheatfate commented Jul 22, 2020

cheatfate commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020

cheatfate commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020 • edited Loading

cheatfate commented Jul 22, 2020

cheatfate commented Jul 22, 2020

kdeme commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020

cheatfate commented Jul 22, 2020 • edited Loading

stefantalpalaru commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020

stefantalpalaru commented Jul 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheatfate Jul 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheatfate commented Jul 23, 2020

cheatfate commented Jul 23, 2020 • edited Loading

cheatfate commented Jul 22, 2020 •

edited

Loading

stefantalpalaru commented Jul 22, 2020 •

edited

Loading

cheatfate commented Jul 22, 2020 •

edited

Loading

cheatfate Jul 23, 2020 •

edited

Loading

cheatfate commented Jul 23, 2020 •

edited

Loading