Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deal with a temporary loss of network connectivity #1354

Merged
merged 7 commits into from
Jul 23, 2020

Conversation

stefantalpalaru
Copy link
Contributor

Bootstrap nodes are special and deserve special treatment. We're now
retrying failed dials forever, to be more resilient in the face of
temporary bootstrap node downtimes at program start.

This means it no longer makes sense to die if we didn't connect to a
bootstrap node in 30 seconds. It's not the user who should be redialing
by restarting beacon_node, it's beacon_node itself that should do that.

TODO: when invalidating peers that we previously dialed, check if they
are bootstrap nodes and, if so, add them back to Eth2Node.connQueue, to
deal with a loss of connectivity on our side (ISP hiccup).

Bootstrap nodes are special and deserve special treatment. We're now
retrying failed dials forever, to be more resilient in the face of
temporary bootstrap node downtimes at program start.

This means it no longer makes sense to die if we didn't connect to a
bootstrap node in 30 seconds. It's not the user who should be redialing
by restarting beacon_node, it's beacon_node itself that should do that.

TODO: when invalidating peers that we previously dialed, check if they
are bootstrap nodes and, if so, add them back to Eth2Node.connQueue, to
deal with a loss of connectivity on our side (ISP hiccup).
@zah
Copy link
Contributor

zah commented Jul 22, 2020

The rationale for handling the initial connection attempts in a special way is that the failure may result from misconfiguration. We don't want to make the failure too silent and difficult to diagnose in the intended real-world environment (e.g. running the beacon node as a system service), so I think we should log a warning or an error every 30 seconds or so until a connection is established.

network.addSeen(pi, SeenTableTimeDeadPeer)
if remotePeerInfo.peerId in bootstrapPeerIDs:
# keep trying
await network.connQueue.addLast(remotePeerInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is add is redundant, as in discovery bootstrap nodes are also treated special and are never removed from the routing table. So they would keep getting added in the discovery loop. The not adding to the seen list that you do there is required of course

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that bootstrap nodes are exempt from removal in Protocol.replaceNode() but where are they re-added to Eth2Node.connQueue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the discoveryLoop they have a chance of being passed again through the randomNodes call. Chance, as it is a random selection, but if no bootstrap nodes were reachable ( = no other nodes added to the routing table) it will always be those bootstrap nodes.

And as you no longer do the addSeen for those bootstrap nodes, a new connection should be attempted I believe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you need to check not only PeerID but also you need to check if PeerInfo you are sending to connection worker has TCP based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is add is redundant

Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you need to check not only PeerID but also you need to check if PeerInfo you are sending to connection worker has TCP based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

That filter should be done earlier, in the discovery loop (it happens now already on eth2 field)

beacon_chain/eth2_network.nim Outdated Show resolved Hide resolved
@kdeme
Copy link
Contributor

kdeme commented Jul 22, 2020

Looks like a better way to handle the failing bootnodes (the most important fix here is the not adding to the seen list for bootstrap nodes, although perhaps a smaller SeenTableTimeTimeout and SeenTableTimeDeadPeer for bootnodes would be better there).
But @zah his comment is important, else a user running in just INFO mode will not notice this or not notice it immediately at least.

The re-adding to the connection queue is not necessary, see comments I made. And, I will probably change that somewhat, so that failing bootnodes are only added again once we drop out of peers (see status-im/nim-eth#280). This would then also not require the smaller SeenTableTimeXXX values I mentioned.

@cheatfate
Copy link
Contributor

Is somebody could explain me why we trying to connect via libp2p to bootstrap nodes which can be discovery5 only nodes?

@cheatfate
Copy link
Contributor

cheatfate commented Jul 22, 2020

For example bootstrap node is UDP only. In such case it will be impossible to even perform dial to this node, because nim-libp2p
not support udp endpoints. So with proposed patch we will keep this node in connection queue and one of our connection workers will perform infinite loop with connection attempts to this node, and if there 10 udp-only bootstrap nodes we going to fill our connection queue with useless attempts to dial to udp-only bootrap nodes.

@cheatfate
Copy link
Contributor

From my point of view discovery5 should never return bootstrap nodes in randomNodes and we should never attempt to perform connection with bootstrap nodes.

@zah
Copy link
Contributor

zah commented Jul 22, 2020

@cheatfate, the information in the ENR tells you whether the node also accepts TCP connections (LibP2P). You don't have to have fixed rules - you can check the record and then decide what to do.

@kdeme
Copy link
Contributor

kdeme commented Jul 22, 2020

Currently, most bootstrap nodes are not only discovery nodes. But I do agree that we should not just blindly try to connect to these, as some are indeed only bootstrap nodes.

And actually, that is what we do. Currently, randomNodes call will not pass nodes which do not have the correct eth2 field, hence bootstrap nodes that only do discovery should be filtered out (unless they hardcode fill this field in, but that would be wrong). I could/should probably add a check on the tcp field too, thanks for that reminder.

@cheatfate
Copy link
Contributor

@zah i know, but this PR do not check anything and just sending this peers to connection queue again and again https://github.com/status-im/nim-beacon-chain/pull/1354/files#diff-9a4cd7edc16fa179b1a30af959c1ab28R750

@kdeme
Copy link
Contributor

kdeme commented Jul 22, 2020

So this is more as an attempt to have temporary bootstrap (or local!) hiccups at start-up to not have a 10 minutes delay to get connected to that node. Is it necessary? No, the bootstrap node will still be used at the level of discovery and if it works, it will help you discover other nodes.

The case of endless dials to non reachable bootnodes could also be resolved in discv5, see my previous comment and issue linked.

@cheatfate
Copy link
Contributor

There was already issue which was made by @zah with very good approach how to fix problems with race condition for eth2_network_simulation:

  1. Adjust beacon_node to start libp2p networking before discovery5 networking.
  2. Start discovery5-only bootstrap node.
  3. Start all the nodes which will connect to bootstrap node via discovery5 only.
  4. This nodes will form a mesh without any problems, because disvovery5 will return in randomNodes only currently available to connect nodes, so there will be no failing dials anymore.

@stefantalpalaru
Copy link
Contributor Author

We don't want to make the failure too silent and difficult to diagnose in the intended real-world environment (e.g. running the beacon node as a system service), so I think we should log a warning or an error every 30 seconds or so until a connection is established.

Implemented, @zah.

@cheatfate cheatfate self-requested a review July 22, 2020 11:53
@kdeme
Copy link
Contributor

kdeme commented Jul 22, 2020

@cheatfate If I'm not mistaken, all points except for 2. are already the case. Yet there are still failures.
I think, that even though discovery is started after libp2p networking (I switched those two lines a while back), there still seems to be an issue of dials.

Anyhow, if this is only about the local simulations, then yes, a better fix would be to fix ^

network.addSeen(pi, SeenTableTimeDeadPeer)
if remotePeerInfo.peerId in bootstrapPeerIDs:
# keep trying
await network.connQueue.addLast(remotePeerInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you need to check not only PeerID but also you need to check if PeerInfo you are sending to connection worker has TCP based address inside. Otherwise it will fail immediately, but you will continue adding this item back to connection in queue. If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

@cheatfate
Copy link
Contributor

Bootstrap nodes should not be crucial point of initial connection. So if libp2p dial is timed out there no reason to connect to this node again and again. The more important here is to check if discovery5 is healthy enough...

@cheatfate
Copy link
Contributor

@kdeme you should agree that if discovery returns 3 nodes, then 3 nodes are already online and bound their server sockets... And if connection dial attempt fails it should be investigated...

@stefantalpalaru
Copy link
Contributor Author

If number of bootstrap nodes will be more then number of connection workers it will create endless loop node will never attempt to connect to real nodes.

We're adding failed bootstrap nodes to the back of a FIFO queue, so they doesn't prevent other nodes from being dialled.

@stefantalpalaru
Copy link
Contributor Author

Bootstrap nodes should not be crucial point of initial connection.

I don't think we're discovering potential peers any other way, when starting with an empty state.

@cheatfate
Copy link
Contributor

@stefantalpalaru nim-libp2p do not discover peers.

@stefantalpalaru
Copy link
Contributor Author

stefantalpalaru commented Jul 22, 2020

nim-libp2p do not discover peers.

I know but, last time I asked, Discovery v5 had only one source of initial peers: bootstrap nodes.

@cheatfate
Copy link
Contributor

@stefantalpalaru discovery5 works independently from nim-libp2p, discovery5 do not require any libp2p connections to bootstrap node to be established.

@cheatfate
Copy link
Contributor

@stefantalpalaru this changes you trying to apply should be done in discovery5 loop, e.g. if we are not getting new peers using discovery5 we should warn about it every N seconds/minutes. But there no reason to do same thing on libp2p side.

@kdeme
Copy link
Contributor

kdeme commented Jul 22, 2020

@kdeme you should agree that if discovery returns 3 nodes, then 3 nodes are already online and bound their server sockets... And if connection dial attempt fails it should be investigated...

Well, that depends on the implementation (online != discovery online != libp2p online), but in our case I would assume that yes, as the libp2p switch (https://github.com/status-im/nim-beacon-chain/blob/devel/beacon_chain/eth2_network.nim#L845) is started (and awaited) before discovery.
I agree that this is the main issue that needs investigation.

@stefantalpalaru
Copy link
Contributor Author

if we are not getting new peers using discovery5 we should warn about it every N seconds/minutes. But there no reason to do same thing on libp2p side.

If we can rely on DiscV5 to redial bootstrap nodes all the time, we can indeed get rid of this parallel redialling on the libp2p side.

But what if the same bootstrap node we're never connecting to as an Eth2 peer, just because we couldn't connect the first time, is the best source for syncing blocks and getting attestation traffic? Why would we deprive our node of such a valuable peer, just because it was "seen"?

How about we periodically empty Eth2Node.seenTable?

@cheatfate
Copy link
Contributor

cheatfate commented Jul 22, 2020

We can do much better filtering with discovery5, SeenTable is structure that is supposed to workaround problems with discovery5. For example when i call randomNodes i do not want to receive peers which i have received with previous randomNodes call. If it happens we don't need to use SeenTable, But it happens that discovery5 returns same bunch of nodes like it was returned with previous call but with different order...

So for example it is absolutely possible that without proper working internet connection randomNodes will return you your bootstrap nodes all the time with different order.

@stefantalpalaru
Copy link
Contributor Author

it is absolutely possible that without proper working internet connection randomNodes will return you your bootstrap nodes all the time with different order

Random sampling prevents such a scenario in which some nodes never get a chance to be dialled.

@stefantalpalaru
Copy link
Contributor Author

We're now sleeping one second between dials in connectWorker() but it makes no difference on my machine, when I stop the network interface while an Altona node is running. The CPU usage is 2-3%, because beacon_node is mostly waiting for I/O, like I expected.

Before the 1s sleep:
img1

After the 1s sleep:

img2

@stefantalpalaru
Copy link
Contributor Author

After reducing a couple of seenTable timeouts to one minute each, we can get comparable results by relying only on Discovery v5 to restore our connectivity:

img

@stefantalpalaru stefantalpalaru changed the title keep dialing bootstrap nodes deal with a temporary loss of network connectivity Jul 22, 2020
@@ -1130,17 +1135,19 @@ proc announcedENR*(node: Eth2Node): enr.Record =
proc shortForm*(id: KeyPair): string =
$PeerID.init(id.pubkey)

let BOOTSTRAP_NODE_CHECK_INTERVAL = 30.seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is constant and should be placed at the beginning of the file with comment where it used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The further it is from the place it's used, the harder it is to find. Jumping all over the file just to understand a few lines of code makes our job harder.

@@ -1130,17 +1135,19 @@ proc announcedENR*(node: Eth2Node): enr.Record =
proc shortForm*(id: KeyPair): string =
$PeerID.init(id.pubkey)

let BOOTSTRAP_NODE_CHECK_INTERVAL = 30.seconds
proc checkIfConnectedToBootstrapNode(p: pointer) {.gcsafe.} =
Copy link
Contributor

@cheatfate cheatfate Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This procedure looks very ugly, but it can look like this:

  proc checkIfConnected(node: Eth2Node) {.async.} =
    while true:
      await sleepAsync(30.seconds)
      if node.discovery.bootstrapRecords.len > 0 and len(node.peerPool) == 0:
        warn "Failed to connect to any node",  bootstrapEnrs = node.discovery.bootstrapRecords

traceAsyncErrors checkIfConnected(node))

But also you should not use metric values for program logic, it is better to check PeerPool for number of connections currently available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should not use metric values for program logic

Why not? Keeps the number of variables low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already present place which perform tracking of libp2p peers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Keeps the number of variables low.

metrics are a one-way data flow out of the application - they're an optional feature and we should be able to compile the application with them completely disabled at compile time - using them as regular variables makes the code harder to analyze as they now start to serve multiple orthogonal concerns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your version might look better to you, but it does the wrong thing by continuing to check for that condition after it became false. You fix that and it becomes uglier than my version. Further more, you dropped that part of its name that made clear what it actually does.

Also, do you remember what traceAsyncErrors() does without reading its code? I don't. Its name sounds like something that's currently broken in Chronos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefantalpalaru rename your PR title please, because PR content is absolutely different from PR title.

You should name it: deal with temporary loss of network connectivity while node is starting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, for what you want, "Failed to connect to any node" is the wrong message, because it implies historical data which you don't have. You need "Not connected to any peer right now. There may be a problem with your network connectivity."

I'm not an author of this PR i have made a proposal, if you do not like message you can easily change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. I want that warning to only appear before succesfully connecting to a peer. Detecting any other problem is a separate concern better addressed in a separate procedure, in a separate PR.

Your PR could easily fix the issue for both cases - "network loss on node start", "network loss while node is working". And this can be done by changing 2 LOCs... So why not introduce such changes in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a dedicated boolean for it then

Done.

You should name it: deal with temporary loss of network connectivity while node is starting.

But reducing those two timeouts is useful all the time, not just at program start.

traceAsyncErrors catching exceptions in absolutely legal way https://github.com/status-im/nim-eth/blob/master/eth/async_utils.nim#L11

The logic inside catchOrQuit() is wrong right now: https://github.com/status-im/nim-eth/blob/765883c454be726799f4e724b4dc2ca8fe25bc74/eth/async_utils.nim#L7

Defects don't end up in Future.error, but are raised directly, so that else branch is unreachable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR could easily fix the issue for both cases - "network loss on node start", "network loss while node is working". And this can be done by changing 2 LOCs...

It might get more complicated than that. Before telling the average user that his network might be down (or there's a problem with the firewall), I should look at any other network traffic I might have available - stuff like Discovery v5 nodes or any UDP multicasting we might be doing and getting replies from.

I should also have more reliable ways of establishing if the problem is with the Eth2 network itself (and I should probably ping some servers with high uptime before making a fool of myself by lying to the user).

@cheatfate
Copy link
Contributor

The overall logic of this PR is already lost in never-ending disputes. Proposed boolean variable will be set to true only if you perform outgoing connection to any node, but its possible that you will receive some incoming network connections and become fully functional while check will continue spam you with Failed to connect to any bootstrap node. I think functionality of this PR should be split to at least 2 different PRs.

And because my review comments are not going to be addressed i'm not going to approve this PR.

@cheatfate
Copy link
Contributor

cheatfate commented Jul 23, 2020

Also this PR title is incorrect and this PR do not deal anything with temporary loss of network connectivity. It only warn user when node is starting that you are not connected to bootstrap nodes... So if bootstrap nodes become unavailable or will not accept TCP connections at all, beacon_node will put in logs infinite Failed to connect to any bootstrap node. End user can be easily confused that his network connection is broken, or application is broken, because of this message and restart the node (or reset network connection), but this message only means that you can't connect to specific bootstrap nodes, but you still able to connect to other network nodes.

@stefantalpalaru stefantalpalaru merged commit c47532f into devel Jul 23, 2020
@stefantalpalaru stefantalpalaru deleted the bootstrapdialing branch July 23, 2020 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants