-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peer dialing/kicking system overhaul #3346
Conversation
I don't see the corresponding spike in the kicked peers metric under discovery and dialling - is that a display issue in grafana or something else? |
# between kicks, so we must not kick too many peers at | ||
# once | ||
for _ in 0..<excessPeers div 2: | ||
await node.trimConnections(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intuitively, this feels like it's on the aggressive end: consider startup - we just connected to a bunch of peers and they have not necessarily had time to even establish a score / connection / handshake to meaningfully differentiate them, and here we kick them out very aggressively -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata pinger is used as a grace period, ie we won't kick a peer if we don't have his metadata.
This may seem a bit weird, but the grace period is a tricky subject: if it's too low, we'll kick perfectly good peers that don't have time to be established correctly, if it's too long, we may kick a useful peer because a useless peer is still in his grace period.
The idea is that if we had time to get his metadata, we should be nicely setup, and if he doesn't respond to metadata requests, he'll eventually get kicked anyway
I did try to do a smarter trimmer, that will only kick peers if we have to, in this spirit:
ed3b2fd#diff-84fc3d73d852761d07624e61621b0784b6f144dbeeb57f001990bd8f7bc5d7a7R1650-R1659
But we ended up getting stable at 75% of the hard-capacity (so stabilized at max-peers * 1.5)
I tried later to do a more soft limit, eg max-score = lerp(minScore=300, maxScore=10000, a=0, b=maxPeers, value=excessPeers)
, but then we just stabilized at a random point between max-peers and max-peers * 2, which I find more confusing than anything. + it relies on magic values (300 & 10000)
So anyway, the smart approach proved difficult, I switched back to a dumb one, which is pretty effective, imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this loop could be: while len(node.peerPool) > node.wantedPeers: kick(worst peer); sleep(1)
or something like this - basically, we fetch new peers from discovery at a certain rate - we should not kick peers at a higher rate than that (but perhaps there's logic somewhere that I'm missing that keeps the two in balance?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't know the discovery rate, for instance on mainnet we discover a lot more peers than in prater (because the DHT is shared and there is more nodes in mainnet)
But the result is the same: since only dialing can go over max-peers, if we kick everything over max-peers, we'll kick as much as we discover (and successfully dial)
I've thought about moving this to the connect worker, eg:
if canConnect:
let success = await connect
if success and len(node.peerPool) > node.wantedPeers:
kick(worst peer)
But I think the current approach is more "bullet-proof"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, kicking should stay in this separate task - what would the downsides of "rate-limiting" kicking be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rate-limiting being your snippet?
Since we don't know how much peers we discover on each round, it may be too slow. To be safe, it would have to sleep very little, which would be equivalent to the current code
Some numbers on mainnet:
2022-02-01 11:32:45.011+00:00 discovered_nodes:56
2022-02-01 11:33:06.336+00:00 discovered_nodes:8
2022-02-01 11:33:13.220+00:00 discovered_nodes:30
2022-02-01 11:33:22.452+00:00 discovered_nodes:24
As you can see, both the number of discovered nodes and the duration of the discovery loop varies wildly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, it would have to sleep very little, which would be equivalent to the current code
except that it allows the chronos loop to run between each kick as well, which is a good thing in general - how about my loop with some cap on the number of peers kicked in each round - maybe something like max(1, sqrt(peers_in_excess))
which will kick more aggressively when there's lots of overshoot and slow down later..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last commit switches to
while true:
let excessPeers = connectedPeers - node.wantedPeers
if excessPeers > 0: kick(2 worst peer)
sleep(1.seconds div max(1, excessPeers))
Which works quite well
except CancelledError as exc: | ||
raise exc | ||
except CatchableError as exc: | ||
debug "failed to kick peer", peerId, err=exc.msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that can really occur? Is the dec toKick
then ok on the line below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it can actually happen, though worst case scenario, the next iteration (5 seconds later) will pick up the slack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I would rather stay safe and over-estimate the number of kicked peers, than under-estimate
|
||
scores[peer.peerId] = scores[peer.peerId] + thisPeersScore | ||
for peer in node.pubsub.gossipsub.getOrDefault(topic): | ||
if peer.peerId notin scores: continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we skip non-stability peers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I read the code wrong - we're looking at peers that nimbus knows about, and filtering out peers that are in gossipsub and not in nimbus - if this happens, it smells of bug, but moving on:
we don't give score to syncnet peers in the initial loop - this needs addressing as well - also, we sort of prefer "large" peers with lots of subnets over smaller peer with fewer subnets - I suspect this ends up being wrong over time because it gives an unfair advantage to large peers which in now way are guaranteed to be "better" - it would be good to strive for a random mix of both kinds - perhaps "has_some_interesting_attnet" and "has_some_interesting_syncnet" would be better conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea I had, instead of scoring on SUM(topicsPoints), we could score on MAX(topicPoints), wdyt?
The first loop is only here to score peers during sync at this point, ideally I would like to get rid of it, not sure how atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and filtering out peers that are in gossipsub and not in nimbus - if this happens, it smells of bug, but moving on
It can happen if we don't have a metadata for them yet (ie, they are in their grace period)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last commit switches to AVG(nonGlobalTopicsPoints)
Global topics are not counted because that would be unfair towards small nodes. They always have low value, and would bring the average down
Indeed, I was wrong, the issue was that the peer trimmer got stuck while disconnecting a peer, switched to an asyncSpawn to avoid it happening again, seems good so far |
when does it get .. unstuck? |
It goes back to the discussion we had a while ago: should Currently, it does wait, so if you are disconnecting someone and don't want to wait, you have to asyncSpawn. And I fell into the trap of not doing it, which caused the trimmer to be stuck for a few minutes when a slow peer is being kicked And I can confirm that it still runs without spikes for 3 days, so that's fixed |
Co-authored-by: Jacek Sieka <[email protected]>
This PR is an ongoing effort to improve peering
The most complex part of this PR is the new force dial mechanism.
Previously, we tried to dial peer, and if we didn't have room to dial peer, we would kick some to make room for the new one. The major downside of this system is that it kick peers before actually dialing them, so it make kick peers and fail to dial the new ones afterward.
Instead, this PR relies on a small modification in libp2p that allows to go over the maxPeers when dialing. A regular trimmer will then kick the excess connections.
The tricky part is that we need to balance between health and incoming starvation. Indeed, by being most of the time above maxPeers, incoming peers will never be allowed, so over time, we'll end up with outgoing only, which is not good for the network as a whole.
Instead, this PR will let a few incoming peers in when we are force dialing.
A few metrics of it running in a few scenarios, with all subnets:
Mainnet, max-peers=300
It's really hard to get a healthy set of peers for every subnet on mainnet, even with 300 peers. This node is never able to stabilize, instead hovering around 310 peers and spend all his time kicking and dialing new peers.
Prater, max-peers=60
On prater, getting healthy peers is easier (bigger nodes on average), so we're stable, hover around 60, and have around half incoming, half outgoing which is good.
Mainnet, max-peers=60
If getting good health with 300 peers is hard, 60 is impossible. But still, a good incoming/outgoing balance
Astute readers will have notice the weird spikes on the last two screenshots. The hard limit of connections is now max-conn * 2, and sometime we will find a lot of peers in a discovery round, and dial them all at once, until we hit the limit. Then the kicker will come in and trim the excess fairly quickly.
EDIT: the spikes were actually caused by the trimmer getting stuck