-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for 5575 - occasional stream reset #832
Conversation
@@ -87,3 +98,47 @@ func (c ConnManager) logConnectionUpdate(n network.Network, con network.Conn, lo | |||
Int("total_connections", len(n.Conns())). | |||
Msg(logMsg) | |||
} | |||
|
|||
// ProtectPeer increments the stream setup count for the peer.ID | |||
func (c *ConnManager) ProtectPeer(id peer.ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when a stream is being created, we make it as protected
by calling ProtectPeer
|
||
// UnprotectPeer decrements the stream setup count for the peer.ID. | ||
// If the count reaches zero, the id is removed from the map | ||
func (c *ConnManager) UnprotectPeer(id peer.ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the stream creation is done (successfully on unsuccessfully) we mark it as unprotected
Codecov Report
@@ Coverage Diff @@
## master #832 +/- ##
==========================================
- Coverage 56.49% 56.48% -0.02%
==========================================
Files 427 427
Lines 25088 25121 +33
==========================================
+ Hits 14173 14189 +16
- Misses 9001 9013 +12
- Partials 1914 1919 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@@ -129,6 +129,10 @@ func (l *libp2pConnector) trimAllConnectionsExcept(peerInfos []peer.AddrInfo) { | |||
|
|||
peerInfo := l.host.Network().Peerstore().PeerInfo(peerID) | |||
log := l.log.With().Str("remote_peer", peerInfo.String()).Logger() | |||
if l.host.ConnManager().IsProtected(peerID, "") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is an attempt in progress to create a one-to-one stream then do not prune the connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation, just please add unit testing for it.
// ProtectPeer increments the stream setup count for the peer.ID | ||
func (c *ConnManager) ProtectPeer(id peer.ID) { | ||
c.streamSetupMapLk.Lock() | ||
defer c.streamSetupMapLk.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can definitely wait for another PR but: for better maintainability would be great if we either encapsulate this entire logic into a separate struct (e.g., StreamSetupProtector
, or use atomic increment. The reason I suggest this is to deter having a single-purpose mutex lingering around a more general-purpose struct. In other words, streamSetupMapLK
only protects this map, however, there is no guarantee that any other component of this struct
does not utilize it for another purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
This is a fix for 5575. The bug is in a nut shell is that there are still some (5 in the last 24 hrs) occurrences of
stream reset
observed on canary net.Root cause
Earlier, I had made a change to make sure that the PeerManager, which is responsible for adding and removing new connections for the 1-k pubsub messaging, does not remove connections with active streams on it.
However, that still left a window between when a connection is created and before a stream is initialized on that connection when the peer manager could remove the connection since the stream count on the connection would be zero at that time.
Fix
To address that, the connection creation and the subsequent stream creation on that connection was made atomic to make sure the peer manager connection pruning logic doesn't get in the way.
Alternate approach
Libp2p made a ConnectionManager overhaul which introduces some new functions on the Connection Manager such as
Protect
,Unprotect
andIsProtected
for a Peer ID and tag. It provides a BasicConnMgr that implements these function.However, I decided not to use the BasicConnMgr because:
Protect
andUnProtect
don't do stream reference counting (counting number of streams per connection) which is unfortunately what we need here since we want to unprotect the connection only when the reference count goes to zero indicating no active streams.