Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for 5575 - occasional stream reset #832

Merged
merged 14 commits into from
Jun 15, 2021

Conversation

vishalchangrani
Copy link
Contributor

@vishalchangrani vishalchangrani commented Jun 14, 2021

This is a fix for 5575. The bug is in a nut shell is that there are still some (5 in the last 24 hrs) occurrences of stream reset observed on canary net.

Screen Shot 2021-06-14 at 3 24 33 PM

Root cause

Earlier, I had made a change to make sure that the PeerManager, which is responsible for adding and removing new connections for the 1-k pubsub messaging, does not remove connections with active streams on it.
However, that still left a window between when a connection is created and before a stream is initialized on that connection when the peer manager could remove the connection since the stream count on the connection would be zero at that time.

Fix

To address that, the connection creation and the subsequent stream creation on that connection was made atomic to make sure the peer manager connection pruning logic doesn't get in the way.

Alternate approach

Libp2p made a ConnectionManager overhaul which introduces some new functions on the Connection Manager such as Protect, Unprotect and IsProtected for a Peer ID and tag. It provides a BasicConnMgr that implements these function.

However, I decided not to use the BasicConnMgr because:

  1. It introduces a decayer which is a feature in itself and may have other side-effects.
  2. The BasicConnMgr runs at a non-configurable one minute period
  3. It also brings in Connection Tagging - again a feature which may have side effects.
  4. The Protect and UnProtect don't do stream reference counting (counting number of streams per connection) which is unfortunately what we need here since we want to unprotect the connection only when the reference count goes to zero indicating no active streams.

@@ -87,3 +98,47 @@ func (c ConnManager) logConnectionUpdate(n network.Network, con network.Conn, lo
Int("total_connections", len(n.Conns())).
Msg(logMsg)
}

// ProtectPeer increments the stream setup count for the peer.ID
func (c *ConnManager) ProtectPeer(id peer.ID) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a stream is being created, we make it as protected by calling ProtectPeer


// UnprotectPeer decrements the stream setup count for the peer.ID.
// If the count reaches zero, the id is removed from the map
func (c *ConnManager) UnprotectPeer(id peer.ID) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the stream creation is done (successfully on unsuccessfully) we mark it as unprotected

@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2021

Codecov Report

Merging #832 (08e9413) into master (c628937) will decrease coverage by 0.01%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #832      +/-   ##
==========================================
- Coverage   56.49%   56.48%   -0.02%     
==========================================
  Files         427      427              
  Lines       25088    25121      +33     
==========================================
+ Hits        14173    14189      +16     
- Misses       9001     9013      +12     
- Partials     1914     1919       +5     
Flag Coverage Δ
unittests 56.48% <91.66%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
network/p2p/libp2pConnector.go 0.00% <0.00%> (ø)
network/p2p/connManager.go 100.00% <100.00%> (ø)
network/p2p/libp2pNode.go 66.93% <100.00%> (+0.68%) ⬆️
engine/verification/match/chunks.go 83.33% <0.00%> (-6.67%) ⬇️
engine/execution/ingestion/engine.go 52.19% <0.00%> (-1.53%) ⬇️
engine/verification/match/engine.go 65.10% <0.00%> (-1.08%) ⬇️
cmd/util/ledger/migrations/storage_v4.go 41.56% <0.00%> (-0.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c628937...08e9413. Read the comment docs.

@@ -129,6 +129,10 @@ func (l *libp2pConnector) trimAllConnectionsExcept(peerInfos []peer.AddrInfo) {

peerInfo := l.host.Network().Peerstore().PeerInfo(peerID)
log := l.log.With().Str("remote_peer", peerInfo.String()).Logger()
if l.host.ConnManager().IsProtected(peerID, "") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is an attempt in progress to create a one-to-one stream then do not prune the connection

Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation, just please add unit testing for it.

// ProtectPeer increments the stream setup count for the peer.ID
func (c *ConnManager) ProtectPeer(id peer.ID) {
c.streamSetupMapLk.Lock()
defer c.streamSetupMapLk.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can definitely wait for another PR but: for better maintainability would be great if we either encapsulate this entire logic into a separate struct (e.g., StreamSetupProtector, or use atomic increment. The reason I suggest this is to deter having a single-purpose mutex lingering around a more general-purpose struct. In other words, streamSetupMapLK only protects this map, however, there is no guarantee that any other component of this struct does not utilize it for another purpose.

Copy link
Member

@zhangchiqing zhangchiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants