feat(connecteventmanager): block Connected() until accepted #435

rvagg · 2023-08-16T01:51:57Z

Fixes: #432

Minimal attempt at solving #432

welcome · 2023-08-16T01:51:59Z

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

The context for this PR is clear, with relevant discussion, decisions
and stakeholders linked/mentioned.
Your contribution itself is clear (code comments, self-review for the
rest) and in its best form. Follow the code contribution
guidelines
if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

A maintainer will triage and assign priority to this PR, commenting on
any missing things and potentially assigning a reviewer for high
priority items.
The PR gets reviews, discussed and approvals as needed.
The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

codecov · 2023-08-16T01:57:11Z

Codecov Report

Merging #435 (22fd90d) into main (dd32d67) will increase coverage by 0.22%.
The diff coverage is 51.02%.

@@            Coverage Diff             @@
##             main     #435      +/-   ##
==========================================
+ Coverage   49.80%   50.02%   +0.22%     
==========================================
  Files         249      249              
  Lines       29972    29990      +18     
==========================================
+ Hits        14928    15003      +75     
+ Misses      13615    13554      -61     
- Partials     1429     1433       +4

Files Changed	Coverage Δ
bitswap/network/connecteventmanager.go	`50.99% <51.02%> (+50.99%)`	⬆️

... and 7 files with indirect coverage changes

rvagg · 2023-08-16T02:08:28Z

bitswap/network/connecteventmanager.go

+
+func (p *peerState) setPending() {
+	if !p.isPending() {
+		p.accepted = make(chan struct{})


I'd like some help sanity checking this because I'm not entirely clear of the atomicity of some of these operations.

We rely on connectEventManager's mutex to do most things, but there is a call to waitAccept that is unsynchronised, it just does a <-p.accepted; that could be happening while we're doing this operation of setting accepted to a new channel—although we're only doing this if the channel is already closed so <-p.accepted should have been a quick noop; is that atomic though? Is there potential for a conflict between <-p.accepted where p.accepted is closed, and p.accepted = make(chan struct{})?

rvagg · 2023-08-16T02:10:56Z

All good with our test suite that makes heavy use of bitswap and has been heavily flaky because of this problem: filecoin-project/lassie#383

hannahhoward

I agree no per peer mutex since everything is already mutexed.

In your new simpler implementation, there is an edge case I'm not sure whether you should care about:

setState called, isPending set true, change created
before change processed, setState called again (to a same or different new state)
- since isPending = true, no new change encoded, and waitNoop returned. however, the change has yet to be handled

Truthfully, the edge case where this causes downstream problems feels possibly non-existant. Just want to identify it.

The potential solution is to put handled back on the peerState, and return a func() that uses it whenever isPending = true -- but if you do so, I recommend you copy the reference out of the peerState before returning the wait func, so that any mutations to the channel in the peer state don't effect the callback. This can be done with a simple closure:

func makeWaitFunc(handled chan struct{}) waitFn {
   return func() {
      <-handled
   }
}

Also, are there any edge cases where handled never gets closed()? Like the queue shuts down for example?

rvagg · 2023-08-16T07:31:05Z

Good feedback!

Here's what I've done:

Handled the Stop() end condition by watching the c.done channel as well:

func (c *connectEventManager) makeWaitFunc(handled chan struct{}) waitFn {
	return func() {
		select {
		case <-handled:
		case <-c.done:
		}
	}
}

Handled the pending case within the current structure with this branch that piggy-backs the existing change in the queue:

	} else if state.pending {
		// Find the change in the queue and return a wait function for it
		for _, change := range c.changeQueue {
			if change.pid == p {
				return c.makeWaitFunc(change.handled)
			}
		}
		log.Error("a peer was marked as change pending but not found in the change queue")
	}

hannahhoward

LGTM by me.

rvagg · 2023-08-17T00:02:27Z

Added a simple test to lock this behaviour in - Connected() blocks until PeerConnected() returns.

Ref: #432 Minimal attempt at solving #432

Jorropo · 2023-08-17T02:09:48Z

The connectEventManager is a very complex piece of code for what it does and it took tens of minutes to properly understand it properly.
It's role is to maintain an atomic state of connected / not connected and dispatch callbacks on state transitions. I've submitted #436 which does this using a map. Could any of you please take a look @rvagg or @hannahhoward please ?

rvagg · 2023-08-17T04:14:55Z

To reiterate the discussion on Slack - I'm not opposed to your approach in #436, but ripping out the existing piece of infrastructure is more radical than attempting to fix it; at least for the purposes we're trying to achieve right now. Perhaps this could be a two-step thing. I see this as a Chesterton's Fence situation - we're all acknowledging that we don't fully appreciate why the connecteventmanager exists and what it's aiming to achieve, and in that case the more prudent approach, rather than just ripping it out, might be to approach it more incrementally ("prudence is to understand why the fence is there in the first place before you attempt to take it down").

There's a singular issue we're trying to fix, and that is the race condition that exists between when a client is Connected and when the client is notified that PeerConnected so it can perform proper per-client initialisation.

Ditching connecteventmanager possibly fixes that, and other problems that relate to it; but I wouldn't mind having a commit in the main branch that solves the one thing we're concerned about now that we can point to, even if later commits do more radical things; so we can choose to be incremental and opt-in to being more radical in our downstream use of this code.

Jorropo

SGTM, this is tricky code and I'm not sure about all the interactions but tests are passing.
Should get #436 merged within a month ideally.

This wasn't caught because the tests hadn't run due to the test.Flaky. The test were testing exactly for the bug #435 fixed.

…435)" and tests This reverts commit 1d2f5e5. This reverts commit 7ec68c5.

…435)" and tests This reverts commit 7ec68c5. This reverts commit 59a2bca. This reverts commit 1d2f5e5.

* feat(connecteventmanager): block Connected() until accepted Ref: #432 Minimal attempt at solving #432 * fix(connecteventmanager): less complex channel signalling * fix(connecteventmanager): handle change queue edge cases and closure * fix(connecteventmanager): add test to confirm sync Connected() call flow changelog: put the 435 fix in the right version fix(connecteventmanager): clean up tests for new synchronous flow

rvagg requested a review from a team as a code owner August 16, 2023 01:51

rvagg commented Aug 16, 2023

View reviewed changes

hannahhoward reviewed Aug 16, 2023

View reviewed changes

hannahhoward approved these changes Aug 16, 2023

View reviewed changes

rvagg added 4 commits August 17, 2023 10:35

feat(connecteventmanager): block Connected() until accepted

76f5ee2

Ref: #432 Minimal attempt at solving #432

fix(connecteventmanager): less complex channel signalling

320ea03

fix(connecteventmanager): handle change queue edge cases and closure

9db1e78

fix(connecteventmanager): add test to confirm sync Connected() call flow

22fd90d

rvagg force-pushed the rvagg/bitswap-connect-race-fix branch from 11fb4a7 to 22fd90d Compare August 17, 2023 00:35

Jorropo approved these changes Aug 17, 2023

View reviewed changes

Jorropo merged commit 1d2f5e5 into main Aug 17, 2023
14 checks passed

Jorropo deleted the rvagg/bitswap-connect-race-fix branch August 17, 2023 06:18

Jorropo added a commit that referenced this pull request Aug 17, 2023

tests: fix tests after 436

24dece5

This wasn't caught because the tests hadn't run due to the test.Flaky. The test were testing exactly for the bug #435 fixed.

Jorropo mentioned this pull request Aug 17, 2023

bitswap/network: refactor connectEventManager more simply in in bsnet #436

Closed

Jorropo added a commit that referenced this pull request Aug 21, 2023

Revert "feat(connecteventmanager): block Connected() until accepted (#…

4db0886

…435)" and tests This reverts commit 1d2f5e5. This reverts commit 7ec68c5.

Jorropo added a commit that referenced this pull request Aug 21, 2023

Revert "feat(connecteventmanager): block Connected() until accepted (#…

222dde6

…435)" and tests This reverts commit 7ec68c5. This reverts commit 59a2bca. This reverts commit 1d2f5e5.

Jorropo added a commit that referenced this pull request Aug 21, 2023

Revert "feat(connecteventmanager): block Connected() until accepted (#…

b15f2a5

…435)" and tests This reverts commit 7ec68c5. This reverts commit 59a2bca. This reverts commit 1d2f5e5.

Jorropo added a commit that referenced this pull request Aug 21, 2023

Revert "feat(connecteventmanager): block Connected() until accepted (#…

7f075b1

…435)" and tests This reverts commit 7ec68c5. This reverts commit 59a2bca. This reverts commit 1d2f5e5.

BigLep mentioned this pull request Aug 22, 2023

Bitswap peer connection race #432

Open

rvagg mentioned this pull request Sep 21, 2023

fix: measure bitswap ttfb from after we get candidates back filecoin-project/lassie#432

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(connecteventmanager): block Connected() until accepted #435

feat(connecteventmanager): block Connected() until accepted #435

rvagg commented Aug 16, 2023 •

edited by Jorropo

Loading

welcome bot commented Aug 16, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading

rvagg Aug 16, 2023

rvagg commented Aug 16, 2023

hannahhoward left a comment •

edited

Loading

rvagg commented Aug 16, 2023

hannahhoward left a comment

rvagg commented Aug 17, 2023

Jorropo commented Aug 17, 2023

rvagg commented Aug 17, 2023

Jorropo left a comment

feat(connecteventmanager): block Connected() until accepted #435

feat(connecteventmanager): block Connected() until accepted #435

Conversation

rvagg commented Aug 16, 2023 • edited by Jorropo Loading

welcome bot commented Aug 16, 2023

codecov bot commented Aug 16, 2023 • edited Loading

Codecov Report

rvagg Aug 16, 2023

Choose a reason for hiding this comment

rvagg commented Aug 16, 2023

hannahhoward left a comment • edited Loading

Choose a reason for hiding this comment

rvagg commented Aug 16, 2023

hannahhoward left a comment

Choose a reason for hiding this comment

rvagg commented Aug 17, 2023

Jorropo commented Aug 17, 2023

rvagg commented Aug 17, 2023

Jorropo left a comment

Choose a reason for hiding this comment

rvagg commented Aug 16, 2023 •

edited by Jorropo

Loading

codecov bot commented Aug 16, 2023 •

edited

Loading

hannahhoward left a comment •

edited

Loading