-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(connecteventmanager): block Connected() until accepted #435
Conversation
Thank you for submitting this PR!
Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. |
Codecov Report
@@ Coverage Diff @@
## main #435 +/- ##
==========================================
+ Coverage 49.80% 50.02% +0.22%
==========================================
Files 249 249
Lines 29972 29990 +18
==========================================
+ Hits 14928 15003 +75
+ Misses 13615 13554 -61
- Partials 1429 1433 +4
|
|
||
func (p *peerState) setPending() { | ||
if !p.isPending() { | ||
p.accepted = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like some help sanity checking this because I'm not entirely clear of the atomicity of some of these operations.
We rely on connectEventManager's mutex to do most things, but there is a call to waitAccept that is unsynchronised, it just does a <-p.accepted; that could be happening while we're doing this operation of setting accepted to a new channel—although we're only doing this if the channel is already closed so <-p.accepted should have been a quick noop; is that atomic though? Is there potential for a conflict between <-p.accepted
where p.accepted
is closed, and p.accepted = make(chan struct{})
?
All good with our test suite that makes heavy use of bitswap and has been heavily flaky because of this problem: filecoin-project/lassie#383 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree no per peer mutex since everything is already mutexed.
In your new simpler implementation, there is an edge case I'm not sure whether you should care about:
- setState called, isPending set true, change created
- before change processed, setState called again (to a same or different new state)
- since isPending = true, no new change encoded, and waitNoop returned. however, the change has yet to be handled
Truthfully, the edge case where this causes downstream problems feels possibly non-existant. Just want to identify it.
The potential solution is to put handled back on the peerState, and return a func() that uses it whenever isPending = true -- but if you do so, I recommend you copy the reference out of the peerState before returning the wait func, so that any mutations to the channel in the peer state don't effect the callback. This can be done with a simple closure:
func makeWaitFunc(handled chan struct{}) waitFn {
return func() {
<-handled
}
}
Also, are there any edge cases where handled never gets closed()? Like the queue shuts down for example?
Good feedback! Here's what I've done:
func (c *connectEventManager) makeWaitFunc(handled chan struct{}) waitFn {
return func() {
select {
case <-handled:
case <-c.done:
}
}
}
} else if state.pending {
// Find the change in the queue and return a wait function for it
for _, change := range c.changeQueue {
if change.pid == p {
return c.makeWaitFunc(change.handled)
}
}
log.Error("a peer was marked as change pending but not found in the change queue")
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM by me.
Added a simple test to lock this behaviour in - |
11fb4a7
to
22fd90d
Compare
The connectEventManager is a very complex piece of code for what it does and it took tens of minutes to properly understand it properly. |
To reiterate the discussion on Slack - I'm not opposed to your approach in #436, but ripping out the existing piece of infrastructure is more radical than attempting to fix it; at least for the purposes we're trying to achieve right now. Perhaps this could be a two-step thing. I see this as a Chesterton's Fence situation - we're all acknowledging that we don't fully appreciate why the connecteventmanager exists and what it's aiming to achieve, and in that case the more prudent approach, rather than just ripping it out, might be to approach it more incrementally ("prudence is to understand why the fence is there in the first place before you attempt to take it down"). There's a singular issue we're trying to fix, and that is the race condition that exists between when a client is Ditching connecteventmanager possibly fixes that, and other problems that relate to it; but I wouldn't mind having a commit in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, this is tricky code and I'm not sure about all the interactions but tests are passing.
Should get #436 merged within a month ideally.
This wasn't caught because the tests hadn't run due to the test.Flaky. The test were testing exactly for the bug #435 fixed.
* feat(connecteventmanager): block Connected() until accepted Ref: #432 Minimal attempt at solving #432 * fix(connecteventmanager): less complex channel signalling * fix(connecteventmanager): handle change queue edge cases and closure * fix(connecteventmanager): add test to confirm sync Connected() call flow changelog: put the 435 fix in the right version fix(connecteventmanager): clean up tests for new synchronous flow
* feat(connecteventmanager): block Connected() until accepted Ref: #432 Minimal attempt at solving #432 * fix(connecteventmanager): less complex channel signalling * fix(connecteventmanager): handle change queue edge cases and closure * fix(connecteventmanager): add test to confirm sync Connected() call flow changelog: put the 435 fix in the right version fix(connecteventmanager): clean up tests for new synchronous flow
Fixes: #432
Minimal attempt at solving #432