Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle network errors/stalls #101

Merged
merged 2 commits into from
Oct 13, 2020
Merged

Handle network errors/stalls #101

merged 2 commits into from
Oct 13, 2020

Conversation

hannahhoward
Copy link
Collaborator

@hannahhoward hannahhoward commented Oct 13, 2020

Goals

Using ipfs/go-graphsync#102, provide better information about what's going on with a request and when it stalls.

Implementation

  • Change the FSM behvaior of Disconnected - remove the PeerDisconnected state, and instead just have it set an informational message that we believe the other peer to be stalled. The reason we want to do this is the actual idea behavior here should be that we go into a PeerDisconnected state, and when we come out of it, we reset to the state we were in before the disconnection. Our FSM module really doesn't support this well, so for the time being, I'm just removing the state change and setting a message -- we weren't really doing anything based on being in PeerDisconnected and it was ultimately a black-hole state. (I think PeerDisconnected was meant to handle the fact that graphsync requests were going into an error state when the peers were unlinked -- due to a message send failure in OnChannelCompleted)
  • Change the FSM behavior of Restart -- it actually should do nothing to the state. The only reason we were changing it to Ongoing was to resolve the PeerDisconnected state, and there are several issues that could arise from forcing into the Ongoing state -- we should instead preserve the state of the request overall!
  • Convert a channel removed due to inactivity to an Error -- this is not a request cancelled by a user but rather something bad that happened
  • Add a new event to the Events interface -- OnRequestDisconnected, which occurs when the transport protocol believes there's been a network disconnection (different from OnRequestTimedOut which occurs when request is cancelled due to a restart)
  • Convert ChannelRemoveTimeout to a configurable param on the data transfer manager
  • Hook into the new NetworkError hook in go-graphsync to detect disconnections
  • Convert some message sent failure to request disconnections rather than channel errors -- cause data transfer is an ongoing process, we shouldn't fail the channel entirely just cause a message doesn't go through -- we may want to restart the channel
  • Buff out a few tests for the graphsync transport protocol

Handle errors when network messages don't go through
lint, remove unused status
@@ -33,3 +33,9 @@ const ErrRejected = errorType("response rejected")

// ErrUnsupported indicates an operation is not supported by the transport protocol
const ErrUnsupported = errorType("unsupported")

// ErrDisconnected indicates the other peer may have hung up and you should try restarting the channel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannahhoward For my own curiosity:

What's the advantage of defining:

type errorType string

func (e errorType) Error() string {
	return string(e)
}

rather than using errors.New()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it allows you to use const instead of var -- there is a danger of var being changed -- technically anyone using the module can do so.

Copy link
Collaborator

@aarshkshah1992 aarshkshah1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannahhoward Great stuff ! Really learn a lot from you simplify things.

Just one review comment.

@@ -170,12 +168,42 @@ func (m *manager) OnRequestTimedOut(ctx context.Context, chid datatransfer.Chann
go func() {
select {
case <-ctx.Done():
case <-time.After(ChannelRemoveTimeout):
case <-time.After(m.channelRemoveTimeout):
channel, err := m.channels.GetByID(ctx, chid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a channel times/out disconnects and then the FSM/data-trasfer crashes ? I think we should have a handler for this state that starts this wait even when the FSM restarts. Does that make sense ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we can address this when we have a better way to actually track disconencts other than the Message -- see comment above about challenges of essentially pushing state

@hannahhoward hannahhoward merged commit eee8d1c into master Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants