Documentation Of The State Of This Module And Outstanding Design Smells/Questions #158

hannahhoward · 2021-03-11T23:48:04Z

What This Module Intends To Do

Data transfer aims to be an abstract data transfer layer for Filecoin. The primary goals in the design of this module are to:

Support multiple protocols
Add a legible language for negotiating and authenticating requests (and renegotiating and reauthenticating requests) that is protocol neutral
Support push and pull requests and augment underlying transport protocols where needed
Add state tracking for long running data transfers to otherwise stateless protocols (which Graphsync mostly is, and HTTP often is, though I've never implemented HTTP as a protocol, so I don't know how much internal state it keeps to handle things like retries and restarts)

How it came to being

While development began in 2019, most of this module landed in a single 7000-line PR in July of 2020 #55. The requirements at the time weren't fully clear, and we could've probably tested things more extensively. Then we rapidly iterated to deliver more features in the middle of Space Race like restarts. As a result of all this, in the back of my mind are a variety of "design-smells", that I'm attempting to document here. I'm not sure they're so much smells as unfinished work and tech debt, or just open questions

Big Design Questions, Current Answers, And Open Questions

The Transport Boundary

Of the goals listed above, one large, open question is where the boundary with the transport should be -- should data transfer expect protocols to do a lot on their own? To eventually support multiple transports, I defined a "Transport" interface that every protocol must satisfy to support working with go-data-transfer.

This means there are three seperate layers at which functionality related to transfer might live:

the protocol library itself (i.e. go-graphsync)
the protocol specific implementation of the Transport interface (currently living in go-data-transfer, but could live outside the repo)
the non-protocol specific parts of go-data-transfer

Consider two contradictory outstanding requests:

One suggestion here is to push request retries down to graphsync -- Single graphsync error causes deal to fail #156
Here the suggestion is to remove pause / unpause on the request side entirely from go-graphsync, pushing it up to the Transport or Data Transfer itself -- Requestor pausing probably needs a rearchitect, or should be removed entirely ipfs/go-graphsync#160

My original thought had been the proper answer here is "put it in the Transport implementation" but that hasn't always been what's happened. For example, CID tracking for request restarts bled up to the core of the data transfer module

Sidebar: Does the Transport interface even make sense -- is it better to just have data transfer talk directly to protocols and implement them one by one or hand code the choices?

Here's my own take on design going forward. These are purely my hunches:

Any version of data transfer that works with multiple protocols is going to need a generic transport interface that is fairly high level
At the same time, historically, when we've built libraries for our protocols (i.e. go-bitswap, go-graphsync), we've often found ourselves mixing low level protocol implementations with high level logic to manage working with a protocol. Later, we wished there was a clear seperation between the two (see the never fully extracted Bitswap sessions)
For this reason, I like the "Transport" implementation as the place for all the glue between what the protocol library does (lower level) and what data transfer does.
This among other things, makes me think the Transport should have a way to write/read from disk storage -- that would enable pushing cid-lists for graphsync restarts down to the Transport layer.
If we go this approach, the Transport implementations become larger, and we might consider them being in their own repo. go-data-transfer is already large.
At the same time, some protocols just can't support every feature data transfer needs, no matter how you implement Transport. So we probably need a way for the Transport interface to describe it's capabilities. Right now there's a primitive version of that in the "Transport" vs "PausableTransport" augmentation, but it probably needs to be more robust.

Also, the work to support multiple transports on a basic level is unfinished:

The manager only has one transport ATM
We have no system for registering transports, or deciding which transport to use for which request.
We do have a system for doing per request configuration of the transport itself, which we use to switch between blockstores in production (or in the future we could use this to read/write from a CAR file). This works but probably could be revisited. . Since the configuration function gets a generic transport, it has to do an interface cast to get to the specific transport type, so that it can call the actual transport specific configuration methods (see https://github.com/filecoin-project/go-fil-markets/blob/c7c8d71729e7266717960c766967f7cbac788f64/storagemarket/impl/dtutils/dtutils.go#L126 for example) -- I'm not sure there's any generic way to avoid this.

A giant unused and buggy feature: Push Request Intermediate Validation

The retrieval market needs a complex system for pausing and revalidating transfers. It uses Pull requests, where the data transfer responder is also the sender of data (as opposed to push where the initiator is the sender of data). The data transfer system enables this via the Revalidator interface.

When I was implementing this system (in the aformentioned 7000 line PR), for symmettry's sake I attempted to implement the entire intermediate request validation system for Push requests. Given that it's an unused feature, it has not been battle tested. To complete the concept of symmetry, I imagined the responder to the push request, a.k.a. the receiver of data, would be the one controlling the intermediate vouchers. Essentially I imagined a responder saying they wouldn't accept more data until they received another voucher. The need to support this kind of push is what motivated the design for requestor pause documented here: ipfs/go-graphsync#160 -- I figured cancelling a graphsync request was the only way to insure we wouldn't acccept more data, so I implemented pause as a cancel.

Suffice to say, these features haven't been used by Lotus yet, so they may contain some bugs. I'm not sure if we should remove this whole feature. If we're going to keep it we ought to at least identify a case where it could be used! And if we do keep it in, when we think about a way to define Transport capabilities, we should figure out how to say what revalidation we do or don't support (Graphsync really can only support Pull reasonably, and I don't think HTTP can support revalidation at all -- or it'd be complicated to do)

Pause / Resume vs Restart

Another complicated aspect of data transfer's design is distinct features for pause/resume AND request restarting. These are two distinct concepts I think, but their implementation bleed a little bit. It's worth documenting how I imagine them to work -- maybe in the future they are not different enough to merit a distinction:

Pause/Resume means an underlying protocol request has been put in a paused state, from which it can be resumed. The state of being paused is kept in protocol library's memory, and in my mind, is almost always controlled by the person sending data. Upon a node restart, it basically does not exist (though currently we track the state of being paused in the channel state). Pause/Resume is primarily used for revalidation in retrieval, though technically PauseDataTransferChannel and ResumeDataTransferChannel exist as methods on the manager that can be called programmatically (though as documented in Requestor pausing probably needs a rearchitect, or should be removed entirely ipfs/go-graphsync#160, they are extremely unreliable when called by the party receiving data for Graphsync, because ultimately it is implemented on the requestor side like a restart)
Restart means data transfer attempts to create a new protocol request that somehow picks up where the last one left off. In the case of Graphsync, we do this by creating a new request with the "graphsync/do-not-send-cids" extension. The ability to restart is controlled entirely by data transfer. The state needed to restart is persisted to disk and should last beyond a node shutdown. Restart is primarily used to survive network disconnects and also node restarts.

State tracking & on disk storage

For tracking the state of each channel, I use Finite State Machines from go-statemachine.

I used state machines for two reasons:

I was comfortabe with them from go-fil-markets
They have a basic async update interface that works well with not blocking data transfers

However, the statemachine in go-data-transfer is a bit of a mess. And I wonder if it's the right solution. The benefit of the statemachine design is also what makes it hard to work with; since updates are asynchronous, you dispatch events without immediately seeing their effect on channel state. If you need an updated channel state, there are ways to force this to work, but it's awkward.

Voucher system

I feel good about the voucher system in general and I think it works well. The mechanics of voucher revalidation on restart could probably use another look -- I haven't actually taken a look at whether validating vouchers on restart for retrieval works well.

The registry system, while basic, basically works :) I kinda like it.

One design fail comes up here: #150 . The various Voucher / VoucherResult methods on ChannelState really should return an error, cause they defer deserialization to the method call, which most definitely can fail.

Documentation

This feels like the beginning of an architecture doc, like the one we have in go-graphsync. This has been really helpful for people working on go-graphsync, and I wonder if we should build it here. Also, if we continue to use statemachines, over in go-fil-markets we autogenerate state machine diagrams which is a nice feature of the FSM code, and I wonder if we should do that here.

The text was updated successfully, but these errors were encountered:

hannahhoward added Epic x/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs labels Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Of The State Of This Module And Outstanding Design Smells/Questions #158

Documentation Of The State Of This Module And Outstanding Design Smells/Questions #158

hannahhoward commented Mar 11, 2021 •

edited by raulk

Loading

Documentation Of The State Of This Module And Outstanding Design Smells/Questions #158

Documentation Of The State Of This Module And Outstanding Design Smells/Questions #158

Comments

hannahhoward commented Mar 11, 2021 • edited by raulk Loading

What This Module Intends To Do

How it came to being

Big Design Questions, Current Answers, And Open Questions

The Transport Boundary

A giant unused and buggy feature: Push Request Intermediate Validation

Pause / Resume vs Restart

State tracking & on disk storage

Voucher system

Documentation

hannahhoward commented Mar 11, 2021 •

edited by raulk

Loading