Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506

Merged
merged 2 commits into from
Oct 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 20 additions & 29 deletions core/commands/dag/export.go
Original file line number Diff line number Diff line change
@@ -1,24 +1,26 @@
package dagcmd

import (
"context"
"errors"
"fmt"
"io"
"os"
"time"

"github.com/cheggaaa/pb"
blocks "github.com/ipfs/go-block-format"
cid "github.com/ipfs/go-cid"
"github.com/ipfs/go-ipfs/core/commands/cmdenv"
ipld "github.com/ipfs/go-ipld-format"
mdag "github.com/ipfs/go-merkledag"
iface "github.com/ipfs/interface-go-ipfs-core"

cmds "github.com/ipfs/go-ipfs-cmds"
gocar "github.com/ipld/go-car"
selectorparse "github.com/ipld/go-ipld-prime/traversal/selector/parse"
)

func dagExport(req *cmds.Request, res cmds.ResponseEmitter, env cmds.Environment) error {

c, err := cid.Decode(req.Arguments[0])
if err != nil {
return fmt.Errorf(
Expand All @@ -32,24 +34,6 @@ func dagExport(req *cmds.Request, res cmds.ResponseEmitter, env cmds.Environment
return err
}

// Code disabled until descent-issue in go-ipld-prime is fixed
// https://github.com/ribasushi/gip-muddle-up
//
// sb := gipselectorbuilder.NewSelectorSpecBuilder(gipfree.NodeBuilder())
// car := gocar.NewSelectiveCar(
// req.Context,
// <needs to be fixed to take format.NodeGetter as well>,
// []gocar.Dag{gocar.Dag{
// Root: c,
// Selector: sb.ExploreRecursive(
// gipselector.RecursionLimitNone(),
// sb.ExploreAll(sb.ExploreRecursiveEdge()),
// ).Node(),
// }},
// )
// ...
// if err := car.Write(pipeW); err != nil {}

pipeR, pipeW := io.Pipe()

errCh := make(chan error, 2) // we only report the 1st error
Expand All @@ -61,15 +45,12 @@ func dagExport(req *cmds.Request, res cmds.ResponseEmitter, env cmds.Environment
close(errCh)
}()

if err := gocar.WriteCar(
req.Context,
mdag.NewSession(
req.Context,
api.Dag(),
),
[]cid.Cid{c},
pipeW,
); err != nil {
store := dagStore{dag: api.Dag(), ctx: req.Context}
dag := gocar.Dag{Root: c, Selector: selectorparse.CommonSelector_ExploreAllRecursively}
// TraverseLinksOnlyOnce is safe for an exhaustive selector but won't be when we allow
// arbitrary selectors here
car := gocar.NewSelectiveCar(req.Context, store, []gocar.Dag{dag}, gocar.TraverseLinksOnlyOnce())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg is this going to be really slow for network operations? My understanding is that go-merkledag has things like concurrent DAG traversal (and prefetching, although that might just be for UnixFS data) whereas go-ipld-prime traversal does not currently have that.

My guess is that this means gocar's traversal is also going to be slow and linear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, well I'm going to claim ignorance on many of those counts, but maybe @warpfork can help out here

it's true that the vast majority of data going through here will be unixfs, so optimisations for that are going to be beneficial if they make a difference, but maybe the case is to build some of that in to go-ipld-prime instead

Copy link
Member

@warpfork warpfork Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me how much we should worry about "prefetching" or concurrent traversal.

I acknowledge that those features exist in some code, but that does not necessarily imply that they need to exist in the world we're building towards. (Maybe it does; maybe it doesn't. If there's not a benchmark proving it, and a documented user story stating why it's required in context, maybe it's time to review the perceived need.)

ISTM the whole point of CAR files (and a lot of the new systems we're building) are to give people new APIs that explicitly avoid having one query turn into an unbounded number of surprise subqueries of unknown latency. (This in turn is a goal that makes sense because once those horses are allowed to bolt the barn, all possible work is damage control: one can do the Sisyphean work to try to minimize the costs, but one can never make them predictable again, nor as minimal as purely local operation would've been.)

So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.

Which would mean prefetching is radically less relevant. (It'd be widdling around disk access latency rather than worrying about internet network latency, or heaven forbid DHT lookup latency. At that point, it probably drops into "nice to have, as an optimization, that we can add... someday" territory... especially because simply having a predictable local-only mode would still probably be noticably faster than something that hits even one DHT lookup.)

I don't know if that's the only mode of use to care about; at some point that becomes an "ipfs" decision, and I'll leave that decision for someone else. But if I walk through this, starting from visualizing myself as an end user, I'd be looking for APIs that are predictable and fast, and that means I'm looking to not use the network at all, and that really minimizes the relevance of prefetching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISTM users also pretty repeatedly tell us that this is what they're interested in. For example, QRI's bulk transporter, which builds whole graphs of content, and ships them all at once, point to point, is evidence of this: that system does not do network requests for missing pieces, but just moves what they have: and this has worked well for them, to the point that they ditched basically any of our other transport mechanisms that were less predictable.

My understanding is that the holistic goal of CAR-based APIs is to unlock that kind of victory in an even better-supported way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear to me how much we should worry about "prefetching" or concurrent traversal. ... need to exist in the world we're building towards.

@warpfork if you are not building towards reasonable-to-massive parallelism at every step and layer of your designs, you are building for an already obsolete world. Latencies continue to increase in every piece of interconnect, regardless of its position in the stack. The only salvation is "do more things", especially in the world of quickly-branching DAGs we find ourselves in.

On modern systems with locally-attached NVMe even something as trivial as

echo /same-fs/file-{1,2,3,4,5,6,7,8} | xargs cat >/dev/null

is being blown out of the water at a near linear factor by

echo /same-fs/file-{1,2,3,4,5,6,7,8} | xargs -n1 -P8 cat >/dev/null

( I need to preload VFS caches quite often, so know this behavior painfully well )

One of the main point of the batteries-included go-ipfs application is to create (an illusion of) a locality-independent-accessible content. I strongly believe that at the current place of its lifecycle, go-ipfs does not get the luxury of redefining its implicit API contracts. Nor does any greenfield project ( like e.g. Estuary ) gets to avoid the need for parallel traversal either: if anything, removing much of the architectural debt induced roadblocks, makes the inability to saturate any reasonable hardware even more apparent.

Please reconsider your position.

Copy link
Member

@warpfork warpfork Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a position.

I'm just saying if somebody finds this important, they'd better figure out how important, and figure out who's going to work on it.

I'd recommend starting with a document about what the user story is that we care about, and a benchmark that gives us a needle to focus on moving. The only thing I'm being cautious about here is letting the thread start asking for work to be done with neither of those in hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.

This seems like a reasonable use of the --offline flag (generally already supported across go-ipfs commands and here as well). However, for those who use the online mode being able to download things with reasonable speed is important.

If there's not a benchmark proving it

Is this a request for a benchmark showing that a linear DAG walk with a latency per retrieval of 1s is slower than one with concurrent traversal and prefetching? Seems unnecessary, but making a basic one seems pretty straightforward if necessary.

and a documented user story stating why it's required in context, maybe it's time to review the perceived need

One is using car export to create verifiable gateways (i.e. trusting the gateway only with availability rather than also data integrity).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg I poked around here a bit. The current version of go-car uses slow linear traversal as well so I'm fine with this PR (after we do a go mod tidy so tests pass).

However, it seems likely we're going to have to revisit this topic as linear DAG traversal is going to be a problem for folks interested in verifiable gateways or generally folks interested in pulling CAR files out of the network (or off of high latency disks which is true for a number of our collabs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warpfork prefetching is a a very high priority and I feel like I've been saying this for a while. It's already limiting graphsync performance to disk latency, and basically means we can't use selectors anywhere in go-ipfs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tell you that the rate-limit will be blockSize/rtt which limits us to a few MiB/s assuming 256KiB blocks. Assuming smaller data, it gets much worse. If you want a benchmark, you can pretty easily test ipfs dag export (no prefetching) versus ipfs pin add (parallel download) versus ipfs get (some prefetching that kind of sucks).

if err := car.Write(pipeW); err != nil {
errCh <- err
}
}()
Expand Down Expand Up @@ -153,3 +134,13 @@ func finishCLIExport(res cmds.Response, re cmds.ResponseEmitter) error {
}
}
}

type dagStore struct {
dag iface.APIDagService
ctx context.Context
}

func (ds dagStore) Get(c cid.Cid) (blocks.Block, error) {
obj, err := ds.dag.Get(ds.ctx, c)
return obj, err
}
4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,9 @@ require (
github.com/ipfs/go-verifcid v0.0.1
github.com/ipfs/interface-go-ipfs-core v0.5.1
github.com/ipfs/tar-utils v0.0.1
github.com/ipld/go-car v0.3.1
github.com/ipld/go-car v0.3.2
github.com/ipld/go-codec-dagpb v1.3.0
github.com/ipld/go-ipld-prime v0.12.2
github.com/ipld/go-ipld-prime v0.12.3
github.com/jbenet/go-random v0.0.0-20190219211222-123a90aedc0c
github.com/jbenet/go-temp-err-catcher v0.1.0
github.com/jbenet/goprocess v0.1.4
Expand Down
8 changes: 4 additions & 4 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -571,16 +571,16 @@ github.com/ipfs/interface-go-ipfs-core v0.5.1 h1:1KMM7RkjUD8W5fSoRsa9xR6ZMzeL8fL
github.com/ipfs/interface-go-ipfs-core v0.5.1/go.mod h1:lNBJrdXHtWS46evMPBdWtDQMDsrKcGbxCOGoKLkztOE=
github.com/ipfs/tar-utils v0.0.1 h1:8Na0KBD6GddGyXwU4rXNtVTE24iuZws8mENJQPLG7W4=
github.com/ipfs/tar-utils v0.0.1/go.mod h1:ACflm9wXvV9w0eMJt6yYXxS2zuIV+yXGNwbuq1bhLeE=
github.com/ipld/go-car v0.3.1 h1:WT+3cdmXlvmWOlGxk9webhj4auGO5QvgqC2vCCkFRXs=
github.com/ipld/go-car v0.3.1/go.mod h1:dPkEWeAK8KaVvH5TahaCs6Mncpd4lDMpkbs0/SPzuVs=
github.com/ipld/go-car v0.3.2 h1:V9wt/80FNfbMRWSD98W5br6fyjUAyVgI2lDOTZX16Lg=
github.com/ipld/go-car v0.3.2/go.mod h1:WEjynkVt04dr0GwJhry0KlaTeSDEiEYyMPOxDBQ17KE=
github.com/ipld/go-codec-dagpb v1.2.0/go.mod h1:6nBN7X7h8EOsEejZGqC7tej5drsdBAXbMHyBT+Fne5s=
github.com/ipld/go-codec-dagpb v1.3.0 h1:czTcaoAuNNyIYWs6Qe01DJ+sEX7B+1Z0LcXjSatMGe8=
github.com/ipld/go-codec-dagpb v1.3.0/go.mod h1:ga4JTU3abYApDC3pZ00BC2RSvC3qfBb9MSJkMLSwnhA=
github.com/ipld/go-ipld-prime v0.9.0/go.mod h1:KvBLMr4PX1gWptgkzRjVZCrLmSGcZCb/jioOQwCqZN8=
github.com/ipld/go-ipld-prime v0.9.1-0.20210324083106-dc342a9917db/go.mod h1:KvBLMr4PX1gWptgkzRjVZCrLmSGcZCb/jioOQwCqZN8=
github.com/ipld/go-ipld-prime v0.11.0/go.mod h1:+WIAkokurHmZ/KwzDOMUuoeJgaRQktHtEaLglS3ZeV8=
github.com/ipld/go-ipld-prime v0.12.2 h1:StIquYvKIRuSEAtjJDr39fyzBtziioHPwVC75tBiXzo=
github.com/ipld/go-ipld-prime v0.12.2/go.mod h1:PaeLYq8k6dJLmDUSLrzkEpoGV4PEfe/1OtFN/eALOc8=
github.com/ipld/go-ipld-prime v0.12.3 h1:furVobw7UBLQZwlEwfE26tYORy3PAK8VYSgZOSr3JMQ=
github.com/ipld/go-ipld-prime v0.12.3/go.mod h1:PaeLYq8k6dJLmDUSLrzkEpoGV4PEfe/1OtFN/eALOc8=
github.com/jackpal/gateway v1.0.5/go.mod h1:lTpwd4ACLXmpyiCTRtfiNyVnUmqT9RivzCDQetPfnjA=
github.com/jackpal/go-nat-pmp v1.0.1/go.mod h1:QPH045xvCAeXUZOxsnwmrtiCoxIr9eob+4orBN1SBKc=
github.com/jackpal/go-nat-pmp v1.0.2 h1:KzKSgb7qkJvOUTqYl9/Hg/me3pWgBmERKrTGD7BdWus=
Expand Down