Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch Hamt Children Nodes in Parallel #4979

Closed
wants to merge 16 commits into from
Closed

Conversation

kevina
Copy link
Contributor

@kevina kevina commented Apr 26, 2018

Closes #4908

@kevina kevina requested a review from Kubuxu as a code owner April 26, 2018 07:02
@ghost ghost assigned kevina Apr 26, 2018
@ghost ghost added the status/in-progress In progress label Apr 26, 2018
@kevina kevina force-pushed the kevina/parallel-hamt-fetch branch 3 times, most recently from f78d58b to b6ba27d Compare April 29, 2018 21:55
@kevina kevina changed the title WIP: Fetch Hamt Children Nodes in Parallel Fetch Hamt Children Nodes in Parallel Apr 29, 2018
@kevina kevina force-pushed the kevina/parallel-hamt-fetch branch 2 times, most recently from 7537e29 to c6be2e2 Compare April 30, 2018 10:18
@whyrusleeping
Copy link
Member

If we're looking for more perf, using bitswap sessions for this would help, heres a small patch that enables that: https://github.com/ipfs/go-ipfs/compare/feat/ls-session?expand=1

@Kubuxu Kubuxu added the status/WIP This a Work In Progress label May 1, 2018
@Kubuxu
Copy link
Member

Kubuxu commented May 1, 2018

Flagging as WIP please remove the flag and ping me when done.

@kevina kevina mentioned this pull request May 3, 2018
@ajbouh
Copy link

ajbouh commented Jul 17, 2018

@kevina What's the latest here? Eagerly awaiting this change! :)

@whyrusleeping
Copy link
Member

@ajbouh I think right now we just need some cleanup and review. cc @magik6k and @schomatis for review

@whyrusleeping whyrusleeping added the topic/perf Performance label Jul 17, 2018
@schomatis
Copy link
Contributor

So, right now my only knowledge of HAMT directories is what @Stebalien explained to me in #5157 (comment), I definitely want to learn more about them as they are in the Files API milestone roadmap, but that will take me some time so I can't contribute any meaningful review at the moment.

@whyrusleeping
Copy link
Member

@schomatis no real knowledge of HAMTs in particular is needed for this review. This is just a prefetching algorithm that should work over any tree.

@schomatis
Copy link
Contributor

Great, I'll take a look at it then.

(GitHub is acting weird.)

@whyrusleeping
Copy link
Member

(GitHub is acting weird.)

Yes. It's been not doing well today

Copy link
Contributor

@schomatis schomatis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review only of the fetcher, not the modified HAMT logic in hamt.go (which I'm not familiar with).

"context"
"time"
//"fmt"
//"os"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove these?


var log = logging.Logger("hamt")

// fetcher implements a background fether to retrieve missing child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fether -> fetcher


// fetcher implements a background fether to retrieve missing child
// shards in large batches. It attempts to retrieves the missing
// shards in an order that allow streaming of the complete hamt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this order is referring to.

dserv ipld.DAGService

reqRes chan *Shard
result chan result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some documentation for these two fields?

todo jobStack // stack of jobs that still need to be done
jobs map[*Shard]*job // map of all jobs in which the results have not been collected yet

// stats relevent for streaming the complete hamt directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevent -> relevant

}

func (f *fetcher) mainLoop() {
var want *Shard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still uncertain of how want works, can we document this field?

fetched.vals[string(no.Node.Cid().Bytes())] = hamt
}
for _, job := range bj.jobs {
job.res = fetched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all of the jobs get all of the results? (Related to comment at line 90.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And I do this for code simplicity and performance reasons.

delete(f.jobs, j.id)
f.doneCnt--
if len(j.res.errs) != 0 {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a single error the entire children fetching chain is cut? Could that be documented in the fetcher?

Copy link
Contributor Author

@kevina kevina Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the question.

Edit: Sorry the question makes sense now, it has been a while since I last touched this code.

case bj := <-f.done:
f.doneCnt += len(bj.jobs)
f.cidCnt += len(bj.cids)
f.launch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following the launch()/idle dynamic but this unconditional launch() call seems to be guaranteeing that we don't reach a deadlock state where the idle is false and no one can call launch() to unblock it, right?

// the result of the batch request and not just the single job. In
// particular, if the 'errs' field is empty the 'vals' of the result
// is guaranteed to contain the all the missing child shards, but the
// map may also contain child shards of other jobs in the batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, so I may get more than I asked for? Or rather, many different get operation results may be returned in a single result of a particular get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get more than you ask for.

@schomatis
Copy link
Contributor

@kevina Most of the review comments are minor details, so feel free to ignore them, the only relevant two are at line 152 and 162.


func (ds *Shard) missingChildShards() []*cid.Cid {
if len(ds.children) != len(ds.nd.Links()) {
panic("inconsistent lengths between children array and Links array")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any scenarios that this could happen with a maliciously crafted node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. The reason I panic here is because I am unsure what to do with the error.

Copy link
Contributor Author

@kevina kevina Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the reason for the panic is because it should not happen. This is only called inside preloadChildren which already verifies the link.

Edit: This needs more careful thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This should be fixed now.

// are not already loaded they are fetched in parallel using GetMany
func (ds *Shard) preloadChildren(ctx context.Context, f *fetcher) error {
if len(ds.children) != len(ds.nd.Links()) {
return fmt.Errorf("inconsistent lengths between children array and Links array")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels inconsistent to panic over this in one place, but throw an error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@kevina kevina force-pushed the kevina/parallel-hamt-fetch branch from 37fb98b to c1605a6 Compare July 20, 2018 21:17
@kevina
Copy link
Contributor Author

kevina commented Jul 20, 2018

Note: Removed the use of bitswap sessions as I don't fully understand that code and I think it is causing tests to fail. Someone else can p.r. that in separately.

@kevina kevina force-pushed the kevina/parallel-hamt-fetch branch 2 times, most recently from 1aae2e2 to 1f37b9c Compare July 21, 2018 11:11
@kevina kevina removed the status/WIP This a Work In Progress label Jul 22, 2018
kevina added 12 commits July 22, 2018 23:29
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina kevina force-pushed the kevina/parallel-hamt-fetch branch from 762311a to 80f212c Compare July 23, 2018 03:33
Copy link
Member

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the fetcher implementation. Logic seems to make sense, but is a bit too hard to read.

Instead of using go logger I'd use go-log - https://github.com/ipfs/go-log/blob/5dc2060baaf8db344f31dafd852340b93811d03f/log.go#L63-L77 (See ipfs/notes#277 for more on tracing stuff)


// startFetcher starts a new fetcher in the background
func startFetcher(ctx context.Context, dserv ipld.DAGService) *fetcher {
log.Infof("fetcher: starting...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Debugf / remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep this at the same level of the rest of the stats output for consistency.

// The recommend minimum size is thus a size slightly larger than the
// maximum number children in a HAMT node (which is the largest number
// of CIDs that could be requested in a single job) or 256 + 64 = 320
const batchSize = 320
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd open an issue to run benchmarks with different values of this once this PR is merged

Copy link
Contributor Author

@kevina kevina Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but I already did do some informal benchmarking to arrive at this value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina is that benchmarking reproducible? Do you have scripts we can run to try it out and rederive the number ourselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually. That number is based more on reason (as outlined on comments) then benchmarks. I do not recommend we go below that size. Larger values might be beneficial at the increase of resource usage. I don't have any formal benchmarks, I mostly just observed the behavior when fetching a large directory. I can create an issue to consider increasing the value if desirable.

}

// get gets the missing child shards for the hamt object.
// The missing children for the passed in shard is returned. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/shard is returned/shard are returned

dserv ipld.DAGService

reqRes chan *Shard // channel for requesting the children of a shard
result chan result // channel for retrieving the results of the request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this requestCh / resultCh

res result
}

func (f *fetcher) mainLoop() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split this function into smaller functions as this is rather hard to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that might be helpful, but I would rather do this in a separate p.r., as I have other higher priority things I want to work on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is new code, and not a refactor, i'd prefer getting it a bit cleaner before merging it in. @kevina If you'd like to hack on other things, maybe @magik6k could help out cleaning things up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't necessary agree it will be an improvement, but see my other comments.

j.idx = -1
}

func (f *fetcher) launch() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split this up too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

for no := range ch {
if no.Err != nil {
fetched.errs = append(fetched.errs, no.Err)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not abort early here (close context and return)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's unnecessary. The fetcher can continue even if it encounters an error here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, at least in https://github.com/ipfs/go-ipfs/pull/4979/files#diff-689271378656b0bd9fc790b4a0a2b784R393 we throw the result away if there are errors, so it would make sense to not fetch more stuff if we know it won't be used (I might be wrong here though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that code code also aborts the context so the amount of extra work done is minimal.

hamt, err := NewHamtFromDag(f.dserv, no.Node)
if err != nil {
fetched.errs = append(fetched.errs, err)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here (abort early)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina
Copy link
Contributor Author

kevina commented Jul 23, 2018

Instead of using go logger I'd use go-log

Um I already do:

import (
	...
	logging "gx/ipfs/QmcVVHfdyv15GVPk7NrxdWjh2hLVccXnoD8j2tyQShiXJb/go-log"
)

var log = logging.Logger("hamt")

Or I am I missing something?

@magik6k
Copy link
Member

magik6k commented Jul 23, 2018

Instead of using go logger I'd use go-log

am I missing something?

What I meant is that it would probably be better / more useful to use logging spans here (i.e. https://github.com/ipfs/go-log/blob/5dc2060baaf8db344f31dafd852340b93811d03f/log.go#L63-L77)

for {
select {
case id := <-f.requestCh:
if want != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we could just pull most of the code in this select case into a separate function, making this flow easier to read and reason about (also removing the need for continues)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I agree this will be helpful. I will do this later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay done.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina
Copy link
Contributor Author

kevina commented Jul 23, 2018

What I meant is that it would probably be better / more useful to use logging spans here (i.e. https://github.com/ipfs/go-log/blob/5dc2060baaf8db344f31dafd852340b93811d03f/log.go#L63-L77)

I am not at all familiar with that code or the proper usage of it.

@kevina
Copy link
Contributor Author

kevina commented Oct 3, 2018

@Stebalien do you now consider this p.r dead. The last time this got some activity I was in the middle of base32 work so I didn't want to spend a lot of time on this and the remaining reviews I saw more as nits then anything else. My apologizes if this caused this P.R. to get dropped.

@Stebalien Stebalien added the status/deferred Conscious decision to pause or backlog label Oct 3, 2018
@Stebalien
Copy link
Member

Stebalien commented Oct 3, 2018

It was a combination of everyone being too busy to give a careful review this patch being complicated enough to make such a review difficult and time consuming. It's also so targeted at one use-case (sharded directories) that the it's unclear if the burden of maintaining all of this machinery outweighs the benefit. It may turn out that we can't do any better but, for now, @hannahhoward has picked up this work and is starting with some simpler approaches. Eventually, I'd like to have a more generalized pre-fetcher (that can prefetch a DAG up to some node-"class").

For now, just focus on the base32 stuff. That's absolutely critical for the browser folks.

@hannahhoward
Copy link
Contributor

closing per previous comments and ipfs/go-unixfs#19 being merged

@ghost ghost removed status/deferred Conscious decision to pause or backlog status/in-progress In progress labels Oct 15, 2018
@Stebalien Stebalien deleted the kevina/parallel-hamt-fetch branch February 28, 2019 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/perf Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants