Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link caching #3642

Closed
wants to merge 1 commit into from
Closed

Link caching #3642

wants to merge 1 commit into from

Conversation

Voker57
Copy link
Contributor

@Voker57 Voker57 commented Jan 28, 2017

This PR introduces link caching in GetLinks, dramatically speeding up, among other things, pinning large sets of data already present in DB.

Performance on my machine:
Pinning 1.7GB without cache: 1.08m
Pinning the same with cache: ~1s

Includes #3598.
Fixes #3505.

@Kubuxu
Copy link
Member

Kubuxu commented Jan 28, 2017

Apart from test failing, could you compare it vs memory only ARC cache? Persistent cache rises some problems (for example deleting).

@Voker57
Copy link
Contributor Author

Voker57 commented Jan 28, 2017

@Kubuxu I don't think memory-only cache will bring considerable improvement and it obviously would only work until ipfs daemon is restarted (if it's even present)
Deleting or updating the data should not be required since DAG nodes are immutable.

@ghost
Copy link

ghost commented Jan 28, 2017

I'm not generally opposed to this (although it touches a lot of code).

Here's a different idea though: a while ago we discussed embedding a graph db like cayley as a linkstore. Use cases for this include smarter GC algorithms, and IPLD selectors.

@rht
Copy link
Contributor

rht commented Jan 28, 2017

Looks neat -- I think the idea in the third commit of this PR could be repurposed as PBNode.Unmarshallinks (read only the links part of the dagnode, i.e. until at the links and data separator, 0xa). No cache is necessary and will likely have smaller implementation surface.

@whyrusleeping
Copy link
Member

@Voker57 Nice work! Those are some serious perf improvements. However, the changeset is uncomfortably large. Theres a lot we can do to break this up, First i'll review and merge #3598. Then i have two ideas to move forward with the rest of this.

We can keep this functionality in the dagservice (increasing the complexity and responsibilities of that one object). If we go this route, id like to see the NewDagService constructor remain the same, and just have it set theData field to nil so we don't have to change every invocation of this through the whole codebase (and external things also using that function). We can then add a new constructor NewDagServiceWithLinkCache (or something less wordy).

We can also extract this functionality into a LinkService abstraction that does this caching, and have the dagservice accept one of those in its extended constructor. Doing it this way should really clean up the interfaces a lot and allow us to mock this out and replace it more easily.

Beyond that, we need to think about cleaning up the datastore (this adds a not-insignificant overhead of storage space) and also tests.

Thanks again!

@rht
Copy link
Contributor

rht commented Jan 29, 2017

Caching doesn't change the fact that every nodes has to have a counterpart that contains just links but without data. Wouldn't it be sufficient to just read the links-bytes of the serialized node stream without additional caching?

@Voker57
Copy link
Contributor Author

Voker57 commented Jan 29, 2017

@rht yeah sounds viable, this however requires fixed node marshaling format, with filestore it's going to get a bit complicated and will need expanded blockstore interface

@rht
Copy link
Contributor

rht commented Jan 29, 2017

@Voker57 does "fixed node marshaling format" refer to having the marshaled data deterministic/stable?
I see this PR as an under-the-hood optimization for DAGService.GetLinks (@whyrusleeping's option 1). And where the only extension needed for multiblockstore is to specify prefix path determines where the node data should be read from. In short, I did not get the complications.

On another note, using in-memory linkstore/DAG cache sounds like a reverse of what zfs does (in-memory block deduptable/DDT cache). Haven't figured out why such is the case for zfs other than suspecting that it is optimized for RW on a single file rather than frequent fs DAG walk (the use case for pin and gc -- but then, zfs does gc well, as well).

@Voker57
Copy link
Contributor Author

Voker57 commented Jan 29, 2017

@rht

And where the only extension needed for multiblockstore is to specify prefix path determines where the node data should be read from. In short, I did not get the complications.

That's assuming data is stored in flatfs. What if it's for example stored in filestore and links are in different place? Blockstore then needs GetLinks method or something. Returning physical file path from blockstore would greatly restrict development of other blockstore backends.

@rht
Copy link
Contributor

rht commented Jan 30, 2017

case flatfs:
Read just the links-bytes until 0xa should be sufficient.

case filestore:
I haven't fully scanned through the filestore implementation to see where the fs links are stored. I am currently assuming that it stores DAGnodes as usual in a flatfs, but each filecontent is stored as a serialized PosInfo objects (i.e. "symlinks to a slice of file"). It operates at a layer in between blockstore and dagnode, so it doesn't see dagnode links. UnmarshalLinks should still suffice, since links and PosInfo metadata are still separated by 0xa.

@rht
Copy link
Contributor

rht commented Jan 30, 2017

Here's a different idea though: a while ago we discussed embedding a graph db like cayley as a linkstore. Use cases for this include smarter GC algorithms, and IPLD selectors.

Where is this discussed, if there is a link/issue I could look at?

Aside, does anyone in the community object on having an incentivized bounty on perf improvements (kind of like http://www.spacex.com/hyperloop)? This was how NixOS/nix#341 was pulled off.

@Voker57
Copy link
Contributor Author

Voker57 commented Feb 13, 2017

Will rework this with to make easier to use EnumerateChildrenAsync and include changes in feat/frugal-enumerate.

if err != nil {
return err
}
if present == true {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be just present.

@kevina
Copy link
Contributor

kevina commented Feb 17, 2017

Note #3700 is related and will likely conflict with this one.

@whyrusleeping whyrusleeping added this to the Ipfs 0.4.7 milestone Feb 17, 2017
@whyrusleeping
Copy link
Member

Yeah, #3700 has been merged now. It should make this a bit different (hopefully easier).

I think we should try out what @rht suggested with reading just the links from the object on disk. It will be really weird to wire it through (since the datastore interface doesnt allow for partial reads), but if we can prove that this will be efficient, then i've been wanting to change the datastore interface to return an io.Reader for quite a while, which would allow us to more easily make this change.

@Voker57 Voker57 force-pushed the feat/linkdb branch 2 times, most recently from 12d29e9 to 2ce89db Compare June 28, 2017 17:29
@Voker57 Voker57 force-pushed the feat/linkdb branch 5 times, most recently from a047c47 to d6f43ff Compare July 14, 2017 21:49
@Voker57
Copy link
Contributor Author

Voker57 commented Jul 14, 2017

Rebased, fixed the tests & addressed the review, please let me know if any other adjustments are needed.

…inks

GetLinks() now is using link cache in datastore under /local/links and
will not necessarily fetch linked nodes. This also affects
EnumerateChildrenAsync and EnumerateChildren, and their previous
behaviour
can be reproduced by using FetchingVisitor from merkledag module.

Add `ipfs repo flushlinkcache`

License: MIT
Signed-off-by: Iaroslav Gridin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ipfs pin add is very slow / hanging
7 participants