-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPFS support #4736
IPFS support #4736
Conversation
Hold on, accidentally hit ctrl+enter, this should have been a draft PR⦠Edit: now we are good to go |
Hi @soerface ! Great stuff! π₯
That sounds like IIRC, IPFS does have directories, right? So could you clarify why can't we use them the same way we do it in all other remotes? Have to clarify that I'm not really familiar with IPFS, so excuse me for potentially naive questions. Our cache/remotes are indeed mostly content-addressable, but we also have things like |
Well, kind of. It does have directories, for example take a look at this listing of all XKCD comics until 1862: https://ipfs.io/ipfs/QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm It is indeed a directory and comic number 1862 is in a subdirectory: https://ipfs.io/ipfs/QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm/1862%20-%20Particle%20Properties However, changing just anything inside there results in a completely new hash. So the created directory is read-only. For URLs that do not change there is IPNS. It allows pointing the same hash to new content, requiring a private key. So this might work for one person publishing content, but I'm not sure if it is a feasible solution for collaborative work. Need to think about that. The IPNS example in the official documentation doesn't even work ("could not resolve name"), I didn't saw yet IPNS working reliable (but that might just be me).
The directories used in the above XKCD example may work for that use-case. The main problem is still that every change changes the hash - except when using IPNS. I'll take a closer look at IPNS and play around with it. Still would be cool if we could find some kind of mechanism to store storage specific data. Using IPFS without IPNS has the charm of not being dependent on some key owner. |
c4e0af3
to
5dbc44e
Compare
The remote allows for updating .dvc/config after a successful IPFS upload. IPFS always needs the latest CID, so a new method for the `Remote` class was introduced to allow doing work after the upload
After giving it some more thoughts, I don't think it would be desirable to allow talking to a remote IPFS daemon. It will inevitably happen that multiple users are working on the same MFS. Scenario: User A will push a file. User B will push some other file. User B does not know about the new content from User A. User B will check in a CID to the DVC repo which contains unnecessary data.
Hey @efiop, I'm ready with a draft that is basically working. I'll need some more time to look at recently merged #4867, since I needed to create a custom remote and that PR seems like a major refactoring at first glance. Also, tests and documentation are still pending, but feedback would be appreciated since I needed to do some "unusual" things compared to other remotes. Overview about this implementation:
I also thought about allowing to connect to a remote IPFS daemon. Users wouldn't be required to download all content when using IPFS and the progress bars are more useful. BUT, if multiple users start working with the same daemon, they will get really nasty race conditions - since the MFS is updated on every user request to serve the content which matches with his version of the repository. It would be fine if every user is disciplined enough to use an own directory inside the MFS, but we know that users won't be :) Therefore I decided to not support remote daemons for now. Alternatively we can think about assigning random directories on the mfs for every user (IPFS will automatically handle deduplication) Quickstart to try it:
Main questions:
Feel free to nudge me in your discord for any questions! |
@soerface This is very interesting, but it is really complex too. The dynamic CID is a lot to ask for from dvc, I'm not sure that this is all worth it. What about IPNS that you've mentioned before? Btw, do you plan on using this or are you just contributing the change? We will need a lot of testing and a lot of maintenance in this current approach, we won't be willing to take such a complex implementation without anyone actually using it and ready to support it. Please excuse us for potential delays in reviews, we have a lot on our plate right now, so will have to find some time to properly review it and deep dive into IPFS. |
@efiop IPNS would actually add more complexity. First, the hooks I introduced would still be needed, since we would need to update the IPNS entry to point to to the new IPFS CID. We would just save us from modifying Second, only the user who owns the private key of the IPNS entry is able to change it. Therefore, only one person can actually change content, which makes collaboration impossible (or at least vastly more challenging, since now you would need some way to distribute a key between members). Last but not least, IPNS entries can expire. As the users in #930 described, IPFS would be useful for public projects where anyone might have an interest in keeping source code and data online. If we are staying with plain IPFS, it would just be a matter of executing |
@soerface I see. Looks like rclone also doesn't support ipfs yet rclone/rclone#128 . I see that they've considered these limitations and have come up with a few potential ways to access it. Though raw ipfs is read-only, which is probably what we will have to do here too, if we want to use it. Otherwise ipns is the only option, looks like. Making dvc save CID back to config is way too strange and will result in merge conficts. Plus, I don't really see how two users could update the same remote, as they will get different CIDs that don't really merge automatically (as I understand it). So basic scenarios of dvc usage simply won't work, which to me seems like a dealbreaker. π |
@efiop They will get a merge-conflict, yes. But it actually isn't hard to solve. If you get the conflict:
I've tested it on two machines, each running it's own IPFS daemon: https://youtu.be/Jz9eGw5xPDE Also, if the connection between the machines is too slow, updating the MFS leads to a timeout.
However, if editing the config is not allowed and we can't find another place to track the CID, I don't think a read-only implementation would be very useful. The users would need to upload all content manually from time to time and get the CID themself, and edit the url manually. This wouldn't be very attractive or useful. Edit: Maybe instead of editing the config, printing the required |
@soerface What if a team has multiple parallel branches? How would they sync their CIDs? Someone will need to collect all the cids and merge them into one remote? In regular dvc workflow none of these manipulations are needed, as everything simply gets pushed to one single remote. I also don't like the idea of printing Maybe raw IPFS is simply too low level to be used as a full-featured remote in dvc. I don't think it is worth all of these dvc workflow sacrifices just to support raw IPFS. |
Syncing them would be problematic, yes. But would there even be a need to sync them? If a branch is meant to be stay parallel, it has it's own files and does not depend on the files of the other branches, does it? If I switch the branch and do a
It's a bit tricky, yeah. The alternative to the MFS would be my first approach by saving the CID of individual files, which wouldn't impose the issues with merging and won't require to download everything when working with IPFS. But as you explained, there are things like the I would like to learn more about the use cases of the users in issue #930, as already asked in #930 (comment). @remram44 @jnareb @flxai @momack2 @weiji14 @newkozlukov how does your workflow look like? Would you mind testing out this fork and give feedback if it fits you from a user-perspective? Until now I had the case of "publishing data" in mind: The data can be kept online by anyone who is interested in it by just pinning the CID that is stored in |
@soerface Wow, exciting work!
It'll take some time before I get to check it out, though
At a glance, sounds exactly like the use case I had in mind. In particular, content-addressable storage like IPFS would be a great way to publish weights/training data accompanying a paper (compare the UX to that of google drive, ugh) |
Correction: This is not true. Pulling is not needed to get the correct CID, since my implementation uses the hooks to always bring the MFS into the correct state before performing operations. The only important thing is to resolve the conflict in the config - it should always contain the changes from remote. Thanks @newkozlukov! If you need help installing this branch, you can reach me at DVCs discord. I'll see if I can fix the timeout issue till then. If you get it, try explicitly connecting your two daemons by executing |
Very nice work, @soerface!
Ideally, the workflow stays the same with the only difference being the aforementioned
I hope for decentralized storage that supports DVC in a fault-tolerant way. DVC currently allows for a select few types of remotes, none of which decentralize authority over the storage provider. This is an essential feature needed to be able to achieve fault-tolerant systems that allow for data sharing without the limitations induced by central control. I see applications within the field of reproducibility in the computational sciences that relies on fault-tolerant storage of data within an automated environment. I recognize this as a desirable property for any published work that deems itself reproducible.
I hope I'll get to that soon. Judging from the video linked above and skimming through the changes I'd say that this looks like a very promising UX that may seem intuitive to people familiar with IPFS. |
@soerface Thanks again for your work on this PR! π Unfortunately I don't think we are ready to accept an implementation that takes this approach. If we don't find a better way to use ipfs in dvc, then maybe it can live as a dvc fork for now, until we implement plugin mechanism that will allow your implementation to use it. Closing this PR for now, since unfortunately I don't see how this could be eventually merged into the upstream. π |
@efiop I love this PR - is there any progress on that plugin mechanism? |
@LennyPenny We are working on some pre-requisites and we are generally going in fsspec #5162 direction right now. Most likely we will end up using it as a base for plugins, but no 100% guarantee yet. Also, the CID situation discussed above is a very hard pill to swallow right now, but we could reconsider it. |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Referring to #930
This PR adds a new tree, IPFSTree. It allows pulling and pushing to IPFS with a running ipfs daemon.
The PR is work in progress. I particular would like core maintainer feedback on the following issue:
IPFS addresses it's content via hashes. In contrast to regular storage solutions, we can't decide the path of our files. Therefore, the
path_info
a tree receives in itsexists()
and_download()
method is not sufficient to retrieve files. Also, the_upload()
method will know the resulting IPFS CID. https://docs.ipfs.io/concepts/content-addressing/I think the
*.dvc
file would be the right place to keep the IPFS CID, it just needs an additional key (alongpath
andmd5
). Would you mind adding storage-specific information there? Would it fit the architecture of DVC or do you have some other place in mind?Can you also give me feedback about the idea of
_upload()
being allowed to return additional information that get's saved to the*.dvc
file? If this would be fine, we should probably do this in a separate PR and build a more general solution - allowing any tree to save storage specific information.Since this issue is a major blocker, a couple of more things are missing before this PR can be merged. TODO list:
walk_files
. Not sure if this method is really necessary for DVC, pulling and pushing worked without it