Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS support #4736

Closed
wants to merge 12 commits into from
Closed

IPFS support #4736

wants to merge 12 commits into from

Conversation

soerface
Copy link

@soerface soerface commented Oct 17, 2020

Referring to #930

This PR adds a new tree, IPFSTree. It allows pulling and pushing to IPFS with a running ipfs daemon.

The PR is work in progress. I particular would like core maintainer feedback on the following issue:

IPFS addresses it's content via hashes. In contrast to regular storage solutions, we can't decide the path of our files. Therefore, the path_info a tree receives in its exists() and _download() method is not sufficient to retrieve files. Also, the _upload() method will know the resulting IPFS CID. https://docs.ipfs.io/concepts/content-addressing/

I think the *.dvc file would be the right place to keep the IPFS CID, it just needs an additional key (along path and md5). Would you mind adding storage-specific information there? Would it fit the architecture of DVC or do you have some other place in mind?

Can you also give me feedback about the idea of _upload() being allowed to return additional information that get's saved to the *.dvc file? If this would be fine, we should probably do this in a separate PR and build a more general solution - allowing any tree to save storage specific information.

Since this issue is a major blocker, a couple of more things are missing before this PR can be merged. TODO list:

  • ❗persist IPFS CID in DVC
  • update documentation
  • add unit tests
  • support not only ipfs daemon on localhost but anywhere
  • maybe implement walk_files. Not sure if this method is really necessary for DVC, pulling and pushing worked without it
  • implement progress bar for up- and download
  • do not timeout when updating MFS. Showing progress would be needed for a good UX, but I don't know where to get the progress from

@soerface
Copy link
Author

soerface commented Oct 17, 2020

Hold on, accidentally hit ctrl+enter, this should have been a draft PR…

Edit: now we are good to go

@soerface soerface closed this Oct 17, 2020
@soerface soerface reopened this Oct 17, 2020
@soerface soerface marked this pull request as draft October 17, 2020 16:20
@efiop
Copy link
Contributor

efiop commented Oct 18, 2020

Hi @soerface !

Great stuff! πŸ”₯

I think the *.dvc file would be the right place to keep the IPFS CID, it just needs an additional key (along path and md5). Would you mind adding storage-specific information there? Would it fit the architecture of DVC or do you have some other place in mind?

Can you also give me feedback about the idea of _upload() being allowed to return additional information that get's saved to the *.dvc file? If this would be fine, we should probably do this in a separate PR and build a more general solution - allowing any tree to save storage specific information.

That sounds like .dvc file will be aware of the way stuff is stored in the remote, which is wrong from the current architectural perspective, as .dvc files are remote-agnostic. So it is likely a no-go. πŸ™

IIRC, IPFS does have directories, right? So could you clarify why can't we use them the same way we do it in all other remotes? Have to clarify that I'm not really familiar with IPFS, so excuse me for potentially naive questions.

Our cache/remotes are indeed mostly content-addressable, but we also have things like run-cache (e.g. .dvc/cache/runs that is created after any dvc run with deps and outputs), which require an ability to list them too. So we actually need a directory to work as a remote, pure file-content-based storage won't do, at least not for the full functionality.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Oct 18, 2020
@soerface
Copy link
Author

soerface commented Oct 18, 2020

IIRC, IPFS does have directories, right? So could you clarify why can't we use them the same way we do it in all other remotes? Have to clarify that I'm not really familiar with IPFS, so excuse me for potentially naive questions.

Well, kind of. It does have directories, for example take a look at this listing of all XKCD comics until 1862: https://ipfs.io/ipfs/QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm

It is indeed a directory and comic number 1862 is in a subdirectory: https://ipfs.io/ipfs/QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm/1862%20-%20Particle%20Properties

However, changing just anything inside there results in a completely new hash. So the created directory is read-only.

For URLs that do not change there is IPNS. It allows pointing the same hash to new content, requiring a private key. So this might work for one person publishing content, but I'm not sure if it is a feasible solution for collaborative work. Need to think about that. The IPNS example in the official documentation doesn't even work ("could not resolve name"), I didn't saw yet IPNS working reliable (but that might just be me).

which require an ability to list them too. So we actually need a directory to work as a remote

The directories used in the above XKCD example may work for that use-case. The main problem is still that every change changes the hash - except when using IPNS. I'll take a closer look at IPNS and play around with it. Still would be cool if we could find some kind of mechanism to store storage specific data. Using IPFS without IPNS has the charm of not being dependent on some key owner.

The remote allows for updating .dvc/config after a successful IPFS upload. IPFS always needs the latest CID, so a new method for the `Remote` class was introduced to allow doing work after the upload
After giving it some more thoughts, I don't think it would be desirable to allow talking to a remote IPFS daemon. It will inevitably happen that multiple users are working on the same MFS. Scenario: User A will push a file. User B will push some other file. User B does not know about the new content from User A. User B will check in a CID to the DVC repo which contains unnecessary data.
@soerface
Copy link
Author

soerface commented Nov 13, 2020

Hey @efiop, I'm ready with a draft that is basically working. I'll need some more time to look at recently merged #4867, since I needed to create a custom remote and that PR seems like a major refactoring at first glance. Also, tests and documentation are still pending, but feedback would be appreciated since I needed to do some "unusual" things compared to other remotes.

Overview about this implementation:

  • Initially, I wanted to track individual files. I've given up this approach in favor of IPFS MFS. It's basically an API which simulates a regular filesystem, which fits DVCs concept of remotes better
  • Because MFS is mutable, the CID (Content ID) changes with every upload. However, we can now use the single CID of the "project directory" of the MFS instead of the CID of individual files
  • After an upload, the new CID of the project will be read. This needs to be saved in git. Since .dvc/config is tracked in git, I decided that it is the most easiest to change the url of the IPFS remote.
  • DVC is up- and downloading from and to the MFS of the local IPFS daemon. But IPFS somehow needs to populate the MFS based on the CID provided by DVC. Therefore, I saw the need of "hooks" for the DVC Remote class. My IPFSRemote class uses the hooks to update the MFS before DVC is interacting with it. Major drawback: The actual download happens before DVC is pulling the content, so the progressbars get nearly useless (they only provide feedback about the transfer from the local IPFS daemon to the DVC cache).

I also thought about allowing to connect to a remote IPFS daemon. Users wouldn't be required to download all content when using IPFS and the progress bars are more useful. BUT, if multiple users start working with the same daemon, they will get really nasty race conditions - since the MFS is updated on every user request to serve the content which matches with his version of the repository. It would be fine if every user is disciplined enough to use an own directory inside the MFS, but we know that users won't be :) Therefore I decided to not support remote daemons for now. Alternatively we can think about assigning random directories on the mfs for every user (IPFS will automatically handle deduplication)

Quickstart to try it:

  • Download the IPFS binary and put it in your path
  • ipfs daemon
  • in your DVC project, dvc remote add ipfs ipfs://
  • Optional: Explicitly set a path for MFS: dvc remote modify ipfs mfs_path /my/favorite/place
  • Note that .dvc/config changes after pushing to ipfs
  • Take a look at your files at http://127.0.0.1:5001/webui (select Tab "Files")

Main questions:

  • fine to let the remote edit .dvc/config?
  • new "hook" methods ok or are there better places for pre-download / post-upload / pre-gc / pre-traversal etc. actions?

Feel free to nudge me in your discord for any questions!

@efiop
Copy link
Contributor

efiop commented Nov 13, 2020

@soerface This is very interesting, but it is really complex too. The dynamic CID is a lot to ask for from dvc, I'm not sure that this is all worth it. What about IPNS that you've mentioned before?

Btw, do you plan on using this or are you just contributing the change? We will need a lot of testing and a lot of maintenance in this current approach, we won't be willing to take such a complex implementation without anyone actually using it and ready to support it.

Please excuse us for potential delays in reviews, we have a lot on our plate right now, so will have to find some time to properly review it and deep dive into IPFS.

@soerface
Copy link
Author

@efiop IPNS would actually add more complexity. First, the hooks I introduced would still be needed, since we would need to update the IPNS entry to point to to the new IPFS CID. We would just save us from modifying .dvc/config.

Second, only the user who owns the private key of the IPNS entry is able to change it. Therefore, only one person can actually change content, which makes collaboration impossible (or at least vastly more challenging, since now you would need some way to distribute a key between members).

Last but not least, IPNS entries can expire. As the users in #930 described, IPFS would be useful for public projects where anyone might have an interest in keeping source code and data online. If we are staying with plain IPFS, it would just be a matter of executing ipfs pin add CID to keep public data available. IPNS on the other hand centralises this and I therefore wouldn't see any benefit of choosing IPFS over a regular cloud storage.

@efiop
Copy link
Contributor

efiop commented Nov 13, 2020

@soerface I see. Looks like rclone also doesn't support ipfs yet rclone/rclone#128 . I see that they've considered these limitations and have come up with a few potential ways to access it. Though raw ipfs is read-only, which is probably what we will have to do here too, if we want to use it. Otherwise ipns is the only option, looks like.

Making dvc save CID back to config is way too strange and will result in merge conficts. Plus, I don't really see how two users could update the same remote, as they will get different CIDs that don't really merge automatically (as I understand it). So basic scenarios of dvc usage simply won't work, which to me seems like a dealbreaker. πŸ™

@soerface
Copy link
Author

soerface commented Nov 14, 2020

Plus, I don't really see how two users could update the same remote, as they will get different CIDs that don't really merge automatically (as I understand it)

@efiop They will get a merge-conflict, yes. But it actually isn't hard to solve. If you get the conflict:

  • edit .dvc/config and accept the changes from remote
  • dvc pull to get the content from the other party
  • dvc push to add your local content
  • git add .dvc/config with the new CID which contains both files

I've tested it on two machines, each running it's own IPFS daemon: https://youtu.be/Jz9eGw5xPDE
This is more of an inconvenience, not a dealbreaker. However, I noticed that pulling beforehand isn't enforced. Forgetting it will lead to accidentally pushing a CID which does not contain the content from the conflict πŸ™ (Edit: no pull is needed, since the MFS is always updated to the CID in the config when performing any operation)

Also, if the connection between the machines is too slow, updating the MFS leads to a timeout.

  • Do not timeout when updating MFS. Showing progress would be needed for a good UX, but I don't know where to get the progress from

However, if editing the config is not allowed and we can't find another place to track the CID, I don't think a read-only implementation would be very useful. The users would need to upload all content manually from time to time and get the CID themself, and edit the url manually. This wouldn't be very attractive or useful.

Edit: Maybe instead of editing the config, printing the required dvc remote modify ... command after a dvc push would be a compromise?

@efiop
Copy link
Contributor

efiop commented Nov 14, 2020

@soerface What if a team has multiple parallel branches? How would they sync their CIDs? Someone will need to collect all the cids and merge them into one remote? In regular dvc workflow none of these manipulations are needed, as everything simply gets pushed to one single remote.

I also don't like the idea of printing dvc remote modify command, as it again breaks the workflow, that should be unified to matter what cloud you use for storage, that's one of the main feautures of dvc.

Maybe raw IPFS is simply too low level to be used as a full-featured remote in dvc. I don't think it is worth all of these dvc workflow sacrifices just to support raw IPFS.

@soerface
Copy link
Author

soerface commented Nov 15, 2020

What if a team has multiple parallel branches? How would they sync their CIDs? Someone will need to collect all the cids and merge them into one remote?

Syncing them would be problematic, yes. But would there even be a need to sync them? If a branch is meant to be stay parallel, it has it's own files and does not depend on the files of the other branches, does it? If I switch the branch and do a dvc pull there, I would get the files I need. And since IPFS caches and deduplicates files, it wouldn't be a pain to keep switching. If two branches should be merged, it can be done like described above by pulling the files from both and then push. Working with IPFS is a bit different compared to cloud storage, yes, but I would expect IPFS users to know about the drawbacks.

Maybe raw IPFS is simply too low level to be used as a full-featured remote in dvc

It's a bit tricky, yeah. The alternative to the MFS would be my first approach by saving the CID of individual files, which wouldn't impose the issues with merging and won't require to download everything when working with IPFS. But as you explained, there are things like the run-cache which needs a directory listing, so… πŸ™


I would like to learn more about the use cases of the users in issue #930, as already asked in #930 (comment). @remram44 @jnareb @flxai @momack2 @weiji14 @newkozlukov how does your workflow look like? Would you mind testing out this fork and give feedback if it fits you from a user-perspective? Until now I had the case of "publishing data" in mind: The data can be kept online by anyone who is interested in it by just pinning the CID that is stored in .dvc/config. If a user has the repository and wants to get the data, he just needs a running ipfs daemon. dvc pull will get it if anyone in the world cared enough to pin it.

@SomeoneSerge
Copy link

@soerface Wow, exciting work!

I would like to learn more about the use cases of the users in issue

It'll take some time before I get to check it out, though

The data can be kept online be anyone who is interested in it by just pinning the CID that is stored in .dvc/config. If a user has the repository and wants to get the data, he just needs a running ipfs daemon. dvc pull will get it if anyone in the world cared enough to pin it

At a glance, sounds exactly like the use case I had in mind. In particular, content-addressable storage like IPFS would be a great way to publish weights/training data accompanying a paper (compare the UX to that of google drive, ugh)

@soerface
Copy link
Author

  • dvc pull to get the content from the other party
    […]
    However, I noticed that pulling beforehand isn't enforced. Forgetting it will lead to accidentally pushing a CID which does not contain the content from the conflict πŸ™

Correction: This is not true. Pulling is not needed to get the correct CID, since my implementation uses the hooks to always bring the MFS into the correct state before performing operations. The only important thing is to resolve the conflict in the config - it should always contain the changes from remote. dvc push will add the own content and produce a new CID.

Thanks @newkozlukov! If you need help installing this branch, you can reach me at DVCs discord. I'll see if I can fix the timeout issue till then. If you get it, try explicitly connecting your two daemons by executing ipfs id on one node, look at the "Addresses" key, and choose an appropriate address to use on the other node with ipfs swarm connect <address>. It's a bit of a bummer that sometimes even in the same local network they won't directly detect each other ☹️

@flxai
Copy link

flxai commented Nov 15, 2020

Very nice work, @soerface!

How does your workflow look like?

Ideally, the workflow stays the same with the only difference being the aforementioned dvc remote add ipfs ipfs://. I see that @efiop sees architectural problems but personally I wouldn't mind changes. As @soerface already said:

Working with IPFS is a bit different compared to cloud storage, yes, but I would expect IPFS users to know about the drawbacks.

I hope for decentralized storage that supports DVC in a fault-tolerant way. DVC currently allows for a select few types of remotes, none of which decentralize authority over the storage provider. This is an essential feature needed to be able to achieve fault-tolerant systems that allow for data sharing without the limitations induced by central control. I see applications within the field of reproducibility in the computational sciences that relies on fault-tolerant storage of data within an automated environment. I recognize this as a desirable property for any published work that deems itself reproducible.

Would you mind testing out this fork and give feedback if it fits you from a user-perspective?

I hope I'll get to that soon. Judging from the video linked above and skimming through the changes I'd say that this looks like a very promising UX that may seem intuitive to people familiar with IPFS.

@efiop
Copy link
Contributor

efiop commented Dec 25, 2020

@soerface Thanks again for your work on this PR! πŸ™ Unfortunately I don't think we are ready to accept an implementation that takes this approach. If we don't find a better way to use ipfs in dvc, then maybe it can live as a dvc fork for now, until we implement plugin mechanism that will allow your implementation to use it. Closing this PR for now, since unfortunately I don't see how this could be eventually merged into the upstream. πŸ™

@efiop efiop closed this Dec 25, 2020
@LennyPenny
Copy link

@efiop I love this PR - is there any progress on that plugin mechanism?

@efiop
Copy link
Contributor

efiop commented Feb 6, 2021

@LennyPenny We are working on some pre-requisites and we are generally going in fsspec #5162 direction right now. Most likely we will end up using it as a base for plugins, but no 100% guarantee yet. Also, the CID situation discussed above is a very hard pill to swallow right now, but we could reconsider it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants