Support IPFS, DAT, or other distributed storage #6777

remram44 · 2018-07-21T23:55:31Z

remram44
Jul 21, 2018

Having an option to share data files in a peer-to-peer way is probably a good idea. It eliminates the need to pay for external services, and scales much better in the "public open project" situation (where lot of cloners would mean substantial S3 costs).

IPFS is probably the easiest to support here, together with DAT. Using BitTorrent directly seems complicated.

efiop · 2018-07-22T07:19:44Z

efiop
Jul 22, 2018

Hi @remram44 !

Great idea, thank you! If you wish to implement support for some of those, please feel free to take a look at dvc/remote/ directory in our project, where all current remote drivers are implemented. To support data pulling/pushing, you would only need to implement download(), upload() and exists() methods, so it should be pretty easy. We will be happy to merge any proper pull request 🙂

A current workaround for that would be to just pack your .dvc/cache directory and share it with others using any P2P protocols you'd like, including bittorrent 🙂

Thanks,
Ruslan

0 replies

remram44 · 2018-07-22T16:47:10Z

remram44
Jul 22, 2018
Author

Indeed this looks easy to add. I don't have the cycles to attempt this now, but I might try in the future.

0 replies

dmpetrov · 2018-07-22T21:58:01Z

dmpetrov
Jul 22, 2018
Maintainer

The idea of storing data in p2p\blockchain looks very appealing. We develop DVC based mostly on our industrial data science experience where p2p is not a big part of this industrial environment. But it might become soon! Recently, I got another request (not in GItHub) regarding denet.pro dApp for storing data for DVC.

It would be great to understand this p2p datasets landscape:

who uses these tools for storing datasets for ML?
what types of ML projects: deep NN for image, NLP, academia analitical projects ...?
what are use cases: storing common data sources, project results, all data derivatives?
what are the most common protocols in this space and difference between them?

If there is a demand we can definitely implement this.

@remram44 please let me know if you use this kind of storages. I would really like to discuss what are use cases and your thoughts. Or if you can connect us to other users or the tool\protocol creators.

0 replies

remram44 · 2018-07-23T00:16:58Z

remram44
Jul 23, 2018
Author

I'm thinking about the case where you make a analysis public, e.g. publish it on GitHub. Having everyone download from your S3 bucket would incur charges, hosting it on some box in your lab would provide very limited bandwidth. Peer-to-peer solutions would scale nicely.

0 replies

flxai · 2019-12-18T17:15:17Z

flxai
Dec 18, 2019

Are there any plans to implement this? @remram44 @dmpetrov
Would a contribution still be valued?

Also I can see further applications in the field of scientific reproducibility and general public data sharing.

0 replies

efiop · 2019-12-18T17:54:09Z

efiop
Dec 18, 2019

@icks No such plans from the core team, at least for now. Would appreciate if you could share your thoughts on this and in which scenarios you would like to use it. Contributions are always welcomed, feel free to give it shot. Ping us here or on discord if you need any help 🙂

0 replies

shcheklein · 2019-12-18T17:58:01Z

shcheklein
Dec 18, 2019
Maintainer

@icks it would be a good addition indeed. Unfortunately, it would take a while for the core team to prioritize this like @efiop mentioned :( We would really love for the community to do a contribution in this case and we can provide all the support and help on this.

0 replies

efiop · 2021-04-12T19:06:22Z

efiop
Apr 12, 2021

For the record: if anyone would be interested in contributing support for any of these, I would highly recommend starting with writing an https://github.com/intake/filesystem_spec/ -compatible filesystem class, as that's what dvc is migrating to.

0 replies

remram44 · 2021-04-12T19:16:07Z

remram44
Apr 12, 2021
Author

This is unlikely to look like a normal fsspec backend, because with content addressing you cannot choose the name of the destination (it includes a hash of the content).

0 replies

efiop · 2021-04-12T19:25:16Z

efiop
Apr 12, 2021

@remram44 That was one of the major problems with #4736 . I'm hoping we could find some way to handle that. With fsspec or without, the fact that after each dvc push the url changes is a pretty serious challenge to handle in dvc in an elegant way.

1 reply

Erotemic Jan 4, 2022

I've been thinking about this problem recently, and I think there might be a solution if #3069 is resolved. IIUC a CID of a file is effectively a hash. It's determined by the file contents. If DVC allowed for a "cid" as the hash algorithm and the filesystem backend was allowed to know about the setting (ipfs backend could then make assumptions), how far would that go towards solving this problem?

If we know what the cid is, download, upload, and exists have actionable implementations (unless I'm missing something).

efiop · 2021-10-08T18:24:09Z

efiop
Oct 8, 2021

Moving to discussion since this is not actionable yet.

0 replies

remram44 · 2021-10-08T18:37:12Z

remram44
Oct 8, 2021
Author

Using Discussions as feature requests is pretty unusual. Or am I to understand that whether this would be accepted is under discussion? People are unlikely to pick items that were removed from issues to work on them.

3 replies

efiop Oct 8, 2021

@remram44 The issue here is not the implementation, as one can implement ipfs support, but it is not clear how to integrate that into dvc workflow to propagate the url changes back. So the issue wasn't actionable in engineering sense 🙁

remram44 Oct 8, 2021
Author

WONTFIX then, if I understand correctly

efiop Oct 8, 2021

Maybe i'm missing something, but WONTFIX is usually about bugs. This discussion is for a new workflow, that we don't really know how to proceed with and there are no actionable things here so far. Hence why keeping as a discussion, so that maybe we could come up with an acceptable workflow and then proceed to implement it.

Something that could be implemented here right away is fsspec-compatible ipfs filesystem that we could use for now only in read-only mode. Maybe we will figure out an acceptable workflow to use write functionality as well in the future.

Erotemic · 2021-11-19T16:40:01Z

Erotemic
Nov 19, 2021

I have a use case where I've collected about 20GB of data (so far) I'd like to publish as a free-to-use open dataset. IPFS seems like a good way to accomplish that. Note: I am new to IPFS, so I might not fully appreciate the limitations and challenges.

I see here that there is a POC fsspec for read-only IPFS: https://github.com/fsspec/ipfsspec

Perhaps using something like DNSLink would be the "right way" to handle a read/write dataset?

2 replies

Erotemic Nov 24, 2021

As an update, this is the project I'm referring to: https://github.com/Erotemic/shitspotter

Erotemic Dec 31, 2021

As another update the data now is on IPFS. I would love to have the dataset managed with DVC:

https://ipfs.io/ipfs/QmNj2MbeL183GtPoGkFv569vMY8nupUVGEVvvvqhjoAATG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support IPFS, DAT, or other distributed storage #6777

{{title}}

Replies: 13 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Support IPFS, DAT, or other distributed storage #6777

Replies: 13 comments · 6 replies

remram44 Jul 22, 2018 Author

dmpetrov Jul 22, 2018 Maintainer

remram44 Jul 23, 2018 Author

shcheklein Dec 18, 2019 Maintainer

remram44 Apr 12, 2021 Author

remram44 Oct 8, 2021 Author

remram44 Oct 8, 2021 Author

Replies: 13 comments 6 replies

remram44
Jul 22, 2018
Author

dmpetrov
Jul 22, 2018
Maintainer

remram44
Jul 23, 2018
Author

shcheklein
Dec 18, 2019
Maintainer

remram44
Apr 12, 2021
Author

remram44
Oct 8, 2021
Author

remram44 Oct 8, 2021
Author