Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec backend #5162

Closed
majidaldo opened this issue Dec 25, 2020 · 19 comments
Closed

fsspec backend #5162

majidaldo opened this issue Dec 25, 2020 · 19 comments
Assignees
Labels
p3-nice-to-have It should be done this or next sprint refactoring Factoring and re-factoring

Comments

@majidaldo
Copy link

I really think dvc should have this backend. There is much history to this filesystem abstraction so it could leverage the backends for fsspec.

@pmrowla
Copy link
Contributor

pmrowla commented Dec 26, 2020

fsspec looks like an interesting project, thanks for the suggestion!

We already use a similar abstraction internally in DVC, and I'm not sure how much we would actually gain by switching to an alternative implementation. Is there a particular remote FS type that is supported in fsspec that you are looking for in DVC?

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Dec 26, 2020
@efiop
Copy link
Contributor

efiop commented Dec 26, 2020

We have considered it previously while working on pyarrow (it uses fsspec too), but for now decided that it is not really worth refactoring so much stuff just for the sake of it. It will probably happen in the future, once we are ready for it and have time for it.

@pmrowla pmrowla added p3-nice-to-have It should be done this or next sprint refactoring Factoring and re-factoring labels Dec 29, 2020
@efiop efiop mentioned this issue Feb 6, 2021
9 tasks
@efiop
Copy link
Contributor

efiop commented Feb 9, 2021

For the record, we are indeed going in fsspec direction in our internals. Current roadmap is

  • make sure no Tree uses self.path_info(self._path in case of gdrive) outside of __init__. We've done some work there already, with only gdrive being the last blocker right now. Tree.path_info should really move into something like cache_dir in a corresponding CloudCache(they currently use self.tree.path_info).
  • make our Trees __init__ only accept their configuration options, don't pass repo or path_info(if path_info has user/port/etc, these should be extracted and passed directly to the tree's __init__)
  • make our tree methods only accept strings as paths, not path_infos (or maybe make it handle both, but strings/bytes are better when dealing with many-many objects, which is currently a problem for us
    # Prefer string path over PathInfo when possible due to performance

also somewhere after the first step we can rename our Tree/tree to FileSystem/FS/fs/etc so it is less confusing.

@Nantero1
Copy link

Nantero1 commented Feb 27, 2021

Hey, do you have any idea which fsspec methods (mkdir, ls, info, rmdir,...) you require and which are a nice-to-have? I wrote a Nexus3 integration for DVC and thinking about migrating it to fsspec so it can be one day used with DVC by the community.

@efiop
Copy link
Contributor

efiop commented Feb 27, 2021

@BenLinnik There is no formal list yet, but you can see all required methods in https://github.com/iterative/dvc/blob/master/dvc/fs/azure.py , for example. (e.g. see all self.fs.* calls).

@Nantero1
Copy link

@BenLinnik There is no formal list yet, but you can see all required methods in https://github.com/iterative/dvc/blob/master/dvc/fs/azure.py , for example. (e.g. see all self.fs.* calls).

Seems straightforward. Nothing too fancy. I could do that. When looking at the azure code, I wonder. Why don't you use a fs method in the self._upload_fobj method?

@efiop
Copy link
Contributor

efiop commented Feb 27, 2021

@BenLinnik We do, through fs.open. But if there is a more suitable method - we can use it too.

@Nantero1
Copy link

@BenLinnik We do, through fs.open. But if there is a more suitable method - we can use it too.

Ah sorry, you are absolutely right. The fs.open is in self.open.

A great idea to move to fsspec. This will make it possible to integrate dvc with any fsspec supported framework. That's a clever architectural choice, despite it makes my PR obsolete at the moment 👍

@martindurant
Copy link

(fsspec admin here)

I just noticed this issue, and am glad to see the energy, and the appearance of an fsspec backend in your API, alongside your own implementations. Please don't hesitate to help us improve every data user's experience!

Did you know that fsspec is interested in exposing DVC as a filesystem backend? In https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/dvc.py is some nascent code which is not references in the official docs or tested. It has been a while since I looked at that code, so it may not do anything useful by now. However, I do think it's generally useful to have a simple abstraction over-layer to DVC in fsspec, even if that ends up calling other fsspec backends internally when fetching files.
Indeed, in some ways, DVC could act like a mix of a cache layer and git.

@efiop
Copy link
Contributor

efiop commented Mar 3, 2021

Hi @martindurant !

Thanks a lot for creating and maintaining fsspec! 🙏 Was meaning to send a similar introductory message to you today 😉

We were inventing something similar ourselves with our legacy Tree classes, but then stumbled upon fsspec and were impressed by how thought through the spec is. Really wish fsspec existed in the early days of DVC 😃

Thanks to great work by @isidentical we were able to temporarily wrap and use gcsfs and adlfs in dvc and we are looking forward to migrating to s3fs and others. We are heavy users of these filesystems, so we've been catching some bugs and contributing some fixes to the upstream already (or at least reporting them in some cases). Even more to come 🙂

Did you know that fsspec is interested in exposing DVC as a filesystem backend?

We've had DvcTree and RepoTree(aka Dvc/RepoFileSystem in dvc/fs/) for a long time now and it was a surprise to find existing fsspec implementation for it! It might make sense to just register our DvcFileSystem implementation instead of the existing experimental one, so we take the burden of maintaining it from you and also provide the most feature-complete and supported implementation. Let us know what you think about that. Of course, our DvcFileSystem is not ready right now, but we are working on making it comply with fsspec.

@martindurant
Copy link

It might make sense to just register our DvcFileSystem implementation instead of the existing experimental one, so we take the burden of maintaining it from you and also provide the most feature-complete and supported implementation. Let us know what you think about that.

Certainly! I might delete the existing module, then, since it's incomplete and not used. Let me know when you have something you feel can be exposed in fsspec, and I am happy to give feedback on it too.

@majidaldo
Copy link
Author

This is awesome!

I think the more fundamental project is fsspec. dvc should be on top of fsspec. Then, when using Python, dvc should just present (versioned) files by passing fs objects to the user (from fsspec). The core idea of dvc is that it just adds a version 'argument' to paths: it should be usable on the fs as well as in Python.

@efiop
Copy link
Contributor

efiop commented Mar 3, 2021

@martindurant We are quickly moving towards it, I would say that DvcFileSystem should be ready in a month or so. We'll definitely reach out when we are ready. Thank you so much for the great work! 🙏

@majidaldo It will likely be similar to what we have right now. E.g. you could do fs = DvcFileSystem(rev="mybranch"). And our high-level wrappers https://dvc.org/doc/api-reference would be available too, as well as fsspec's user functions (e.g. open https://filesystem-spec.readthedocs.io/en/latest/api.html#user-functions).

@efiop efiop mentioned this issue Mar 8, 2021
@efiop efiop removed the awaiting response we are waiting for your reply, please respond! :) label Mar 13, 2021
@efiop efiop mentioned this issue Apr 27, 2021
2 tasks
@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

March 3:

We are quickly moving towards it, I would say that DvcFileSystem should be ready in a month or so

Oct 9:
image

On a serious note, we've underestimated this initially, but thanks to tons of great work by @isidentical 🙏 we've migrated all of our currently supported clouds to fsspec and implemented a few fsspec-compatible filesystems ourselves from scratch:

Now we are working on removing some wrappers that we have (mainly related to PathInfo, but there is more) and will soon cleanup dvcfs/repofs to make it sharable as well. Plugins will likely be introduced a bit later, maybe in Q1.

@efiop efiop moved this to Backlog in DVC Apr 12, 2022
@efiop efiop added this to DVC Apr 12, 2022
@efiop efiop self-assigned this Apr 12, 2022
@skshetry
Copy link
Member

Closing, we are already using fsspec-based filesystems. And we have dvc-data and dvc-objects projects that are using those, this is no longer very relevant in dvc. We can consider this as done.

Repository owner moved this from Backlog to Done in DVC May 20, 2022
@efiop
Copy link
Contributor

efiop commented May 20, 2022

For the record: dvcfs is also almost ready to be published, we have a few small things to cleanup.

@shcheklein
Copy link
Member

@efiop @skshetry do we plan to document it in any way?

@efiop
Copy link
Contributor

efiop commented May 20, 2022

@shcheklein Once it is ready - yes. dvcfs wrap up/documentation is in our potential plans for q3

@dberenbaum
Copy link
Collaborator

Hopefully, dvcfs can address some of our pressing Python API needs like #3182.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p3-nice-to-have It should be done this or next sprint refactoring Factoring and re-factoring
Projects
No open projects
Archived in project
Development

No branches or pull requests

8 participants