Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] api: traversable files API #6850

Closed
wants to merge 1 commit into from

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Oct 22, 2021

WIP, I don't plan to merge this without more discussions/refinements. The implementation is quite straightforward (half of the changes are type-annotations). See: #6550 (comment) for the motivation.

This API allows users to traverse through Repo, with pathlib.Path-like API.

eg:

from dvc.api import files

with files() as root:
   (root / "data" / "data.xml").read_text()
   print((root / "data" / "data.xml").url(remote=None))
   print(list((root / "data").iterdir()))

Pathlib APIs supported: .open/.read_text/.read_bytes/.iterdir/.exists/.is_dir/.isfile.
Extended APIs: .read (equivalent to dvc.api.read) / .url (eq. to dvc.api.get_url).
Limited support for: .glob/.rglob

Also, all of the dvc.api have been migrated to use files() api.


Given that we are migrating to fsspec, and the fact that we have RepoFileSystem, this does mean that we'll have two similar APIs (though this one is built on top of it). The RepoFileSystem may be more powerful, but the pathlib-like API will definitely be user-friendly. We can keep this API private or leave this PR unmerged till we decide on the API.

@skshetry skshetry requested a review from a team as a code owner October 22, 2021 07:25
@skshetry skshetry requested a review from pared October 22, 2021 07:25
@skshetry skshetry force-pushed the dvc-api-traversable branch 2 times, most recently from 45b2c11 to b1dc02f Compare October 22, 2021 07:38
dvc/api.py Outdated Show resolved Hide resolved
@skshetry skshetry force-pushed the dvc-api-traversable branch 2 times, most recently from 06598c5 to 7ca6f06 Compare October 22, 2021 08:52
@karajan1001
Copy link
Contributor

karajan1001 commented Oct 29, 2021

I guess the built-in_io.TextIOWrapper is more friendly to me on files. The pathlib.Path which supports directories is a paths. So, maybe your files package here are more paths?

@skshetry skshetry requested review from daavoo and dberenbaum November 1, 2021 14:48
dvc/api.py Outdated


@contextmanager
def files(path=os.curdir, repo=None, rev=None) -> Iterator[RepoPath]:
Copy link
Contributor

@daavoo daavoo Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a better suggestion but IMO, the name files doesn't feel very intuitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo walk_files?

Copy link
Contributor

@daavoo daavoo Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files()/walk_files() sound to me like what I would expect to be returned by os.listdir(os.curdir)/os.walk(os.curdir).

However, it looks like what files() is returning is more similar to what Path(os.curdir) would return (unless I'm completely misunderstanding the example snippets in the description).

So, for me, it would be more intuitive to have a name that doesn't suggest that you are going to return an iterator of files. Instead, I would like a name that suggest that you are going to return a context manager with a Path-like instance pointing to some path (defaulting to os.curdir).

Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the usage wasn't immediately intuitive to me. How much value is the 2-line files() function adding? It might be more clear to use Repo.open() and RepoPath directly since they follow conventions of Python built-ins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open for changing the name of the API, would love to hear other alternative namings. Regarding Repo, it's an undocumented API and to replace files(), it needs to interact with two different components.

Copy link
Contributor

@daavoo daavoo Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_repo_path / repo_path?

@dberenbaum
Copy link
Collaborator

RepoPath looks great. How is this or files() supposed to work for remote repos?

I tried:

>>> with files("src", repo="https://github.com/iterative/example-dvc-experiments.git") as root:
...     print(list(root.iterdir()))
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/dave/Code/dvc/dvc/repo_path.py", line 201, in iterdir
    for _, dirs, files in repo_walk:
  File "/Users/dave/Code/dvc/dvc/fs/repo.py", line 410, in walk
    if not self.exists(top):
  File "/Users/dave/Code/dvc/dvc/fs/repo.py", line 235, in exists
    if dvc_fs.repo.dvcignore.is_ignored(fs, path_info):
  File "/Users/dave/Code/dvc/dvc/ignore.py", line 382, in is_ignored
    if fs.isfile(path):
  File "/Users/dave/Code/dvc/dvc/fs/git.py", line 67, in isfile
    key = self._get_key(path_info)
  File "/Users/dave/Code/dvc/dvc/fs/git.py", line 35, in _get_key
    relparts = path.relative_to(self._root).parts
  File "/opt/homebrew/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/pathlib.py", line 929, in relative_to
    raise ValueError("{!r} is not in the subpath of {!r}"
ValueError: 'src' is not in the subpath of '/private/var/folders/24/99_tf1xj3vx8k1k_jkdmnhq00000gn/T/tmp787b4c4bdvc-clone' OR one path is relative and the other is absolute.

@dberenbaum
Copy link
Collaborator

Things I find confusing:

  1. dvc.api.files() returns a RepoPath, so do we need to document RepoPath anyway? Documenting this and dvc.api.files() seems like it might be more complex than using RepoPath directly.
  2. Does the user need a context manager? It's more overhead and it feels weird combining open and Path. I guess without it, you would need to create a repo object whenever a RepoPath method is called. pathlib.Path calls with self.open() in every method where it opens a file, so maybe it's not too different?
  3. If the user needs a context manager, it seems more convenient at the repo level than combining it with a path. Why should users need a separate context manager to open a different path in the same repo?

It would be more intuitive to me for users to directly use RepoPath (or dvc.api.Path), which could create/open repo instances as needed internally, or to have something like dvc.api.repo_open() and pass that to RepoPath.

@skshetry
Copy link
Member Author

skshetry commented Nov 4, 2021

  1. dvc.api.files() returns a RepoPath, so do we need to document RepoPath anyway? Documenting this and dvc.api.files() seems like it might be more complex than using RepoPath directly.

Not sure I understand this, it should be fairly straightforward to document this.
files() returns a `RepoPath` object that is an extended version of pathlib.Path.

I think you are looking it at being two different things, files() API and RepoPath API, but they are the same thing. Functions do have return types and that's how you interact with them, so they are not two things, but one. For example, dvc.api.open returns a io.IOBase compatible object that has methods like read.

  1. Does the user need a context manager? It's more overhead and it feels weird combining open and Path. I guess without it, you would need to create a repo object whenever a RepoPath method is called. pathlib.Path calls with self.open() in every method where it opens a file, so maybe it's not too different?

We do need a context manager to manage resources for the Repo. DVC may be keeping handles to a lots of files (mostly git related libraries are doing this), keeping sqlite state open. We should always cleanup these resources. This is going to be a big issue on Windows if we don't manage them properly, though we should always clean up these resources in all platforms.

We already have dvc.api.read and dvc.api.open that does clone repo each time you call.
This API however is meant to be traversable, so we cannot do that on each open and read, so we do it only once.
Also, context manager is the pythonic way to manage resources.

  1. If the user needs a context manager, it seems more convenient at the repo level than combining it with a path. Why should users need a separate context manager to open a different path in the same repo?

Users don't need to use separate context managers for different paths, the files() returns a root path, and users can use child objects using root / <path>.

It would be more intuitive to me for users to directly use RepoPath (or dvc.api.Path), which could create/open repo instances as needed internally, or to have something like dvc.api.repo_open() and pass that to RepoPath.

dvc.api.repo_open still needs to return Repo instance, which has a huge API surface. We should be hiding those complexities with these APIs as we only need this thing for being traversable with files across the Repo. Also, why separate two different things if we can do so with a single API?

@dberenbaum
Copy link
Collaborator

Thanks for the detailed response @skshetry. Backing up, is this related to any pressing issues on the roadmap? If not, we can take our time making decisions about the interface. To me, it's worth it here because this has potential to be widely useful and maybe even supersede a lot of the existing API while also being way more flexible 🚀 .

Two ideas that would aid a lot of my confusion:

  1. Calling it something like repo_open, get_repo, repo_session, or something that hints that it's opening and managing the repo (and also documenting this). It's initially confusing because it's not obvious why I can't create a RepoPath directly like I can with pathlib.Path. It's not clear (to someone looking at this for the first time) what files() is adding on top of this, why it's a context manager, and what that context manager is handling. It's additionally confusing because a lot of path operations require their own context managers to manage files, but the files() context manager is instead managing the repo.
  2. Removing the path argument in files(). Then it's more clear that this is the entrypoint to access the repo rather than also being a way to get a specific subpath. I know it's weird to suggest making it less flexible, but it helps me understand what's happening. It's also not always obvious what to enter for path, especially for remote repos (should I use the full url, a relative path?). Once they have the base path, it's easy to navigate.

Other thoughts:

  • What about putting this into an entirely new submodule, like dvc.api.repo? Then we could even call it dvc.api.repo.open(). We are already likely to have dvc.api.experiments at some point. The top-level api could be reserved for the most high-level functions.
  • The RepoPath has a repo attribute, so how do we document it without addressing Repo? I guess we could make it a private attribute if needed?
  • I'm still not totally clear on one thing: is it a performance bottleneck (re-cloning, losing info from sqlite, etc.) to create a separate repo instance in each call?
  • We can discuss outside of this PR, but can you clarify the difference between the index idea you were working on vs the repo objects here? I thought the combination of a repo and revisions was basically what we described as an index.

@daavoo daavoo added the A: api Related to the dvc.api label Nov 10, 2021
@skshetry skshetry force-pushed the dvc-api-traversable branch 2 times, most recently from 1842db9 to f504346 Compare January 4, 2022 06:27
@skshetry skshetry force-pushed the dvc-api-traversable branch from f504346 to 9546b36 Compare January 4, 2022 06:36
@iterative iterative deleted a comment from lgtm-com bot Jan 4, 2022
@iterative iterative deleted a comment from lgtm-com bot Jan 4, 2022
@skshetry
Copy link
Member Author

skshetry commented Feb 15, 2022

From discussions in #7379, it seems we need to also expose a way to copy a file/directory similar to dvc get.

with files() as root:
    root.download(to="path")

@dberenbaum
Copy link
Collaborator

From discussions in #7379, it seems we need to also expose a way to copy a file/directory similar to dvc get.

What would be the benefit over exposing Repo.get as part of the public API? I'm not opposed to adding a way to download in the file API, but when trying to mimic the CLI, it might be more straightforward to use the Repo API.

@skshetry
Copy link
Member Author

From discussions in #7379, it seems we need to also expose a way to copy a file/directory similar to dvc get.

What would be the benefit over exposing Repo.get as part of the public API?

Repo API is not a public interface and the fact that it mimics CLI

@dberenbaum
Copy link
Collaborator

Repo API is not a public interface and the fact that it mimics CLI

Not sure I understand. My point is that it might be easier for users to follow an API that mimics the CLI for simple use cases. Anyway, let me open a separate discussion since this is really about Repo.

@skshetry
Copy link
Member Author

Closing for now, will need to revisit after all the filesystem changes.

@skshetry skshetry closed this Apr 21, 2022
@skshetry skshetry deleted the dvc-api-traversable branch April 21, 2022 09:13
@skshetry skshetry restored the dvc-api-traversable branch April 27, 2022 03:55
@skshetry skshetry deleted the dvc-api-traversable branch May 15, 2022 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants