Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datafs: allow caching remote streams to odb #333

Merged
merged 1 commit into from
Mar 15, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions src/dvc_data/fs.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,14 +64,22 @@ def _get_fs_path(self, path: "AnyFSPath"):
if data:
fs, fs_path = data
if fs.exists(fs_path):
return fs, fs_path
return fs, typ, fs_path

raise FileNotFoundError

def open( # type: ignore
self, path: str, mode="r", encoding=None, **kwargs
): # pylint: disable=arguments-renamed, arguments-differ
fs, fspath = self._get_fs_path(path, **kwargs)
cache_odb = kwargs.pop("cache_odb", None)
fs, typ, fspath = self._get_fs_path(path, **kwargs)

if cache_odb and typ == "remote":
from dvc_data.hashfile.build import _upload_file

_, obj = _upload_file(fspath, fs, cache_odb, cache_odb)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit inefficient API, we need something like add() that can stage as well as add to the odb in one go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry This does look convoluted. Why can't plots just cache.add a file they are already reading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it may be git-tracked or already cached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cache_odb only works for open, which is odd (i would expect it to work for downloading too).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall cache_odb looks very out of place in datafs 🙁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I decided not to add to get_file at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find it out of place in open(), and it probably should be at the instance-level, as we always have a cache_odb (even if it's a readonly).

The datafs does not provide enough visibility on where the data is coming from.
This means it has to be implemented somewhere.

fs, fspath = cache_odb.fs, obj.path

return fs.open(fspath, mode=mode, encoding=encoding)

def ls(self, path, detail=True, **kwargs):
Expand Down Expand Up @@ -118,7 +126,7 @@ def get_file( # pylint: disable=arguments-differ
self, rpath, lpath, callback=DEFAULT_CALLBACK, **kwargs
):
try:
fs, path = self._get_fs_path(rpath)
fs, _, path = self._get_fs_path(rpath)
except IsADirectoryError:
os.makedirs(lpath, exist_ok=True)
return None
Expand Down