-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datafs: allow caching remote streams to odb #333
Conversation
if cache_odb and typ == "remote": | ||
from dvc_data.hashfile.build import _upload_file | ||
|
||
_, obj = _upload_file(fspath, fs, cache_odb, cache_odb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bit inefficient API, we need something like add()
that can stage as well as add
to the odb in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry This does look convoluted. Why can't plots just cache.add
a file they are already reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it may be git-tracked or already cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cache_odb
only works for open
, which is odd (i would expect it to work for downloading too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall cache_odb
looks very out of place in datafs 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I decided not to add to get_file
at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find it out of place in open()
, and it probably should be at the instance-level, as we always have a cache_odb (even if it's a readonly).
The datafs
does not provide enough visibility on where the data is coming from.
This means it has to be implemented somewhere.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #333 +/- ##
==========================================
- Coverage 57.32% 57.27% -0.06%
==========================================
Files 51 51
Lines 3365 3370 +5
Branches 589 590 +1
==========================================
+ Hits 1929 1930 +1
- Misses 1350 1353 +3
- Partials 86 87 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was about to ask if this needed to go behind an explicit flag (other than setting cache_odb
) for cases where users don't actually want caching (and only want to stream data), but I see that's being handled at the DVC level. LGTM
yeah, this is a bit weird to me as well. But the intention behind I am okay with introducing a separate kwarg as well, we just need to figure out what we should do if there's only |
If the intention is to cache to odb from remote only when both are specified, it's better to just use one kwarg. |
This PR does cache streams, but |
@skshetry It seems the whole approach here is wrong. plots should fetch stuff and then read it. This was one of the reasons for iterative/dvc#9140 Feels like we should revert this change. |
Could you please elaborate why you think this approach is wrong? |
For the record: iterative/dvc#9183 (comment) |
I think support for caching is a good feature on its own and fits within datafs/dvcfs. |
Related to iterative/dvc#9030.
Used in iterative/dvc#9183.