Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datafs: allow caching remote streams to odb #333

Merged
merged 1 commit into from
Mar 15, 2023
Merged

datafs: allow caching remote streams to odb #333

merged 1 commit into from
Mar 15, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Mar 15, 2023

Related to iterative/dvc#9030.

Used in iterative/dvc#9183.

@skshetry skshetry requested review from pmrowla and efiop March 15, 2023 07:00
if cache_odb and typ == "remote":
from dvc_data.hashfile.build import _upload_file

_, obj = _upload_file(fspath, fs, cache_odb, cache_odb)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit inefficient API, we need something like add() that can stage as well as add to the odb in one go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry This does look convoluted. Why can't plots just cache.add a file they are already reading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it may be git-tracked or already cached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cache_odb only works for open, which is odd (i would expect it to work for downloading too).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall cache_odb looks very out of place in datafs 🙁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I decided not to add to get_file at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find it out of place in open(), and it probably should be at the instance-level, as we always have a cache_odb (even if it's a readonly).

The datafs does not provide enough visibility on where the data is coming from.
This means it has to be implemented somewhere.

@codecov-commenter
Copy link

codecov-commenter commented Mar 15, 2023

Codecov Report

Patch coverage: 37.50% and project coverage change: -0.06 ⚠️

Comparison is base (6757ef9) 57.32% compared to head (11eb97f) 57.27%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #333      +/-   ##
==========================================
- Coverage   57.32%   57.27%   -0.06%     
==========================================
  Files          51       51              
  Lines        3365     3370       +5     
  Branches      589      590       +1     
==========================================
+ Hits         1929     1930       +1     
- Misses       1350     1353       +3     
- Partials       86       87       +1     
Impacted Files Coverage Δ
src/dvc_data/fs.py 59.37% <37.50%> (-2.17%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was about to ask if this needed to go behind an explicit flag (other than setting cache_odb) for cases where users don't actually want caching (and only want to stream data), but I see that's being handled at the DVC level. LGTM

@skshetry
Copy link
Member Author

skshetry commented Mar 15, 2023

Was about to ask if this needed to go behind an explicit flag (other than setting cache_odb) for cases where users don't actually want caching

yeah, this is a bit weird to me as well. But the intention behind cache_odb is usually to cache from remotes, I could not think of a usecase where you want to use a different odb other than when streaming from remote. Maybe if there's a remote cache?

I am okay with introducing a separate kwarg as well, we just need to figure out what we should do if there's only cache_odb.

@skshetry
Copy link
Member Author

If the intention is to cache to odb from remote only when both are specified, it's better to just use one kwarg.

@skshetry
Copy link
Member Author

This PR does cache streams, but .dir files are still not cached, so we still end up hitting remotes.

@skshetry skshetry merged commit 226e1de into main Mar 15, 2023
@skshetry skshetry deleted the cache-streams branch March 15, 2023 12:34
@efiop
Copy link
Contributor

efiop commented Mar 15, 2023

@skshetry It seems the whole approach here is wrong. plots should fetch stuff and then read it. This was one of the reasons for iterative/dvc#9140

Feels like we should revert this change.

@skshetry
Copy link
Member Author

Could you please elaborate why you think this approach is wrong?

@efiop
Copy link
Contributor

efiop commented Mar 15, 2023

For the record: iterative/dvc#9183 (comment)

@skshetry
Copy link
Member Author

I think support for caching is a good feature on its own and fits within datafs/dvcfs.
fsspec has similar caching mechanisms on open().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants