Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dep: repo: use dvcfs #9246

Merged
merged 1 commit into from
Apr 2, 2023
Merged

dep: repo: use dvcfs #9246

merged 1 commit into from
Apr 2, 2023

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Mar 27, 2023

Makes dvc get/import/etc use dvcfs, which already supports cloud versioning, circular imports and other stuff.

Also makes dvc import behave more like dvc import-url, so that we can use the same existing logic for fetching those using index instead of objects.

Fixes #8789
Related https://github.com/iterative/studio/issues/4782

@efiop efiop force-pushed the fetch-import-index branch from a5838e1 to 053c0d6 Compare March 29, 2023 11:01
@efiop efiop force-pushed the fetch-import-index branch 3 times, most recently from 395804c to 8bb918a Compare March 29, 2023 15:47
@efiop efiop closed this Mar 29, 2023
@efiop efiop reopened this Mar 29, 2023
@efiop efiop force-pushed the fetch-import-index branch 2 times, most recently from 4c79caf to 4e94701 Compare March 29, 2023 22:48
@efiop efiop force-pushed the fetch-import-index branch 2 times, most recently from 03e3e10 to 2d7f427 Compare March 30, 2023 14:14
@sjawhar
Copy link
Contributor

sjawhar commented Mar 31, 2023

Reporting back that this PR fixes the use case for which I had previously created #9252 🙏
Relevant issue: #6124

@efiop efiop force-pushed the fetch-import-index branch from 2d7f427 to 8cb5f82 Compare April 1, 2023 01:49
@codecov
Copy link

codecov bot commented Apr 1, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (24e10e0) 92.91% compared to head (ccb6797) 92.93%.

❗ Current head ccb6797 differs from pull request most recent head 67900a7. Consider uploading reports for the commit 67900a7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9246      +/-   ##
==========================================
+ Coverage   92.91%   92.93%   +0.02%     
==========================================
  Files         459      459              
  Lines       37226    37104     -122     
  Branches     5371     5344      -27     
==========================================
- Hits        34588    34484     -104     
+ Misses       2100     2089      -11     
+ Partials      538      531       -7     
Impacted Files Coverage Δ
dvc/exceptions.py 95.74% <ø> (-0.09%) ⬇️
dvc/dependency/base.py 61.70% <100.00%> (+1.70%) ⬆️
dvc/dependency/repo.py 100.00% <100.00%> (+4.34%) ⬆️
dvc/fs/dvc.py 95.76% <100.00%> (-0.75%) ⬇️
dvc/output.py 89.73% <100.00%> (+1.05%) ⬆️
dvc/repo/imports.py 89.65% <100.00%> (+0.94%) ⬆️
dvc/repo/index.py 92.71% <100.00%> (+0.04%) ⬆️
dvc/repo/init.py 93.87% <100.00%> (+0.39%) ⬆️
tests/func/test_data_cloud.py 99.43% <100.00%> (+<0.01%) ⬆️
tests/func/test_import.py 99.46% <100.00%> (+0.45%) ⬆️

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@efiop
Copy link
Contributor Author

efiop commented Apr 1, 2023

@sjawhar Thank you for testing it! 🙏

@efiop efiop force-pushed the fetch-import-index branch 4 times, most recently from b320bc5 to adc194a Compare April 1, 2023 23:34
@efiop efiop changed the title [WIP] dependency: repo: use dvcfs [WIP] dep: repo: use dvcfs Apr 1, 2023
Makes dvc get/import/etc use dvcfs, which already supports cloud versioning,
circular imports and other stuff.

Also makes dvc import behave more like dvc import-url, so that we can use
the same existing logic for fetching those using index instead of objects.

Fixes iterative#8789
Related iterative/studio#4782
@efiop efiop force-pushed the fetch-import-index branch from adc194a to 67900a7 Compare April 2, 2023 00:25
@efiop efiop changed the title [WIP] dep: repo: use dvcfs dep: repo: use dvcfs Apr 2, 2023
@efiop efiop marked this pull request as ready for review April 2, 2023 00:54
@efiop efiop merged commit ecdc44e into iterative:main Apr 2, 2023
@efiop efiop mentioned this pull request Apr 2, 2023
11 tasks
@omesser omesser added the A: Repo Related to the internal Repo api label Apr 4, 2023
@efiop efiop added A: data-sync Related to dvc get/fetch/import/pull/push and removed A: Repo Related to the internal Repo api labels Apr 6, 2023
@shcheklein
Copy link
Member

Continuing the discussion from the #9385 cc @efiop

  • documentation - I meant the PR description in the last comment, something like "we discussed with Peter it makes sense to remove this bc it's covered there" - just for the context and visibility behind the decisions.
  • as it is a normal scenario now and should work by design - you mean circular deps? can it be normal? (I'm not that familiar, but sounds suspicious a bit).
  • also, another question - is it parallel (in case of a dir) with the new implementation (I assume the previous one was downloading everything the regular pull/fetch/checkout that was respecting flags and had optimizations ?

@efiop
Copy link
Contributor Author

efiop commented May 7, 2023

@shcheklein

documentation - I meant the PR description in the last comment, something like "we discussed with Peter it makes sense to remove this bc it's covered there" - just for the context and visibility behind the decisions.

I agree and already admitted that this is not my proudest work in terms of PR comments/descriptions. Not sure if there is anything else I could do to mitigate that right now. Will be sure not to skip out on that next time.

as it is a normal scenario now and should work by design - you mean circular deps? can it be normal? (I'm not that familiar, but sounds suspicious a bit).

Yes, the dependency resolution is lazy and will stop at the first place where it finds the data contents so that test no longer applies as is. Still there are lots of edge cases that might be there beyond this that we never researched deeper.

also, another question - is it parallel (in case of a dir) with the new implementation (I assume the previous one was downloading everything the regular pull/fetch/checkout that was respecting flags and had optimizations ?

Not sure implementation of specifically what you are asking, could you elaborate? dvc get used and uses dvcfs.get that is not parallelized and wasn't before. dvc import was building objects from dvcfs sequentially but now doesn't do that, but it wasn't parallelized either. dvc fetch used to download imports in parallel and doesn't anymore, but that had other problems. The aim here is to have it all in one place specifically so that we don't have varying implementations, features, and optimizations based on a command you are running. dvcfs.get will be parallelized and will automatically benefit, for example, both dvc get and dvc import now (and more) instead of having diverging implementations.

@skshetry
Copy link
Member

skshetry commented May 7, 2023

@efiop, maybe I am not following the discussion here, but dvcfs.get() support concurrent downloads with batch_size.

@efiop
Copy link
Contributor Author

efiop commented May 7, 2023

@efiop, maybe I am not following the discussion here, but dvcfs.get() support concurrent downloads with batch_size.

Maybe I misunderstand something, but ‘dvcfs’ is based on a regular ‘AbstractFileSystem’ and so ‘get’ is just a loop, unlike in ‘async’ one. We don’t currently override ‘get’ there to do anything extra. We do parallelise object transfers, but that uses ‘get_file’ directly.

@shcheklein
Copy link
Member

Yes, the dependency resolution is lazy and will stop at the first place where it finds the data contents

I still don't get it. I'll try it I guess to better understand the current behavior.

The aim here is to have it all in one place specifically

I very well understand the goal. I don't think in this specific case this simplification justifies slowing things down for users (if that happens).

To be precise - I mean: dvc import <repo> large_directory use case. Again, I have no idea was it parallel before or not (it looks like it was) and if it's optimized / parallel now (it looks like it's not). I was just asking to confirm this.

@skshetry
Copy link
Member

skshetry commented May 7, 2023

Maybe I misunderstand something, but ‘dvcfs’ is based on a regular ‘AbstractFileSystem’ and so ‘get’ is just a loop, unlike in ‘async’ one. We don’t currently override ‘get’ there to do anything extra. We do parallelise object transfers, but that uses ‘get_file’ directly.

I think you are confusing between "DvcFileSystem" and dvcfs. dvcfs does download concurrently in a threadloop (which is based on our FileSystem implementation).

https://github.com/iterative/dvc-objects/blob/8ca9eeb34ee16aa89b4719591930f0a83321ee0c/src/dvc_objects/fs/base.py#L657-L660

@shcheklein
Copy link
Member

@skshetry thanks! a few things to check then:

  • do we pass the https://dvc.org/doc/command-reference/import#-j downstream properly (it should be this flag, right?)
  • curious how does it look like from the user perspective (I don't see immediately a callback or anything) - will it just print progress bard one on top of another? :)

@skshetry
Copy link
Member

skshetry commented May 7, 2023

@shcheklein, yes, we do pass the jobs, which goes here. And the self.fs is DvcFileSystem, which inherits from our own FileSystem which does support concurrent downloads.

def download(self, to, jobs=None):
fs_download(self.fs, self.fs_path, to, jobs=jobs)

class DVCFileSystem(FileSystem):

Here's how it looks like:

asciicast

Note: I cancelled it because it was taking too long. :)

The callbacks are lazy, so you won't see them unless they are really being downloaded. Or, may be the files are small, that the progress bar cleared before it rendered or immediately after being rendered.

@shcheklein
Copy link
Member

Yep, perfect. Thanks @skshetry !

@efiop
Copy link
Contributor Author

efiop commented May 7, 2023

Maybe I misunderstand something, but ‘dvcfs’ is based on a regular ‘AbstractFileSystem’ and so ‘get’ is just a loop, unlike in ‘async’ one. We don’t currently override ‘get’ there to do anything extra. We do parallelise object transfers, but that uses ‘get_file’ directly.

I think you are confusing between "DvcFileSystem" and dvcfs. dvcfs does download concurrently in a threadloop (which is based on our FileSystem implementation).

https://github.com/iterative/dvc-objects/blob/8ca9eeb34ee16aa89b4719591930f0a83321ee0c/src/dvc_objects/fs/base.py#L657-L660

@skshetry Yep, you are right! Thank you! I thought we had that in transfer, but indeed thread pool is in our fs class as well. So that's even better news then 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud versioning: imports
5 participants