Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: only download incremental changes #8808

Open
dberenbaum opened this issue Jan 13, 2023 · 4 comments
Open

update: only download incremental changes #8808

dberenbaum opened this issue Jan 13, 2023 · 4 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Collaborator

dvc update will re-download all items in a directory if any of them change. See the comment below:

Stage 1 import from cloud and operate locally, so as long as we do the dvc update it will be fine. As a side note from the logs it looks like it always downloads all the files when we do the update (I would expect it to only download what is new). Because of this, even if there are no new files stage 1 will always run after an update.

Originally posted by @rmlopes in #8759 (comment)

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important A: data-sync Related to dvc get/fetch/import/pull/push labels Jan 13, 2023
@dberenbaum dberenbaum mentioned this issue Apr 14, 2023
11 tasks
@pragadeeshraju
Copy link

It helps a lot to have dvc update to download the incremental changes rather than the entire data. This will be really helpful in managing a large dataset by using dvc as dataregistry.

@dberenbaum
Copy link
Collaborator Author

Note that this is a problem for both import and import-url.

@dberenbaum
Copy link
Collaborator Author

Coming from #9385, it looks like import (unlike import-url) used to only download incremental changes, do bumping the priority to fix at least for import.

@efiop
Copy link
Contributor

efiop commented May 6, 2023

IIRC import-url always behaved that way, because we didn't have the means of comparing things remotely without downloading them one way or another (e.g. to compute md5s). Now we could compare indexes and only download difference (again now possible with index diff), but I'll need to take a closer look.

import was only downloading changes by coinsidence (just because it knows md5s for some files), but in general we would re-download large git files if you happened to import a git/mixed directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

3 participants