-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import-url
/import
: follow ups for cloud versioning
#8464
Comments
import-url
: follow ups for cloud versioningimport-url
/import
: follow ups for cloud versioning
@pmrowla Thoughts on this? This would be nice to have for initial release, since redownloading on every update seems unrealistic in any real scenario. |
@karajan1001 @pmrowla There was a regression that was making update download unchanged datasets even for regular remotes that was fixed in #8752. For cloud-versioned remotes, update still downloads unchanged datasets even after that fix. |
We should also check that data added to a worktree remote doesn't redownload on |
For import-url-update.movFor import-url-version-aware-update.movFor a non-imported dataset pushed to a worktree-add-push-update.mov |
@karajan1001 Any updates on this? |
I'm sorry, I'm still debugging what's wrong with my s3 setup. Maybe I should switch to azure or gcp. |
@dberenbaum, @daavoo A new situation. I found that the problem of duplicated downloading had already been solved in #8849, but after #8882 it became terribly slow (takes several minutes to finish). Maybe I should make the performance problem the next work, but better to open a new PR for it. And for
It still exists. |
@karajan1001 not sure what is Could you share the profile from:
|
Yes, as well as viztracer.dvc-20230202_225242.json.tar.gz looks like the time is mostly spent on waiting for responses from the dvc update --viztracer --viztracer-depth 12 version [ins][23:08:47]
'version.dvc' didn't change, skipping
Loading finish
Total Entries: 1000005
Circular buffer is full, you lost some early data, but you still have the most recent data.
If you need more buffer, use "viztracer --tracer_entries <entry_number>"
Or, you can try the filter options to filter out some data you don't need
use --quiet to shut me up
Use the following command to open the report:
vizviewer /Users/gao/test/cloud_versioning/viztracer.dvc-20230202_231153.json |
fix: iterative#8464 1. Add some UI message in `dvc update` for unchanged data pushed to some worktree remotes
fix: iterative#8464 1. Add some UI message in `dvc update` for unchanged data pushed to some worktree remotes
@karajan1001 I'm not seeing a major performance difference, but could you and @daavoo debug and make sure it's not a problem? We can close the issue once that's addressed. |
A big difference is that in the new version of I found today, the downloading problem appears again even in the latest version, looks like #8849 only partially solved it. Need a more detailed investigation. |
The |
I can confirm that the redownloading problem had been solved. But the performance problem still exists. |
@karajan1001 Are there still performance issues to investigate? Otherwise, let's close this issue. |
The performance down was because, in the previous version, we didn't consider the s3 fs as a version_aware fs, and after #8882 this bug was fixed, but we still have performance problems on a |
dvc update
we can probably get around downloading when it hasn't changed, but that's a separate feature request I think (import-url: use dvc-dataindex.save()
for fetching imports #8249 (comment)_)- [ ] For chainedMoved to cloud versioning: imports #8789import-url
imports I think it will not work right now unless the file has been pushed to a DVC remote, but when we get to refactoring regular cached objects to use dvc-data index for push/pull (so everything is unified to use the dvc-data index) this kind of chaining (from original source location) should work out of the box. (import-url: use dvc-dataindex.save()
for fetching imports #8249 (comment)_)- [ ] Add support to import/get from another repo where that repo's remote is cloud-versioned.Moved to cloud versioning: imports #8789dvc import
data to remote when remote is configured with worktree = true. (worktree: support push: false #8581)push: false
). (worktree push: do not push existing versions #8606)The text was updated successfully, but these errors were encountered: