Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-url/import: follow ups for cloud versioning #8464

Closed
2 of 3 tasks
dberenbaum opened this issue Oct 24, 2022 · 15 comments
Closed
2 of 3 tasks

import-url/import: follow ups for cloud versioning #8464

dberenbaum opened this issue Oct 24, 2022 · 15 comments
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Collaborator

dberenbaum commented Oct 24, 2022

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important A: cloud-versioning Related to cloud-versioned remotes labels Oct 24, 2022
@dberenbaum dberenbaum changed the title import-url: follow ups for cloud versioning import-url/import: follow ups for cloud versioning Nov 17, 2022
@dberenbaum
Copy link
Collaborator Author

  • For dvc update we can probably get around downloading when it hasn't changed

@pmrowla Thoughts on this? This would be nice to have for initial release, since redownloading on every update seems unrealistic in any real scenario.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Jan 5, 2023
@dberenbaum
Copy link
Collaborator Author

@karajan1001 @pmrowla There was a regression that was making update download unchanged datasets even for regular remotes that was fixed in #8752. For cloud-versioned remotes, update still downloads unchanged datasets even after that fix.

@karajan1001 karajan1001 self-assigned this Jan 17, 2023
@dberenbaum
Copy link
Collaborator Author

We should also check that data added to a worktree remote doesn't redownload on dvc update if nothing changed.

@dberenbaum
Copy link
Collaborator Author

For update after import-url without cloud versioning, here's what I see:

import-url-update.mov

For update after import-url --version-aware, I see that everything gets downloaded again:

import-url-version-aware-update.mov

For a non-imported dataset pushed to a worktree remote, it looks like nothing gets downloaded, but it would be nice to have this be more explicit and match the UI of the first video:

worktree-add-push-update.mov

@dberenbaum
Copy link
Collaborator Author

@karajan1001 Any updates on this?

@karajan1001
Copy link
Contributor

karajan1001 commented Jan 31, 2023

@karajan1001 Any updates on this?

I'm sorry, I'm still debugging what's wrong with my s3 setup. Maybe I should switch to azure or gcp.

@karajan1001
Copy link
Contributor

karajan1001 commented Feb 2, 2023

@dberenbaum, @daavoo A new situation. I found that the problem of duplicated downloading had already been solved in #8849, but after #8882 it became terribly slow (takes several minutes to finish). Maybe I should make the performance problem the next work, but better to open a new PR for it.

And for

For a non-imported dataset pushed to a worktree remote, it looks like nothing gets downloaded, but it would be nice to have this be more explicit and match the UI of the first video:

It still exists.

@daavoo
Copy link
Contributor

daavoo commented Feb 2, 2023

but after #8882 it became terribly slow (takes several minutes to finish).

@karajan1001 not sure what is it in the phrase. dvc update after dvc import-url --version-aware ?

Could you share the profile from:

pip install viztracer
dvc update --viztracer --viztracer-depth 12

@karajan1001
Copy link
Contributor

karajan1001 commented Feb 2, 2023

@karajan1001 not sure what is it in the phrase. dvc update after dvc import-url --version-aware ?

Yes, as well as import itself, but maybe it is caused by the network problem on my side,

viztracer.dvc-20230202_225242.json.tar.gz

looks like the time is mostly spent on waiting for responses from the s3 server. And the output is

dvc update --viztracer --viztracer-depth 12 version                                                                                  [ins][23:08:47]
'version.dvc' didn't change, skipping
Loading finish
Total Entries: 1000005

Circular buffer is full, you lost some early data, but you still have the most recent data.
    If you need more buffer, use "viztracer --tracer_entries <entry_number>"
    Or, you can try the filter options to filter out some data you don't need
    use --quiet to shut me up

Use the following command to open the report:
vizviewer /Users/gao/test/cloud_versioning/viztracer.dvc-20230202_231153.json

karajan1001 added a commit to karajan1001/dvc that referenced this issue Feb 3, 2023
fix: iterative#8464
1. Add some UI message in `dvc update` for unchanged data pushed to some
worktree remotes
karajan1001 added a commit to karajan1001/dvc that referenced this issue Feb 3, 2023
fix: iterative#8464
1. Add some UI message in `dvc update` for unchanged data pushed to some
worktree remotes
@dberenbaum
Copy link
Collaborator Author

@karajan1001 I'm not seeing a major performance difference, but could you and @daavoo debug and make sure it's not a problem? We can close the issue once that's addressed.

@karajan1001
Copy link
Contributor

A big difference is that in the new version of dvc-data's get method, we have find in it.

image
image

I found today, the downloading problem appears again even in the latest version, looks like #8849 only partially solved it. Need a more detailed investigation.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 8, 2023

The update after import-url --version-aware downloading files when nothing has changed issue should be fixed in main (underlying cause was also related to the import-url --no-download --version-aware bug)

@karajan1001
Copy link
Contributor

karajan1001 commented Feb 8, 2023

I can confirm that the redownloading problem had been solved. But the performance problem still exists.

@dberenbaum
Copy link
Collaborator Author

@karajan1001 Are there still performance issues to investigate? Otherwise, let's close this issue.

@karajan1001
Copy link
Contributor

@karajan1001 Are there still performance issues to investigate? Otherwise, let's close this issue.

The performance down was because, in the previous version, we didn't consider the s3 fs as a version_aware fs, and after #8882 this bug was fixed, but we still have performance problems on a version_aware S3FileSystem, But as it only affects users under particular network conditions. I recommend that we close this one and open a new p2 one for the performance problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants