-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud versioning: update --no-download support #8653
Comments
@pmrowla Do you have an idea of what an example |
I was thinking something along the lines of outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
locations:
s3://versioned-s3-bucket/data.xml:
etag: abc123
version_id: 1234
azure://versioned-azure-bucket/data.xml:
etag: def456
version_id: 20220101T00:00:00
...
I think using URLs over remote names for the location keys works better here since etags and version IDs are filesystem/bucket specific. Using remote names would mean that the user can update the URL in the future if they wish to use a different remote, which is convenient for regular DVC (since the DVC md5 never changes, regardless of storage location). But in a cloud versioning context this doesn't actually help, since the version IDs and etags would not transfer between different storages. For dirs, it would be the same except each location entry would also need to have the corresponding If we had separate |
This feels to me like we are getting back towards #3920 and there may be a cleaner abstraction with something like repo-wide lock files per workspace/worktree, but I have to think about this some more |
@pmrowla I'm not sure I'm following what's unique here for worktree. Doesn't |
For imports we clear the entire |
Right, sorry, was getting mixed up and forgetting that these aren't |
@skshetry @pmrowla outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
remote:
myworktree:
etag: abc123
version_id: 1234 |
After discussion with @dberenbaum we decided to hold off on adding support for worktree
Due to the way cloud versioning ( What we really need is an actual abstraction for separate worktrees/workspaces with separate data-indexes for each possible worktree location (local vs remote), instead of the current representation where everything is a part of the local repo workspace (with the |
I'm not sure we can actually implement no-download right now with the way cloud versioning is currently implemented. It works for imports because the etag/version/url for the import source is stored separately as a dep, and we just completely remove the output information for the import stage entirely. The problem with supporting this for worktrees is that we have to collect both the cloud/remote etag/version and the local md5 (which requires downloading) for the newly updated output. I guess we will need to just clear the md5 field for worktree
update --no-download
and then have fetch account for that possibility.I've been thinking about this a little bit and it seems like we probably need to change how we handle worktree remotes in the .dvc file. What we really have in this scenario is a local out, and then one more external data locations for each remote. The external locations function like an import's deps entry, but in this case you could have multiple possible locations (if you configure multiple remotes), and we can also write (push) to the external location as well (so it's not a read-only dependency).
related: #8356
Originally posted by @pmrowla in #8649 (comment)
The text was updated successfully, but these errors were encountered: