Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: update --no-download support #8653

Closed
dberenbaum opened this issue Dec 2, 2022 · 8 comments
Closed

cloud versioning: update --no-download support #8653

dberenbaum opened this issue Dec 2, 2022 · 8 comments
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Collaborator

    > * `--no-download`: It should update the .dvc file but not download locally (seems to ignore the flag right now).

I'm not sure we can actually implement no-download right now with the way cloud versioning is currently implemented. It works for imports because the etag/version/url for the import source is stored separately as a dep, and we just completely remove the output information for the import stage entirely. The problem with supporting this for worktrees is that we have to collect both the cloud/remote etag/version and the local md5 (which requires downloading) for the newly updated output. I guess we will need to just clear the md5 field for worktree update --no-download and then have fetch account for that possibility.

I've been thinking about this a little bit and it seems like we probably need to change how we handle worktree remotes in the .dvc file. What we really have in this scenario is a local out, and then one more external data locations for each remote. The external locations function like an import's deps entry, but in this case you could have multiple possible locations (if you configure multiple remotes), and we can also write (push) to the external location as well (so it's not a read-only dependency).

related: #8356

Originally posted by @pmrowla in #8649 (comment)

@dberenbaum dberenbaum added A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important labels Dec 2, 2022
@dberenbaum
Copy link
Collaborator Author

@pmrowla Do you have an idea of what an example .dvc file should look like?

@pmrowla
Copy link
Contributor

pmrowla commented Dec 5, 2022

I was thinking something along the lines of

outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml
    locations:
      s3://versioned-s3-bucket/data.xml:
        etag: abc123
        version_id: 1234
      azure://versioned-azure-bucket/data.xml:
        etag: def456
        version_id: 20220101T00:00:00
      ...
  • On dvc add (when data has been modified), the main output entry would be updated with a new md5, and locations would be cleared
  • On dvc push, a new locations entry would be appended (for each unique bucket we have pushed to)
  • On dvc fetch/pull, we would fetch from the appropriate location
  • On dvc update --no-download, we would clear the output's md5, update the appropriate location, and clear all other location entries
    • Once the update is actually downloaded (either via update without --no-download or on a subsequent dvc fetch, we would update the output md5

I think using URLs over remote names for the location keys works better here since etags and version IDs are filesystem/bucket specific. Using remote names would mean that the user can update the URL in the future if they wish to use a different remote, which is convenient for regular DVC (since the DVC md5 never changes, regardless of storage location). But in a cloud versioning context this doesn't actually help, since the version IDs and etags would not transfer between different storages.

For dirs, it would be the same except each location entry would also need to have the corresponding files: listing (meaning this .dvc file would get very ugly with each additional location)

If we had separate .dvc/dvc.lock files, locations would go in the lock file

@pmrowla
Copy link
Contributor

pmrowla commented Dec 5, 2022

This feels to me like we are getting back towards #3920 and there may be a cleaner abstraction with something like repo-wide lock files per workspace/worktree, but I have to think about this some more

@dberenbaum
Copy link
Collaborator Author

The problem with supporting this for worktrees is that we have to collect both the cloud/remote etag/version and the local md5 (which requires downloading) for the newly updated output.

@pmrowla I'm not sure I'm following what's unique here for worktree. Doesn't --no-download always involve clearing the output?

@pmrowla
Copy link
Contributor

pmrowla commented Jan 15, 2023

For imports we clear the entire outs entry on update --no-download (other than the relative path where we are importing to in our DVC repo), but we still generate an entire dep entry.
For worktrees we will need to add proper support for incomplete outs, since we store the cloud etag/URL as part of the output for worktree remotes (and not as a completely separate dep like we do for imports).

@dberenbaum
Copy link
Collaborator Author

Right, sorry, was getting mixed up and forgetting that these aren't import-url but instead dvc add scenarios.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Jan 16, 2023

@skshetry @pmrowla If we don't actually need multiple remotes (edit: I guess this could also support multiple remotes, but it's not a requirement), can we reuse the remote field here and have the output look like this:

outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml
    remote:
      myworktree:
        etag: abc123
        version_id: 1234

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Jan 25, 2023
@pmrowla pmrowla self-assigned this Feb 1, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Feb 3, 2023

After discussion with @dberenbaum we decided to hold off on adding support for worktree update --no-download in the initial release

For imports we clear the entire outs entry on update --no-download (other than the relative path where we are importing to in our DVC repo), but we still generate an entire dep entry. For worktrees we will need to add proper support for incomplete outs, since we store the cloud etag/URL as part of the output for worktree remotes (and not as a completely separate dep like we do for imports).

Due to the way cloud versioning (worktree remotes specifically) is currently handled in DVC, update --no-download requires generating partial outputs that are missing md5's for files which we do not download. This causes problems on trying to actually implement fetch/pull, since pull/checkout all expect Output to have valid hash infos for all files in the output. We are able to work around this with import-url --no-download because there is at least a clear separation between dependencies and outputs, and the "unpartial" step is essentially a layer between the dep and output (and we really just generate a new output which is then passed into the pull/checkout step. For worktree we can't do the same thing, since everything is being done in-place within a single output.

What we really need is an actual abstraction for separate worktrees/workspaces with separate data-indexes for each possible worktree location (local vs remote), instead of the current representation where everything is a part of the local repo workspace (with the cloud metadata tacked on as an afterthought). This is also related to the same reasons we can't properly support multiple remotes/worktrees at the moment (see #8356, iterative/dvc-data#280 (comment))

@pmrowla pmrowla added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Feb 3, 2023
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Feb 3, 2023
@pmrowla pmrowla removed their assignment Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants