Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: external outputs #8411

Closed
dberenbaum opened this issue Oct 7, 2022 · 7 comments
Closed

cloud versioning: external outputs #8411

dberenbaum opened this issue Oct 7, 2022 · 7 comments
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Collaborator

For a cloud bucket/container with versioning turned on, external outputs can work like cloud-versioned remotes to simplify the workflow. No external cache is needed since a cloud-versioned external output path can serve as its own remote/cache. DVC should only need to collect the version info to store in .dvc or dvc.lock. This also can make external outputs much more performant because no copies or md5s are needed.

Question: Should dvc checkout update the current versions of external outputs? This should probably be configurable, but it can be left for a follow-up task.

Out of scope: Isolation of external output paths from race conditions and conflicts when multiple users are writing to the same paths. Let's improve the current functionality before worrying about this, which we could address either with something like #3920 or adding locks on the cloud.

@dberenbaum dberenbaum added A: cloud-versioning Related to cloud-versioned remotes p1-important Important, aka current backlog of things to do p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Oct 7, 2022
@dberenbaum
Copy link
Collaborator Author

To clarify, this is mostly needed in dvc.lock for cloud-focused pipelines. For data management without pipelines, import-url [--now-download] can already be used for most scenarios to add data from cloud.

@dberenbaum
Copy link
Collaborator Author

@pmrowla Let's leave this out of scope for now. I think we can come back to it after we release the initial cloud versioning features.

@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Nov 17, 2022
@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Jan 11, 2023
@omesser
Copy link
Contributor

omesser commented Jan 16, 2023

Hey @dberenbaum - for Q1 2023 - my understanding, scope is as follows:

must have

  • collect the version info of external outputs to store in .dvc or dvc.lock

nice to have / followups

  • dvc checkout to update versions of external outputs - configurable, ( non default?)
  • dealing with multi-user race conditions

Correct me if I got it wrong or some things missing / changed / need clarification

@dberenbaum
Copy link
Collaborator Author

Some questions/thoughts:

  • Is any flag needed in dvc stage add to use cloud versioning for an output, or should this be handled automatically?
  • Should it be possible to download/pull the output?
  • What should checkout do?
  • What fields (if any) should be added to the dvc.yaml for a cloud-versioned output?

@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Jan 20, 2023
@dberenbaum
Copy link
Collaborator Author

Lowering priority again since it may overlap with DQL work

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do discussion requires active participation to reach a conclusion and removed p3-nice-to-have It should be done this or next sprint labels May 24, 2023
@dberenbaum
Copy link
Collaborator Author

I'd like to come back to this since we are deprecating external outputs, and this could help fill the gap.

Ideas for the design:

  1. Add a stage and set version_aware: True for cloud-versioned outputs. As a nice to have, we could add a flag like dvc stage add ... --outs-version-aware s3://....
  2. When the stage is run, DVC can capture the cloud version ID of the latest version in dvc.lock. The file itself doesn't get cached, pushed, pulled, or checked out anywhere by DVC.
  3. If a dvc.lock entry already exists when the stage is run, DVC can find that version of the file. If it exists, DVC doesn't do anything (consider it cached, but don't "checkout" anywhere). If it doesn't exist, DVC treats it like it's missing and tries to reproduce it.

An alternative for step 3 is that if the version ID is found but isn't the current version, it is copied to the latest version. However, this gets back towards previous problems with external outputs and cloud-versioned workspace remotes.


A follow-up phase (out of scope here) could be to use these cloud-versioned outputs downstream. For example, assume a dvc.yaml like:

stages:
  prep:
    cmd: python prep.py
    deps:
      - prep.py
    outs:
      - s3://bucket/data
          version_aware: true
  train:
    cmd: python train.py
    deps:
      - train.py
      - s3://bucket/data
    outs:
      - model.pkl

In this case, I have s3://bucket/data in 2 places:

  1. As an out of stage prep
  2. As a dep of stage train

I always want the version IDs of these to match (maybe we can have some kind of dep that explicitly uses the output version?), and I want to make sure train.py reads in the expected version (this should work if using the dvc api to read the file).

This was referenced Jun 1, 2023
@dberenbaum
Copy link
Collaborator Author

3. If a dvc.lock entry already exists when the stage is run, DVC can find that version of the file. If it exists, DVC doesn't do anything (consider it cached, but don't "checkout" anywhere). If it doesn't exist, DVC treats it like it's missing and tries to reproduce it.

An alternative for step 3 is that if the version ID is found but isn't the current version, it is copied to the latest version. However, this gets back towards previous problems with external outputs and cloud-versioned workspace remotes.

Even though the "alternative" potentially has problems with conflicts, I think we should warn users and do this (similar to the behavior of normal outputs and previous external outputs). It's the simplest approach. When checking whether a stage has changed, dvc can match the hash info we have for the versions in dvc.lock against the current versions in cloud. If they match, do nothing. If they don't match and the stage is cached, restore those versions. Otherwise, run the stage.

A follow-up phase (out of scope here) could be to use these cloud-versioned outputs downstream.

We can ignore all of this if we use this approach.

@dberenbaum dberenbaum removed the discussion requires active participation to reach a conclusion label Jul 24, 2023
@dberenbaum dberenbaum added the p2-medium Medium priority, should be done, but less important label Sep 5, 2023
@dberenbaum dberenbaum removed the p1-important Important, aka current backlog of things to do label Sep 5, 2023
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

2 participants