-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud versioning: external outputs #8411
Comments
To clarify, this is mostly needed in |
@pmrowla Let's leave this out of scope for now. I think we can come back to it after we release the initial cloud versioning features. |
Hey @dberenbaum - for Q1 2023 - my understanding, scope is as follows: must have
nice to have / followups
Correct me if I got it wrong or some things missing / changed / need clarification |
Some questions/thoughts:
|
Lowering priority again since it may overlap with DQL work |
I'd like to come back to this since we are deprecating external outputs, and this could help fill the gap. Ideas for the design:
An alternative for step 3 is that if the version ID is found but isn't the current version, it is copied to the latest version. However, this gets back towards previous problems with external outputs and cloud-versioned workspace remotes. A follow-up phase (out of scope here) could be to use these cloud-versioned outputs downstream. For example, assume a stages:
prep:
cmd: python prep.py
deps:
- prep.py
outs:
- s3://bucket/data
version_aware: true
train:
cmd: python train.py
deps:
- train.py
- s3://bucket/data
outs:
- model.pkl In this case, I have s3://bucket/data in 2 places:
I always want the version IDs of these to match (maybe we can have some kind of dep that explicitly uses the output version?), and I want to make sure |
Even though the "alternative" potentially has problems with conflicts, I think we should warn users and do this (similar to the behavior of normal outputs and previous external outputs). It's the simplest approach. When checking whether a stage has changed, dvc can match the hash info we have for the versions in
We can ignore all of this if we use this approach. |
For a cloud bucket/container with versioning turned on, external outputs can work like cloud-versioned remotes to simplify the workflow. No external cache is needed since a cloud-versioned external output path can serve as its own remote/cache. DVC should only need to collect the version info to store in
.dvc
ordvc.lock
. This also can make external outputs much more performant because no copies or md5s are needed.Question: Should
dvc checkout
update the current versions of external outputs? This should probably be configurable, but it can be left for a follow-up task.Out of scope: Isolation of external output paths from race conditions and conflicts when multiple users are writing to the same paths. Let's improve the current functionality before worrying about this, which we could address either with something like #3920 or adding locks on the cloud.
The text was updated successfully, but these errors were encountered: