Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvcfile: support remote per output #6486

Merged
merged 1 commit into from
Aug 27, 2021
Merged

dvcfile: support remote per output #6486

merged 1 commit into from
Aug 27, 2021

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Aug 24, 2021

Basic support for specifying remotes per output in both dvc.yaml and *.dvc files. Only output level is supported (no stage or pipeline). Because this is an advanced feature, for now, this only supports setting it by hand, no special CLI flags (we could use --remote in dvc add in the future, hence why we have --remote and --to-remote there).

Examples:

$ cat data.dvc
outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml
    remote: myremote
$ cat dvc.yaml
stages:
  transpose:
    cmd: ./trans.r rows.txt > columns.txt
    deps:
      - rows.txt
    outs:
      - columns.txt:
          remote: myremote

for push/pull/fetch/etc the order or priority is as follows:

  1. remote: per output from *.dvc/dvc.yaml
  2. --remote from CLI
  3. core.remote from config

Related to #2095

Docs: iterative/dvc.org#2761

@efiop efiop requested a review from a team as a code owner August 24, 2021 12:52
@efiop efiop requested a review from pmrowla August 24, 2021 12:52
@dberenbaum
Copy link
Collaborator

🔥

How does this work with other dvc push flags, and can we test those? Specifically:

  • dvc push -r: does -r override the file-specific remote?
  • dvc push --run-cache: what happens with files that need to be pushed to reconstruct results from the run-cache?

Comment on lines +852 to +853
remote = self.repo.cloud.get_remote_odb(self.remote)
self.repo.cloud.pull([obj.hash_info], odb=remote, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need support for better error messages, like where the remote has been specified in case of NoRemoteError or RemoteNotFound.

@@ -185,6 +188,7 @@ def load_from_pipeline(stage, data, typ="outs"):
Output.PARAM_CACHE,
Output.PARAM_PERSIST,
Output.PARAM_CHECKPOINT,
Output.PARAM_REMOTE,
Copy link
Member

@skshetry skshetry Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a need for a stage-level/pipeline-level remote?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the future, but not for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I was just wondering if it makes sense in the stage-level, or should there be a different keyword there?

Copy link
Member

@skshetry skshetry Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push_to: <remote> or something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that we'll need to support pushing/pulling granularly in the future, but for now this is a symmetrical. Interesting to note is that our imports are kinda like pull: orig_remote, push: false, since they are pulled from external repo remote but not pushed anywhere by default. So that might be a better way to handle #4581 (e.g. push: mybackup or just true for default remote). But that's beyond the scope for this PR. Or do you see where remote: might get problematic in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think of remote to be a data-management thing, so I was just wondering if it is still understandable in terms of stage concept or pipeline concept.

@efiop
Copy link
Contributor Author

efiop commented Aug 25, 2021

dvc push -r: does -r override the file-specific remote?

No, it doesn't. This is similar to how -r doesn't override remote for dvc import-ed artifacts. So the order is:

  1. remote: per output
  2. -r
  3. core.remote from config

dvc push --run-cache: what happens with files that need to be pushed to reconstruct results from the run-cache?

So we've talked about --run-cache before and it is a little special. It is only pushed/pulled with an explicit --run-cache flag, so there is less of a chance of accidentally pushing data. But still, this is a risk. The problem here is that regular pushing/pulling of artifacts and run-cache are really different and somewhat separate things, so I think that maybe what we really need here is a stage-level run-cache options to be able to turn off run-cache (similar to always_changed) explicitly, to specify if it needs to be pushed or if it is allowed to be pulled and stuff like that. E.g.

run-cache: false/true
run-cache:
    push: false/myremote/etc
    pull: false/myremote/etc   # could be useful for slow speeds or giant artifacts that are faster to generate locally. Currently if one uses `dvc repro --pull` , we will download artifacts no matter what, so this option could be useful.

etc. A bit worried about overengineering this one, need to think a bit more about it, but it is obvious that needs to be handled in a special way.

@shcheklein
Copy link
Member

@efiop could you please provide a brief summary in the description on how will it look like (or docs PR?)? Otherwise it makes it hard to review for folks who are not familiar with the code base.

@efiop
Copy link
Contributor Author

efiop commented Aug 25, 2021

@shcheklein Sure, this is WIP still, so I didn't add it yet. Will do.

@@ -326,7 +332,7 @@ def __init__(
self.obj = None
self.isexec = False if self.IS_DEPENDENCY else isexec

self.def_remote = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a leftover that haven't been used at all for a while.

@@ -81,6 +81,7 @@ def loadd_from(stage, d_list):
desc = d.pop(Output.PARAM_DESC, False)
isexec = d.pop(Output.PARAM_ISEXEC, False)
live = d.pop(Output.PARAM_LIVE, False)
remote = d.pop(Output.PARAM_REMOTE, None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new entries here and in schema is getting crazy these days, we'll need to finally revisit this and parsing/schema in the future.

@efiop efiop changed the title [WIP] dvcfile: support remote per output dvcfile: support remote per output Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature is a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants