Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.yaml: parametrize imports #5058

Open
tillbier opened this issue Dec 8, 2020 · 8 comments
Open

dvc.yaml: parametrize imports #5058

tillbier opened this issue Dec 8, 2020 · 8 comments
Labels
A: templating Related to the templating feature feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@tillbier
Copy link

tillbier commented Dec 8, 2020

When importing dependencies via the dvc import feature, I want to be able to set a revision for a stage in the dvc.yaml, from which the dependency should be pulled, so that i can change the revision with the parametrisation feature.

This could look like the following:

dvc.yaml

vars:
  - params.yaml

stages:
  train:
    cmd: ./trainModel.sh
    deps:
    - images:
        revision: ${training.images.version}
    outs:
    - model.hdf5:
@efiop efiop added the A: templating Related to the templating feature label Dec 8, 2020
@efiop
Copy link
Contributor

efiop commented Dec 8, 2020

We are thinking about moving imports to dvc.yaml too #4841 , but so far we are thinking about a format that looks something like:

vars:
  - params.yaml

imports:
- images:
    repo: https://github.com/iterative/dvc
    path: data
    rev: ${training.images.version}

stages:
  train:
    cmd: ./trainModel.sh
    deps:
    - images
    outs:
    - model.hdf5

or similar, the main idea is that imports are separated from stages. Would something like that work for you?

@efiop efiop added feature request Requesting a new feature awaiting response we are waiting for your reply, please respond! :) p2-medium Medium priority, should be done, but less important labels Dec 8, 2020
@tillbier
Copy link
Author

tillbier commented Dec 9, 2020

This looks like it would serve our needs pretty well. Thank you!

@efiop
Copy link
Contributor

efiop commented Dec 9, 2020

CC @skshetry WDYT?

@efiop
Copy link
Contributor

efiop commented Dec 19, 2020

For the record: had a discussion with @skshetry earlier and decided to wait with this likely after 2.0 release, there will be some preparations leading to that in the near future.

@efiop efiop changed the title Imported Dependencies should have a configureable revision dvc.yaml: parametrize imports Dec 19, 2020
@efiop efiop removed the awaiting response we are waiting for your reply, please respond! :) label Dec 19, 2020
@karajan1001
Copy link
Contributor

karajan1001 commented Dec 22, 2020

We are thinking about moving imports to dvc.yaml too #4841 , but so far we are thinking about a format that looks something like:

vars:
  - params.yaml

imports:
- images:
    repo: https://github.com/iterative/dvc
    path: data
    rev: ${training.images.version}

stages:
  train:
    cmd: ./trainModel.sh
    deps:
    - images
    outs:
    - model.hdf5

or similar, the main idea is that imports are separated from stages. Would something like that work for you?

@efiop
Excuse me, what about the stages from dvc add cmd?

@skshetry
Copy link
Member

We haven't yet decided on that.

@karajan1001, what do you think? Would it to be good to get rid of .dvc files?

@karajan1001
Copy link
Contributor

karajan1001 commented Dec 23, 2020

@skshetry @efiop

My opinion in #4278 (comment)

@karajan1001
Copy link
Contributor

In tensorlfow we save a checkpoint into three files.

meta file: describes the saved graph structure,
data file: saves the values of all variables.
index file: ia string-string immutable table describes the metadata of a tensor: which of the "data" files contains the content of a tensor, the offset into that file, checksum, some auxiliary data, etc.

Here in DVC's DAG we have

dvc.yaml: describes the graph structure
dvc/cache: saves the values of nodes.
index file: two ways, distributed in .dvc files or centralized in some file like dvc.lock

centralization are more manageable and distribution are readable when no data existing on current workspace (For example in a Git repo). I don't know which is more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

4 participants