Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: explain usage of multiple dvc.yaml files #2494

Open
1 task
amin-nejad opened this issue May 21, 2021 · 15 comments
Open
1 task

guide: explain usage of multiple dvc.yaml files #2494

amin-nejad opened this issue May 21, 2021 · 15 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide good first issue Good for newcomers p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@amin-nejad
Copy link
Contributor

amin-nejad commented May 21, 2021

In #1641, it was added that multiple dvc.yaml files are supported. I think it would be good to give extra information on how this works and even encourage it where relevant.

Specifically one or more of the following:

  • dvc.yaml files can be in any subdirectory or nested subdirectory in the project structure and DVC will find them
  • DVC will process them just the same as if they were one DVC file i.e. dependencies between stages in different dvc.yaml files are still respected
  • Each dvc.yaml file will have its own dvc.lock file in the same directory
  • Splitting a dvc.yaml file into multiple files is encouraged where there are clear logical groupings between stages. It avoids confusion, improves readability and shortens commands by avoiding long paths preceding every filename

Other Details

(Added by @shcheklein)

  • you need to use --all-pipelines or --recursive to find and run all pipelines
  • a particular pipeline dvc.yaml can be run with dvc exp run pipeline1/dvc.yaml or cd pipeline1; dvc exp run (works for dvc repro as well)
  • each subdirectory could have its own params.yaml that will be used as a default params file for a particular pipeline

Example

An artificial example. We should modify it a bit to be more realistic when we write docs:

Example
(.venv) √ Projects/test-pipelines % tree .
.
├── pipeline1
│   ├── dvc.lock
│   ├── dvc.yaml
│   └── params.yaml
└── pipeline2
    ├── dvc.lock
    ├── dvc.yaml
    └── params.yaml

2 directories, 6 files
(.venv) √ Projects/test-pipelines % cat pipeline1/dvc.yaml
stages:
  p1-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline1/params.yaml
v: 1
(.venv) √ Projects/test-pipelines % cat pipeline2/dvc.yaml
stages:
  p2-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline2/params.yaml
v: 2

To reference a param file in a different directory, try an explicit syntax for param files:

      params:
        - params.yaml:

within a stage, or globally per dcv.yaml.

Tasks

@shcheklein

This comment was marked as outdated.

@shcheklein shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels May 21, 2021
@amin-nejad

This comment was marked as outdated.

@jorgeorpinel jorgeorpinel changed the title doc: give more information about multiple dvc.yaml files feature guide: explain usage of multiple dvc.yaml files May 23, 2021
@shcheklein shcheklein added the p1-important Active priorities to deal within next sprints label Jul 12, 2021
@iesahin iesahin added the C: guide Content of /doc/user-guide label Oct 21, 2021
@itcarroll
Copy link

A mention of the "--all-pipelines" argument to dvc repro would have helped me. Took some Discuss searching to understand how to get the nested dvc.yaml files to go with dvc repro --all-pipelines. An explanation of "--recursive" could help too (I for one, don't understand the help).

@JulianoLagana
Copy link
Contributor

JulianoLagana commented Feb 22, 2023

How does one reference parameters when having multiple dvc.yaml files? Should there be one params.yaml file in the same directory as each dvc.yaml? If so, how to reference parameters from parameter files in other directories?

@shcheklein shcheklein added the good first issue Good for newcomers label Feb 22, 2023
@shcheklein
Copy link
Member

@JulianoLagana here is a very brief and small example that I tested:

(.venv) √ Projects/test-pipelines % tree .
.
├── pipeline1
│   ├── dvc.lock
│   ├── dvc.yaml
│   └── params.yaml
└── pipeline2
    ├── dvc.lock
    ├── dvc.yaml
    └── params.yaml

2 directories, 6 files
(.venv) √ Projects/test-pipelines % cat pipeline1/dvc.yaml
stages:
  p1-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline1/params.yaml
v: 1
(.venv) √ Projects/test-pipelines % cat pipeline2/dvc.yaml
stages:
  p2-echo:
    cmd: echo ${v}
(.venv) √ Projects/test-pipelines % cat pipeline2/params.yaml
v: 2

To reference a param file in a different directory, try an explicit syntax for param files:

      params:
        - params.yaml:

within a stage, or globally per dcv.yaml.

@amdsobhy
Copy link

amdsobhy commented Feb 25, 2023

I did create a directory tree like the one mentioned above. How can I choose to run only one of them? Will dvc repro --all-pipelines run them all? I want to select only one to run, how can I do that? lets say I want to run pipeline1/dvc.yaml only

@shcheklein
Copy link
Member

@amdsobhy :

One way to run it is to do:

$ cd pipeline1
$ dvc repro or dvc exp run

Another way to do this:

$ dvc repro pipeline1/dvc.yaml

or

$ dvc exp run pipeline1/dvc.yaml

@amdsobhy
Copy link

@shcheklein Thank you for your answer. I tried

dvc repro pipeline1/dvc.yaml

before but did not work for some reason and I think this might be because I moved the dvc.yaml from its original location in the root directory.

So lets say I currently have one dvc.yaml along with a dvc.lock file in the root directory of my repo ~/repo, and I want to move the files to ~/repo/pipeline1. Do I need to move the dvc.lock file as well? How should I make this transition? Also I have already finished training while the dvc.yaml was at ~/repo/dvc.yaml and I do not want to retrain. I just want to relocate the files for future training and to combine multiple models in the same repo

@shcheklein
Copy link
Member

Do I need to move the dvc.lock file as well?

Yes, if it's a heavy pipeline and you don't want to run it again. If you need to change dvc.yaml in the process you could run dvc commit at the end (assuming that you moved all the outputs, metrics, etc and you are sure that it is exactly what should be produced) to save the time and avoid running it again.

How should I make this transition?

Moving files is fine. One thing you would need to check and potentially change, or also move - are paths to different dependencies, outputs, etc. You might need update them, or move some additional files, etc. It really depends on the dvc.yaml.

@amdsobhy
Copy link

amdsobhy commented Feb 25, 2023

When editing paths in dvc.yaml and dvc.lock are the paths relative to the root directory of the repo or relative to the location of the dvc.yaml file or relatve to where I execute the dvc repro command?

for example I have my output in

~/repo/output/pipeline1/p1.weights

and I currently have my dvc file in

~/repo/dvc/pipeline1/dvc.yaml

before I had the output as following:

outs:
- output/pipeline1/p1.weights

Should the new output path be:

outs:
- dvc/../output/pipeline1/p1.weights

so that it is relatve to the new dvc file location?

Right now when I try to run dvc status dvc/pipeline1/dvc.yaml, it reports back that the files are deleted because it is looking for them inside the ~/repo/dvc directory while they are one level up

@shcheklein
Copy link
Member

shcheklein commented Feb 25, 2023

When editing paths in dvc.yaml and dvc.lock are the paths relative to the root directory of the repo or relative to the location of the dvc.yaml file or relatve to where I execute the dvc repro command?

I think they are relative to the dvc.yaml location.

Should the new output path be:

Looks like it should be ../../output/pipeline1/p1.weights?


Optional, and only if it's needed - there are a few ways to manipulate this. Use wdir in the stage definition. Also you could use dvc root command to get the root of the project and then compose stable path.

@amdsobhy
Copy link

amdsobhy commented Feb 25, 2023

I really appreciate your help @shcheklein Thank you so much.

Yes I made a mistake its two levels up.

Is it possible the "wdir" be set as a global in dvc.yaml?

Also what about the paths in the dvc.lock file? do I need to manully modify them as well if I do not run the pipeline? and when modifying the dvc.lock file is the wdir variable recognized in this file

I noticed the paths in the dvc.lock file are the old ones.

@shcheklein
Copy link
Member

Is it possible the "wdir" be set as a global in dvc.yaml?

No, not at the moment :(

Also what about the paths in the dvc.lock file? do I need to manully modify them as well if I do not run the pipeline? and when modifying the dvc.lock file is the wdir variable recognized in this file

You can run dvc commit I think to forcefully recreate the lock file.

@dberenbaum dberenbaum removed the p1-important Active priorities to deal within next sprints label Jul 18, 2023
@dberenbaum dberenbaum added the p1-important Active priorities to deal within next sprints label Aug 16, 2023
@dberenbaum
Copy link
Collaborator

Adding back as a p1 since it relates to general monorepo usage, which we are seeing is increasingly common

@dberenbaum dberenbaum self-assigned this Oct 3, 2023
@dberenbaum
Copy link
Collaborator

Another topic to cover here is how to view experiment results when there are multiple pipelines or projects. From a recent email response:

With the command line and VS Code extension, you can filter the columns to only those relevant to that pipeline. For example, to only show pipeline1, you might do something like dvc exp show --drop 'pipeline1.*'. In VS Code, you can duplicate the workspace so that you have a window open for each pipeline.

If you use DVC Studio, you can configure a project directory and have a project for each pipeline without having to manually configure the columns.

@dberenbaum dberenbaum added p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. and removed p1-important Active priorities to deal within next sprints labels Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide good first issue Good for newcomers p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. type: enhancement Something is not clear, small updates, improvement suggestions
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants