-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref: reduce overlap between repro
and stage add
#4026
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -37,82 +37,72 @@ are ignored by `dvc stage add`. | |||||||||||
|
||||||||||||
</admon> | ||||||||||||
|
||||||||||||
Stages whose outputs become dependencies for other stages form | ||||||||||||
<abbr>pipelines</abbr>. `dvc repro` can be used to rebuild this [dependency | ||||||||||||
graph] and execute them. | ||||||||||||
Stages whose <abbr>outputs</abbr> become <abbr>dependencies</abbr> for other | ||||||||||||
stages form <abbr>pipelines</abbr>. For example: | ||||||||||||
|
||||||||||||
<admon type="info"> | ||||||||||||
```dvc | ||||||||||||
$ dvc stage add -n printer -d write.sh -o pages ./write.sh | ||||||||||||
$ dvc stage add -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages | ||||||||||||
``` | ||||||||||||
|
||||||||||||
See the guide on [defining pipeline stages] for more details. | ||||||||||||
<admon icon="book"> | ||||||||||||
|
||||||||||||
[defining pipeline stages]: | ||||||||||||
/doc/user-guide/pipelines/defining-pipelines#pipelines | ||||||||||||
See the guide on [defining pipeline stages] for more details. | ||||||||||||
|
||||||||||||
</admon> | ||||||||||||
|
||||||||||||
`dvc repro` can be used to rebuild this [dependency graph] and run stages. | ||||||||||||
|
||||||||||||
[`command` argument]: | ||||||||||||
/doc/user-guide/project-structure/dvcyaml-files#stage-commands | ||||||||||||
[defining pipeline stages]: | ||||||||||||
/doc/user-guide/pipelines/defining-pipelines#dvcyaml-metafiles | ||||||||||||
[dependency graph]: | ||||||||||||
/doc/user-guide/pipelines/defining-pipelines#directed-acyclic-graph-dag | ||||||||||||
|
||||||||||||
### Dependencies and outputs | ||||||||||||
|
||||||||||||
By specifying lists of <abbr>dependencies</abbr> (`-d` option) and/or | ||||||||||||
<abbr>outputs</abbr> (`-o` and `-O` options) for each stage, we can create a | ||||||||||||
[dependency graph] that connects them, i.e. the output of a stage becomes the | ||||||||||||
input of another, and so on (see `dvc dag`). This graph can be restored by DVC | ||||||||||||
later to modify or [reproduce](/doc/command-reference/repro) the full pipeline. | ||||||||||||
For example: | ||||||||||||
|
||||||||||||
```cli | ||||||||||||
$ dvc stage add -n printer -d write.sh -o pages ./write.sh | ||||||||||||
$ dvc stage add -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages | ||||||||||||
``` | ||||||||||||
|
||||||||||||
Stage dependencies can be any file or directory, either untracked, or more | ||||||||||||
commonly tracked by DVC or Git. Outputs will be tracked and <abbr>cached</abbr> | ||||||||||||
by DVC when the stage is run. Every output version will be cached when the stage | ||||||||||||
is reproduced (see also `dvc gc`). | ||||||||||||
|
||||||||||||
Relevant notes: | ||||||||||||
is reproduced (see also `dvc gc`). Relevant notes: | ||||||||||||
|
||||||||||||
- Typically, scripts to run (or possibly a directory containing the source code) | ||||||||||||
are included among the specified `-d` dependencies. This ensures that when the | ||||||||||||
source code changes, DVC knows that the stage needs to be reproduced. (You can | ||||||||||||
chose whether to do this.) | ||||||||||||
|
||||||||||||
- `dvc stage add` checks the dependency graph integrity before creating a new | ||||||||||||
- `dvc stage add` checks the [dependency graph] integrity before creating a new | ||||||||||||
stage. For example: two stage cannot specify the same output or overlapping | ||||||||||||
output paths, there should be no cycles, etc. | ||||||||||||
|
||||||||||||
- DVC does not feed dependency files to the command being run. The program will | ||||||||||||
have to read by itself the files specified with `-d`. | ||||||||||||
have to read the files itself. | ||||||||||||
|
||||||||||||
- Entire directories produced by the stage can be tracked as outputs by DVC, | ||||||||||||
which generates a single `.dir` entry in the cache (refer to | ||||||||||||
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) | ||||||||||||
for more info.) | ||||||||||||
which generates a single `.dir` entry in the cache (refer to [Structure of | ||||||||||||
cache directory] for more info.) | ||||||||||||
|
||||||||||||
- [external dependencies](/doc/user-guide/data-management/importing-external-data) | ||||||||||||
and [external outputs](/doc/user-guide/data-management/managing-external-data) | ||||||||||||
(outside of the <abbr>workspace</abbr>) are also supported (except metrics and | ||||||||||||
plots). | ||||||||||||
- [external dependencies] and [external outputs] (outside of the | ||||||||||||
<abbr>workspace</abbr>) are also supported (except metrics and plots). | ||||||||||||
|
||||||||||||
- Outputs are deleted from the workspace before executing the command (including | ||||||||||||
at `dvc repro`) if their paths are found as existing files/directories (unless | ||||||||||||
`--outs-persist` is used). This also means that the stage command needs to | ||||||||||||
recreate any directory structures defined as outputs every time its executed | ||||||||||||
by DVC. | ||||||||||||
- Stage commands need to recreate any directory structures defined as outputs | ||||||||||||
every time its executed by DVC. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
@jorgeorpinel I will probably go with something like this unless there's a strong reason to drop the first sentence completely. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I already rephrased in a similar way. Sure up to you 🙂 |
||||||||||||
|
||||||||||||
- In some situations, we have previously executed a stage, and later notice that | ||||||||||||
some of the files/directories used by the stage as dependencies, or created as | ||||||||||||
outputs are missing from `dvc.yaml`. It is possible to | ||||||||||||
[add missing dependencies/outputs to an existing stage](/doc/user-guide/how-to/add-deps-or-outs-to-a-stage) | ||||||||||||
without having to execute it again. | ||||||||||||
some of the dependencies or outputs are missing from `dvc.yaml`. It is | ||||||||||||
possible to [add them to an existing stage]. | ||||||||||||
|
||||||||||||
- Renaming dependencies or outputs requires a | ||||||||||||
[manual process](/doc/command-reference/move#renaming-stage-outputs) to update | ||||||||||||
- Renaming dependencies or outputs requires a [manual process] to update | ||||||||||||
`dvc.yaml` and the project's cache accordingly. | ||||||||||||
|
||||||||||||
[dependency graph]: /doc/user-guide/pipelines/defining-pipelines | ||||||||||||
[add them to an existing stage]: | ||||||||||||
/docs/user-guide/how-to/add-deps-or-outs-to-a-stage | ||||||||||||
[structure of cache directory]: | ||||||||||||
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory | ||||||||||||
[external dependencies]: /doc/user-guide/external-dependencies | ||||||||||||
[external outputs]: /doc/user-guide/managing-external-data | ||||||||||||
[manual process]: /doc/command-reference/move#renaming-stage-outputs | ||||||||||||
|
||||||||||||
### For displaying and comparing data science experiments | ||||||||||||
|
||||||||||||
|
@@ -140,7 +130,7 @@ data science experiments. | |||||||||||
on. Multiple dependencies can be specified like this: | ||||||||||||
`-d data.csv -d process.py`. Usually, each dependency is a file or a directory | ||||||||||||
with data, or a code file, or a configuration file. DVC also supports certain | ||||||||||||
[external dependencies](/doc/user-guide/data-management/importing-external-data). | ||||||||||||
[external dependencies]. | ||||||||||||
|
||||||||||||
When you use `dvc repro`, the list of dependencies helps DVC analyze whether | ||||||||||||
any dependencies have changed and thus executing stages required to regenerate | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgeorpinel Do you mind explaining why you dropped this note?
Edit: I guess you modified it, but it seems less explicit now without
Outputs are deleted from the workspace before executing the command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was just making it shorter I guess. Idk this was a long time ago. I've reinstated the first part so it's explicit again. PTAL