Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: reduce overlap between repro and stage add #4026

Merged
merged 4 commits into from
Mar 13, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions content/docs/command-reference/exp/init.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ locations of your inputs (data, <abbr>parameters</abbr>, and source code) and
outputs (models, <abbr>metrics</abbr>, and
[plots](/doc/command-reference/plots)).

The only required argument is a [shell `command`] to run your experiment(s). It
can be provided directly as an argument (see example below) or by using the
`--interactive` (`-i`) mode, which will prompt for it.
The only required argument is the terminal `command` that runs your
experiment(s). It can be provided directly [as an argument] or by using the
`--interactive` (`-i`) mode (which will prompt for it).

```cli
$ dvc exp init "python src/train.py"
Expand Down Expand Up @@ -101,6 +101,7 @@ See the [Pipelines guide] for more on that topic.
/doc/user-guide/project-structure/dvcyaml-files#stage-commands
[checkpoints]: /doc/user-guide/experiment-management/checkpoints
[dvc experiments]: /doc/user-guide/experiment-management
[as an argument]: /doc/user-guide/pipelines/defining-pipelines#stage-commands
[pipelines guide]: /doc/user-guide/pipelines/defining-pipelines

## Options
Expand Down
82 changes: 42 additions & 40 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,61 +21,63 @@ positional arguments:

## Description

Provides a way to regenerate data pipeline results, by restoring the [dependency
graph] implicitly defined by the stages listed in `dvc.yaml`. The commands
defined in these stages are then executed in the correct order.
Provides a way to regenerate data pipeline results by restoring the [dependency
graph] defined among the stages listed in `dvc.yaml`. Stages are then checked to
decide which ones need to run (see `dvc status`). Finally, [their commands] are
executed.

This is similar to [`make`](https://www.gnu.org/software/make/manual/) in
software build automation, but DVC captures "build requirements" (stage
<abbr>dependencies</abbr>) and <abbr>caches</abbr> the pipeline's
<abbr>outputs</abbr> along the way.

<admon type="info" title="Notes">

Stage outputs are deleted from the <abbr>workspace</abbr> before executing the
stage commands that produce them (unless `persist: true` is used in `dvc.yaml`).

For stages with multiple commands (having a list in the `cmd` field), commands
are run one after the other in the order they are defined. The failure of any
command will halt the remaining stage execution, and raises an error.
command will halt the remaining stage execution and raise an error.

> Pipeline stages are defined in `dvc.yaml` (either manually or by using
> `dvc stage add`) while initial data dependencies can be registered with
> `dvc add`.
Stages without dependencies nor outputs are considered [always changed], so
`dvc repro` always runs them.

`dvc repro` is similar to [Make](https://www.gnu.org/software/make/) in software
build automation, but DVC captures build requirements
([dependencies and outputs](/doc/command-reference/run#dependencies-and-outputs))
and <abbr>caches</abbr> the pipeline's <abbr>outputs</abbr> along the way.
</admon>

💡 For convenience, a Git hook is available to remind you to `dvc repro` when
needed after a `git commit`. See `dvc install` for more details.
This is usually done after one or more <abbr>stages</abbr> are defined (see
`dvc.yaml` and `dvc stage add`) or when code or other dependencies change or are
missing. Note that `dvc repro` does not attempt to `dvc checkout` or `dvc pull`
data unless the `--pull` option is used.

Keep in mind that one `dvc.yaml` file does not necessarily equal one
[pipeline](/doc/command-reference/dag) (although that is typical). DVC evaluates
all the `dvc.yaml` files in the <abbr>workspace</abbr> to rebuild and validate
pipeline(s). Then it executes the corresponding commands (`cmd` field of
`dvc.yaml`).
<admon type="tip">

There are a few ways to restrict what will be regenerated by this command: by
specifying specific reproduction [`targets`](#options), or by using certain
command [options](#options), such as `--single-item` or `--all-pipelines`.
For convenience, a Git hook is available to remind you to `dvc repro` when
needed after a `git commit`. See `dvc install` for more details.

<!-- prettier-ignore -->
> Note that stages without dependencies nor outputs are considered [always
> changed], so `dvc repro` always executes them.
</admon>

<abbr>Stage</abbr> outputs are deleted from the <abbr>workspace</abbr> before
executing the stage commands that produce them (unless `persist: true` is used
in `dvc.yaml`).
Keep in mind that one `dvc.yaml` file does not necessarily equal one
<abbr>pipeline</abbr> (although that is typical). So DVC reads all the
`dvc.yaml` files in the <abbr>workspace</abbr> to rebuild pipeline(s).

`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results (except if the `--pull` option is used).
However, there are a few ways to restrict what gets regenerated: by specifying
reproduction [`targets`](#options), or by using certain command
[options](#options) such as `--single-item` or `--all-pipelines`.

It stores all the data files, intermediate or final results in the
<abbr>cache</abbr> (unless the `--no-commit` option is used), and updates the
hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc`
files.
All the data files, intermediate or final results are <abbr>cached</abbr>
(unless the `--no-commit` option is used), and the hash values of changed
dependencies and outputs are updated in `dvc.lock` and `.dvc` files, as needed.

[dependency graph]: /doc/user-guide/pipelines/defining-pipelines
[their commands]: /doc/user-guide/pipelines/defining-pipelines#stage-commands
[always changed]: /doc/command-reference/status#local-workspace-status

### Parallel stage execution

Currently, `dvc repro` is not able to parallelize stage execution automatically.
If you need to do this, you can launch `dvc repro` multiple times manually. For
example, let's say a [pipelines](/doc/command-reference/dag) graph looks
something like this:
If you need to parallelize stage execution, you can launch `dvc repro` multiple
times concurrently (e.g. in separate terminals). For example, let's say a
[pipelines](/doc/command-reference/dag) graph looks something like this:

```cli
$ dvc dag
Expand All @@ -100,9 +102,9 @@ This pipeline consists of two parallel branches (`A` and `B`), and the final
`train` stage, where the branches merge. If you run `dvc repro` at this point,
it would reproduce each branch sequentially before `train`. To reproduce both
branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the
same time (e.g. in separate terminals). After both finish successfully, you can
then run `dvc repro train`: DVC will know that both branches are already
up-to-date and only execute the final stage.
same time. After both finish successfully, you can then run `dvc repro train`:
DVC will know that both branches are already up-to-date and only execute the
final stage.

## Options

Expand Down
78 changes: 34 additions & 44 deletions content/docs/command-reference/stage/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,82 +37,72 @@ are ignored by `dvc stage add`.

</admon>

Stages whose outputs become dependencies for other stages form
<abbr>pipelines</abbr>. `dvc repro` can be used to rebuild this [dependency
graph] and execute them.
Stages whose <abbr>outputs</abbr> become <abbr>dependencies</abbr> for other
stages form <abbr>pipelines</abbr>. For example:

<admon type="info">
```dvc
$ dvc stage add -n printer -d write.sh -o pages ./write.sh
$ dvc stage add -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages
```

See the guide on [defining pipeline stages] for more details.
<admon icon="book">

[defining pipeline stages]:
/doc/user-guide/pipelines/defining-pipelines#pipelines
See the guide on [defining pipeline stages] for more details.

</admon>

`dvc repro` can be used to rebuild this [dependency graph] and run stages.

[`command` argument]:
/doc/user-guide/project-structure/dvcyaml-files#stage-commands
[defining pipeline stages]:
/doc/user-guide/pipelines/defining-pipelines#dvcyaml-metafiles
[dependency graph]:
/doc/user-guide/pipelines/defining-pipelines#directed-acyclic-graph-dag

### Dependencies and outputs

By specifying lists of <abbr>dependencies</abbr> (`-d` option) and/or
<abbr>outputs</abbr> (`-o` and `-O` options) for each stage, we can create a
[dependency graph] that connects them, i.e. the output of a stage becomes the
input of another, and so on (see `dvc dag`). This graph can be restored by DVC
later to modify or [reproduce](/doc/command-reference/repro) the full pipeline.
For example:

```cli
$ dvc stage add -n printer -d write.sh -o pages ./write.sh
$ dvc stage add -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages
```

Stage dependencies can be any file or directory, either untracked, or more
commonly tracked by DVC or Git. Outputs will be tracked and <abbr>cached</abbr>
by DVC when the stage is run. Every output version will be cached when the stage
is reproduced (see also `dvc gc`).

Relevant notes:
is reproduced (see also `dvc gc`). Relevant notes:

- Typically, scripts to run (or possibly a directory containing the source code)
are included among the specified `-d` dependencies. This ensures that when the
source code changes, DVC knows that the stage needs to be reproduced. (You can
chose whether to do this.)

- `dvc stage add` checks the dependency graph integrity before creating a new
- `dvc stage add` checks the [dependency graph] integrity before creating a new
stage. For example: two stage cannot specify the same output or overlapping
output paths, there should be no cycles, etc.

- DVC does not feed dependency files to the command being run. The program will
have to read by itself the files specified with `-d`.
have to read the files itself.

- Entire directories produced by the stage can be tracked as outputs by DVC,
which generates a single `.dir` entry in the cache (refer to
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
for more info.)
which generates a single `.dir` entry in the cache (refer to [Structure of
cache directory] for more info.)

- [external dependencies](/doc/user-guide/data-management/importing-external-data)
and [external outputs](/doc/user-guide/data-management/managing-external-data)
(outside of the <abbr>workspace</abbr>) are also supported (except metrics and
plots).
- [external dependencies] and [external outputs] (outside of the
<abbr>workspace</abbr>) are also supported (except metrics and plots).

- Outputs are deleted from the workspace before executing the command (including
at `dvc repro`) if their paths are found as existing files/directories (unless
`--outs-persist` is used). This also means that the stage command needs to
recreate any directory structures defined as outputs every time its executed
by DVC.
Comment on lines -99 to -103
Copy link
Contributor

@dberenbaum dberenbaum Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Do you mind explaining why you dropped this note?

Edit: I guess you modified it, but it seems less explicit now without Outputs are deleted from the workspace before executing the command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was just making it shorter I guess. Idk this was a long time ago. I've reinstated the first part so it's explicit again. PTAL

- Stage commands need to recreate any directory structures defined as outputs
every time its executed by DVC.
Copy link
Contributor

@dberenbaum dberenbaum Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Stage commands need to recreate any directory structures defined as outputs
every time its executed by DVC.
- Outputs are deleted from the workspace before executing the command (unless
`--outs-persist` is used). Stage commands need to recreate any directory structures defined as outputs
every time it's executed by DVC.

@jorgeorpinel I will probably go with something like this unless there's a strong reason to drop the first sentence completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I already rephrased in a similar way. Sure up to you 🙂


- In some situations, we have previously executed a stage, and later notice that
some of the files/directories used by the stage as dependencies, or created as
outputs are missing from `dvc.yaml`. It is possible to
[add missing dependencies/outputs to an existing stage](/doc/user-guide/how-to/add-deps-or-outs-to-a-stage)
without having to execute it again.
some of the dependencies or outputs are missing from `dvc.yaml`. It is
possible to [add them to an existing stage].

- Renaming dependencies or outputs requires a
[manual process](/doc/command-reference/move#renaming-stage-outputs) to update
- Renaming dependencies or outputs requires a [manual process] to update
`dvc.yaml` and the project's cache accordingly.

[dependency graph]: /doc/user-guide/pipelines/defining-pipelines
[add them to an existing stage]:
/docs/user-guide/how-to/add-deps-or-outs-to-a-stage
[structure of cache directory]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
[external dependencies]: /doc/user-guide/external-dependencies
[external outputs]: /doc/user-guide/managing-external-data
[manual process]: /doc/command-reference/move#renaming-stage-outputs

### For displaying and comparing data science experiments

Expand Down Expand Up @@ -140,7 +130,7 @@ data science experiments.
on. Multiple dependencies can be specified like this:
`-d data.csv -d process.py`. Usually, each dependency is a file or a directory
with data, or a code file, or a configuration file. DVC also supports certain
[external dependencies](/doc/user-guide/data-management/importing-external-data).
[external dependencies].

When you use `dvc repro`, the list of dependencies helps DVC analyze whether
any dependencies have changed and thus executing stages required to regenerate
Expand Down