diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md
index 43ff3d86a7..8c663f788f 100644
--- a/content/docs/command-reference/dag.md
+++ b/content/docs/command-reference/dag.md
@@ -1,7 +1,7 @@
# dag
-Visualize the pipeline(s) in `dvc.yaml` as one or more graph(s) of
-connected [stages](/doc/command-reference/run).
+Visualize pipelines as one or more stage dependency
+graphs.
## Synopsis
@@ -17,28 +17,15 @@ positional arguments:
## Description
-Displays the stages of a pipeline up to the `target` stage. If the `target` is
-omitted, it will show the full project DAG.
+DVC represents a pipeline internally as a **Directed Acyclic Graph** (DAG) where
+the nodes are stages and the edges are dependencies.
-### Directed acyclic graph
+`dvc dag` displays this dependency graph in one or more pipelines, as defined in
+the `dvc.yaml` files found in the project. Provide a `target` stage
+name to show the pipeline up to that point.
-A data pipeline, in general, is a series of data processing
-[stages](/doc/command-reference/run) (for example, console commands that take an
-input and produce an outcome). The connections between stages are formed by the
-output of one turning into the dependency of another.
-A pipeline may produce intermediate data, and has a final result.
-
-Data science and machine learning pipelines typically start with large raw
-datasets, include intermediate featurization and training stages, and produce a
-final model, as well as accuracy [metrics](/doc/command-reference/metrics).
-
-In DVC, pipeline stages and commands, their data I/O, interdependencies, and
-results (intermediate or final) are specified in `dvc.yaml`, which can be
-written manually or built using the helper command `dvc stage add`. This allows
-DVC to restore one or more pipelines later (see `dvc repro`).
-
-> DVC builds a dependency graph
-> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this.
+[directed acyclic graph]:
+ /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag
### Paginating the output
diff --git a/content/docs/command-reference/exp/init.md b/content/docs/command-reference/exp/init.md
index dd5f323b09..5794a9b50e 100644
--- a/content/docs/command-reference/exp/init.md
+++ b/content/docs/command-reference/exp/init.md
@@ -14,21 +14,38 @@ usage: dvc exp init [-h] [-q | -v] [--run] [--interactive] [-f]
[--metrics METRICS] [--plots PLOTS] [--live LIVE]
[--type {default,checkpoint}]
[command]
+
+positional arguments:
+ command Shell command to runs the experiment(s)
```
## Description
This command helps you get started with DVC Experiments quickly. It reduces
-repetitive DVC procedures by creating a necessary `dvc.yaml` file, which assumes
-standard locations of your inputs (data, parameters, and source
-code) and outputs (models, metrics, and
+repetitive DVC procedures by creating a `dvc.yaml` file. It assumes standard
+locations of your inputs (data, parameters, and source code) and
+outputs (models, metrics, and
[plots](/doc/command-reference/plots)).
-These locations can be customized through the [command options](#options) or via
-[configuration](/doc/command-reference/config#exp). Default project structure:
+The only required argument is a [shell `command`] to run your experiment(s). It
+can be provided directly as an argument (see example below) or by using the
+`--interactive` (`-i`) mode, which will prompt for it.
+
+```cli
+$ dvc exp init "python src/train.py"
+Creating dependencies: src, data and params.yaml
+Creating output directories: plots and models
+Creating train stage in dvc.yaml
+```
+
+`dvc exp init` also generates the boilerplate project structure, including input
+files/directories and directories needed for future outputs. These locations can
+also be customized via [CLI options](#options) or interactive mode, or with
+[configuration](/doc/command-reference/config#exp). Default structure:
```
├── data/
+├── dvc.yaml
├── metrics.json
├── models/
├── params.yaml
@@ -36,55 +53,51 @@ These locations can be customized through the [command options](#options) or via
└── src/
```
-The only required argument is the terminal command that runs your experiment(s).
-It can be provided directly [as an argument](#the-command-argument) or by using
-the `--interactive` (`-i`) mode (which will prompt for it). The command will be
-wrapped as a stage that `dvc exp run` can execute.
+Inside `dvc.yaml`, the experiment is wrapped as a stage that
+`dvc exp run` can execute.
-
+
-A special `--type` of stage is supported (`checkpoint`), which monitors
-[checkpoints] during training of ML models.
+### Click to see `dvc.yaml` example
-
+```yaml
+stages:
+ train:
+ cmd: python src/train.py
+ deps:
+ - data
+ - src
+ params:
+ - params.yaml:
+ outs:
+ - models
+ metrics:
+ - metrics.json:
+ cache: false
+ plots:
+ - plots:
+ cache: false
+```
-`dvc exp init` also generates the boilerplate project structure, including input
-files/directories and directories needed for future outputs, or any locations
-determined in interactive mode.
+
-
+
-`dvc exp init` is intended as a quick way to start running [DVC Experiments].
-See the `dvc.yaml` specification for more complex data pipelines.
+A special `--type` of stage is supported (`checkpoint`), which monitors
+[checkpoints] during training of ML models.
+📖 `dvc exp init` is intended as a quick way to start running [DVC Experiments].
+See the [Pipelines guide] for more on that topic.
+
[stage definition]:
/doc/user-guide/project-structure/dvcyaml-files#stage-entries
+[shell `command`]:
+ /doc/user-guide/project-structure/dvcyaml-files#stage-commands
[checkpoints]: /doc/user-guide/experiment-management/checkpoints
[dvc experiments]: /doc/user-guide/experiment-management/experiments-overview
-
-### The `command` argument
-
-The command given to `dvc exp init` can be anything your system terminal would
-accept and run directly, for example a shell built-in, an expression, or a
-binary found in `PATH`. Please note that any flags sent after the `command`
-argument will normally become part of that command itself and ignored by
-`dvc exp init` (so provide it last).
-
-⚠️ While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field)
-may only work on some operating systems and require certain software packages or
-libraries in the environment.
-
-Surround the command with double quotes `"` if it includes special characters
-like `|` or `<`, `>` -- otherwise they would apply to `dvc exp init` itself. Use
-single quotes `'` instead if there are environment variables in it that should
-be evaluated dynamically.
-
-```dvc
-$ dvc exp init "./a_script.sh > /dev/null 2>&1"
-$ dvc exp init './another_script.sh $MYENVVAR'
-```
+[pipelines guide]: /doc/user-guide/data-pipelines/defining-pipelines
## Options
diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md
index 0d3b15b1a3..2fe4e00d46 100644
--- a/content/docs/command-reference/move.md
+++ b/content/docs/command-reference/move.md
@@ -87,11 +87,15 @@ model file:
$ mv keras.h5 model.h5
```
-> Note that, often the output of a stage is a dependency in another stage,
-> creating a
-> [dependency graph](/doc/command-reference/run#dependencies-and-outputs). In
-> this case, you may want to also update the `path` in the `deps` field of
-> `dvc.yaml`.
+
+
+Often the output of a stage is a dependency in another stage, creating a
+[dependency graph]. In this case, you may want to also update the `path` in the
+`deps` field of `dvc.yaml`.
+
+[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines
+
+
Finally, we run `dvc commit` with the `-f` option to force save the changes to
cache:
diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md
index 327d11f49f..ba4e0b9efa 100644
--- a/content/docs/command-reference/repro.md
+++ b/content/docs/command-reference/repro.md
@@ -22,10 +22,9 @@ positional arguments:
## Description
-Provides a way to regenerate data pipeline results, by restoring the dependency
-graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) implicitly
-defined by the stages listed in `dvc.yaml`. The commands defined in these stages
-are then executed in the correct order.
+Provides a way to regenerate data pipeline results, by restoring the [dependency
+graph] implicitly defined by the stages listed in `dvc.yaml`. The commands
+defined in these stages are then executed in the correct order.
For stages with multiple commands (having a list in the `cmd` field), commands
are run one after the other in the order they are defined. The failure of any
@@ -69,6 +68,7 @@ It stores all the data files, intermediate or final results in the
hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc`
files.
+[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines
[always changed]: /doc/command-reference/status#local-workspace-status
### Parallel stage execution
@@ -163,7 +163,7 @@ up-to-date and only execute the final stage.
with the same dependencies and outputs (see the
[details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful
for example if the stage command/s is/are non-deterministic
- ([not recommended](/doc/command-reference/run#avoiding-unexpected-behavior)).
+ ([not recommended](/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior)).
- `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will
reproduce `A` first and then `B`, even if `B` was previously executed with the
diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md
index 5e0146fe3d..8006801088 100644
--- a/content/docs/command-reference/run.md
+++ b/content/docs/command-reference/run.md
@@ -34,44 +34,29 @@ organize data science projects, or build detailed machine learning pipelines.
A stage name is required and can be provided using the `-n` (`--name`) option.
The other available [options](#options) are mostly meant to describe different
kinds of stage [dependencies and outputs](#dependencies-and-outputs). The
-remaining terminal input provided to `dvc run` after `-`/`--` flags will become
-the required [`command` argument](#the-command-argument).
+remaining terminal input provided to `dvc run` after any options/flags will
+become the required [`command` argument].
-`dvc run` executes stage commands, unless the `--no-exec` option is used.
-
-
-
-### 💡 Avoiding unexpected behavior
-
-We don't want to tell anyone how to write their code or what programs to use!
-However, please be aware that in order to prevent unexpected results when DVC
-reproduces pipeline stages, the underlying code should ideally follow these
-rules:
-
-- Read/write exclusively from/to the specified dependencies and
- outputs (including parameters files, metrics, and plots).
+
-- Completely rewrite outputs. Do not append or edit.
+`-`/`--` flags sent after the `command` become part of the command itself and
+are ignored by `dvc stage add`.
-- Stop reading and writing files when the `command` exits.
+
-Also, if your pipeline reproducibility goals include consistent output data, its
-code should be
-[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (produce
-the same output for any given input): avoid code that increases
-[entropy](https://en.wikipedia.org/wiki/Software_entropy) (e.g. random numbers,
-time functions, hardware dependencies, etc.).
+`dvc run` executes stage commands, unless the `--no-exec` option is used.
-
+[`command` argument]:
+ /doc/user-guide/project-structure/dvcyaml-files#stage-commands
### Dependencies and outputs
By specifying lists of dependencies (`-d` option) and/or
outputs (`-o` and `-O` options) for each stage, we can create a
-_dependency graph_ ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))
-that connects them, i.e. the output of a stage becomes the input of another, and
-so on (see `dvc dag`). This graph can be restored by DVC later to modify or
-[reproduce](/doc/command-reference/repro) the full pipeline. For example:
+[dependency graph] that connects them, i.e. the output of a stage becomes the
+input of another, and so on (see `dvc dag`). This graph can be restored by DVC
+later to modify or [reproduce](/doc/command-reference/repro) the full pipeline.
+For example:
```dvc
$ dvc run -n printer -d write.sh -o pages ./write.sh
@@ -122,12 +107,14 @@ Relevant notes:
[manual process](/doc/command-reference/move#renaming-stage-outputs) to update
`dvc.yaml` and the project's cache accordingly.
+[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines
+
### For displaying and comparing data science experiments
-[parameters](/doc/command-reference/params) (`-p`/`--params` option) are a
-special type of key/value dependencies. Multiple parameter dependencies can be
-specified from within one or more YAML, JSON, TOML, or Python parameters files
-(e.g. `params.yaml`). This allows tracking experimental hyperparameters easily.
+[parameters][param-deps] (`-p`/`--params` option) are a special type of
+key/value dependencies. Multiple params can be specified from within one or more
+structured files (`params.yaml` by default). This allows tracking experimental
+hyperparameters easily in ML.
Special types of output files, [metrics](/doc/command-reference/metrics) (`-m`
and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and
@@ -135,26 +122,8 @@ and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and
specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing
data science experiments.
-### The `command` argument
-
-The `command` sent to `dvc run` can be anything your terminal would accept and
-run directly, for example a shell built-in, expression, or binary found in
-`PATH`. Please remember that any flags sent after the `command` are interpreted
-by the command itself, not by `dvc run`.
-
-⚠️ While DVC is platform-agnostic, the commands defined in your
-[pipeline](/doc/command-reference/dag) stages may only work on some operating
-systems and require certain software packages to be installed.
-
-Wrap the command with double quotes `"` if there are special characters in it
-like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to
-`dvc run` itself. Use single quotes `'` instead if there are environment
-variables in it that should be evaluated dynamically. Examples:
-
-```dvc
-$ dvc run -n first_stage "./a_script.sh > /dev/null 2>&1"
-$ dvc run -n second_stage './another_script.sh $MYENVVAR'
-```
+[param-deps]:
+ /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies
## Options
@@ -199,12 +168,12 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR'
makes the stage incompatible with `dvc repro`. Implies `--no-exec`.
- `-p [:]`, `--params [:]` - specify one
- or more [parameter dependencies](/doc/command-reference/params) from a
- parameters file `path` (`./params.yaml` by default). This is done by sending a
- comma separated list (`params_list`) as argument, e.g.
- `-p learning_rate,epochs`. A custom params file can be defined with a prefix,
- e.g. `-p params.json:threshold`. Or use the prefix alone with `:` to use all
- the parameters found in that file, e.g. `-p myparams.toml:`.
+ or more [parameter dependencies][param-deps] from a structured file `path`
+ (`./params.yaml` by default). This is done by sending a comma separated list
+ (`params_list`) as argument, e.g. `-p learning_rate,epochs`. A custom params
+ file can be defined with a prefix, e.g. `-p params.json:threshold`. Or use the
+ prefix alone with `:` to use all the parameters found in that file, e.g.
+ `-p myparams.toml:`.
- `-m `, `--metrics ` - specify a metrics file produced by this
stage. This option behaves like `-o` but registers the file in a `metrics`
@@ -250,7 +219,7 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR'
run with the same dependencies and outputs (see the
[details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful
for example if the stage command/s is/are non-deterministic
- ([not recommended](#avoiding-unexpected-behavior)).
+ ([not recommended](/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior)).
- `--no-commit` - do not store the outputs of this execution in the cache
(`dvc.yaml` and `dvc.lock` are still created or updated); useful to avoid
diff --git a/content/docs/command-reference/stage/add.md b/content/docs/command-reference/stage/add.md
index c1b9c268ac..dcfd76b781 100644
--- a/content/docs/command-reference/stage/add.md
+++ b/content/docs/command-reference/stage/add.md
@@ -27,67 +27,40 @@ update an existing stage, overwrite it with the `-f` (`--force`) option.
A stage name is required and can be provided using the `-n` (`--name`) option.
Most of the other [options](#options) help with defining different kinds of
[dependencies and outputs](#dependencies-and-outputs) for the stage. The
-remaining terminal input provided to `dvc stage add` after `-`/`--` flags will
-become the required [`command` argument](#the-command-argument).
+remaining terminal input provided to `dvc stage add` after any options/flags
+will become the required [`command` argument].
-Stages whose dependencies are outputs from other stages form
-[pipelines](/doc/command-reference/dag). `dvc repro` can be used to rebuild
-their dependency graph, and execute them.
-
-
-
-### 💡 Avoiding unexpected behavior
-
-We don't want to tell anyone how to write their code or what programs to use!
-However, please be aware that in order to prevent unexpected results when DVC
-reproduces pipeline stages, the underlying code should ideally follow these
-rules:
-
-- Read/write exclusively from/to the specified dependencies and
- outputs (including parameters files, metrics, and plots).
-
-- Completely rewrite outputs. Do not append or edit.
+
-- Stop reading and writing files when the `command` exits.
+`-`/`--` flags sent after the `command` become part of the command itself and
+are ignored by `dvc stage add`.
-Also, if your pipeline reproducibility goals include consistent output data, its
-code should be
-[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (produce
-the same output for any given input): avoid code that increases
-[entropy](https://en.wikipedia.org/wiki/Software_entropy) (e.g. random numbers,
-time functions, hardware dependencies, etc.).
+
-
+Stages whose outputs become dependencies for other stages form
+pipelines. `dvc repro` can be used to rebuild this [dependency
+graph] and execute them.
-### The `command` argument
+
-The `command` sent to `dvc stage add` can be anything your terminal would accept
-and run directly, for example a shell built-in, expression, or binary found in
-`PATH`. Please remember that any flags sent after the `command` are considered
-part of the command itself, not of `dvc stage add`.
+See the guide on [defining pipeline stages] for more details.
-⚠️ While DVC is platform-agnostic, the commands defined in your
-[pipeline](/doc/command-reference/dag) stages may only work on some operating
-systems and require certain software packages to be installed.
+[defining pipeline stages]:
+ /doc/user-guide/data-pipelines/defining-pipelines#pipelines
-Wrap the command with double quotes `"` if there are special characters in it
-like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to
-`dvc stage add` itself. Use single quotes `'` instead if there are environment
-variables in it that should be evaluated dynamically. Examples:
+
-```cli
-$ dvc stage add -n first_stage "./a_script.sh > /dev/null 2>&1"
-$ dvc stage add -n second_stage './another_script.sh $MYENVVAR'
-```
+[`command` argument]:
+ /doc/user-guide/project-structure/dvcyaml-files#stage-commands
### Dependencies and outputs
By specifying lists of dependencies (`-d` option) and/or
outputs (`-o` and `-O` options) for each stage, we can create a
-_dependency graph_ ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))
-that connects them, i.e. the output of a stage becomes the input of another, and
-so on (see `dvc dag`). This graph can be restored by DVC later to modify or
-[reproduce](/doc/command-reference/repro) the full pipeline. For example:
+[dependency graph] that connects them, i.e. the output of a stage becomes the
+input of another, and so on (see `dvc dag`). This graph can be restored by DVC
+later to modify or [reproduce](/doc/command-reference/repro) the full pipeline.
+For example:
```cli
$ dvc stage add -n printer -d write.sh -o pages ./write.sh
@@ -138,12 +111,14 @@ Relevant notes:
[manual process](/doc/command-reference/move#renaming-stage-outputs) to update
`dvc.yaml` and the project's cache accordingly.
+[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines
+
### For displaying and comparing data science experiments
-[parameters](/doc/command-reference/params) (`-p`/`--params` option) are a
-special type of key/value dependencies. Multiple parameter dependencies can be
-specified from within one or more YAML, JSON, TOML, or Python parameters files
-(e.g. `params.yaml`). This allows tracking experimental hyperparameters easily.
+[parameters][param-deps] (`-p`/`--params` option) are a special type of
+key/value dependencies. Multiple params can be specified from within one or more
+structured files (`params.yaml` by default). This allows tracking experimental
+hyperparameters easily in ML.
Special types of output files, [metrics](/doc/command-reference/metrics) (`-m`
and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and
@@ -151,6 +126,9 @@ and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and
specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing
data science experiments.
+[param-deps]:
+ /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies
+
## Options
- `-n `, `--name ` (**required**) - specify a name for the stage
@@ -194,12 +172,12 @@ data science experiments.
makes the stage incompatible with `dvc repro`.
- `-p [:]`, `--params [:]` - specify one
- or more [parameter dependencies](/doc/command-reference/params) from a
- parameters file `path` (`./params.yaml` by default). This is done by sending a
- comma separated list (`params_list`) as argument, e.g.
- `-p learning_rate,epochs`. A custom params file can be defined with a prefix,
- e.g. `-p params.json:threshold`. Or use the prefix alone with `:` to use all
- the parameters found in that file, e.g. `-p myparams.toml:`.
+ or more [parameter dependencies][param-deps] from a structured file `path`
+ (`./params.yaml` by default). This is done by sending a comma separated list
+ (`params_list`) as argument, e.g. `-p learning_rate,epochs`. A custom params
+ file can be defined with a prefix, e.g. `-p params.json:threshold`. Or use the
+ prefix alone with `:` to use all the parameters found in that file, e.g.
+ `-p myparams.toml:`.
- `-m `, `--metrics ` - specify a metrics file produced by this
stage. This option behaves like `-o` but registers the file in a `metrics`
diff --git a/content/docs/command-reference/stage/index.md b/content/docs/command-reference/stage/index.md
index e870b6f33c..1bb7c73939 100644
--- a/content/docs/command-reference/stage/index.md
+++ b/content/docs/command-reference/stage/index.md
@@ -24,3 +24,6 @@ organize data science projects, or build detailed machine learning pipelines.
`dvc stage add` can be used to create/update stages in the `dvc.yaml` file. Use
`dvc stage list` or `dvc dag` to discover existing stages without having to
examine `dvc.yaml` files manually.
+
+Learn more about
+[defining stages](/doc/user-guide/data-pipelines/defining-pipelines#stages).
diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json
index 47ef755f6e..a582677b7d 100644
--- a/content/docs/sidebar.json
+++ b/content/docs/sidebar.json
@@ -149,6 +149,10 @@
"checkpoints"
]
},
+ {
+ "slug": "pipelines",
+ "children": ["defining-pipelines"]
+ },
{
"slug": "how-to",
"source": false,
diff --git a/content/docs/start/data-management/metrics-parameters-plots.md b/content/docs/start/data-management/metrics-parameters-plots.md
index 8024534801..e8a7006f9e 100644
--- a/content/docs/start/data-management/metrics-parameters-plots.md
+++ b/content/docs/start/data-management/metrics-parameters-plots.md
@@ -192,7 +192,7 @@ featurize:
### ⚙️ Expand to recall how it was generated.
The `featurize` stage
-[was created](/doc/start/data-pipelines#dependency-graphs-dags) with this
+[was created](/doc/start/data-pipelines#dependency-graphs-dag) with this
`dvc run` command. Notice the argument sent to the `-p` option (short for
`--params`):
diff --git a/content/docs/start/data-management/pipelines.md b/content/docs/start/data-management/pipelines.md
index 6b14b459cf..d61177df46 100644
--- a/content/docs/start/data-management/pipelines.md
+++ b/content/docs/start/data-management/pipelines.md
@@ -151,13 +151,12 @@ along with `git commit` to version DVC metafiles).
[to remote storage]: /doc/start/data-and-model-versioning#storing-and-sharing
-## Dependency graphs (DAGs)
+## Dependency graphs (DAG)
-By using `dvc stage add` multiple times, and specifying outputs of
-a stage as dependencies of another one, we can describe a sequence
-of commands which gets to a desired result. This is what we call a _data
-pipeline_ or
-[_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
+By using `dvc stage add` multiple times to use outputs of a stage
+as dependencies of another, we can describe a sequence of commands
+that gets to a desired result. This is what we call a _data pipeline_ or
+dependency graph (a [DAG]).
Let's create a second stage chained to the outputs of `prepare`, to perform
feature extraction:
@@ -172,6 +171,8 @@ $ dvc stage add -n featurize \
The `dvc.yaml` file is updated automatically and should include two stages now.
+[dag]: /doc/user-guide/data-pipelines/defining-pipelines
+
### 💡 Expand to see what happens under the hood.
@@ -275,8 +276,8 @@ it also doesn't rerun `train`! The previous run with the same set of inputs
### 💡 Expand to see what happens under the hood.
-`dvc repro` relies on the DAG definition from `dvc.yaml`, and uses
-`dvc.lock` to determine what exactly needs to be run.
+`dvc repro` relies on the [DAG] defined in `dvc.yaml`, and uses `dvc.lock` to
+determine what exactly needs to be run.
The `dvc.lock` file is similar to a `.dvc` file — it captures hashes (in most
cases `md5`s) of the dependencies and values of the parameters that were used.
diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md
index 453fda173f..29c82a592a 100644
--- a/content/docs/user-guide/basic-concepts/dependency.md
+++ b/content/docs/user-guide/basic-concepts/dependency.md
@@ -2,7 +2,9 @@
name: Dependency
match: [dependency, dependencies, depends, input]
tooltip: >-
- A file or directory (possibly tracked by DVC) recorded in the `deps` section
- of a stage (in `dvc.yaml`) or `.dvc` file file. See `dvc run`. Stages are
- invalidated (considered outdated) when any of their dependencies change.
+ A file (e.g. data, code), directory (e.g. datasets), or parameter used as
+ input for a stage in a DVC pipeline. These are specified as paths in the
+ `deps` field of `dvc.yaml` or `.dvc` files. Stages are invalidated (considered
+ outdated) when any of their dependencies change. See `dvc stage add`, `dvc
+ params`, `dvc repro`.
---
diff --git a/content/docs/user-guide/basic-concepts/output.md b/content/docs/user-guide/basic-concepts/output.md
index d6d53a4821..59ced07a0c 100644
--- a/content/docs/user-guide/basic-concepts/output.md
+++ b/content/docs/user-guide/basic-concepts/output.md
@@ -4,5 +4,5 @@ match: [output, outputs]
tooltip: >-
A file or directory tracked by DVC, recorded in the `outs` section of a stage
(in `dvc.yaml`) or `.dvc` file. Outputs are usually the result of stages. See
- `dvc add`, `dvc run`, `dvc import`, among others.
+ `dvc add`, `dvc repro`, `dvc import`, among others.
---
diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md
index f66383479f..8c58710ed5 100644
--- a/content/docs/user-guide/basic-concepts/pipeline.md
+++ b/content/docs/user-guide/basic-concepts/pipeline.md
@@ -1,7 +1,11 @@
---
-name: Pipeline (DAG)
-match: [DAG, pipeline, 'data pipeline', 'data pipelines']
+name: Pipeline
+match: [pipeline, pipelines, 'data pipeline', 'data pipelines']
tooltip: >-
- A set of inter-dependent stages. This is also called a [dependency
- graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
+ DVC pipelines describe data processing workflows in a standard declarative
+ YAML format ([`dvc.yaml`](/doc/user-guide/project-structure/dvcyaml-files)).
+ This guarantees DVC can reproduce them consistently. DVC also helps automate
+ their execution and caches their results. See [Defining
+ Pipelines](/doc/user-guide/data-pipelines/defining-pipelines) for more
+ details.
---
diff --git a/content/docs/user-guide/basic-concepts/stage.md b/content/docs/user-guide/basic-concepts/stage.md
index e73027b460..1a5292a367 100644
--- a/content/docs/user-guide/basic-concepts/stage.md
+++ b/content/docs/user-guide/basic-concepts/stage.md
@@ -2,7 +2,9 @@
name: Stage
match: [stage, stages]
tooltip: >-
- A stage represents individual data processes, including their input and
- resulting output which can be combined to build detailed machine learning
- pipelines.
+ A stage represents an individual command, script, or source code that gets to
+ some milestone as part of your project's workflow. For example, `python
+ train.py` may generate a machine learning model. DVC stages include data
+ input(s) and resulting output(s), if any. [Learn
+ more](/doc/user-guide/data-pipelines/defining-pipelines#stages).
---
diff --git a/content/docs/user-guide/experiment-management/running-experiments.md b/content/docs/user-guide/experiment-management/running-experiments.md
index 6c52899ec7..8208cffc1d 100644
--- a/content/docs/user-guide/experiment-management/running-experiments.md
+++ b/content/docs/user-guide/experiment-management/running-experiments.md
@@ -45,7 +45,8 @@ once.
> 📖 `dvc exp run` is an experiment-specific alternative to `dvc repro`.
[reproduction targets]: /doc/command-reference/repro#options
-[dependency graph]: /doc/command-reference/dag#directed-acyclic-graph
+[dependency graph]:
+ /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph
## Tuning (hyper)parameters
diff --git a/content/docs/user-guide/pipelines/defining-pipelines.md b/content/docs/user-guide/pipelines/defining-pipelines.md
new file mode 100644
index 0000000000..003f87b978
--- /dev/null
+++ b/content/docs/user-guide/pipelines/defining-pipelines.md
@@ -0,0 +1,260 @@
+# Defining Pipelines
+
+Pipelines represent data workflows that you want to **reproduce** reliably -- so
+the results are consistent. The typical pipelining process involves:
+
+- Obtain and `dvc add` or `dvc import` the project's initial data requirements
+ (see [Data Management]). This caches the data and generates
+ `.dvc` files.
+
+- Define the pipeline [stages](#stages) in `dvc.yaml` files (more on this
+ later). Example structure:
+
+ ```yaml
+ stages:
+ prepare: ... # stage 1 definition
+ train: ... # stage 2 definition
+ evaluate: ... # stage 3 definition
+ ```
+
+- Capture other useful metadata such as runtime
+ [parameters](#parameter-dependencies), performance [metrics], and [plots] to
+ visualize. DVC supports multiple file formats for these.
+
+
+
+We call this file-based definition _codification_ (YAML format in our case). It
+has the added benefit of allowing you to develop pipelines on standard Git
+workflows ([GitOps]).
+
+[gitops]: /doc/use-cases/versioning-data-and-model-files
+
+
+
+Stages usually take some data and run some code, producing an output (e.g. an ML
+model). The pipeline is formed by making them interdependent, meaning that the
+output of a stage becomes the input of another, and so on. Technically, this is
+called a _dependency graph_ (DAG).
+
+Note that while each pipeline is a graph, this doesn't mean a single `dvc.yaml`
+file. DVC checks the entire project tree and validates all such
+files to find stages, rebuilding all the pipelines that these may define.
+
+[data management]: /doc/start/data-management
+[metrics]: /doc/command-reference/metrics
+[plots]: /doc/user-guide/visualizing-plots
+
+
+
+## Directed Acyclic Graph (DAG)
+
+DVC represents a pipeline internally as a _graph_ where the nodes are stages and
+the edges are _directed_ dependencies (e.g. A before B). And in order for DVC to
+run a pipeline, its topology should be _acyclic_ -- because executing cycles
+(e.g. A -> B -> C -> A ...) would continue indefinitely. [More about DAGs].
+
+Use `dvc dag` to visualize (or export) them.
+
+[more about dags]: https://en.wikipedia.org/wiki/Directed_acyclic_graph
+
+
+
+## Stages
+
+
+
+See the full [specification] of stage entries.
+
+[specification]: /doc/user-guide/project-structure/dvcyaml-files#stage-entries
+
+
+
+Each stage wraps around an executable shell [command] and specifies any
+file-based [dependencies](#simple-dependencies) as well as [outputs](#outputs).
+Let's look at a sample stage: it depends on a script file it runs as well as on
+a raw data input (ideally [tracked by DVC][data management] already):
+
+```yaml
+stages:
+ prepare:
+ cmd: source src/cleanup.sh
+ deps:
+ - src/cleanup.sh
+ - data/raw
+ outs:
+ - data/clean.csv
+```
+
+
+
+We use [GNU/Linux](https://www.gnu.org/software/software.html) in these
+examples, but Windows or other shells can be used too.
+
+
+
+Besides writing `dvc.yaml` files manually (recommended), you can also create
+stages with `dvc stage add` -- a limited command-line interface to setup
+pipelines. Let's add another stage this way and look at the resulting
+`dvc.yaml`:
+
+```dvc
+$ dvc stage add --name train \
+ --deps src/model.py \
+ --deps data/clean.csv \
+ --outs data/predict.dat \
+ python src/model.py data/clean.csv
+```
+
+```yaml
+stages:
+ prepare:
+ ...
+ outs:
+ - data/clean.csv
+ train:
+ cmd: python src/model.py data/model.csv
+ deps:
+ - src/model.py
+ - data/clean.csv
+ outs:
+ - data/predict.dat
+```
+
+
+
+One advantage of using `dvc stage add` is that it will verify the validity of
+the arguments provided (otherwise stage definition won't be checked until
+execution). A disadvantage is that some advanced features such as [templating]
+are not available this way.
+
+[command]: /doc/user-guide/project-structure/dvcyaml-files#stage-commands
+[templating]: /doc/user-guide/project-structure/pipelines-files#templating
+
+
+
+Notice that the new `train` stage depends on the output from stage `prepare`
+(`data/clean.csv`), forming the pipeline ([DAG](#directed-acyclic-graph-dag)).
+
+
+
+Stage execution sequences will be determined entirely by the DAG, not by the
+order in which stages are found in `dvc.yaml`.
+
+
+
+## Simple dependencies
+
+There's more than one type of stage dependency. A simple dependency is a file or
+directory used as input by the stage command. When it's contents have changed,
+DVC "invalidates" the stage -- it knows that it needs to run again (see
+`dvc status`). This in turn may cause a chain reaction in which subsequent
+stages of the pipeline are also reproduced.
+
+
+
+DVC [calculates a hash] of file/dir contents to compare vs. previous versions.
+This is a distinctive mechanism over traditional build tools like `make`.
+
+[calculates a hash]:
+ /doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
+
+
+
+File system-level dependencies are defined in the `deps` field of `dvc.yaml`
+stages; Alternatively, using the `--deps` (`-d`) option of `dvc stage add` (see
+the previous section's example).
+
+
+
+### External dependencies: click to learn more.
+
+A less common kind of dependency is a _URL dependency_. Instead of files in a
+local disk, you can `dvc import` data from another DVC project (for
+example hosted on GitHub). External dependencies establish relationships between
+different projects or systems (see `dvc import-url`).
+[Get all the details](/doc/user-guide/external-dependencies).
+
+
+
+DVC will use special methods to check whether the contents of an URL have
+changed for the purpose of stage invalidation.
+
+
+
+
+
+## Parameter dependencies
+
+A more granular type of dependency is the parameter (`params` field of
+`dvc.yaml`), or _hyperparameters_ in machine learning. These represent simple
+values used inside your code to tune data processing, or that affect stage
+execution in any other way. For example, training a [Neural Network] usually
+requires _batch size_ and _epoch_ values.
+
+Instead of hard-coding param values, your code can read them from a structured
+file (e.g. YAML format). DVC can track any key/value pair in a supported
+[parameters file] (`params.yaml` by default). Params are granular dependencies
+because DVC only invalidates stages when the corresponding part of the params
+file has changed.
+
+```yaml
+stages:
+ train:
+ cmd: ...
+ deps: ...
+ params: # from params.yaml
+ - learning_rate
+ - nn.epochs
+ - nn.batch_size
+ outs: ...
+```
+
+
+
+See [more details] about this syntax.
+
+
+
+Use `dvc params diff` to compare parameters across project versions.
+
+[parameters file]:
+ /doc/user-guide/project-structure/dvcyaml-files#parameters-files
+[neural network]:
+ https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
+[more details]: /doc/user-guide/project-structure/dvcyaml-files#parameters
+
+## Outputs
+
+Stage outputs are files (or directories) written by pipelines, for
+example machine learning models, intermediate artifacts, as well as data [plots]
+and performance [metrics]. These files are cached by DVC
+automatically, and tracked with the help of `dvc.lock` files.
+
+Outputs can be dependencies of subsequent stages (as explained earlier). So when
+they change, DVC may need to reproduce downstream stages as well (handled
+automatically).
+
+The types of outputs are:
+
+- Files and directories: Typically data to feed to intermediate stages, as well
+ as the final results of a pipeline (e.g. a dataset or an ML model).
+
+- [Metrics]: DVC supports small text files that usually contain model
+ performance metrics from the evaluation, validation, or testing phases of the
+ ML lifecycle. DVC allows to compare produced metrics with one another using
+ `dvc metrics diff` and presents the results as a table with `dvc metrics show`
+ or `dvc exp show`.
+
+- [Plots]: Different kinds of data that can be visually graphed. For example
+ contrast ML performance statistics or continuous metrics from multiple
+ experiments. `dvc plots show` can generate charts for certain data files or
+ render custom image files for you, or you can compare different ones with
+ `dvc plots diff`.
+
+
+
+Outputs are produced by [stage commands][command]. DVC does not make any
+assumption regarding this process; they should just match the path specified in
+`dvc.yaml`.
+
+
diff --git a/content/docs/user-guide/pipelines/index.md b/content/docs/user-guide/pipelines/index.md
new file mode 100644
index 0000000000..5a9f96a823
--- /dev/null
+++ b/content/docs/user-guide/pipelines/index.md
@@ -0,0 +1,19 @@
+# Pipelines
+
+If you find yourself repeating sequence of actions to get or update the results
+of your project, then you may already have a pipeline. For example, a data
+science workflow could involve:
+
+1. Gathering data for training and validation
+2. Extracting useful features from the training dataset
+3. (Re)training an ML model
+4. Evaluating the results against the validation set
+
+DVC helps you [define] these stages in a standard YAML format (`.dvc` and
+`dvc.yaml` files), making your pipeline more manageable and
+consistent to reproduce.
+
+See [Get Started: Data Pipelines](/doc/start/data-management/pipelines) for a
+hands-on introduction to this topic.
+
+[define]: /doc/user-guide/data-pipelines/defining-pipelines
diff --git a/content/docs/user-guide/project-structure/dvcyaml-files.md b/content/docs/user-guide/project-structure/dvcyaml-files.md
index ad975873d8..e6194ed0dd 100644
--- a/content/docs/user-guide/project-structure/dvcyaml-files.md
+++ b/content/docs/user-guide/project-structure/dvcyaml-files.md
@@ -20,8 +20,8 @@ so you may modify, write, or generate stages and pipelines on your own.
-We use [GNU/Linux](https://www.gnu.org/software/software.html) in most of our
-examples.
+We use [GNU/Linux](https://www.gnu.org/software/software.html) in these
+examples, but Windows or other shells can be used too.
@@ -40,14 +40,19 @@ stages:
- columns.txt
```
-> See also `dvc stage add`, a helper command to write stages in `dvc.yaml`.
+
+
+See also `dvc stage add`, a helper command to write stages in `dvc.yaml`.
+
+
The most important part of a stage is the terminal command(s) it executes (`cmd`
field). This is what DVC runs when the stage is reproduced (see `dvc repro`).
-If a command reads input files, these (or their directory locations) can be
-defined as dependencies (`deps`). DVC will check whether they have
-changed to decide whether the stage requires re-execution (see `dvc status`).
+If a [stage command](#stage-commands) reads input files, these (or their
+directory locations) can be defined as dependencies (`deps`). DVC
+will check whether they have changed to decide whether the stage requires
+re-execution (see `dvc status`).
If it writes files or dirs, they can be defined as outputs
(`outs`). DVC will track them going forward (similar to using `dvc add`).
@@ -64,6 +69,24 @@ See the full stage entry [specification](#stage-entries).
+### Stage commands
+
+The command(s) defined in the `stages` (`cmd` field) can be anything your system
+terminal would accept and run, for example a shell built-in, an expression, or a
+binary found in `PATH`.
+
+Surround the command with double quotes `"` if it includes special characters
+like `|` or `<`, `>`. Use single quotes `'` instead if there are environment
+variables in it that should be evaluated dynamically.
+
+The same applies to the `command` argument for helper commands (`dvc stage add`,
+`dvc exp init`), otherwise they would apply to the DVC call itself:
+
+```cli
+$ dvc stage add -n a_stage "./a_script.sh > /dev/null 2>&1"
+$ dvc exp init './another_script.sh $MYENVVAR'
+```
+
### Parameter dependencies
[Parameters](/doc/command-reference/params) are a special type of stage
@@ -104,9 +127,9 @@ This allows several stages to depend on values of a shared structured file
### Metrics and Plots outputs
-Like [common outputs](#outputs), metrics and plots
-files are produced by the stage `cmd`. However, their purpose is different.
-Typically they contain metadata to evaluate pipeline processes. Example:
+Like common output files, metrics and plots files are
+produced by the stage `cmd`. However, their purpose is different. Typically they
+contain metadata to evaluate pipeline processes. Example:
```yaml
stages:
@@ -432,7 +455,7 @@ These are the fields that are accepted in each stage:
| Field | Description |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `cmd` | (Required) One or more commands executed by the stage (may contain either a single value or a list). Commands are executed sequentially until all are finished or until one of them fails (see `dvc repro`). |
+| `cmd` | (Required) One or more commands executed by the stage (may contain either a single value or a list). [Learn more](#stage-commands). |
| `wdir` | Working directory for the stage command to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to `.` (the file's location). |
| `deps` | List of dependency paths of this stage (relative to `wdir`). |
| `outs` | List of stage output paths (relative to `wdir`). These can contain optional [subfields](#output-subfields). |
@@ -457,6 +480,14 @@ validation and auto-completion.
> See also
> [How to Merge Conflicts](/doc/user-guide/how-to/merge-conflicts#dvcyaml).
+
+
+While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field) may
+only work on some operating systems and require certain software packages or
+libraries in the environment.
+
+
+
### Output subfields
> These include a subset of the fields in `.dvc` file
diff --git a/content/docs/user-guide/project-structure/internal-files.md b/content/docs/user-guide/project-structure/internal-files.md
index 9d0edd57aa..f8b705ada2 100644
--- a/content/docs/user-guide/project-structure/internal-files.md
+++ b/content/docs/user-guide/project-structure/internal-files.md
@@ -165,4 +165,7 @@ run.
run-cache to remote storage for sharing and/or as a back up.
> Note that the run-cache assumes that stage commands are deterministic (see
-> **Avoiding unexpected behavior** in `dvc run`).
+> [Avoiding unexpected behavior]).
+
+[avoiding unexpected behavior]:
+ /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior
diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md
index a6137ed92f..b6d0967004 100644
--- a/content/docs/user-guide/related-technologies.md
+++ b/content/docs/user-guide/related-technologies.md
@@ -63,8 +63,7 @@ bringing best practices from software engineering into the data science field
## Workflow management systems
-Pipelines and dependency graphs
-([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as _Airflow_,
+Systems to manage data pipelines and [dependency graphs] such as _Airflow_,
_Luigi_, etc.
- DVC is focused on data science and modeling. As a result, DVC pipelines are
@@ -79,6 +78,8 @@ _Luigi_, etc.
- See also our sister project, [CML](https://cml.dev/), that helps fill some of
these gaps.
+[dependency graphs]: /doc/user-guide/data-pipelines/defining-pipelines
+
## Experiment management software
> See also the [Experiment Management](/doc/user-guide/experiment-management)
@@ -111,11 +112,9 @@ _Luigi_, etc.
avoid recomputing all dependency file hashes, which would be highly
problematic when working with large files (multiple GB).
-- DVC utilizes a
- [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph)
- (DAG):
+- DVC utilizes a [Directed Acyclic Graph] (DAG):
- - The DAG or dependency graph is defined implicitly by the connections between
+ - The dependency graph is defined implicitly by the connections between
[stages](/doc/command-reference/run), based on their
dependencies and outputs.
@@ -132,3 +131,6 @@ _Luigi_, etc.
> actual file contents. See **Linking files** in
> [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for
> technical details (Linux).
+
+[directed acyclic graph]:
+ /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag
diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md
index 6c960dbb59..e695443e7a 100644
--- a/content/docs/user-guide/what-is-dvc.md
+++ b/content/docs/user-guide/what-is-dvc.md
@@ -33,8 +33,8 @@ can version experiments, manage large datasets, and make projects reproducible.
transfer large datasets or share a GPU-trained model with others.
- DVC makes data science projects **reproducible** by creating lightweight
- [pipelines](/doc/command-reference/dag) using implicit dependency graphs, and
- by codifying the data and artifacts involved.
+ [pipelines] using implicit dependency graphs, and by codifying the data and
+ artifacts involved.
- DVC is **platform agnostic**: It runs on all major operating systems (Linux,
macOS, and Windows), and works independently of the programming languages
@@ -51,6 +51,7 @@ can version experiments, manage large datasets, and make projects reproducible.
[free]: https://github.com/iterative/dvc/blob/master/LICENSE
[vs code extension]: /doc/vs-code-extension
[command line]: /doc/command-reference
+[pipelines]: /doc/user-guide/data-pipelines
## DVC does not replace Git!