diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md index 43ff3d86a7..8c663f788f 100644 --- a/content/docs/command-reference/dag.md +++ b/content/docs/command-reference/dag.md @@ -1,7 +1,7 @@ # dag -Visualize the pipeline(s) in `dvc.yaml` as one or more graph(s) of -connected [stages](/doc/command-reference/run). +Visualize pipelines as one or more stage dependency +graphs. ## Synopsis @@ -17,28 +17,15 @@ positional arguments: ## Description -Displays the stages of a pipeline up to the `target` stage. If the `target` is -omitted, it will show the full project DAG. +DVC represents a pipeline internally as a **Directed Acyclic Graph** (DAG) where +the nodes are stages and the edges are dependencies. -### Directed acyclic graph +`dvc dag` displays this dependency graph in one or more pipelines, as defined in +the `dvc.yaml` files found in the project. Provide a `target` stage +name to show the pipeline up to that point. -A data pipeline, in general, is a series of data processing -[stages](/doc/command-reference/run) (for example, console commands that take an -input and produce an outcome). The connections between stages are formed by the -output of one turning into the dependency of another. -A pipeline may produce intermediate data, and has a final result. - -Data science and machine learning pipelines typically start with large raw -datasets, include intermediate featurization and training stages, and produce a -final model, as well as accuracy [metrics](/doc/command-reference/metrics). - -In DVC, pipeline stages and commands, their data I/O, interdependencies, and -results (intermediate or final) are specified in `dvc.yaml`, which can be -written manually or built using the helper command `dvc stage add`. This allows -DVC to restore one or more pipelines later (see `dvc repro`). - -> DVC builds a dependency graph -> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this. +[directed acyclic graph]: + /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag ### Paginating the output diff --git a/content/docs/command-reference/exp/init.md b/content/docs/command-reference/exp/init.md index dd5f323b09..5794a9b50e 100644 --- a/content/docs/command-reference/exp/init.md +++ b/content/docs/command-reference/exp/init.md @@ -14,21 +14,38 @@ usage: dvc exp init [-h] [-q | -v] [--run] [--interactive] [-f] [--metrics METRICS] [--plots PLOTS] [--live LIVE] [--type {default,checkpoint}] [command] + +positional arguments: + command Shell command to runs the experiment(s) ``` ## Description This command helps you get started with DVC Experiments quickly. It reduces -repetitive DVC procedures by creating a necessary `dvc.yaml` file, which assumes -standard locations of your inputs (data, parameters, and source -code) and outputs (models, metrics, and +repetitive DVC procedures by creating a `dvc.yaml` file. It assumes standard +locations of your inputs (data, parameters, and source code) and +outputs (models, metrics, and [plots](/doc/command-reference/plots)). -These locations can be customized through the [command options](#options) or via -[configuration](/doc/command-reference/config#exp). Default project structure: +The only required argument is a [shell `command`] to run your experiment(s). It +can be provided directly as an argument (see example below) or by using the +`--interactive` (`-i`) mode, which will prompt for it. + +```cli +$ dvc exp init "python src/train.py" +Creating dependencies: src, data and params.yaml +Creating output directories: plots and models +Creating train stage in dvc.yaml +``` + +`dvc exp init` also generates the boilerplate project structure, including input +files/directories and directories needed for future outputs. These locations can +also be customized via [CLI options](#options) or interactive mode, or with +[configuration](/doc/command-reference/config#exp). Default structure: ``` ├── data/ +├── dvc.yaml ├── metrics.json ├── models/ ├── params.yaml @@ -36,55 +53,51 @@ These locations can be customized through the [command options](#options) or via └── src/ ``` -The only required argument is the terminal command that runs your experiment(s). -It can be provided directly [as an argument](#the-command-argument) or by using -the `--interactive` (`-i`) mode (which will prompt for it). The command will be -wrapped as a stage that `dvc exp run` can execute. +Inside `dvc.yaml`, the experiment is wrapped as a stage that +`dvc exp run` can execute. - +
-A special `--type` of stage is supported (`checkpoint`), which monitors -[checkpoints] during training of ML models. +### Click to see `dvc.yaml` example - +```yaml +stages: + train: + cmd: python src/train.py + deps: + - data + - src + params: + - params.yaml: + outs: + - models + metrics: + - metrics.json: + cache: false + plots: + - plots: + cache: false +``` -`dvc exp init` also generates the boilerplate project structure, including input -files/directories and directories needed for future outputs, or any locations -determined in interactive mode. +
- + -`dvc exp init` is intended as a quick way to start running [DVC Experiments]. -See the `dvc.yaml` specification for more complex data pipelines. +A special `--type` of stage is supported (`checkpoint`), which monitors +[checkpoints] during training of ML models. +📖 `dvc exp init` is intended as a quick way to start running [DVC Experiments]. +See the [Pipelines guide] for more on that topic. + [stage definition]: /doc/user-guide/project-structure/dvcyaml-files#stage-entries +[shell `command`]: + /doc/user-guide/project-structure/dvcyaml-files#stage-commands [checkpoints]: /doc/user-guide/experiment-management/checkpoints [dvc experiments]: /doc/user-guide/experiment-management/experiments-overview - -### The `command` argument - -The command given to `dvc exp init` can be anything your system terminal would -accept and run directly, for example a shell built-in, an expression, or a -binary found in `PATH`. Please note that any flags sent after the `command` -argument will normally become part of that command itself and ignored by -`dvc exp init` (so provide it last). - -⚠️ While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field) -may only work on some operating systems and require certain software packages or -libraries in the environment. - -Surround the command with double quotes `"` if it includes special characters -like `|` or `<`, `>` -- otherwise they would apply to `dvc exp init` itself. Use -single quotes `'` instead if there are environment variables in it that should -be evaluated dynamically. - -```dvc -$ dvc exp init "./a_script.sh > /dev/null 2>&1" -$ dvc exp init './another_script.sh $MYENVVAR' -``` +[pipelines guide]: /doc/user-guide/data-pipelines/defining-pipelines ## Options diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md index 0d3b15b1a3..2fe4e00d46 100644 --- a/content/docs/command-reference/move.md +++ b/content/docs/command-reference/move.md @@ -87,11 +87,15 @@ model file: $ mv keras.h5 model.h5 ``` -> Note that, often the output of a stage is a dependency in another stage, -> creating a -> [dependency graph](/doc/command-reference/run#dependencies-and-outputs). In -> this case, you may want to also update the `path` in the `deps` field of -> `dvc.yaml`. + + +Often the output of a stage is a dependency in another stage, creating a +[dependency graph]. In this case, you may want to also update the `path` in the +`deps` field of `dvc.yaml`. + +[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines + + Finally, we run `dvc commit` with the `-f` option to force save the changes to cache: diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 327d11f49f..ba4e0b9efa 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -22,10 +22,9 @@ positional arguments: ## Description -Provides a way to regenerate data pipeline results, by restoring the dependency -graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) implicitly -defined by the stages listed in `dvc.yaml`. The commands defined in these stages -are then executed in the correct order. +Provides a way to regenerate data pipeline results, by restoring the [dependency +graph] implicitly defined by the stages listed in `dvc.yaml`. The commands +defined in these stages are then executed in the correct order. For stages with multiple commands (having a list in the `cmd` field), commands are run one after the other in the order they are defined. The failure of any @@ -69,6 +68,7 @@ It stores all the data files, intermediate or final results in the hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc` files. +[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines [always changed]: /doc/command-reference/status#local-workspace-status ### Parallel stage execution @@ -163,7 +163,7 @@ up-to-date and only execute the final stage. with the same dependencies and outputs (see the [details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful for example if the stage command/s is/are non-deterministic - ([not recommended](/doc/command-reference/run#avoiding-unexpected-behavior)). + ([not recommended](/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior)). - `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will reproduce `A` first and then `B`, even if `B` was previously executed with the diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 5e0146fe3d..8006801088 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -34,44 +34,29 @@ organize data science projects, or build detailed machine learning pipelines. A stage name is required and can be provided using the `-n` (`--name`) option. The other available [options](#options) are mostly meant to describe different kinds of stage [dependencies and outputs](#dependencies-and-outputs). The -remaining terminal input provided to `dvc run` after `-`/`--` flags will become -the required [`command` argument](#the-command-argument). +remaining terminal input provided to `dvc run` after any options/flags will +become the required [`command` argument]. -`dvc run` executes stage commands, unless the `--no-exec` option is used. - -
- -### 💡 Avoiding unexpected behavior - -We don't want to tell anyone how to write their code or what programs to use! -However, please be aware that in order to prevent unexpected results when DVC -reproduces pipeline stages, the underlying code should ideally follow these -rules: - -- Read/write exclusively from/to the specified dependencies and - outputs (including parameters files, metrics, and plots). + -- Completely rewrite outputs. Do not append or edit. +`-`/`--` flags sent after the `command` become part of the command itself and +are ignored by `dvc stage add`. -- Stop reading and writing files when the `command` exits. + -Also, if your pipeline reproducibility goals include consistent output data, its -code should be -[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (produce -the same output for any given input): avoid code that increases -[entropy](https://en.wikipedia.org/wiki/Software_entropy) (e.g. random numbers, -time functions, hardware dependencies, etc.). +`dvc run` executes stage commands, unless the `--no-exec` option is used. -
+[`command` argument]: + /doc/user-guide/project-structure/dvcyaml-files#stage-commands ### Dependencies and outputs By specifying lists of dependencies (`-d` option) and/or outputs (`-o` and `-O` options) for each stage, we can create a -_dependency graph_ ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) -that connects them, i.e. the output of a stage becomes the input of another, and -so on (see `dvc dag`). This graph can be restored by DVC later to modify or -[reproduce](/doc/command-reference/repro) the full pipeline. For example: +[dependency graph] that connects them, i.e. the output of a stage becomes the +input of another, and so on (see `dvc dag`). This graph can be restored by DVC +later to modify or [reproduce](/doc/command-reference/repro) the full pipeline. +For example: ```dvc $ dvc run -n printer -d write.sh -o pages ./write.sh @@ -122,12 +107,14 @@ Relevant notes: [manual process](/doc/command-reference/move#renaming-stage-outputs) to update `dvc.yaml` and the project's cache accordingly. +[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines + ### For displaying and comparing data science experiments -[parameters](/doc/command-reference/params) (`-p`/`--params` option) are a -special type of key/value dependencies. Multiple parameter dependencies can be -specified from within one or more YAML, JSON, TOML, or Python parameters files -(e.g. `params.yaml`). This allows tracking experimental hyperparameters easily. +[parameters][param-deps] (`-p`/`--params` option) are a special type of +key/value dependencies. Multiple params can be specified from within one or more +structured files (`params.yaml` by default). This allows tracking experimental +hyperparameters easily in ML. Special types of output files, [metrics](/doc/command-reference/metrics) (`-m` and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and @@ -135,26 +122,8 @@ and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing data science experiments. -### The `command` argument - -The `command` sent to `dvc run` can be anything your terminal would accept and -run directly, for example a shell built-in, expression, or binary found in -`PATH`. Please remember that any flags sent after the `command` are interpreted -by the command itself, not by `dvc run`. - -⚠️ While DVC is platform-agnostic, the commands defined in your -[pipeline](/doc/command-reference/dag) stages may only work on some operating -systems and require certain software packages to be installed. - -Wrap the command with double quotes `"` if there are special characters in it -like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to -`dvc run` itself. Use single quotes `'` instead if there are environment -variables in it that should be evaluated dynamically. Examples: - -```dvc -$ dvc run -n first_stage "./a_script.sh > /dev/null 2>&1" -$ dvc run -n second_stage './another_script.sh $MYENVVAR' -``` +[param-deps]: + /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies ## Options @@ -199,12 +168,12 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR' makes the stage incompatible with `dvc repro`. Implies `--no-exec`. - `-p [:]`, `--params [:]` - specify one - or more [parameter dependencies](/doc/command-reference/params) from a - parameters file `path` (`./params.yaml` by default). This is done by sending a - comma separated list (`params_list`) as argument, e.g. - `-p learning_rate,epochs`. A custom params file can be defined with a prefix, - e.g. `-p params.json:threshold`. Or use the prefix alone with `:` to use all - the parameters found in that file, e.g. `-p myparams.toml:`. + or more [parameter dependencies][param-deps] from a structured file `path` + (`./params.yaml` by default). This is done by sending a comma separated list + (`params_list`) as argument, e.g. `-p learning_rate,epochs`. A custom params + file can be defined with a prefix, e.g. `-p params.json:threshold`. Or use the + prefix alone with `:` to use all the parameters found in that file, e.g. + `-p myparams.toml:`. - `-m `, `--metrics ` - specify a metrics file produced by this stage. This option behaves like `-o` but registers the file in a `metrics` @@ -250,7 +219,7 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR' run with the same dependencies and outputs (see the [details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful for example if the stage command/s is/are non-deterministic - ([not recommended](#avoiding-unexpected-behavior)). + ([not recommended](/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior)). - `--no-commit` - do not store the outputs of this execution in the cache (`dvc.yaml` and `dvc.lock` are still created or updated); useful to avoid diff --git a/content/docs/command-reference/stage/add.md b/content/docs/command-reference/stage/add.md index c1b9c268ac..dcfd76b781 100644 --- a/content/docs/command-reference/stage/add.md +++ b/content/docs/command-reference/stage/add.md @@ -27,67 +27,40 @@ update an existing stage, overwrite it with the `-f` (`--force`) option. A stage name is required and can be provided using the `-n` (`--name`) option. Most of the other [options](#options) help with defining different kinds of [dependencies and outputs](#dependencies-and-outputs) for the stage. The -remaining terminal input provided to `dvc stage add` after `-`/`--` flags will -become the required [`command` argument](#the-command-argument). +remaining terminal input provided to `dvc stage add` after any options/flags +will become the required [`command` argument]. -Stages whose dependencies are outputs from other stages form -[pipelines](/doc/command-reference/dag). `dvc repro` can be used to rebuild -their dependency graph, and execute them. - -
- -### 💡 Avoiding unexpected behavior - -We don't want to tell anyone how to write their code or what programs to use! -However, please be aware that in order to prevent unexpected results when DVC -reproduces pipeline stages, the underlying code should ideally follow these -rules: - -- Read/write exclusively from/to the specified dependencies and - outputs (including parameters files, metrics, and plots). - -- Completely rewrite outputs. Do not append or edit. + -- Stop reading and writing files when the `command` exits. +`-`/`--` flags sent after the `command` become part of the command itself and +are ignored by `dvc stage add`. -Also, if your pipeline reproducibility goals include consistent output data, its -code should be -[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (produce -the same output for any given input): avoid code that increases -[entropy](https://en.wikipedia.org/wiki/Software_entropy) (e.g. random numbers, -time functions, hardware dependencies, etc.). + -
+Stages whose outputs become dependencies for other stages form +pipelines. `dvc repro` can be used to rebuild this [dependency +graph] and execute them. -### The `command` argument + -The `command` sent to `dvc stage add` can be anything your terminal would accept -and run directly, for example a shell built-in, expression, or binary found in -`PATH`. Please remember that any flags sent after the `command` are considered -part of the command itself, not of `dvc stage add`. +See the guide on [defining pipeline stages] for more details. -⚠️ While DVC is platform-agnostic, the commands defined in your -[pipeline](/doc/command-reference/dag) stages may only work on some operating -systems and require certain software packages to be installed. +[defining pipeline stages]: + /doc/user-guide/data-pipelines/defining-pipelines#pipelines -Wrap the command with double quotes `"` if there are special characters in it -like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to -`dvc stage add` itself. Use single quotes `'` instead if there are environment -variables in it that should be evaluated dynamically. Examples: + -```cli -$ dvc stage add -n first_stage "./a_script.sh > /dev/null 2>&1" -$ dvc stage add -n second_stage './another_script.sh $MYENVVAR' -``` +[`command` argument]: + /doc/user-guide/project-structure/dvcyaml-files#stage-commands ### Dependencies and outputs By specifying lists of dependencies (`-d` option) and/or outputs (`-o` and `-O` options) for each stage, we can create a -_dependency graph_ ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) -that connects them, i.e. the output of a stage becomes the input of another, and -so on (see `dvc dag`). This graph can be restored by DVC later to modify or -[reproduce](/doc/command-reference/repro) the full pipeline. For example: +[dependency graph] that connects them, i.e. the output of a stage becomes the +input of another, and so on (see `dvc dag`). This graph can be restored by DVC +later to modify or [reproduce](/doc/command-reference/repro) the full pipeline. +For example: ```cli $ dvc stage add -n printer -d write.sh -o pages ./write.sh @@ -138,12 +111,14 @@ Relevant notes: [manual process](/doc/command-reference/move#renaming-stage-outputs) to update `dvc.yaml` and the project's cache accordingly. +[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines + ### For displaying and comparing data science experiments -[parameters](/doc/command-reference/params) (`-p`/`--params` option) are a -special type of key/value dependencies. Multiple parameter dependencies can be -specified from within one or more YAML, JSON, TOML, or Python parameters files -(e.g. `params.yaml`). This allows tracking experimental hyperparameters easily. +[parameters][param-deps] (`-p`/`--params` option) are a special type of +key/value dependencies. Multiple params can be specified from within one or more +structured files (`params.yaml` by default). This allows tracking experimental +hyperparameters easily in ML. Special types of output files, [metrics](/doc/command-reference/metrics) (`-m` and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and @@ -151,6 +126,9 @@ and `-M` options) and [plots](/doc/command-reference/plots) (`--plots` and specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing data science experiments. +[param-deps]: + /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies + ## Options - `-n `, `--name ` (**required**) - specify a name for the stage @@ -194,12 +172,12 @@ data science experiments. makes the stage incompatible with `dvc repro`. - `-p [:]`, `--params [:]` - specify one - or more [parameter dependencies](/doc/command-reference/params) from a - parameters file `path` (`./params.yaml` by default). This is done by sending a - comma separated list (`params_list`) as argument, e.g. - `-p learning_rate,epochs`. A custom params file can be defined with a prefix, - e.g. `-p params.json:threshold`. Or use the prefix alone with `:` to use all - the parameters found in that file, e.g. `-p myparams.toml:`. + or more [parameter dependencies][param-deps] from a structured file `path` + (`./params.yaml` by default). This is done by sending a comma separated list + (`params_list`) as argument, e.g. `-p learning_rate,epochs`. A custom params + file can be defined with a prefix, e.g. `-p params.json:threshold`. Or use the + prefix alone with `:` to use all the parameters found in that file, e.g. + `-p myparams.toml:`. - `-m `, `--metrics ` - specify a metrics file produced by this stage. This option behaves like `-o` but registers the file in a `metrics` diff --git a/content/docs/command-reference/stage/index.md b/content/docs/command-reference/stage/index.md index e870b6f33c..1bb7c73939 100644 --- a/content/docs/command-reference/stage/index.md +++ b/content/docs/command-reference/stage/index.md @@ -24,3 +24,6 @@ organize data science projects, or build detailed machine learning pipelines. `dvc stage add` can be used to create/update stages in the `dvc.yaml` file. Use `dvc stage list` or `dvc dag` to discover existing stages without having to examine `dvc.yaml` files manually. + +Learn more about +[defining stages](/doc/user-guide/data-pipelines/defining-pipelines#stages). diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 47ef755f6e..a582677b7d 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -149,6 +149,10 @@ "checkpoints" ] }, + { + "slug": "pipelines", + "children": ["defining-pipelines"] + }, { "slug": "how-to", "source": false, diff --git a/content/docs/start/data-management/metrics-parameters-plots.md b/content/docs/start/data-management/metrics-parameters-plots.md index 8024534801..e8a7006f9e 100644 --- a/content/docs/start/data-management/metrics-parameters-plots.md +++ b/content/docs/start/data-management/metrics-parameters-plots.md @@ -192,7 +192,7 @@ featurize: ### ⚙️ Expand to recall how it was generated. The `featurize` stage -[was created](/doc/start/data-pipelines#dependency-graphs-dags) with this +[was created](/doc/start/data-pipelines#dependency-graphs-dag) with this `dvc run` command. Notice the argument sent to the `-p` option (short for `--params`): diff --git a/content/docs/start/data-management/pipelines.md b/content/docs/start/data-management/pipelines.md index 6b14b459cf..d61177df46 100644 --- a/content/docs/start/data-management/pipelines.md +++ b/content/docs/start/data-management/pipelines.md @@ -151,13 +151,12 @@ along with `git commit` to version DVC metafiles). [to remote storage]: /doc/start/data-and-model-versioning#storing-and-sharing -## Dependency graphs (DAGs) +## Dependency graphs (DAG) -By using `dvc stage add` multiple times, and specifying outputs of -a stage as dependencies of another one, we can describe a sequence -of commands which gets to a desired result. This is what we call a _data -pipeline_ or -[_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph). +By using `dvc stage add` multiple times to use outputs of a stage +as dependencies of another, we can describe a sequence of commands +that gets to a desired result. This is what we call a _data pipeline_ or +dependency graph (a [DAG]). Let's create a second stage chained to the outputs of `prepare`, to perform feature extraction: @@ -172,6 +171,8 @@ $ dvc stage add -n featurize \ The `dvc.yaml` file is updated automatically and should include two stages now. +[dag]: /doc/user-guide/data-pipelines/defining-pipelines +
### 💡 Expand to see what happens under the hood. @@ -275,8 +276,8 @@ it also doesn't rerun `train`! The previous run with the same set of inputs ### 💡 Expand to see what happens under the hood. -`dvc repro` relies on the DAG definition from `dvc.yaml`, and uses -`dvc.lock` to determine what exactly needs to be run. +`dvc repro` relies on the [DAG] defined in `dvc.yaml`, and uses `dvc.lock` to +determine what exactly needs to be run. The `dvc.lock` file is similar to a `.dvc` file — it captures hashes (in most cases `md5`s) of the dependencies and values of the parameters that were used. diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md index 453fda173f..29c82a592a 100644 --- a/content/docs/user-guide/basic-concepts/dependency.md +++ b/content/docs/user-guide/basic-concepts/dependency.md @@ -2,7 +2,9 @@ name: Dependency match: [dependency, dependencies, depends, input] tooltip: >- - A file or directory (possibly tracked by DVC) recorded in the `deps` section - of a stage (in `dvc.yaml`) or `.dvc` file file. See `dvc run`. Stages are - invalidated (considered outdated) when any of their dependencies change. + A file (e.g. data, code), directory (e.g. datasets), or parameter used as + input for a stage in a DVC pipeline. These are specified as paths in the + `deps` field of `dvc.yaml` or `.dvc` files. Stages are invalidated (considered + outdated) when any of their dependencies change. See `dvc stage add`, `dvc + params`, `dvc repro`. --- diff --git a/content/docs/user-guide/basic-concepts/output.md b/content/docs/user-guide/basic-concepts/output.md index d6d53a4821..59ced07a0c 100644 --- a/content/docs/user-guide/basic-concepts/output.md +++ b/content/docs/user-guide/basic-concepts/output.md @@ -4,5 +4,5 @@ match: [output, outputs] tooltip: >- A file or directory tracked by DVC, recorded in the `outs` section of a stage (in `dvc.yaml`) or `.dvc` file. Outputs are usually the result of stages. See - `dvc add`, `dvc run`, `dvc import`, among others. + `dvc add`, `dvc repro`, `dvc import`, among others. --- diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md index f66383479f..8c58710ed5 100644 --- a/content/docs/user-guide/basic-concepts/pipeline.md +++ b/content/docs/user-guide/basic-concepts/pipeline.md @@ -1,7 +1,11 @@ --- -name: Pipeline (DAG) -match: [DAG, pipeline, 'data pipeline', 'data pipelines'] +name: Pipeline +match: [pipeline, pipelines, 'data pipeline', 'data pipelines'] tooltip: >- - A set of inter-dependent stages. This is also called a [dependency - graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). + DVC pipelines describe data processing workflows in a standard declarative + YAML format ([`dvc.yaml`](/doc/user-guide/project-structure/dvcyaml-files)). + This guarantees DVC can reproduce them consistently. DVC also helps automate + their execution and caches their results. See [Defining + Pipelines](/doc/user-guide/data-pipelines/defining-pipelines) for more + details. --- diff --git a/content/docs/user-guide/basic-concepts/stage.md b/content/docs/user-guide/basic-concepts/stage.md index e73027b460..1a5292a367 100644 --- a/content/docs/user-guide/basic-concepts/stage.md +++ b/content/docs/user-guide/basic-concepts/stage.md @@ -2,7 +2,9 @@ name: Stage match: [stage, stages] tooltip: >- - A stage represents individual data processes, including their input and - resulting output which can be combined to build detailed machine learning - pipelines. + A stage represents an individual command, script, or source code that gets to + some milestone as part of your project's workflow. For example, `python + train.py` may generate a machine learning model. DVC stages include data + input(s) and resulting output(s), if any. [Learn + more](/doc/user-guide/data-pipelines/defining-pipelines#stages). --- diff --git a/content/docs/user-guide/experiment-management/running-experiments.md b/content/docs/user-guide/experiment-management/running-experiments.md index 6c52899ec7..8208cffc1d 100644 --- a/content/docs/user-guide/experiment-management/running-experiments.md +++ b/content/docs/user-guide/experiment-management/running-experiments.md @@ -45,7 +45,8 @@ once. > 📖 `dvc exp run` is an experiment-specific alternative to `dvc repro`. [reproduction targets]: /doc/command-reference/repro#options -[dependency graph]: /doc/command-reference/dag#directed-acyclic-graph +[dependency graph]: + /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph ## Tuning (hyper)parameters diff --git a/content/docs/user-guide/pipelines/defining-pipelines.md b/content/docs/user-guide/pipelines/defining-pipelines.md new file mode 100644 index 0000000000..003f87b978 --- /dev/null +++ b/content/docs/user-guide/pipelines/defining-pipelines.md @@ -0,0 +1,260 @@ +# Defining Pipelines + +Pipelines represent data workflows that you want to **reproduce** reliably -- so +the results are consistent. The typical pipelining process involves: + +- Obtain and `dvc add` or `dvc import` the project's initial data requirements + (see [Data Management]). This caches the data and generates + `.dvc` files. + +- Define the pipeline [stages](#stages) in `dvc.yaml` files (more on this + later). Example structure: + + ```yaml + stages: + prepare: ... # stage 1 definition + train: ... # stage 2 definition + evaluate: ... # stage 3 definition + ``` + +- Capture other useful metadata such as runtime + [parameters](#parameter-dependencies), performance [metrics], and [plots] to + visualize. DVC supports multiple file formats for these. + + + +We call this file-based definition _codification_ (YAML format in our case). It +has the added benefit of allowing you to develop pipelines on standard Git +workflows ([GitOps]). + +[gitops]: /doc/use-cases/versioning-data-and-model-files + + + +Stages usually take some data and run some code, producing an output (e.g. an ML +model). The pipeline is formed by making them interdependent, meaning that the +output of a stage becomes the input of another, and so on. Technically, this is +called a _dependency graph_ (DAG). + +Note that while each pipeline is a graph, this doesn't mean a single `dvc.yaml` +file. DVC checks the entire project tree and validates all such +files to find stages, rebuilding all the pipelines that these may define. + +[data management]: /doc/start/data-management +[metrics]: /doc/command-reference/metrics +[plots]: /doc/user-guide/visualizing-plots + +
+ +## Directed Acyclic Graph (DAG) + +DVC represents a pipeline internally as a _graph_ where the nodes are stages and +the edges are _directed_ dependencies (e.g. A before B). And in order for DVC to +run a pipeline, its topology should be _acyclic_ -- because executing cycles +(e.g. A -> B -> C -> A ...) would continue indefinitely. [More about DAGs]. + +Use `dvc dag` to visualize (or export) them. + +[more about dags]: https://en.wikipedia.org/wiki/Directed_acyclic_graph + +
+ +## Stages + + + +See the full [specification] of stage entries. + +[specification]: /doc/user-guide/project-structure/dvcyaml-files#stage-entries + + + +Each stage wraps around an executable shell [command] and specifies any +file-based [dependencies](#simple-dependencies) as well as [outputs](#outputs). +Let's look at a sample stage: it depends on a script file it runs as well as on +a raw data input (ideally [tracked by DVC][data management] already): + +```yaml +stages: + prepare: + cmd: source src/cleanup.sh + deps: + - src/cleanup.sh + - data/raw + outs: + - data/clean.csv +``` + + + +We use [GNU/Linux](https://www.gnu.org/software/software.html) in these +examples, but Windows or other shells can be used too. + + + +Besides writing `dvc.yaml` files manually (recommended), you can also create +stages with `dvc stage add` -- a limited command-line interface to setup +pipelines. Let's add another stage this way and look at the resulting +`dvc.yaml`: + +```dvc +$ dvc stage add --name train \ + --deps src/model.py \ + --deps data/clean.csv \ + --outs data/predict.dat \ + python src/model.py data/clean.csv +``` + +```yaml +stages: + prepare: + ... + outs: + - data/clean.csv + train: + cmd: python src/model.py data/model.csv + deps: + - src/model.py + - data/clean.csv + outs: + - data/predict.dat +``` + + + +One advantage of using `dvc stage add` is that it will verify the validity of +the arguments provided (otherwise stage definition won't be checked until +execution). A disadvantage is that some advanced features such as [templating] +are not available this way. + +[command]: /doc/user-guide/project-structure/dvcyaml-files#stage-commands +[templating]: /doc/user-guide/project-structure/pipelines-files#templating + + + +Notice that the new `train` stage depends on the output from stage `prepare` +(`data/clean.csv`), forming the pipeline ([DAG](#directed-acyclic-graph-dag)). + + + +Stage execution sequences will be determined entirely by the DAG, not by the +order in which stages are found in `dvc.yaml`. + + + +## Simple dependencies + +There's more than one type of stage dependency. A simple dependency is a file or +directory used as input by the stage command. When it's contents have changed, +DVC "invalidates" the stage -- it knows that it needs to run again (see +`dvc status`). This in turn may cause a chain reaction in which subsequent +stages of the pipeline are also reproduced. + + + +DVC [calculates a hash] of file/dir contents to compare vs. previous versions. +This is a distinctive mechanism over traditional build tools like `make`. + +[calculates a hash]: + /doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory + + + +File system-level dependencies are defined in the `deps` field of `dvc.yaml` +stages; Alternatively, using the `--deps` (`-d`) option of `dvc stage add` (see +the previous section's example). + +
+ +### External dependencies: click to learn more. + +A less common kind of dependency is a _URL dependency_. Instead of files in a +local disk, you can `dvc import` data from another DVC project (for +example hosted on GitHub). External dependencies establish relationships between +different projects or systems (see `dvc import-url`). +[Get all the details](/doc/user-guide/external-dependencies). + + + +DVC will use special methods to check whether the contents of an URL have +changed for the purpose of stage invalidation. + + + +
+ +## Parameter dependencies + +A more granular type of dependency is the parameter (`params` field of +`dvc.yaml`), or _hyperparameters_ in machine learning. These represent simple +values used inside your code to tune data processing, or that affect stage +execution in any other way. For example, training a [Neural Network] usually +requires _batch size_ and _epoch_ values. + +Instead of hard-coding param values, your code can read them from a structured +file (e.g. YAML format). DVC can track any key/value pair in a supported +[parameters file] (`params.yaml` by default). Params are granular dependencies +because DVC only invalidates stages when the corresponding part of the params +file has changed. + +```yaml +stages: + train: + cmd: ... + deps: ... + params: # from params.yaml + - learning_rate + - nn.epochs + - nn.batch_size + outs: ... +``` + + + +See [more details] about this syntax. + + + +Use `dvc params diff` to compare parameters across project versions. + +[parameters file]: + /doc/user-guide/project-structure/dvcyaml-files#parameters-files +[neural network]: + https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/ +[more details]: /doc/user-guide/project-structure/dvcyaml-files#parameters + +## Outputs + +Stage outputs are files (or directories) written by pipelines, for +example machine learning models, intermediate artifacts, as well as data [plots] +and performance [metrics]. These files are cached by DVC +automatically, and tracked with the help of `dvc.lock` files. + +Outputs can be dependencies of subsequent stages (as explained earlier). So when +they change, DVC may need to reproduce downstream stages as well (handled +automatically). + +The types of outputs are: + +- Files and directories: Typically data to feed to intermediate stages, as well + as the final results of a pipeline (e.g. a dataset or an ML model). + +- [Metrics]: DVC supports small text files that usually contain model + performance metrics from the evaluation, validation, or testing phases of the + ML lifecycle. DVC allows to compare produced metrics with one another using + `dvc metrics diff` and presents the results as a table with `dvc metrics show` + or `dvc exp show`. + +- [Plots]: Different kinds of data that can be visually graphed. For example + contrast ML performance statistics or continuous metrics from multiple + experiments. `dvc plots show` can generate charts for certain data files or + render custom image files for you, or you can compare different ones with + `dvc plots diff`. + + + +Outputs are produced by [stage commands][command]. DVC does not make any +assumption regarding this process; they should just match the path specified in +`dvc.yaml`. + + diff --git a/content/docs/user-guide/pipelines/index.md b/content/docs/user-guide/pipelines/index.md new file mode 100644 index 0000000000..5a9f96a823 --- /dev/null +++ b/content/docs/user-guide/pipelines/index.md @@ -0,0 +1,19 @@ +# Pipelines + +If you find yourself repeating sequence of actions to get or update the results +of your project, then you may already have a pipeline. For example, a data +science workflow could involve: + +1. Gathering data for training and validation +2. Extracting useful features from the training dataset +3. (Re)training an ML model +4. Evaluating the results against the validation set + +DVC helps you [define] these stages in a standard YAML format (`.dvc` and +`dvc.yaml` files), making your pipeline more manageable and +consistent to reproduce. + +See [Get Started: Data Pipelines](/doc/start/data-management/pipelines) for a +hands-on introduction to this topic. + +[define]: /doc/user-guide/data-pipelines/defining-pipelines diff --git a/content/docs/user-guide/project-structure/dvcyaml-files.md b/content/docs/user-guide/project-structure/dvcyaml-files.md index ad975873d8..e6194ed0dd 100644 --- a/content/docs/user-guide/project-structure/dvcyaml-files.md +++ b/content/docs/user-guide/project-structure/dvcyaml-files.md @@ -20,8 +20,8 @@ so you may modify, write, or generate stages and pipelines on your own. -We use [GNU/Linux](https://www.gnu.org/software/software.html) in most of our -examples. +We use [GNU/Linux](https://www.gnu.org/software/software.html) in these +examples, but Windows or other shells can be used too. @@ -40,14 +40,19 @@ stages: - columns.txt ``` -> See also `dvc stage add`, a helper command to write stages in `dvc.yaml`. + + +See also `dvc stage add`, a helper command to write stages in `dvc.yaml`. + + The most important part of a stage is the terminal command(s) it executes (`cmd` field). This is what DVC runs when the stage is reproduced (see `dvc repro`). -If a command reads input files, these (or their directory locations) can be -defined as dependencies (`deps`). DVC will check whether they have -changed to decide whether the stage requires re-execution (see `dvc status`). +If a [stage command](#stage-commands) reads input files, these (or their +directory locations) can be defined as dependencies (`deps`). DVC +will check whether they have changed to decide whether the stage requires +re-execution (see `dvc status`). If it writes files or dirs, they can be defined as outputs (`outs`). DVC will track them going forward (similar to using `dvc add`). @@ -64,6 +69,24 @@ See the full stage entry [specification](#stage-entries). +### Stage commands + +The command(s) defined in the `stages` (`cmd` field) can be anything your system +terminal would accept and run, for example a shell built-in, an expression, or a +binary found in `PATH`. + +Surround the command with double quotes `"` if it includes special characters +like `|` or `<`, `>`. Use single quotes `'` instead if there are environment +variables in it that should be evaluated dynamically. + +The same applies to the `command` argument for helper commands (`dvc stage add`, +`dvc exp init`), otherwise they would apply to the DVC call itself: + +```cli +$ dvc stage add -n a_stage "./a_script.sh > /dev/null 2>&1" +$ dvc exp init './another_script.sh $MYENVVAR' +``` + ### Parameter dependencies [Parameters](/doc/command-reference/params) are a special type of stage @@ -104,9 +127,9 @@ This allows several stages to depend on values of a shared structured file ### Metrics and Plots outputs -Like [common outputs](#outputs), metrics and plots -files are produced by the stage `cmd`. However, their purpose is different. -Typically they contain metadata to evaluate pipeline processes. Example: +Like common output files, metrics and plots files are +produced by the stage `cmd`. However, their purpose is different. Typically they +contain metadata to evaluate pipeline processes. Example: ```yaml stages: @@ -432,7 +455,7 @@ These are the fields that are accepted in each stage: | Field | Description | | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `cmd` | (Required) One or more commands executed by the stage (may contain either a single value or a list). Commands are executed sequentially until all are finished or until one of them fails (see `dvc repro`). | +| `cmd` | (Required) One or more commands executed by the stage (may contain either a single value or a list). [Learn more](#stage-commands). | | `wdir` | Working directory for the stage command to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to `.` (the file's location). | | `deps` | List of dependency paths of this stage (relative to `wdir`). | | `outs` | List of stage output paths (relative to `wdir`). These can contain optional [subfields](#output-subfields). | @@ -457,6 +480,14 @@ validation and auto-completion. > See also > [How to Merge Conflicts](/doc/user-guide/how-to/merge-conflicts#dvcyaml). + + +While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field) may +only work on some operating systems and require certain software packages or +libraries in the environment. + + + ### Output subfields > These include a subset of the fields in `.dvc` file diff --git a/content/docs/user-guide/project-structure/internal-files.md b/content/docs/user-guide/project-structure/internal-files.md index 9d0edd57aa..f8b705ada2 100644 --- a/content/docs/user-guide/project-structure/internal-files.md +++ b/content/docs/user-guide/project-structure/internal-files.md @@ -165,4 +165,7 @@ run. run-cache to remote storage for sharing and/or as a back up. > Note that the run-cache assumes that stage commands are deterministic (see -> **Avoiding unexpected behavior** in `dvc run`). +> [Avoiding unexpected behavior]). + +[avoiding unexpected behavior]: + /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md index a6137ed92f..b6d0967004 100644 --- a/content/docs/user-guide/related-technologies.md +++ b/content/docs/user-guide/related-technologies.md @@ -63,8 +63,7 @@ bringing best practices from software engineering into the data science field ## Workflow management systems -Pipelines and dependency graphs -([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as _Airflow_, +Systems to manage data pipelines and [dependency graphs] such as _Airflow_, _Luigi_, etc. - DVC is focused on data science and modeling. As a result, DVC pipelines are @@ -79,6 +78,8 @@ _Luigi_, etc. - See also our sister project, [CML](https://cml.dev/), that helps fill some of these gaps. +[dependency graphs]: /doc/user-guide/data-pipelines/defining-pipelines + ## Experiment management software > See also the [Experiment Management](/doc/user-guide/experiment-management) @@ -111,11 +112,9 @@ _Luigi_, etc. avoid recomputing all dependency file hashes, which would be highly problematic when working with large files (multiple GB). -- DVC utilizes a - [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) - (DAG): +- DVC utilizes a [Directed Acyclic Graph] (DAG): - - The DAG or dependency graph is defined implicitly by the connections between + - The dependency graph is defined implicitly by the connections between [stages](/doc/command-reference/run), based on their dependencies and outputs. @@ -132,3 +131,6 @@ _Luigi_, etc. > actual file contents. See **Linking files** in > [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for > technical details (Linux). + +[directed acyclic graph]: + /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 6c960dbb59..e695443e7a 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -33,8 +33,8 @@ can version experiments, manage large datasets, and make projects reproducible. transfer large datasets or share a GPU-trained model with others. - DVC makes data science projects **reproducible** by creating lightweight - [pipelines](/doc/command-reference/dag) using implicit dependency graphs, and - by codifying the data and artifacts involved. + [pipelines] using implicit dependency graphs, and by codifying the data and + artifacts involved. - DVC is **platform agnostic**: It runs on all major operating systems (Linux, macOS, and Windows), and works independently of the programming languages @@ -51,6 +51,7 @@ can version experiments, manage large datasets, and make projects reproducible. [free]: https://github.com/iterative/dvc/blob/master/LICENSE [vs code extension]: /doc/vs-code-extension [command line]: /doc/command-reference +[pipelines]: /doc/user-guide/data-pipelines ## DVC does not replace Git!