diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 2bbfd3c1c1..a52bc06c9b 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -254,7 +254,7 @@ experiments or projects use a similar structure. - `parsing.bool` - Controls the templating syntax for boolean values when used in - [dict unpacking](/doc/user-guide/project-structure/dvcyaml-files#dict-unpacking). + [dict unpacking](/doc/user-guide/project-structure/dvcyaml-files#dictionary-unpacking). Valid values are `"store_true"` (default) and `"boolean_optional"`, named after @@ -289,7 +289,7 @@ experiments or projects use a similar structure. ``` - `parsing.list` - Controls the templating syntax for list values when used in - [dict unpacking](/doc/user-guide/project-structure/dvcyaml-files#dict-unpacking). + [dict unpacking](/doc/user-guide/project-structure/dvcyaml-files#dictionary-unpacking). Valid values are `"nargs"` (default) and `"append"`, named after [Python argparse actions](https://docs.python.org/3/library/argparse.html#action). diff --git a/content/docs/user-guide/project-structure/dvcyaml-files.md b/content/docs/user-guide/project-structure/dvcyaml-files.md index e9d766fbf7..b8a5787c6c 100644 --- a/content/docs/user-guide/project-structure/dvcyaml-files.md +++ b/content/docs/user-guide/project-structure/dvcyaml-files.md @@ -1,34 +1,22 @@ # `dvc.yaml` -You can construct data science or machine learning pipelines by defining -individual [stages](/doc/command-reference/run) in one or more `dvc.yaml` files. -Stages form a pipeline when they connect with each other (forming a _dependency -graph_, see `dvc dag`). Refer to -[Get Started: Data Pipelines](/doc/start/data-management/data-pipelines). +You can construct machine learning pipelines by defining individual +[stages](/doc/command-reference/run) in one or more `dvc.yaml` files. Stages +constitute a pipeline when they connect with each other (forming a [dependency +graph], see `dvc dag`). - - -A helper command, `dvc stage`, is available to create and list stages. - - +`dvc.yaml` uses the [YAML 1.2](https://yaml.org/) format and a human-friendly +schema explained below. We encourage you to get familiar with it so you may +modify, write, or generate them by your own means. -`dvc.yaml` files can be versioned with Git. - -These files use the [YAML 1.2](https://yaml.org/) file format, and a -human-friendly schema explained below. We encourage you to get familiar with it -so you may modify, write, or generate stages and pipelines on your own. - - - -We use [GNU/Linux](https://www.gnu.org/software/software.html) in these -examples, but Windows or other shells can be used too. - - +`dvc.yaml` files are designed to be small enough so you can easily version them +with Git along with other DVC metafiles and your project's code. ## Stages -The list of `stages` contains one or more user-defined stages. Here's a simple -one named `transpose`: +The list of `stages` is typically the most important part of a `dvc.yaml` file. +It contains one or more user-defined stages. Here's a simple one +named `transpose`: ```yaml stages: @@ -42,20 +30,28 @@ stages: -See also `dvc stage add`, a helper command to write stages in `dvc.yaml`. +A helper command group, `dvc stage`, is available to create and list stages. -The most important part of a stage is the terminal command(s) it executes (`cmd` +The only required part of a stage it's the shell command(s) it executes (`cmd` field). This is what DVC runs when the stage is reproduced (see `dvc repro`). + + +We use [GNU/Linux](https://www.gnu.org/software/software.html) in our examples, +but Windows or other shells can be used too. + + + If a [stage command](#stage-commands) reads input files, these (or their directory locations) can be defined as dependencies (`deps`). DVC will check whether they have changed to decide whether the stage requires re-execution (see `dvc status`). -If it writes files or dirs, they can be defined as outputs -(`outs`). DVC will track them going forward (similar to using `dvc add`). +If it writes files or directories, these can be defined as outputs +(`outs`). DVC will track them going forward (similar to using `dvc add` on +them). @@ -180,7 +176,7 @@ See also `dvc params diff` to compare params across project version. ### Metrics and Plots outputs -Like common output files, metrics and plots files are +Like common outputs, metrics and plots files are produced by the stage `cmd`. However, their purpose is different. Typically they contain metadata to evaluate pipeline processes. Example: @@ -200,12 +196,79 @@ stages: cache: false ``` -> `cache: false` is typical here, since they're small enough for Git to version -> directly. + + +`cache: false` is typical here, since they're small enough for Git to store +directly. + + The commands in `dvc metrics` and `dvc plots` help you display and compare metrics and plots. +## Stage entries + +These are the fields that are accepted in each stage: + +| Field | Description | +| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `cmd` | (Required) One or more shell commands to execute (may contain either a single value or a list). `cmd` values may use [dictionary substitution](#dictionary-unpacking) from param files. Commands are executed sequentially until all are finished or until one of them fails (see `dvc repro`). | +| `wdir` | Working directory for the `cmd` to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to `.` (the file's location). | +| `deps` | List of dependency paths (relative to `wdir`). | +| `outs` | List of output paths (relative to `wdir`). These can contain certain optional [subfields](#output-subfields). | +| `params` | List of parameter dependency keys (field names) to track from `params.yaml` (in `wdir`). The list may also contain other parameters file names, with a sub-list of the param names to track in them. | +| `metrics` | List of [metrics files](/doc/command-reference/metrics), and optionally, whether or not this metrics file is cached (`true` by default). See the `--metrics-no-cache` (`-M`) option of `dvc run`. | +| `plots` | List of [plot metrics](/doc/command-reference/plots), and optionally, their default configuration (subfields matching the options of `dvc plots modify`), and whether or not this plots file is cached ( `true` by default). See the `--plots-no-cache` option of `dvc run`. | +| `frozen` | Whether or not this stage is frozen (prevented from execution during reproduction) | +| `always_changed` | Causes this stage to be always considered as [changed] by commands such as `dvc status` and `dvc repro`. `false` by default | +| `meta` | (Optional) arbitrary metadata can be added manually with this field. Any YAML content is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read or write `.dvc` files directly. | +| `desc` | (Optional) user description. This doesn't affect any DVC operations. | + +[changed]: /doc/command-reference/status#local-workspace-status + +`dvc.yaml` files also support `# comments`. + + + +We maintain a `dvc.yaml` [schema] that can be used by editors like [VSCode] or +[PyCharm] to enable automatic syntax validation and auto-completion. + +[schema]: https://github.com/iterative/dvcyaml-schema +[vscode]: /doc/install/plugins#visual-studio-code +[pycharm]: /doc/install/plugins#pycharmintellij + + + + + +See also +[How to Merge Conflicts](/doc/user-guide/how-to/merge-conflicts#dvcyaml). + + + +### Output subfields + + + +These include a subset of the fields in `.dvc` file +[output entries](/doc/user-guide/project-structure/dvc-files#output-entries). + + + +| Field | Description | +| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | +| `remote` | (Optional) Name of the remote to use for pushing/fetching | +| `persist` | Whether the output file/dir should remain in place during `dvc repro` (`false` by default: outputs are deleted when `dvc repro` starts) | +| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [checkpoint experiments](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. | +| `desc` | (Optional) User description for this output. This doesn't affect any DVC operations. | + + + +Using the `checkpoint` field in `dvc.yaml` is not compatible with `dvc repro`. + + + ## Templating `dvc.yaml` supports a templating format to insert values from different sources @@ -244,52 +307,54 @@ stages: DVC will track simple param values (numbers, strings, etc.) used in `${}` (they will be listed by `dvc params diff`). -### Dict Unpacking +
+ +### Dictionary unpacking Only inside the `cmd` entries, you can also reference a dictionary inside `${}` -and DVC will _unpack_ it. For example, given the following `params.yaml`: +and DVC will _unpack_ it. This can be useful to avoid writing every argument +passed to the command, or having to modify `dvc.yaml` when arguments change. + +For example, given the following `params.yaml`: ```yaml -dict: +mydict: foo: foo - bar: 2 + bar: 1 bool: true nested: - foo: bar - list: [1, 2, 'foo'] + baz: bar + list: [2, 3, 'qux'] ``` -You can reference `dict` in the `cmd` section of a `dvc.yaml`: +You can reference `mydict` in a stage command like this: ```yaml stages: train: - cmd: python train.py ${dict} + cmd: python train.py ${mydict} ``` -And DVC will _unpack_ the values inside `dict`, creating the following `cmd` -call: +DVC will unpack the values inside `mydict`, creating the following `cmd` call: ```cli -$ python train.py --foo 'foo' --bar 2 --bool \ - --nested.foo 'bar' --list 1 2 'foo' +$ python train.py --foo 'foo' --bar 1 --bool \ + --nested.baz 'bar' --list 2 3 'qux' ``` -This can be useful for avoiding to write every argument passed to the `cmd` or -having to modify the `dvc.yaml` when adding or removing arguments. - -The [parsing](/doc/command-reference/config#parsing) section of `dvc config` can -be used to customize the syntax used for some ambiguous types like booleans and -lists. +`dvc config parsing` can be used to customize the syntax used for ambiguous +types like booleans and lists. -### Vars +
+ +### Variables -Alternatively, values for substitution can be listed as top-level `vars` like -this: +Alternatively (to relying on parameter files), values for substitution can be +listed as top-level `vars` like this: ```yaml vars: @@ -313,9 +378,6 @@ Values from `vars` are not tracked like parameters. To load additional params files, list them in the top `vars`, in the desired order, e.g.: -> Params file paths will be evaluated based on [`wdir`](#stage-entries), if -> specified. - ```yaml vars: - params.json @@ -323,9 +385,11 @@ vars: - config/myapp.yaml ``` - + -Note that the default `params.yaml` file is always loaded first, if present. +The default `params.yaml` file is always loaded first, if present. +Param file paths will be evaluated based on [`wdir`](#stage-entries), if +specified. @@ -364,13 +428,17 @@ DVC merges values from params files and `vars` in each scope when possible. For example, `{"grp": {"a": 1}}` merges with `{"grp": {"b": 2}}`, but not with `{"grp": {"a": 7}}`. -⚠️ Known limitations of local `vars`: + + +Known limitations of local `vars`: - [`wdir`](#stage-entries) cannot use values from local `vars`, as DVC uses the working directory first (to load any values from params files listed in `vars`). - `foreach` is also incompatible with local `vars` at the moment. + + The substitution expression supports these forms: ```yaml @@ -379,11 +447,21 @@ ${param.key} # Nested values through . (period) ${param.list[0]} # List elements via index in [] (square brackets) ``` -> To use the expression literally in `dvc.yaml` (so DVC does not replace it for -> a value), escape it with a backslash, e.g. `\${...`. + + +To use the expression literally in `dvc.yaml` (so DVC does not replace it for a +value), escape it with a backslash, e.g. `\${...`. + + ## `foreach` stages + + +This feature cannot be combined with [templating](#templating) at the moment. + + + You can define more than one stage in a single `dvc.yaml` entry with the following syntax. A `foreach` element accepts a list or dictionary with values to iterate on, while `do` contains the regular stage fields (`cmd`, `outs`, @@ -503,67 +581,6 @@ Both individual foreach stages (`train@1`) and groups of foreach stages -> Note that this feature is not compatible with [templating](#templating) at the -> moment. - -## Stage entries - -These are the fields that are accepted in each stage: - -| Field | Description | -| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `cmd` | (Required) One or more commands executed by the stage (may contain either a single value or a list). [Learn more](#stage-commands). | -| `wdir` | Working directory for the stage command to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to `.` (the file's location). | -| `deps` | List of dependency paths of this stage (relative to `wdir`). | -| `outs` | List of stage output paths (relative to `wdir`). These can contain optional [subfields](#output-subfields). | -| `params` | List of parameter dependency keys (field names) to track from `params.yaml` (in `wdir`). The list may also contain other parameters file names, with a sub-list of the param names to track in them. | -| `metrics` | List of [metrics files](/doc/command-reference/metrics), and optionally, whether or not this metrics file is cached (`true` by default). See the `--metrics-no-cache` (`-M`) option of `dvc run`. | -| `plots` | List of [plot metrics](/doc/command-reference/plots), and optionally, their default configuration (subfields matching the options of `dvc plots modify`), and whether or not this plots file is cached ( `true` by default). See the `--plots-no-cache` option of `dvc run`. | -| `frozen` | Whether or not this stage is frozen from reproduction | -| `always_changed` | Causes this stage to be always considered as [changed] by commands such as `dvc status` and `dvc repro`. `false` by default | -| `meta` | Arbitrary metadata can be added manually with this field. Any YAML content is supported. `meta` contents are ignored by DVC, but they can be meaningful for user processes that read or write `.dvc` files directly. | -| `desc` | User description for this stage. This doesn't affect any DVC operations. | - -[changed]: /doc/command-reference/status#local-workspace-status - -`dvc.yaml` files also support `# comments`. - -Note that we maintain a `dvc.yaml` -[schema](https://github.com/iterative/dvcyaml-schema) that can be used by -editors like [VSCode](/doc/install/plugins#visual-studio-code) or -[PyCharm](/doc/install/plugins#pycharmintellij) to enable automatic syntax -validation and auto-completion. - -> See also -> [How to Merge Conflicts](/doc/user-guide/how-to/resolve-merge-conflicts#dvcyaml). - - - -While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field) may -only work on some operating systems and require certain software packages or -libraries in the environment. - - - -### Output subfields - -> These include a subset of the fields in `.dvc` file -> [output entries](/doc/user-guide/project-structure/dvc-files#output-entries). - -| Field | Description | -| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | -| `remote` | Name of the remote to use for pushing/fetching | -| `persist` | Whether the output file/dir should remain in place during `dvc repro` (`false` by default: outputs are deleted when `dvc repro` starts) | -| `checkpoint` | Set to `true` to let DVC know that this output is associated with [checkpoint experiments](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. | -| `desc` | User description for this output. This doesn't affect any DVC operations. | -| `type` | User-assigned type of the data. | -| `labels` | User-assigned labels to add to the data. | -| `meta` | Custom metadata about the data. | - -⚠️ Note that using the `checkpoint` field in `dvc.yaml` is not compatible with -`dvc repro`. - ## Top-level plot definitions The `plots` dictionary contains one or more user-defined `dvc plots` @@ -611,8 +628,6 @@ Refer to [Visualizing Plots] and `dvc plots show` for examples. ## dvc.lock file -> ⚠️ Avoid editing these files. DVC will create and update them for you. - To record the state of your pipeline(s) and help track its outputs, DVC will maintain a `dvc.lock` file for each `dvc.yaml`. Their purposes include: @@ -624,6 +639,12 @@ DVC will maintain a `dvc.lock` file for each `dvc.yaml`. Their purposes include: - Needed for several DVC commands to operate, such as `dvc checkout` or `dvc get`. + + +Avoid editing these files. DVC will create and update them for you. + + + Here's an example: ```yaml