From 33a303c440654bc4e1dc80b143b346b643702840 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 22 Dec 2021 20:27:09 -0600 Subject: [PATCH] ref: `exp init` improvements (#3071) * ref: first copy edits to exp init * ref: clarify exp init explanations * ref: clarify `exp init` option descriptions * ref: re-describe `exp init` + reorder in nav and `exp` help per https://github.com/iterative/dvc.org/pull/3071#pullrequestreview-825977103 * ref: clarify params.yaml is needed only with defaults in params.yaml per https://github.com/iterative/dvc.org/pull/3071#discussion_r768050363 * ref: clarify what --interactive prompts user for per https://github.com/iterative/dvc.org/pull/3071#pullrequestreview-825981863 * ref: link from exp init to config section and mention --explicit avoids a params.yaml file too. * ref: simplify exp init --explicit explanation * ref: explain why params.yaml are required (by default) in exp init per https://github.com/iterative/dvc.org/pull/3071#discussion_r769810080 * ref: copy edits to exp init * ref: add simple example to exp init rel. https://github.com/iterative/dvc.org/pull/3071#issuecomment-995393831 * ref: use model training example in exp init per https://github.com/iterative/dvc.org/pull/3071#pullrequestreview-838656604 * ref: shorten sample block --- content/docs/command-reference/exp/index.md | 14 +- content/docs/command-reference/exp/init.md | 229 ++++++++++++------ content/docs/sidebar.json | 16 +- .../project-structure/pipelines-files.md | 6 +- 4 files changed, 172 insertions(+), 93 deletions(-) diff --git a/content/docs/command-reference/exp/index.md b/content/docs/command-reference/exp/index.md index 0e1e36bfe8..ca88ccbc93 100644 --- a/content/docs/command-reference/exp/index.md +++ b/content/docs/command-reference/exp/index.md @@ -26,19 +26,19 @@ usage: dvc exp [-h] [-q | -v] positional arguments: COMMAND + init Quickly setup any project to use DVC Experiments. + run Reproduce complete or partial experiment pipelines. show Print experiments. - apply Apply the changes from an experiment to your - workspace. diff Show changes between experiments in the DVC repository. - run Reproduce complete or partial experiment pipelines. - gc Garbage collect unneeded experiments. - branch Promote an experiment to a Git branch. list List local and remote experiments. + apply Apply the changes from an experiment to your + workspace. + branch Promote an experiment to a Git branch. + remove Remove local experiments. + gc Garbage collect unneeded experiments. push Push a local experiment to a Git remote. pull Pull an experiment from a Git remote. - remove Remove local experiments. - init Codify project using DVC metafiles to run experiments. ``` ## Description diff --git a/content/docs/command-reference/exp/init.md b/content/docs/command-reference/exp/init.md index 02184b6ce3..0cc93a7b96 100644 --- a/content/docs/command-reference/exp/init.md +++ b/content/docs/command-reference/exp/init.md @@ -1,7 +1,6 @@ # exp init -Codify project using [DVC metafiles](/doc/user-guide/project-structure) to run -[experiments](/doc/user-guide/experiment-management). +Quickly setup any project to use [DVC Experiments]. > Requires a DVC repository, created with `git init` and > `dvc init`. @@ -19,43 +18,60 @@ usage: dvc exp init [-h] [-q | -v] [--run] [--interactive] [-f] ## Description -`dvc exp init` helps you quickly get started with experiments. It reduces -boilerplate for initializing [pipeline](/doc/command-reference/dag) stages in a -`dvc.yaml` file by assuming defaults about the location of your data, -[parameters](/doc/command-reference/params), source code, models, -[metrics](/doc/command-reference/metrics) and -[plots](/doc/command-reference/plots), which can be customized through config. +`dvc exp init` helps you get started with DVC Experiments quickly. It reduces +boilerplate DVC procedures by creating a `dvc.yaml` file that assumes standard +locations of your input data, parameters, source code, models, +metrics and [plots](/doc/command-reference/plots). These locations +can be customized through the [options](#options) below or via +[configuration](/doc/command-reference/config#exp). -It also offers guided `--interactive` mode for creating a stage to be -[`exp run`](/doc/command-reference/exp/run) later. `dvc exp init` supports -creating different types of stages, eg: `dl` if you are doing deep learning, -which uses [dvclive](/doc/dvclive) to monitor and checkpoint progress during -training of machine learning models. +Repository structure assumed by default: -This command is intended to be a quick way to start running experiments. To -create more complex stages and pipelines, use `dvc stage add`. +``` +├── data/ +├── metrics.json +├── models/ +├── params.yaml # required +├── plots/ +└── src/ +``` + +> Note that `dvc exp init` expects at least a `params.yaml` file present. DVC +> reads it to find parameters to include in the [stage definition]. It can +> however be omitted when using the `--explicit` and/or `-i` flags. -> 📖 More context in [Experiments Overview]. +You must always provide a command that runs your experiment(s). It can be given +either directly [as an argument](#the-command-argument), or by using the +`--interactive` (`-i`) mode which will prompt you for it. This command will be +wrapped as a stage that `dvc exp run` can execute. -[experiments overview]: - /doc/user-guide/experiment-management/experiments-overview +Different types of stages are supported, such as `dl` (deep learning) which uses +[DVCLive](/doc/dvclive) to monitor [checkpoints] during training of ML models. + +> `dvc exp init` is intended as a quick way to start running [DVC Experiments]. +> See the `dvc.yaml` specification for complex data pipelines. + +[stage definition]: + /doc/user-guide/project-structure/pipelines-files#stage-entries +[checkpoints]: /doc/user-guide/experiment-management/checkpoints +[dvc experiments]: /doc/user-guide/experiment-management/experiments-overview ### The `command` argument -The `command` argument is optional, if you are using `--interactive` mode. The -`command` sent to `dvc exp init` can be anything your terminal would accept and -run directly, for example a shell built-in, expression, or binary found in -`PATH`. Please remember that any flags sent after the `command` are interpreted -by the command itself, not by `dvc exp init`. +The command given to `dvc exp init` can be anything your system terminal would +accept and run directly, for example a shell built-in, an expression, or a +binary found in `PATH`. Please note that any flags sent after the `command` +argument will normally become part of that command itself and ignored by +`dvc exp init` (so provide it last). -⚠️ While DVC is platform-agnostic, the commands defined in your -[pipeline](/doc/command-reference/dag) stages may only work on some operating -systems and require certain software packages to be installed. +⚠️ While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field) +may only work on some operating systems and require certain software packages or +libraries in the environment. -Wrap the command with double quotes `"` if there are special characters in it -like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to -`dvc exp init` itself. Use single quotes `'` instead if there are environment -variables in it that should be evaluated dynamically. Examples: +Surround the command with double quotes `"` if it includes special characters +like `|` or `<`, `>` -- otherwise they would apply to `dvc exp init` itself. Use +single quotes `'` instead if there are environment variables in it that should +be evaluated dynamically. ```dvc $ dvc exp init "./a_script.sh > /dev/null 2>&1" @@ -64,67 +80,66 @@ $ dvc exp init './another_script.sh $MYENVVAR' ## Options -- `-i`, `--interactive` - prompts user for the command to execute and different - paths for tracking outputs and dependencies, unless they are provided through - arguments explicitly. Interactive mode allows users to set those locations - from default values or omit them. +- `-i`, `--interactive` - prompts user for a command that runs your + experiment(s) (see [details](#the-command-argument)) and to confirm or define + the paths that conform your repo's structure. -- `--explicit` - `dvc exp init` assumes default location of your outputs and - dependencies (which can be overriden from the config). By using `--explicit`, - it will not use those default values while initializing experiments. In - `--interactive` mode, prompt won't set default value and all the values for - the prompt needs to be explicitly provided, or omitted. +- `-n `, `--name ` - specify a custom name for the stage generated + by this command. The default is `train`. It can only contain letters, numbers, + dash `-` and underscore `_` (same as `dvc stage add --name`). -- `--code` - override the a path to your source file or directory which your - experiment depends on. The default is `src` directory for your code. +- `--run` - automatically run the experiment after creating the stage (same as + `dvc exp run`). -- `--data` - override the path to your data file or directory to track, which - your experiment depends on. The default is `data` directory. +- `--type` - selects the type of the stage to create. Currently it provides two + alternatives: `dl` and `default` (no need to specify this one). -- `--params` - override the path to - [parameter dependencies](/doc/command-reference/params) which your experiment - depends on. The default parameters file name is `params.yaml`. Note that - `dvc exp init` may fail if the parameters file does not exist at the time of - the invocation, as DVC reads the file to find parameters to track for the - stage. + `dl` stages are intended for use in deep-learning scenarios, where metrics and + plots are tracked with [DVCLive](/doc/dvclive). This also supports logging + [checkpoints](/doc/command-reference/exp/run#checkpoints) during the training + of DL models. -- `--model` - override the path to your models file or directory to track, which - your experiment produces. `dvc exp init` assumes `models` directory by - default. +- `--code` - set the path to the file or directory where the source code that + your experiment depends on can be found (if any). Overrides other + configuration and default value (`src/`). -- `--metrics` - override the path to metrics file to track, which your - experiment produces. Default is `metrics.json` file. +- `--params` - set the path to the file or directory where the + parameters that your experiment depends on can be found. + Overrides other configuration and default value (`params.yaml`). -- `--plots` - override the path to plots file or directory, which your - experiment produces. The default is `plots`. + > Note that `dvc exp init` will fail if the params file does not exist. This + > is because DVC reads it to find params to include in the [stage definition]. -- `--live` - override the directory `path` for [DVCLive](/doc/dvclive), which - your experiment will write logs to. The default is `dvclive` directory, which - only comes to effect when used with `--type=dl`. +- `--data` - set the path to the data file or directory that your experiment + depends on can be found (if any). Overrides other configuration and default + value (`data/`). -- `--type` - selects the type of the stage to create. Currently it provides two - different kinds of stages: `default` and `dl`. If unspecified, `default` stage - is created. +- `--model` - set the path to the file or directory where the model(s) produced + by your experiment can be found (if any). Overrides other configuration and + default value (`models/`). - `default` stage creates a stage with `metrics` and `plots` tracked by DVC - itself, and does not track live-created artifacts (unless explicitly - specified). + > 💡 This could be used for any artifacts produced by your experiment. - `dl` stage is intended for use in deep-learning scenarios, where metrics and - plots are tracked by [dvclive](/doc/dvclive) and supports tracking progress - while training a deep-learning model with - [checkpoints](/doc/command-reference/exp/run#checkpoints). +- `--metrics` - set the path to the file or directory where the metrics produced + by your experiment can be found (if any). Overrides other configuration and + default value (`metrics.json`). -- `-n `, `--name ` - specify a custom name for the stage generated - by this command (e.g. `-n train`). The default is `train`. +- `--plots` - set the path to the file or directory where the plots produced by + your experiment can be found (if any). Overrides other configuration and + default value (`plots/`). - Note that the stage name can only contain letters, numbers, dash `-` and - underscore `_`. +- `--live` - configure the `path` directory for [DVCLive](/doc/dvclive). This is + where experiment logs will be written. Overrides other configuration and + default value (`dvclive/`). -- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without - asking for confirmation. + > This only has an effect when used with `--type=dl`. -- `--run` - runs the experiment after initializing it. +- `--explicit` - do not assume default locations of project dependencies and + outputs. You'll have to provide specific locations via other options or + `dvc config exp`. In `--interactive` this removes default values from prompts. + +- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without + asking for confirmation (same as `dvc stage add --force`). - `-h`, `--help` - prints the usage/help message, and exit. @@ -132,3 +147,65 @@ $ dvc exp init './another_script.sh $MYENVVAR' problems arise, otherwise 1. - `-v`, `--verbose` - displays detailed tracing information. + +## Example: interactive mode + +Let's prepare an ML model training script to start running experiments on it. +The easiest route is using interactive mode and answering a few questions: + +```dvc +$ dvc exp init --interactive +This command will guide you to set up a train stage in dvc.yaml... + +Command to execute: python src/train.py + +Enter the paths for dependencies and outputs of the command. +DVC assumes the following workspace structure: +├── data +├── metrics.json +├── models +├── params.yaml +├── plots +└── src + +Path to a code file/directory [src, n to omit]: src/train.py +Path to a data file/directory [data, n to omit]: data/features +Path to a model file/directory [models, n to omit]: models/predict.h5 +Path to a parameters file [params.yaml, n to omit]: +Path to a metrics file [metrics.json, n to omit]: +Path to a plots file/directory [plots, n to omit]: n +... +``` + +In this example the code, data, and model locations were specified above to +avoid using the defaults (which are too broad). `params.yaml` and `metrics.json` +are accepted (pressed Enter) for parameters and +metrics. Plots are omitted (entered `n`) as none will be written. + +The resulting `dvc.yaml` file codifies the meta-information you provided in +DVC's format: + +```yaml +train: + cmd: python src/train.py + deps: + - data/features + - src/train.py + params: + - epochs + outs: + - models/predict.h5 + metrics: + - metrics.json: + cache: false +``` + +> Notes: +> +> - `train` is the default stage name unless you provide one with the `--name` +> option. +> - The `epochs` param was obtained from the `params.yaml` file. Any other param +> keys found there would all be listed under `params:` automatically. + +The next step would be to tune `params.yaml` or improve `src/train.py` directly, +and start [running experiments](/doc/command-reference/exp/run). diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index f68a665d58..995ed6f3bd 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -254,6 +254,10 @@ "slug": "exp", "source": "exp/index.md", "children": [ + { + "label": "exp init", + "slug": "init" + }, { "label": "exp run", "slug": "run" @@ -262,14 +266,14 @@ "label": "exp show", "slug": "show" }, - { - "label": "exp init", - "slug": "init" - }, { "label": "exp diff", "slug": "diff" }, + { + "label": "exp list", + "slug": "list" + }, { "label": "exp apply", "slug": "apply" @@ -293,10 +297,6 @@ { "label": "exp pull", "slug": "pull" - }, - { - "label": "exp list", - "slug": "list" } ] }, diff --git a/content/docs/user-guide/project-structure/pipelines-files.md b/content/docs/user-guide/project-structure/pipelines-files.md index 57e8052792..e9e0bf2875 100644 --- a/content/docs/user-guide/project-structure/pipelines-files.md +++ b/content/docs/user-guide/project-structure/pipelines-files.md @@ -20,8 +20,8 @@ so you may modify, write, or generate stages and pipelines on your own. ## Stages -The `stages` list contains a list of user-defined stages. Here's a simple one -named `transpose`: +The list of `stages` contains one or more user-defined stages. Here's a simple +one named `transpose`: ```yaml stages: @@ -33,6 +33,8 @@ stages: - columns.txt ``` +> See also `dvc stage add`, a helper command to write stages in `dvc.yaml`. + The most important part of a stage it's the terminal command(s) it executes (`cmd` field). This is what DVC runs when the stage is reproduced (see `dvc repro`).