From db3989db6b6170c0310efc301b7a074b83eddb77 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 11 Apr 2020 17:38:47 -0500 Subject: [PATCH] cmd ref: finish params and related docs --- .../docs/command-reference/metrics/diff.md | 8 +- .../docs/command-reference/metrics/index.md | 6 +- .../docs/command-reference/metrics/modify.md | 2 +- content/docs/command-reference/params/diff.md | 72 +++++++------ .../docs/command-reference/params/index.md | 94 ++++++++-------- content/docs/command-reference/run.md | 101 ++++++++++-------- content/docs/glossary.js | 11 +- .../docs/tutorials/get-started/experiments.md | 2 +- content/docs/user-guide/dvc-file-format.md | 22 ++-- 9 files changed, 175 insertions(+), 143 deletions(-) diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index cf3ec32fba..b67a123c77 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -22,9 +22,11 @@ positional arguments: This command means to provide a quick way to compare results from your previous experiments with the current results of your pipeline, as long as you're using -metrics that DVC is aware of (see `dvc metrics add`). Run without arguments, -this command compares all existing metric files currently present in the -workspace (uncommitted changes) with the latest committed version. +metrics that DVC is aware of (see `dvc metrics add`). + +Run without arguments, this command compares all existing metric files currently +present in the workspace (including uncommitted changes) with the +latest committed version. The differences shown by this command include the new value, and numeric difference (delta) from the previous value of metrics (with 3-digit accuracy). diff --git a/content/docs/command-reference/metrics/index.md b/content/docs/command-reference/metrics/index.md index da966a8dda..bceb7f2ea6 100644 --- a/content/docs/command-reference/metrics/index.md +++ b/content/docs/command-reference/metrics/index.md @@ -30,9 +30,9 @@ optional arguments: ## Description -In order to track metrics associated to machie learning experiments, DVC has the -ability to mark a certain stage outputs as files containing metrics -to track. (See the `--metrics` option of `dvc run`.) Metrics are +In order to track metrics associated to machine learning experiments, DVC has +the ability to mark a certain stage outputs as files containing +metrics to track. (See the `--metrics` option of `dvc run`.) Metrics are project-specific floating-point values e.g. `AUC`, `ROC`, etc. Supported file formats: JSON. Metrics can be organized in a tree hierarchy in a diff --git a/content/docs/command-reference/metrics/modify.md b/content/docs/command-reference/metrics/modify.md index 938ec21ba8..7295e1da6e 100644 --- a/content/docs/command-reference/metrics/modify.md +++ b/content/docs/command-reference/metrics/modify.md @@ -90,7 +90,7 @@ $ dvc metrics show metrics.json ``` Okay. Let's now imagine we are interested only in a single value of true -posivives (TP). We can specify the `JSON` type (`-t`) and an `xpath` (`-x`) to +positives (TP). We can specify the `JSON` type (`-t`) and an `xpath` (`-x`) to extract the TP value: ```dvc diff --git a/content/docs/command-reference/params/diff.md b/content/docs/command-reference/params/diff.md index f1091b251e..925a945d51 100644 --- a/content/docs/command-reference/params/diff.md +++ b/content/docs/command-reference/params/diff.md @@ -1,6 +1,6 @@ # params diff -Show changes in [project parameters](/doc/command-reference/params), between +Show changes in [parameter dependencies](/doc/command-reference/params) between commits in the DVC repository, or between a commit and the workspace. @@ -17,16 +17,21 @@ positional arguments: ## Description -This command means to provide a quick way to compare parameters from your -previous experiments with the current ones of your pipeline, as long as you're -using params that DVC is aware of. The dependencies to parameters can be defined -by `--params` or `-p` option in `dvc run`. To learn more about parameters see -[project parameters](/doc/command-reference/params). +This command provides a quick way to compare parameter values from your previous +experiments with the current one(s) in your workspace. -Run without arguments, this command compares all existing parameters currently -present in the workspace (uncommitted changes) with the latest -committed version. The command shows only parameters that were used in any of -stages and ignores parameters that were not used. +> Parameter dependencies are defined with the `-p` option in `dvc run`. See also +> `dvc params`. + +Run without arguments, this command compares parameters currently present in the +workspace (including uncommitted changes) with the latest committed +version. + +❗ It only shows parameters used in any of the currently present +[stage files](/doc/command-reference/run) (DVC-files). + +Supported parameter _value_ types are: string, integer, float, and arrays. DVC +itself does not ascribe any specific meaning for these values. ## Options @@ -42,23 +47,23 @@ stages and ignores parameters that were not used. ## Examples -Let's create a simple parameters file and a stage with params dependency (See -`dvc params` and `dvc run` to learn more): +Let's create a simple YAML parameters file named `params.yaml` (default params +file name, see `dvc params` to learn more): -```dvc -$ cat params.yaml +```yaml lr: 0.0041 train: - epochs: 70 - layers: 9 + epochs: 70 + layers: 9 processing: - threshold: 0.98 - bow_size: 15000 + threshold: 0.98 + bow_size: 15000 ``` -Define a pipeline stage with dependencies to parameters: +Define a pipeline [stage](/doc/command-reference/run) with parameter +dependencies: ```dvc $ dvc run -d users.csv -o model.pkl \ @@ -66,7 +71,8 @@ $ dvc run -d users.csv -o model.pkl \ python train.py ``` -Let's print parameter values that we are tracking in this project: +Let's now print parameter values that we are tracking in this +project: ```dvc $ dvc params diff @@ -76,16 +82,16 @@ params.yaml train.layers None 9 params.yaml train.epochs None 70 ``` -The command showed the difference between the workspace and the last commited -version of the `params.yaml` file which does not exist yet. This is why all -`Old` values are `None`. +The command above shows the difference in parameters between the workspace and +the last committed version of the params file `params.yaml`. Since it did not +exist before, all `Old` values are `None`. -Note, not all the parameter were printed. `dvc params diff` prints only changed -parameters that were used in one of the stages and ignors parameters from the -group `processing` that were not used. +❗ Note that not all the parameter were printed. `dvc params diff` prints only +changed parameters that were used in one of the stages and ignores parameters +from the group `processing` that were not used. -In a project with parameter file history you will see both `Old` and `New` -values: +In a project with parameters file history (params present in various Git +commits), you will see both `Old` and `New` values: ```dvc $ dvc params diff @@ -95,8 +101,9 @@ params.yaml train.layers 9 7 params.yaml train.epochs 70 110 ``` -To compare parameters with a specific commit, tag or revision it should be -specified as an additional command line parameter: +To compare parameters with a specific commit, a tag or any +[revision](https://git-scm.com/docs/revisions) should be specified as an +additional command line parameter: ```dvc $ dvc params diff e12b167 @@ -105,8 +112,9 @@ params.yaml lr 0.0038 0.0043 params.yaml train.epochs 70 110 ``` -Note, the `train.layers` parameter dissapeared because its value was not changed -between the current version in the workspace and the defined one. +Note that the `train.layers` parameter disappeared because its value was not +changed between the current version in the workspace and the given one +(`e12b167`). To see the difference between two specific commits, both need to be specified: diff --git a/content/docs/command-reference/params/index.md b/content/docs/command-reference/params/index.md index 170183d8f9..6aea5a7206 100644 --- a/content/docs/command-reference/params/index.md +++ b/content/docs/command-reference/params/index.md @@ -1,7 +1,7 @@ # params -Contains a command to show changes in parameters defined with the `-p` option of -`dvc run`: [diff](/doc/command-reference/params/diff). +Contains a command to show changes in parameters: +[diff](/doc/command-reference/params/diff). ## Synopsis @@ -17,40 +17,49 @@ positional arguments: ## Description In order to track parameters and hyperparameters associated to machine learning -experiments, DVC provides a special type of dependencies: -_parameters_ (see the `--params` option of `dvc run`). Parameters are -project version-specific string or array values e.g. `epochs`, -`learning-rate`, `batch_size`, `num_classes` etc. - -In contrast to a regular file dependency, a parameter consists of a parameter -_file_ (the file dependency itself) and a parameter _name_ to look for inside -the file. User can specify dependencies to many parameters from a single -parameters file as well as many dependencies from different parameters files. - -Users manualy write parameters file, store them in Git and specify parameter -dependencies for DVC stages. DVC saves the dependent parameters and their values -in the [DVC-file](/doc/user-guide/dvc-file-format) corresponded to the stage. -These values will be compared to the ones in the parameter files whenever -`dvc repro` is used, to determine if dependency to the parameter is invalidated. +experiments in DVC projects, DVC provides a special type of +dependencies: _parameters_. Parameter values are specific to a +version of the project, and defined (using `dvc run`) with simple names like +`epochs`, `learning-rate`, `batch_size`, etc. -The default parameters file name is `params.yaml`. Parameters should be -organized as a tree hierarchy in the params file. DVC addresses the parameters -by the tree path. Supported file formats for parameter file are: YAML and JSON. - -The parameters concept helps to define stage dependencies more granularly when -not only a file change invalidate a stage and requires the stage execution but a -particular parameter or a set of parameters change is required for the stage -invalidation. As a result, it prevents situations when many pipeline stages -depends on a single file and any change in the file invalidates all of these -stages. - -Supported parameter value types are: string, integer, float values and arrays. -DVC itself does not ascribe any specific meaning for these parameter values. -Usually these values are defined by users and serve as a way to generalize and -parametrize an machine learning algorithm or data processing code. +In contrast to a regular text file dependency, a parameter dependency consists +of a parameters _file_ (the file dependency itself) and a parameter _name_ (to +find inside the text file). Multiple parameter dependencies can be specified +from one or more parameters files. -`dvc run` is used to define parameters, and `dvc params diff` is available to -manage them. +The default parameters file name is `params.yaml`. Parameters should be +organized as a tree hierarchy in it, as DVC will locate param names by their +tree path. Supported file formats for params files are: YAML and JSON. + +Supported parameter _value_ types are: string, integer, float, and arrays. DVC +itself does not ascribe any specific meaning for these values. They are +user-defined, and serve as a way to generalize and parametrize an machine +learning algorithms or data processing code. + +### Benefits and workflow + +The parameters concept helps to define [stage](/doc/command-reference/run) +dependencies more granularly. A particular parameter or set of +parameters will be required for the stage invalidation (see `dvc status` and +`dvc repro`). Changes to other parts of the dependency file will not affect the +stage. + +Using parameter dependencies prevents situations where several +[pipeline](/doc/command-reference/pipeline) stages depend on the same file, and +any change in the file invalidates causes the reproduction all those stages +unnecessarily. + +You should manually write or generate the YAML or JSON parameters files needed +for the project, which can be versioned directly with Git. You can then use +`dvc run` with the `-p` (`--params`) option to specify parameter dependencies +for your pipeline's stages (instead of or in addition to regular `-d` deps.) DVC +saves the param names and values in the stage file (see +[DVC-file format](/doc/user-guide/dvc-file-format)). These values will be +compared to the ones in the params files to determine if the stage is +invalidated upon pipeline [reproduction](/doc/command-reference/repro). + +`dvc params diff` is available to show changes in parameters, displaying the +param names as well as their current and previous values. ## Options @@ -78,8 +87,8 @@ processing: ``` Define a [stage](/doc/command-reference/run) that depends on params `lr`, -`layers`, and `epochs` from the parameters file above. Full paths should be used -to specify `layers` and `epochs` from the `train` group: +`layers`, and `epochs` from the params file above. Full paths should be used to +specify `layers` and `epochs` from the `train` group: ```dvc $ dvc run -d users.csv -o model.pkl \ @@ -87,7 +96,7 @@ $ dvc run -d users.csv -o model.pkl \ python train.py ``` -> Note that we could use the same parameters addressation with JSON parameters +> Note that we could use the same parameter addressing with JSON parameters > files. Alternatively, the entire group of parameters `train` can be referenced, instead @@ -101,8 +110,8 @@ $ dvc run -d users.csv -o model.pkl \ You can find that each parameter and it's value were saved in the [DVC-file](/doc/user-guide/dvc-file-format). These values will be compared to -the ones in the parameter files whenever `dvc repro` is used, to determine if -dependency to the parameter file is invalidated: +the ones in the parameters files whenever `dvc repro` is used, to determine if +dependency to the params file is invalidated: ```yaml md5: 05d178cfa0d1474b6c5800aa1e1b34ac @@ -117,8 +126,8 @@ deps: train.layers: 9 ``` -In the examples above, the default parameters file `params.yaml` was used. The -parameter file name can be redefined with a prefix in the `-p` argument: +In the examples above, the default parameters file name `params.yaml` was used. +This file name can be redefined with a prefix in the `-p` argument: ```dvc $ dvc run -d logs/ -o users.csv \ @@ -126,9 +135,6 @@ $ dvc run -d logs/ -o users.csv \ python train.py ``` -Now let's print parameter values that we are tracking in this -project: - ## Examples: Print all parameter values in the workspace Following the previous example, we can use `dvc params diff` to list all of the diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 6aa586d9d7..0395c985b6 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -22,12 +22,13 @@ positional arguments: ## Description `dvc run` provides an interface to describe stages: individual commands and the -data input and output that go into creating a result. By specifying a list of -dependencies (`-d` option), params (`-p` option) and outputs (`-o`, -`-O`, `-m`, or `-M` options) DVC can later connect each stage by building a -dependency graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). -This graph is used by DVC to restore a full data -[pipeline](/doc/command-reference/pipeline). +data input and output that go into creating a result. By specifying lists of +dependencies (`-d` option), +[parameters](/doc/command-reference/params) (`-p` option), outputs +(`-o`, `-O` options), and/or [metrics](/doc/command-reference/metrics) (`-m`, +`-M` options), DVC can later connect each stage by building a dependency graph +([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). This graph is +used by DVC to restore a full data [pipeline](/doc/command-reference/pipeline). The remaining terminal input provided to `dvc run` after the command options (`-`/`--` flags) will become the required `command` argument. Please wrap the @@ -73,8 +74,8 @@ commands, they should ideally follow these rules: at `dvc repro`). - Stop reading and writing files when the `command` exits. -Keep in mind that if the pipeline's reproducibility goals include consistent -output data, its code should be as +Keep in mind that if the [pipeline](/doc/command-reference/pipeline)'s +reproducibility goals include consistent output data, its code should be as [deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) as possible (produce the same output for a given input). In this case, avoid code that brings [entropy](https://en.wikipedia.org/wiki/Software_entropy) into your @@ -88,21 +89,22 @@ data pipeline (e.g. random numbers, time functions, hardware dependency, etc.) with data, or a code file, or a configuration file. DVC also supports certain [external dependencies](/doc/user-guide/external-dependencies). - DVC builds a dependency graph connecting different stages with each other. - When you use `dvc repro`, the list of dependencies helps DVC analyze whether - any dependencies have changed and thus executing stages as required to - regenerate their output. A special case is when no dependencies are specified. + DVC builds a dependency graph ([pipeline](/doc/command-reference/pipeline)) + connecting different stages with each other. When you use `dvc repro`, the + list of dependencies helps DVC analyze whether any dependencies have changed + and thus executing stages as required to regenerate their output. A special + case is when no dependencies are specified. > Note that a DVC-file without dependencies is considered always changed, so > `dvc repro` always executes it. - `-p [:]`, `--params [:]` - - specify a subset of parameters from a parameter file the stage depends on. The - params subset can be specified by coma separated params list: - `-p learning_rate,epochs`. By default, the params file is `params.yaml` but - this value can be redefined with params prefix: - `-p parse_params.yaml:threshold` See `dvc params` to learn more about using - parameters. + specify a set of [parameter dependencies](/doc/command-reference/params) the + stage depends on, from a parameters file. This is done by sending a coma + separated list as argument, e.g. `-p learning_rate,epochs`. The default + parameters file name is `params.yaml`, but this can be redefined with a prefix + in the argument sent to this option, e.g. `-p parse_params.yaml:threshold`. + See `dvc params` to learn more about parameters. - `-o `, `--outs ` - specify a file or directory that is the result of running the `command`. Multiple outputs can be specified: @@ -212,8 +214,9 @@ To track the changes with git, run: > See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the > text format above. -Execute a Python script as a DVC pipeline stage. The stage file name is not -specified, so a `model.p.dvc` DVC-file is created: +Execute a Python script as a DVC [pipeline](/doc/command-reference/pipeline) +stage. The stage file name is not specified, so a `model.p.dvc` DVC-file is +created by default based on the registered output (`-o): ```dvc # Train ML model on the training dataset. 20180226 is a seed value. @@ -222,34 +225,53 @@ $ dvc run -d matrix-train.p -d train_model.py \ python train_model.py matrix-train.p 20180226 model.p ``` -Execute an R script as s DVC pipeline stage: +Place the stage file in a subdirectory: ```dvc -$ dvc run -d parsingxml.R -d Posts.xml \ - -o Posts.csv \ - Rscript parsingxml.R Posts.xml Posts.csv +$ dvc run -d test.txt -f stages/test.dvc -o result.out \ + "cat test.txt | wc -l > result.out" + +$ tree . + +. +├── result.out +├── stages +│   └── test.dvc +└── test.txt ``` -Dependency to hyperparameters from the default params file `params.yaml`: +## Example: Using specific hyperparameter dependencies -```dvc -$ cat params.yaml +To use granular [parameter dependencies](/doc/command-reference/params), create +a simple YAML parameters file named `params.yaml` (default params file name, see +`dvc params` to learn more): + +```yaml seed: 20180226 train: - lr: 0.0041 - epochs: 75 - layers: 9 + lr: 0.0041 + epochs: 75 + layers: 9 processing: - threshold: 0.98 - bow_size: 15000 + threshold: 0.98 + bow_size: 15000 +``` + +Define a pipeline stage with both regular and parameter dependencies: +```dvc $ dvc run -d matrix-train.p -d train_model.py -o model.p \ -p seed,train.lr,train.epochs python train_model.py matrix-train.p model.p ``` +## Example: chaining stages (build a pipeline) + +DVC [pipelines](/doc/command-reference/pipeline) are constructed by connecting +one stage outputs to the next's dependencies: + Extract an XML file from an archive to the `data/` folder: ```dvc @@ -260,17 +282,10 @@ $ dvc run -d Posts.xml.zip \ unzip Posts.xml.zip -d data/ ``` -Place the generated stage file (DVC-file) into a subdirectory: +Execute an R script: ```dvc -$ dvc run -d test.txt -f stages/test.dvc -o result.out \ - "cat test.txt | wc -l > result.out" - -$ tree . - -. -├── result.out -├── stages -│   └── test.dvc -└── test.txt +$ dvc run -d parsingxml.R -d data/Posts.xml \ + -o data/Posts.csv \ + Rscript parsingxml.R data/Posts.xml data/Posts.csv ``` diff --git a/content/docs/glossary.js b/content/docs/glossary.js index fe982958ac..040e57bab2 100644 --- a/content/docs/glossary.js +++ b/content/docs/glossary.js @@ -44,17 +44,18 @@ For more details, please refer to this [document] name: 'Output', match: ['output', 'outputs'], desc: ` -A file or directory that is under DVC control, recorded in the \`outs\` section -of a DVC-file. See \`dvc add\` \`dvc run\`, \`dvc import\`, \`dvc import-url\` -commands. A.k.a. **data artifact*. +A file or directory tracked by DVC, recorded in the \`outs\` section of a +DVC-file. Outputs are usually the result of stages. A.k.a. **data artifact*. +See \`dvc add\`, \`dvc run\`, \`dvc import\`, et al. ` }, { name: 'Dependency', match: ['dependency', 'dependencies'], desc: ` -A file or directory (possibly under DVC control) recorded in the \`deps\` -section of a DVC-file. See \`dvc run\`. +A file or directory (possibly tracked by DVC) recorded in the \`deps\` section +of a DVC-file (stage file). See \`dvc run\`. Stages are invalidated when any of +their dependencies change. ` }, { diff --git a/content/docs/tutorials/get-started/experiments.md b/content/docs/tutorials/get-started/experiments.md index b716872a2e..c99199c5e0 100644 --- a/content/docs/tutorials/get-started/experiments.md +++ b/content/docs/tutorials/get-started/experiments.md @@ -1,7 +1,7 @@ # Experiments Data science process is inherently iterative and R&D like. Data scientist may -try many different approaches, different hyper-parameter values, and "fail" many +try many different approaches, different hyperparameter values, and "fail" many times before the required level of a metric is achieved. DVC is built to provide a way to capture different experiments and navigate diff --git a/content/docs/user-guide/dvc-file-format.md b/content/docs/user-guide/dvc-file-format.md index b33277c59d..e3256be232 100644 --- a/content/docs/user-guide/dvc-file-format.md +++ b/content/docs/user-guide/dvc-file-format.md @@ -44,23 +44,25 @@ meta: # Special field to contain arbitary user data ## Structure -On the top level, `.dvc` file consists of these fields: +On the top level, `.dvc` file consists of these possible fields: - `cmd`: Executable command defined in this stage +- `wdir`: Directory to run command in (default `.`) +- `md5`: MD5 hash for this DVC-file - `deps`: List of dependencies for this stage - `outs`: List of outputs for this stage -- `md5`: MD5 hash for this DVC-file - `locked`: Whether or not this stage is locked from reproduction -- `wdir`: Directory to run command in (default `.`) - `always_changed`: Whether or not this stage should always be considered as changed by commands such as `dvc status` and `dvc repro` (default `false`) -A dependency entry consists of a pair of fields: +A dependency entry consists of a these possible fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) - `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) +- `params`: If this is a [parameter dependency](/doc/command-reference/params) + file, contains a list of the parameter names and their current values. - `repo`: This entry is only for external dependencies created with `dvc import`, and can contains the following fields: @@ -81,14 +83,12 @@ An output entry consists of these fields: - `path`: Path to the output, relative to the `wdir` path - `md5`: MD5 hash for the output - `cache`: Whether or not dvc should cache the output -- `metric`: Whether or not this file is a - [metric](/doc/command-reference/metrics) file - -A metric entry consists of these fields: +- `metric`: If this file is a [metric](/doc/command-reference/metrics), contains + the following fields: -- `type`: Type of the metric file (e.g. raw/json/tsv/htsv/csv/hcsv) -- `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` for - `{"AUC": {"value": 0.624321}}`) + - `type`: Type of the metric file (`json`) + - `xpath`: Path within the metric file to the metrics data(e.g. `AUC.value` + for `{"AUC": {"value": 0.624321}}`) A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry can have any valid YAML structure containing any number of attributes.