Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add params #1128

Merged
merged 13 commits into from
Apr 11, 2020
Merged
116 changes: 116 additions & 0 deletions content/docs/command-reference/params/diff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# params diff

Show changes in [project parameters](/doc/command-reference/params), between
commits in the <abbr>DVC repository</abbr>, or between a commit and the
<abbr>workspace</abbr>.

## Synopsis

```usage
usage: dvc params diff [-h] [-q | -v] [--show-json] [a_rev] [b_rev]

positional arguments:
a_rev Old Git commit to compare (defaults to HEAD)
b_rev New Git commit to compare (defaults to the
current workspace)
```

## Description

This command means to provide a quick way to compare parameters from your
previous experiments with the current ones of your pipeline, as long as you're
using params that DVC is aware of (see `--params` in `dvc run`). Run without
arguments, this command compares all existing parameters currently present in
the <abbr>workspace</abbr> (uncommitted changes) with the latest committed
version. The command shows only parameters that were used in any of stages and
ignores parameters that were not used.

## Options

- `--show-json` - prints the command's output in easily parsable JSON format,
instead of a human-readable table.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

Let's create a simple parameters file and a stage with params dependency (See
`dvc params` and `dvc run` to learn more):

```dvc
$ cat params.yaml
lr: 0.0041

train:
epochs: 70
layers: 9

processing:
threshold: 0.98
bow_size: 15000
```

Define a pipeline stage with dependencies to parameters:

```dvc
$ dvc run -d users.csv -o model.pkl \
-p lr,train \
python train.py
```

Let's print parameter values that we are tracking in this <abbr>project</abbr>:

```dvc
$ dvc params diff
Path Param Old New
params.yaml lr None 0.0041
params.yaml train.layers None 9
params.yaml train.epochs None 70
```

The command showed the difference between the workspace and the last commited
version of the `params.yaml` file which does not exist yet. This is why all
`Old` values are `None`.

Note, not all the parameter were printed. `dvc params diff` prints only changed
parameters that were used in one of the stages and ignors parameters from the
group `processing` that were not used.

In a project with parameter file history you will see both `Old` and `New`
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
values:

```dvc
$ dvc params diff
Path Param Old New
params.yaml lr 0.0041 0.0043
params.yaml train.layers 9 7
params.yaml train.epochs 70 110
```

To compare parameters with a specific commit, tag or revision it should be
specified as an additional command line parameter:

```dvc
$ dvc params diff e12b167
Path Param Old New
params.yaml lr 0.0038 0.0043
params.yaml train.epochs 70 110
```

Note, the `train.layers` parameter dissapeared because its value was not changed
between the current version in the workspace and the defined one.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

To see the difference between two specific commits, both need to be specified:

```dvc
$ dvc params diff e12b167 HEAD^
Path Param Old New
params.yaml lr 0.0038 0.0041
params.yaml train.layers 10 9
params.yaml train.epochs 50 70
```
148 changes: 148 additions & 0 deletions content/docs/command-reference/params/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# params

A set of commands to manage and display experiment parameters:
[diff](/doc/command-reference/params/diff).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc params [-h] [-q | -v] {diff} ...

positional arguments:
COMMAND
diff Show changes in params between commits in the
DVC repository, or between a commit and the workspace.
```

## Description

In order to track parameters and hyperparameters associated to machine learning
experiments, DVC provides a special type of <abbr>dependencies</abbr>:
_parameters_ (see the `--params` option of `dvc run`). Parameters are
<abbr>project</abbr> version-specific string or array values e.g. `epochs`,
`learning-rate`, `batch_size`, `num_classes` etc.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

In contrast to a regular file dependency, a parameter consists of a parameter
_file_ (the file dependency itself) and a parameter _name_ to look for inside
the file. User can specify dependencies to many parameters from a single
parameters file as well as many dependencies from different parameters files.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Users manualy write parameters file, store them in Git and specify parameter
dependencies for DVC stages. DVC saves the dependent parameters and their values
in the [DVC-file](/doc/user-guide/dvc-file-format) corresponded to the stage.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
These values will be compared to the ones in the parameter files whenever
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`dvc repro` is used, to determine if dependency to the parameter is invalidated.

The default parameters file name is `params.yaml`. Parameters should be
organized as a tree hierarchy in the params file. DVC addresses the parameters
by the tree path. Supported file formats for parameter file are: YAML and JSON.

The parameters concept helps to define stage dependencies more granularly when
not only a file change invalidate a stage and requires the stage execution but a
particular parameter or a set of parameters change is required for the stage
invalidation. As a result, it prevents situations when many pipeline stages
depends on a single file and any change in the file invalidates all of these
stages.

Supported parameter value types are: string, integer, float values and arrays.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
DVC itself does not ascribe any specific meaning for these parameter values.
Usually these values are defined by users and serve as a way to generalize and
parametrize an machine learning algorithm or data processing code.

`dvc run` is used to define parameters, and `dvc params diff` is available to
manage them.

jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

First, let's create a simple parameters file in YAML format, using the default
file name `params.yaml`:

```yaml
lr: 0.0041

train:
epochs: 70
layers: 9

processing:
threshold: 0.98
bow_size: 15000
```

Define a [stage](/doc/command-reference/run) that depends on params `lr`,
`layers`, and `epochs` from the parameters file above. Full paths should be used
to specify `layers` and `epochs` from the `train` group:

```dvc
$ dvc run -d users.csv -o model.pkl \
-p lr,train.epochs,train.layers \
python train.py
```

> Note that we could use the same parameters addressation with JSON parameters
> files.

Alternatively, the entire group of parameters `train` can be referenced, instead
of specifying each of the group parameters separately:

```dvc
$ dvc run -d users.csv -o model.pkl \
-p lr,train \
python train.py
```

You can find that each parameter and it's value were saved in the
[DVC-file](/doc/user-guide/dvc-file-format). These values will be compared to
the ones in the parameter files whenever `dvc repro` is used, to determine if
dependency to the parameter file is invalidated:

```yaml
md5: 05d178cfa0d1474b6c5800aa1e1b34ac
cmd: python train.py
deps:
- md5: 3aec0a6cf36720a1e9b0995a01016242
path: users.csv
- path: params.yaml
params:
lr: 0.0041
train.epochs: 70
train.layers: 9
```

In the examples above, the default parameters file `params.yaml` was used. The
parameter file name can be redefined with a prefix in the `-p` argument:

```dvc
$ dvc run -d logs/ -o users.csv \
-p parse_params.yaml:threshold,classes_num \
python train.py
```

Now let's print parameter values that we are tracking in this
<abbr>project</abbr>:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Examples: Print all parameter values in the workspace

Following the previous example, we can use `dvc params diff` to list all of the
available param values associated to DVC-files in the <abbr>workspace</abbr>:

```dvc
$ dvc params diff
Path Param Old New
params.yaml lr None 0.0041
params.yaml train.layers None 9
params.yaml train.epochs None 70
```

This command shows the difference in parameters between the workspace and the
last committed version of the `params.yaml` file. In our example, there's no
previous version, which is why all `Old` values are `None`. See `params diff` to
learn more about the `diff` command.
50 changes: 41 additions & 9 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@ command and execute the command.
## Synopsis

```usage
usage: dvc run [-h] [-q | -v] [-d <path>] [-o <path>] [-O <path>]
[-m <path>] [-M <path>] [-f <filename>] [-c <path>]
[-w <path>] [--no-exec] [-y] [--overwrite-dvcfile]
usage: dvc run [-h] [-q | -v] [-d DEPS] [-o OUTS] [-O OUTS_NO_CACHE]
[-p PARAMS] [-m METRICS] [-M METRICS_NO_CACHE] [-f FILE]
[-c CWD] [-w WDIR] [--no-exec] [-y] [--overwrite-dvcfile]
[--ignore-build-cache] [--remove-outs] [--no-commit]
[--outs-persist OUTS_PERSIST]
[--outs-persist-no-cache OUTS_PERSIST_NO_CACHE]
[--always-changed]
command
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment: dvc run is getting gargantuan 🐋

Isn't there an issue out there somewhere to split it into 2 commands? May be time to revisit that 😬


Expand All @@ -21,10 +23,11 @@ positional arguments:

`dvc run` provides an interface to describe stages: individual commands and the
data input and output that go into creating a result. By specifying a list of
dependencies (`-d` option) and <abbr>outputs</abbr> (`-o`, `-O`, `-m`, or `-M`
options) DVC can later connect each stage by building a dependency graph
([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). This graph is
used by DVC to restore a full data [pipeline](/doc/command-reference/pipeline).
dependencies (`-d` option), params (`-p` option) and <abbr>outputs</abbr> (`-o`,
`-O`, `-m`, or `-M` options) DVC can later connect each stage by building a
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
dependency graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)).
This graph is used by DVC to restore a full data
[pipeline](/doc/command-reference/pipeline).

The remaining terminal input provided to `dvc run` after the command options
(`-`/`--` flags) will become the required `command` argument. Please wrap the
Expand Down Expand Up @@ -93,6 +96,14 @@ data pipeline (e.g. random numbers, time functions, hardware dependency, etc.)
> Note that a DVC-file without dependencies is considered always changed, so
> `dvc repro` always executes it.

- `-p [<filename>:]<params_list>`, `--params [<filename>:]<params_list>` -
specify a subset of parameters from a parameter file the stage depends on. The
params subset can be specified by coma separated params list:
`-p learning_rate,epochs`. By default, the params file is `params.yaml` but
this value can be redefined with params prefix:
`-p parse_params.yaml:threshold` See `dvc params` to learn more about using
parameters.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `-o <path>`, `--outs <path>` - specify a file or directory that is the result
of running the `command`. Multiple outputs can be specified:
`-o model.pkl -o output.log`. DVC builds a dependency graph (pipeline) to
Expand Down Expand Up @@ -187,9 +198,10 @@ $ mkdir example && cd example
$ git init
$ dvc init
$ mkdir data
$ dvc run -d data -o metric -f metric.dvc "echo '1' >> metric"
$ dvc run -d data -o metric -f metric.dvc \
"echo '{ \"AUC\": 0.86252 }' >> metric"
Running command:
echo '1' >> metric
echo '{ "AUC": 0.86252 }' >> metric
WARNING: 'data' is empty.

To track the changes with git, run:
Expand Down Expand Up @@ -218,6 +230,26 @@ $ dvc run -d parsingxml.R -d Posts.xml \
Rscript parsingxml.R Posts.xml Posts.csv
```

Dependency to hyperparameters from the default params file `params.yaml`:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ cat params.yaml
seed: 20180226

train:
lr: 0.0041
epochs: 75
layers: 9

processing:
threshold: 0.98
bow_size: 15000

$ dvc run -d matrix-train.p -d train_model.py -o model.p \
-p seed,train.lr,train.epochs
python train_model.py matrix-train.p model.p
```

Extract an XML file from an archive to the `data/` folder:

```dvc
Expand Down
11 changes: 11 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,17 @@
"label": "move",
"slug": "move"
},
{
"label": "params",
"slug": "params",
"source": "params/index.md",
"children": [
{
"label": "params diff",
"slug": "diff"
}
]
},
{
"label": "pipeline",
"slug": "pipeline",
Expand Down