Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of CSV/TSV metrics #1097

Merged
merged 8 commits into from
Apr 7, 2020
Merged
65 changes: 30 additions & 35 deletions content/docs/command-reference/metrics/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,42 +18,36 @@ defines the given `path` as an <abbr>output</abbr>, marking `path` as a metric
file to track.

Note that outputs can also be marked as metrics via the `-m` or `-M` options of
`dvc run`.
`dvc run`. We recommend using `-M` option to keep metrics in Git history.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

While any text file can be tracked as a metric file, we recommend using TSV,
CSV, or JSON formats. DVC provides a way to parse those formats to get to a
specific value, if the file contains multiple metrics. See the
[options](#options) below and `dvc metrics show` for more info.
While any text file can be tracked as a metric file, we recommend using JSON
formats. DVC provides a way to parse this formats to get to a specific value, if
the file contains multiple metrics. See the [options](#options) below and
`dvc metrics diff` for more info.
Comment on lines +23 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While any text file can be tracked as a metric file, we recommend using JSON
formats. DVC provides a way to parse this formats to get to a specific value, if
the file contains multiple metrics. See the [options](#options) below and
`dvc metrics diff` for more info.
While any text file can be tracked as a metric file, we recommend using JSON
format. DVC provides a way to parse this format to get to a specific value, if
the file contains multiple metric values. See the [options](#options) below,
`dvc metrics show`, and `dvc metrics diff` for more info.

Notice I added dvc metrics show to the list which I understand is the only subcommand we're sure should keep the ---xpath option. Is this correct?


> Note that [external output](/doc/user-guide/managing-external-data) cannot be
> marked as project metrics.

## Options

- `-t <type>`, `--type <type>` - specify a type for the metric file. Accepted
values are: `raw` (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be
saved into the corresponding DVC-file, and used by `dvc metrics show` to
determine how to handle displaying metrics.
values are: `raw` (default), `json`. It will be saved into the corresponding
DVC-file, and used by `dvc metrics show` to determine how to handle displaying
metrics.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

`raw` means that no additional parsing is applied, and `--xpath` is ignored.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of
the file will be used as the field names and should be used to address columns
in the `--xpath` option.

- `-x <path>`, `--xpath <path>` - specify a path within a metric file to get a
specific metric value. Should be used if the metric file contains multiple
numbers and you want to use only one of them. Only a single path is allowed.
It will be saved into the corresponding DVC-file, and used by
`dvc metrics show` to determine how to display metrics. The accepted value
depends on the metric file type (`--type` option):
`dvc metrics show` and `dvc metrics diff` to determine how to display metrics.
The accepted value depends on the metric file type (`--type` option):

- For `json` - see [JSONPath](https://goessner.net/articles/JsonPath/) or
[jsonpath-ng](https://github.com/h2non/jsonpath-ng) to know the syntax. For
example, `"AUC"` extracts the value from the following JSON-formatted metric
file: `{"AUC": "0.624652"}`.
- For `tsv`/`csv` - `row,column` e.g. `1,2`. Indices are 0-based.
- For `htsv`/`hcsv` - `row,column name` e.g. `0,Name`. Row index is 0-based.
First row is used to specify column names and is not included into index.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand All @@ -68,22 +62,24 @@ Let's first create a regular <abbr>output</abbr> with the `-o` option of
`dvc run`:

```dvc
$ dvc run -o metrics.txt "echo 0.9643 > metrics.txt"
$ dvc run -o metrics.json \
'echo {\"AUC\": 0.9643, \"TP\": 527} > metrics.json'
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

Even when we named this output file `metrics.txt`, DVC won't know that it's a
metric if we don't specify so. The content of stage file `metrics.txt.dvc` (a
Even when we named this output file `metrics.json`, DVC won't know that it's a
metric if we don't specify so. The content of stage file `metrics.json.dvc` (a
[DVC-file](/doc/user-guide/dvc-file-format)) should look like this: (Notice the
`metric: false` field.)

```yaml
cmd: echo 0.9643 > metrics.txt
md5: f75f291b02ab38530ba659c1e10e577f
md5: 906ea9489e432c85d085b248c712567b
cmd: echo {\"AUC\":0.9643, \"TP\":527} > metrics.json
outs:
- cache: true
md5: 235d585fcea283135682457b15c76101
- md5: 0f0e67dc927aa69cd3fc37435ee1304f
path: metrics.json
cache: true
metric: false
path: metrics.txt
persist: false
```

If you run `dvc metrics show` now, you should get an error message:
Expand All @@ -97,27 +93,26 @@ ERROR: failed to show metrics - no metric files in
Now, let's mark the output as a metric:

```dvc
$ dvc metrics add metrics.txt

Saving information to 'metrics.txt.dvc'.
$ dvc metrics add metrics.json
```

This command updates `metrics.txt.dvc` to specify that `metrics.txt` is actually
a metric file:
This command updates `metrics.json.dvc` to specify that `metrics.json` is
actually a metric file:

```yaml
cmd: echo 0.9643 > metrics.txt
md5: f75f291b02ab38530ba659c1e10e577f
md5: 906ea9489e432c85d085b248c712567b
cmd: echo {\"AUC\":0.9643, \"TP\":527} > metrics.json
outs:
- cache: true
md5: 235d585fcea283135682457b15c76101
- md5: 0f0e67dc927aa69cd3fc37435ee1304f
path: metrics.json
cache: true
metric:
type: raw
path: metrics.txt
persist: false
```

And if you run `dvc metrics show` you should now see a report like this:

```dvc
metrics.txt: 0.9643
metrics.json: {"AUC":0.9643, "TP":527}
```
54 changes: 20 additions & 34 deletions content/docs/command-reference/metrics/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,8 @@ They're calculated between two commits (hash, branch, tag, or any
no directories among the `targets`, this option is ignored.

- `-t <type>`, `--type <type>` - specify a type of the metric file. Accepted
values are: `raw` (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be
used to determine how to parse and format metics for display. See
`dvc metrics show` for more details.
values are: `raw` (default), `json`. It will be used to determine how to parse
and format metics for display. See `dvc metrics show` for more details.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

This option will override `type` and `xpath` defined in the corresponding
DVC-file. If no `type` is provided or found in the DVC-file, DVC will try to
Expand All @@ -68,53 +67,40 @@ They're calculated between two commits (hash, branch, tag, or any

## Examples

Let's employ a simple <abbr>workspace</abbr> with some data, code, ML models,
pipeline stages, such as the <abbr>DVC project</abbr> created in our
[Get Started](/doc/tutorials/get-started) section. Then we can see what happens
with `dvc install` in different situations.

<details>

### Click and expand to setup the project

Start by cloning our example repo if you don't already have it:
Start by creating a simple metrics file and commit it:

```dvc
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
$ dvc run -M metrics.json \
'echo {\"AUC\": 0.9643, \"TP\": 527} > metrics.json'
$ git add metrics.json metrics.json.dvc
$ git commit -m "Add metrics file"
```

</details>

Notice that we have an `auc.metric` metric file:

```
$ cat auc.metric
0.602818
$ cat metrics.json
{"AUC":0.9643, "TP":527}
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

Now let's mock a change in our AUC metric:

```
$ echo '0.5' > auc.metric
$ echo {\"AUC\":0.9671, \"TP\":531} > metrics.json

$ git diff
--- a/metrics.json
+++ b/metrics.json
@@ -1 +1 @@
-{"AUC":0.9643, "TP":527}
+{"AUC":0.9671, "TP":531}
```

To see the change, let's run `dvc metrics diff`. This compares our current
<abbr>workspace</abbr> (including uncommitted local changes) metrics to what we
had in the previous commit:

```
$ git diff
--- a/auc.metric
+++ b/auc.metric
@@ -1 +1 @@
-0.602818
+0.5

$ dvc metrics diff
Path Metric Value Change
auc.metric 0.500 -0.103
Path Metric Value Change
metrics.json TP 531 4
metrics.json AUC 0.967 0.003
```

> Note that metric files are typically versioned with Git, so we can use both
> `git diff` and `dvc metrics diff` to understand their changes, as seen above.
57 changes: 40 additions & 17 deletions content/docs/command-reference/metrics/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,37 @@ A set of commands to add, manage, collect, and display project metrics:
## Synopsis

```usage
usage: dvc metrics [-h] [-q | -v] {show,add,modify,remove} ...
usage: dvc metrics [-h] [-q | -v] {show,add,modify,remove,diff} ...
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

positional arguments:
COMMAND
show Output metric values.
add Tag file as a metric file.
modify Modify metric file values.
remove Remove files's metric tag.
{show,add,modify,remove,diff}
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
Use `dvc metrics CMD --help` to display command-
specific help.
show Print metrics, with optional formatting.
add Mark a DVC-tracked file as a metric.
modify Modify metric default formatting.
remove Remove metric mark on a DVC-tracked file.
diff Show changes in metrics between commits
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

optional arguments:
-h, --help show this help message and exit
-q, --quiet Be quiet.
-v, --verbose Be verbose
```

## Description

DVC has the ability to mark a certain stage <abbr>outputs</abbr> as files
containing metrics to track. (See the `--metrics` option of `dvc run`.) Metrics
are project-specific numeric values e.g. `AUC`, `ROC`, etc. DVC itself does not
ascribe any specific meaning for these numbers. Usually these numbers are
produced by the model evaluation script and serve as a way to compare and pick
the best performing experiment.
In order to track metrics associated to machie learning experiments, DVC has the
ability to mark a certain stage <abbr>outputs</abbr> as files containing metrics
to track. (See the `--metrics` option of `dvc run`.) Metrics are
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
project-specific floating-point values e.g. `AUC`, `ROC`, etc.

Supported file formats: JSON. Metrics can be organized in a tree hierarchy in a
JSON file. DVC addresses the metrics by the tree path.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

DVC itself does not ascribe any specific meaning for these numbers. Usually
these numbers are produced by the model training or model evaluation code and
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
serve as a way to compare and pick the best performing experiment.

[Add](/doc/command-reference/metrics/add),
[show](/doc/command-reference/metrics/show),
Expand Down Expand Up @@ -65,7 +78,12 @@ Now let's print metric values that we are tracking in this <abbr>project</abbr>:
$ dvc metrics show -a

master:
data/eval.json: {"AUC": "0.624652"}
data/eval.json:
{
"AUC": 0.65115,
"error": 0.17304,
"TP": 528
}
```

We can also tell DVC an `xpath` for the metric file, so that it can output only
Expand All @@ -74,11 +92,16 @@ the value of AUC. In the case of JSON, use
selectively extract data out of metric files:

```dvc
$ dvc metrics modify data/eval.json --type json --xpath AUC
$ dvc metrics show
$ dvc metrics show --xpath AUC data/eval.json
data/eval.json: {'AUC': 0.65115}
```

master:
data/eval.json: 0.624652
The xpath filter can be saved for a metrics file:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ dvc metrics modify data/eval.json --xpath AUC
$ dvc metrics show
data/eval.json: {'AUC': 0.65115}
```

And finally let's remove `data/eval.json` from the project metrics:
Expand Down
58 changes: 26 additions & 32 deletions content/docs/command-reference/metrics/modify.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,11 @@ ERROR: failed to modify metric file settings -
## Options

- `-t <type>`, `--type <type>` - specify a type for the metric file. Accepted
values are: `raw` (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be
saved into the corresponding DVC-file, and used by `dvc metrics show` to
determine how to handle displaying metrics.
values are: `raw` (default), `json`. It will be saved into the corresponding
DVC-file, and used by `dvc metrics show` to determine how to handle displaying
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
metrics.

`raw` means that no additional parsing is applied, and `--xpath` is ignored.
`htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of
the file will be used as the field names and should be used to address columns
in the `--xpath` option.

- `-x <path>`, `--xpath <path>` - specify a path within a metric file to get a
specific metric value. Should be used if the metric file contains multiple
Expand All @@ -53,9 +50,6 @@ ERROR: failed to modify metric file settings -
[jsonpath-ng](https://github.com/h2non/jsonpath-ng) to know the syntax. For
example, `"AUC"` extracts the value from the following JSON-formatted metric
file: `{"AUC": "0.624652"}`.
- For `tsv`/`csv` - `row,column` e.g. `1,2`. Indices are 0-based.
- For `htsv`/`hcsv` - `row,column name` e.g. `0,Name`. Row index is 0-based.
First row is used to specify column names and is not included into index.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand All @@ -70,52 +64,52 @@ Let's first imagine we have a [stage](/doc/command-reference/run) with a generic
metric file initially. The dummy command below simulates this imaginary setup:

```dvc
$ dvc run -M metrics.csv "echo auc, 0.9567 > metrics.csv"
$ dvc run -o metrics.json \
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
'echo {\"AUC\": 0.9643, \"TP\": 527} > metrics.json'
```

The resulting stage file `metrics.csv.dvc` should look like this:
The resulting stage file `metrics.json.dvc` should look like this:

```yaml
md5: 6ed9b798bf460e1aa80b27388425a07d
cmd: echo auc, 0.9567 > metrics.csv
wdir: .
md5: 906ea9489e432c85d085b248c712567b
cmd: echo {\"AUC\":0.9643, \"TP\":527} > metrics.json
outs:
- md5: 13ee80c6b3e238c5097427c2114ae6e4
path: metrics.csv
cache: false
metric: true
- md5: 0f0e67dc927aa69cd3fc37435ee1304f
path: metrics.json
cache: true
metric: false
persist: false
```

And if we run `dvc metrics show metrics.csv`, we will get the complete contents
And if we run `dvc metrics show metrics.json`, we will get the complete contents
of the file:

```dvc
$ dvc metrics show metrics.csv
metrics.csv: auc 0.9567
$ dvc metrics show metrics.json
metrics.json: {"AUC":0.9643, "TP":527}
```

Okay. Let's now imagine we are interested only in the numeric value, the second
column of the CSV file. We can specify the `CSV` type (`-t`) and an `xpath`
(`-x`) to extract the second column:
Okay. Let's now imagine we are interested only in a single value of true
posivives - TP. We can specify the `JSON` type (`-t`) and an `xpath` (`-x`) to
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
extract the TP value:

```dvc
$ dvc metrics modify -t csv -x '0,1' metrics.csv
$ dvc metrics modify -t json -x TP metrics.json
```

After this change `dvc metrics show` should always select only the value itself,
and exclude names:
After this change `dvc metrics show` should always select only the specified
value:

```dvc
$ dvc metrics show metrics.csv
metrics.csv: [' 0.9567']
$ dvc metrics show metrics.json
metrics.json: {'TP': 527}
```

Notice that the `metric` field in the `metrics.csv.dvc` stage file changed to
Notice that the `metric` field in the `metrics.json.dvc` stage file changed to
include this information:

```yaml
metric:
type: csv
xpath: 0,1
type: json
xpath: TP
```
Loading