Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plots #1186

Merged
merged 9 commits into from
May 7, 2020
Merged

Plots #1186

Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions config/prismjs/dvc-commands.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ module.exports = [
'metrics',
'params diff',
'params',
'plot show',
'plot diff',
'plot',
'lock',
'list',
'install',
Expand Down
115 changes: 115 additions & 0 deletions content/docs/command-reference/plot/diff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# plot diff

Show difference in
[continuous metrics](/doc/command-reference/plot#continous-metrics) by plotting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we want to name them continuous. This word applies to functions. What about, for example, confusion matrix? Data for that type of plot is not continuous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @pared . Probably even plain explicit "non-scalar metric" would be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. it will be changed. all the terminology around continuous will be removed in the next iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • So let's leave "continuous" in this PR so we have a plot cmd ref for now, and update it in a following PR.

on a single [plot](/doc/command-reference/plot) different versions of metrics
from the <abbr>DVC repository</abbr> or workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc plot diff [-h] [-q | -v] [-t [TEMPLATE]] [-d [DATAFILE]]
[-r RESULT] [--no-html] [-f FIELDS] [-o]
[--no-csv-header]
[revisions [revisions ...]]

positional arguments:
revisions Git revisions to plot from
```

## Description

This command visualize difference between continuous metrics among experiments
in the repository history. Requires that Git is being used to version the
metrics files.

The metrics file needs to be specified through `-d`/`--datafile` option. Also, a
plot can be customized by [Vega](https://vega.github.io/) templates through
option `--template`. To learn more about the file formats and templates please
see `dvc plot`.

Run without any revision specified, this command compares metrics currently
presented in the workspace (uncommitted changes) with the latest committed
version. A single specified revision shows the difference between the revision
and the version in the workspace.

In contrast to many commands such as `git diff`, `dvc metrics diff` and
`dvc prams diff` the plot difference shows all the revisions in a single ouput
and does not limited by two versions. A user can specify as many revisions as
needed.
dmpetrov marked this conversation as resolved.
Show resolved Hide resolved

The files with metrics can be files commited in Git as well as data files under
DVC control. In the case of data files, the file revision is corresponded to Git
revision of [DVC-files](/doc/user-guide/dvc-file-format) that has this file as
an output.

## Options

- `-t [TEMPLATE], --template [TEMPLATE]` - File to be injected with data.

- `-d [DATAFILE], --datafile [DATAFILE]` - Data to be visualized.

- `-r RESULT, --result RESULT` - Name of the generated file.

- `--no-html` - Do not wrap vega plot json with HTML.

- `-f FIELDS, --fields FIELDS` - Choose which fileds or jsonpath to put into
plot.

- `--no-csv-header` - Provided CSV or TSV datafile does not have a header.

- `-o, --stdout` - Print plot content to stdout.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

The difference between a not commited version of the file and the last commited
one:
dmpetrov marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ dvc plot diff -d logs.csv
file:///Users/dmitry/src/plot/logs.html
```

A new file `logs.html` was generated. User can open it in a web browser.

![](/img/plot_diff_workspace.svg)

The difference between two specified commits:

```dvc
$ dvc plot diff -d logs.csv HEAD 11c0bf1
file:///Users/dmitry/src/plot/logs.html
```

![](/img/plot_diff.svg)

The predefined confusion matrix template shows how continuous metrics difference
can be faceted by separate plots:

```csv
actual,predicted
cat,cat
cat,cat
cat,cat
cat,dog
cat,dinosaur
cat,dinosaur
cat,bird
turtle,dog
turtle,cat
...
```

```dvc
$ dvc plot diff -d classes.csv -t confusion_matrix
file:///Users/dmitry/src/test/plot_old/classes.html
```

![](/img/plot_diff_confusion.svg)
189 changes: 189 additions & 0 deletions content/docs/command-reference/plot/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# plot

Contains commands to visualize
[continuous metrics](/doc/command-reference/plot#continuous-metrics) in
structured files like JSON, CSV, TSV: [show](/doc/command-reference/plot/show),
[diff](/doc/command-reference/plot/diff).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc plot [-h] [-q | -v] {show,diff} ...

positional arguments:
COMMAND
show Plot data from a file
diff Plot changes between commits in the DVC repository,
or between the last commit and the workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

## Description

DVC provides a set of commands to visualize continuous metrics of machine
learning experiments in. Usual examples of plots are AUC curves, loss functions,
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
and confusion matrices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just jump into explaining what continuous metrics are right here on the top of the description, and remove the Continuous metrics H3 header. Start with the

In contrast to [scalar metrics](/doc/command-reference/metrics), continous metrics represents a plot and ...

paragraph that's much further down rn. Idk if we need to explain scalar metrics here, probably not?

And just link to /doc/command-reference/plot directly from places where [continuous metrics] are mentioned.

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All points are great! I rearranged this part. Please take a look.

It is still important to keep the scalars vs continuous difference somewhere. I keep it as a next section but rearranged from the plot point of view.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still important to keep the scalars vs continuous difference somewhere

I agree! But I think the scalar metrics should be explained in the description of dvc metrics cmd ref index, and just mention the word here with a link to there, without explaining. But OK I guess that can be a separate issue (to match dvc metrics with this ref)

Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • p.s. unresolving for myself to remember about "that can be a separate issue (to match dvc metrics with this ref)"


The continuous metrics should be saved in files which are usually created by
users or generated by user's modeling or data processing code. The plot commands
can work with these files commited to a repository history, data files
controlled by DVC or files from workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

### Continuous metrics

DVC has two concepts for metrics for representing result of machine learning
training or data processing:

1. `dvc metrics` to represent scalar numbers such as AUC, true positive rate and
others.
2. `dvc plot` to visualize continuous metrics such as AUC curve, loss function,
confusion matrixes and others.
dmpetrov marked this conversation as resolved.
Show resolved Hide resolved

Scalar metrics should be stored in a hirarchical files such as JSON and YAML and
`dvc metrics diff` command can represent difference between the metrics in
different experiments as a float numbers. Like `AUC` metrics is `0.801807` and
was increase by `+0.037826` from the previous value:

```dvc
$ dvc metrics diff
Path Metric Value Change
summary.json AUC 0.801807 0.037826
```

In contrast to scalar metrics, continous metrics represents a plot and should be
stored as an array in JSON file or as a column in CSV or TSV files. The command
dmpetrov marked this conversation as resolved.
Show resolved Hide resolved
`dvc plot diff` generates a plot with two versions of the metrics:

```dvc
$ dvc plot diff -d logs.csv
file:///Users/dmitry/src/plot/logs.html
```

![](/img/plot_auc.svg)

### File formats

Supported file formats for continuous metrics are: JSON, CSV, TSV. DVC expects
to see an array (or multiple arrays) of objects (usually _float numbers_) in the
file.
Comment on lines +66 to +68
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, maybe use the "data series" term and avoid complicated descriptions until you get into each format?

Suggested change
Supported file formats for continuous metrics are: JSON, CSV, TSV. DVC expects
to see an array (or multiple arrays) of objects (usually _float numbers_) in the
file.
Supported formats for continuous metrics are: JSON, CSV, and TSV. DVC expects
to find data series (usually containing _float numbers_) in the file.


In tabular file formats such as CSV and TSV the array is a column. Plot command
can generate visuals for a specified column or a set of columns. Like `AUC`
column:
Comment on lines +70 to +72
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In tabular file formats such as CSV and TSV the array is a column. Plot command
can generate visuals for a specified column or a set of columns. Like `AUC`
column:
In tabular continuous metrics files (CSV and TSV formats), each column is a
series. `dvc plot show` can generate visuals for one, several, or all columns.
For example `AUC` below:


```
epoch, AUC, loss
34, 0.91935, 0.0317345
35, 0.91913, 0.0317829
36, 0.92256, 0.0304632
37, 0.92302, 0.0299015
```

In hierarchical file formats such as JSON an array of JSON-objects is expected.
Plot command can generate visuals for a specified field name or a set of fields
from the array's object. Like `val_loss` field in the `train` array in this
example:
Comment on lines +82 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In hierarchical file formats such as JSON an array of JSON-objects is expected.
Plot command can generate visuals for a specified field name or a set of fields
from the array's object. Like `val_loss` field in the `train` array in this
example:
In JSON files (hierarchical format) a named array of matching JSON objects is
expected. `dvc plot show` can generate visuals for a one, several, or all field
names from the array's objects. For example `val_loss` in the `train` array below:


```
{
"train": [
{"val_accuracy": 0.9665, "val_loss": 0.10757},
{"val_accuracy": 0.9764, "val_loss": 0.07324},
{"val_accuracy": 0.8770, "val_loss": 0.08136},
{"val_accuracy": 0.8740, "val_loss": 0.09026},
{"val_accuracy": 0.8795, "val_loss": 0.07640},
{"val_accuracy": 0.8803, "val_loss": 0.07608},
{"val_accuracy": 0.8987, "val_loss": 0.08455}
]
}
```

### Plot templates

DVC generates plots as HTML files that a user can click and open in a web
browser. The HTML files contain plots as [Vega-Lite](https://vega.github.io/)
objects. The files can also be transformed to traditional PNG, JPEG, SVG image
formats using external tools.

Vega is a declarative, programming language agnostic format of defining plots as
JSON specification. DVC gives users the ability to change the specification and
generate plots in the format that fits the best to the users need. At the same
time, it does not make DVC dependent on user's visualization code or any
programming language or environment which allows DVC stay programming language
agnostic.

Plot templates are stored in `.dvc/plot/` directory as json files. A user can
define it's own templates or modify the existing ones. Please see more details
in `dvc plot show` and `dvc plot diff`.

## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

Tabular file `logs.csv` visualization:

```
epoch,accuracy,loss,val_accuracy,val_loss
0,0.9418667,0.19958884770199656,0.9679,0.10217399864746257
1,0.9763333,0.07896138601688048,0.9768,0.07310650711813942
2,0.98375,0.05241111190887168,0.9788,0.06665669009438716
3,0.98801666,0.03681169906261687,0.9781,0.06697812260198989
4,0.99111664,0.027362171787042946,0.978,0.07385754839298315
5,0.9932333,0.02069501801203781,0.9771,0.08009233058886166
6,0.9945,0.017702101902437668,0.9803,0.07830339228538505
7,0.9954,0.01396906608727198,0.9802,0.07247738889862157
```

```dvc
$ dvc plot show logs.csv
file:///Users/dmitry/src/plot/logs.html
```

![](/img/plot_show.svg)

Difference between the current file and the previous commited one:

```dvc
$ dvc plot diff -d logs.csv HEAD^
file:///Users/dmitry/src/plot/logs.html
```

![](/img/plot_diff.svg)

Visualize a specific field:

```dvc
$ dvc plot show --field loss logs.csv
file:///Users/dmitry/src/plot/logs.html
```

![](/img/plot_show_field.svg)

Confusion matrix template is predefined in DVC (file
`.dvc/plot/confusion_matrix.json`):

```csv
actual,predicted
cat,cat
cat,cat
cat,cat
cat,dog
cat,dinosaur
cat,dinosaur
cat,bird
turtle,dog
turtle,cat
...
```

```dvc
$ dvc plot show classes.csv --template confusion_matrix
file:///Users/dmitry/src/plot/classes.html
```

![](/img/plot_show_confusion.svg)
Loading