Skip to content

Commit

Permalink
add docs for dvc metrics diff (#933)
Browse files Browse the repository at this point in the history
* add docs for dvc metrics diff

* nav: add `metrics diff` to sidebar

* cmd ref: typos in `metrics diff`

* cmd ref: rewrite `metrics diff` ref and
and review related concepts throughout docs e.g. "Git reference", "working tree"

* cmd ref: update descs, review options, link all metrics subcmds
addresses #933 (review)
as well as #933 (review)
and #933 (review)

* cmd ref: update cmd argument descriptions for `diff` and `metics diff`

* metrics diff: big terminology review around the intro of this new command
per #933 (review) et al.

* term: review usage of "hash", "commit hash", "SHA", and "MD5"
per #933 (review)

* term: rewrite definition of "workspace"
per #933 (review)

* cmd ref: change link from `metrics diff` options to `metrics show`
per #933 (comment)

* cmd ref: update example in `dvc metrics diff` and similar ones
per #933 (review)

* cmd ref: simplify dvc gc -a option
per #933 (review)
and #933 (review)

* cmd ref: use "reference" more than "revision" in diff
per #933 (review)

* cmd ref: link term "revision" in diff and `metrics diff`
also per #933 (review)

* term: put Git ref exapmles before term and link
per #933 (review)

* cmd ref: friendlier explanation of "tip of default branch"
per #933 (review)

* cmd ref: use tag name instead of term "the revision"
per #933 (review)

* term: revert some "revision"->"reference" changes, and related simplifications
per #933 (review)

* cmd ref: review desc. of `-a` options throughout refs

* cmd ref: update diff params
per iterative/dvc/pull/3244

* cmd ref: update notes around moving/static Git refs in import and update
per #933 (review)

* revert workspace glossary entry
per #933 (review)

* tutorial: use full name of Deep Dive Tutorial in title and links
per #933 (review)

* user-guide: undo change on "binary" literal for analytics example
per #933 (review)

* use-cases: avoid term "revision" in data-registries
per #933 (review)

* term: revert "hash"->"checksum" in this PR
per #933 (comment)

* cmd ref: "revision"->"commit" in get ref
per #933 (review)

* cmd ref: use correct tag names in checkout examples
and double check they still work
per #933 (review)

* diff: remove backquotes adound "HEAD" same as in core repo
per iterative/dvc#3244 (review)

* cmd ref: don't use link to git reference doc
per #933 (comment)

* cmd ref: don't use term "revision" in diff, prefer "commit"
per #933 (review)

* cmd ref: no need for word "specific" (or "SHA") in get/import
per #933 (review)

* cmd ref: update "project"->"workspace" term and example intros in `dvc install`
per #933 (review)

* docs: 2 misc updates
per #933 (review)
and #933 (review)

* tutorials: update model->"data or model"
per #933 (review) et al.

* cmd ref: fixed link to `metrics diff` and updated mention of it in `metrics show`
per #933 (review)
and #933 (review)
and #933 (review)

* get-started: typo in pipelines chapter

* cmd ref: rewrite paragraph about fixed revision import stages in `update`
per #933 (review)

* cmd ref: rewrite p about `repro` rewriting artifacts in cache
per #933 (review)

* cmd ref: rephrase and split p about what to compare and about targets in `status`
per #933 (review)

* cmd ref: reorg last part of repro desc

* cmd ref: restore `git tag` sample output in checkout examples
per #933 (review)
and #961 (comment)

* cmd ref: small rewording around tag names

* cmd ref: `metrics diff` and `diff` intro updates
per #933 (review)
and #933 (review)

* term: use "version" instead of "revision" in `import` cmd ref
per #933 (comment)

* cmd ref: updates to `metrics diff` (and `diff`) descriptions
per https://dvc.org/doc/command-reference/status
and #933 (review)
and #933 (review)

* cmd ref: change "ref" -> "rev" per iterative/dvc/pull/3299
and #933 (review)

* cmd ref: "revision"->"version" in a couple more docs
per #933 (review)

* cmd ref: simplify note about `metrics diff` in `metrics show`
per #933 (review)

* cmd ref: use descriptive exampe-get-started repo tags in `get` examples
per #933 (review)

* term: "commit hash"->"commit SHA hash" to match #962 but
may change the decision in that other PR.

* cmd ref: improve -a adn -c option descs
per #933 (review)

* cmd ref: remove p about --targets option in metrics diff
per #933 (review)

* cmd ref: rewrite pa bout fixed revisions/re-importing in `update`
per #933 (review)

* tutorials: use and instead of data "or" models in versioning tut
per #933 (review)
and #933 (review)

* user-guide: restore bullet about `git` in analytics
per #933 (review)

* you don't usually merge tags
per #933 (review)

* term: don't use "version of repo/project" when referring to commits
per #933 (comment)

* cmd ref: simlpify note about metrics diff in metrics show
per #933 (review)

* cmd ref: and->and/or in checkout sample of versioning tut
per #933 (review)

* user-guide: updated analytics details
per #933 (review)

* cmd ref: restore simpler wording about status -aT desc
per #933 (review)

* cmd ref: correct (again) the short desc for diff and metrics diff
per #933 (comment)

Co-authored-by: Jorge Orpinel <[email protected]>
  • Loading branch information
efiop and jorgeorpinel authored Feb 13, 2020
1 parent c32e687 commit bd0e250
Show file tree
Hide file tree
Showing 50 changed files with 512 additions and 408 deletions.
4 changes: 2 additions & 2 deletions public/static/docs/changelog/0.18.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ really excited to share the progress with you:
- 🙂 **Usability improvements** - DVC interface got more informative and easier
to use:

- More heavy operations render dynamic progress bar (e.g. hash computation):
![](/static/img/0.18-progress.gif)
- More heavy operations render dynamic progress bar (e.g. file hash
computation): ![](/static/img/0.18-progress.gif)

- Pipeline visualization via command line. Just run `dvc pipeline show` with
`ascii` option and a target: ![](/static/img/0.18-pipeline.gif)
Expand Down
8 changes: 4 additions & 4 deletions public/static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ to work with directory hierarchies with `dvc add`:
directory (with default name `dirname.dvc`). Every file in the hierarchy is
added to the cache (unless `--no-commit` flag is added), but DVC does not
produce individual DVC-files for each file in the directory tree. Instead,
the single DVC-file points to a file in the cache that contains references to
the files in the added hierarchy.
the single DVC-file references a file in the cache that in turn points to the
files in the added hierarchy.

In a <abbr>DVC project</abbr>, `dvc add` can be used to version control any
<abbr>data artifact</abbr> (input, intermediate, or output files and
Expand Down Expand Up @@ -197,8 +197,8 @@ Saving information to 'pics.dvc'.

There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this
directory structure, but the images are all added to the <abbr>cache</abbr>. DVC
prints a message about this, mentioning that `md5` values are computed for each
directory. A single `pics.dvc` DVC-file is generated for the top-level
prints a message about this, mentioning that MD5 hash values are computed for
each directory. A single `pics.dvc` DVC-file is generated for the top-level
directory, and it contains:

```yaml
Expand Down
78 changes: 35 additions & 43 deletions public/static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,17 @@ The execution of `dvc checkout` does the following:

- Scans the DVC-files to compare against the data files or directories in the
<abbr>workspace</abbr>. DVC knows which data (<abbr>outputs</abbr>) match
because their checksums are saved in the `outs` fields inside the DVC-files.
Scanning is limited to the given `targets` (if any). See also options
`--with-deps` and `--recursive` below.
because the corresponding file hash values are saved in the `outs` fields in
the DVC-files. Scanning is limited to the given `targets` (if any). See also
options `--with-deps` and `--recursive` below.

- Missing data files or directories, or those that don't match with any
DVC-file, are restored from the <abbr>cache</abbr>. See options `--force` and
`--relink`.

By default, this command tries not to copy files between the cache and the
workspace, using reflinks instead, when supported by the file system. (Refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).)
By default, this command tries not make copies of cached files in the workspace,
using reflinks instead when supported by the file system (refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)).
The next linking strategy default value is `copy` though, so unless other file
link types are manually configured in `cache.type` (using `dvc config`), files
will be copied. Keep in mind that having file copies doesn't present much of a
Expand Down Expand Up @@ -102,10 +102,9 @@ be pulled from remote storage using `dvc pull`.
## Examples

Let's employ a simple <abbr>workspace</abbr> with some data, code, ML models,
pipeline stages, as well as a few Git tags, such as our
[get started example repo](https://github.com/iterative/example-get-started).
Then we can see what happens with `git checkout` and `dvc checkout` as we switch
from tag to tag.
pipeline stages, such as the <abbr>DVC project</abbr> created in our
[Get Started](/doc/get-started) section. Then we can see what happens with
`git checkout` and `dvc checkout` as we switch from tag to tag.

<details>

Expand All @@ -120,8 +119,7 @@ $ cd example-get-started

</details>

The workspace looks almost like in this
[pipeline setup](/doc/tutorials/pipelines):
The workspace looks something like this:

```dvc
.
Expand All @@ -131,17 +129,19 @@ The workspace looks almost like in this
├── featurize.dvc
├── prepare.dvc
├── train.dvc
└── src
└── <code files here>
├── src
│ └── ...
└── train.dvc
```

We have these tags in the repository that represent different iterations of
solving the problem:
This repository includes the following tags, that represent different variants
of the resulting model:

```dvc
$ git tag
baseline-experiment <- first simple version of the model
bigrams-experiment <- use bigrams to improve the model
...
baseline-experiment <- First simple version of the model
bigrams-experiment <- Uses bigrams to improve the model
```

This project comes with a predefined HTTP
Expand All @@ -153,57 +153,52 @@ files that are under DVC control. The model file checksum

```dvc
$ dvc pull
...
Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33'
...
$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951
```

What if we want to rewind history, so to speak? The `git checkout` command lets
us checkout at any point in the commit history, or even checkout other tags. It
What if we want to "rewind history", so to speak? The `git checkout` command
lets us restore any point in the repository history, including any tags. It
automatically adjusts the files, by replacing file content and adding or
deleting files as necessary.

```dvc
$ git checkout baseline
Note: checking out 'baseline'.
...
HEAD is now at 40cc182...
$ git checkout baseline-experiment # Stage where model is first created
```

Let's check the `model.pkl` entry in `train.dvc` now:

```yaml
outs:
md5: a66489653d1b6a8ba989799367b32c43
path: model.pkl
- md5: 43630cce66a2432dcecddc9dd006d0a7
path: model.pkl
```
But if you check `model.pkl`, the file hash is still the same:

```dvc
$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951
```

This is because `git checkout` changed `featurize.dvc`, `train.dvc`, and other
DVC-files. But it did nothing with the `model.pkl` and `matrix.pkl` files. Git
doesn't track those files, DVC does, so we must do this:
doesn't track those files; DVC does, so we must do this:

```dvc
$ dvc fetch
$ dvc checkout
$ md5 model.pkl
MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43
MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7
```

What happened is that DVC went through the sole existing DVC-file and adjusted
the current set of files to match the `outs` of that stage. `dvc fetch` is run
once to download missing data from the remote storage to the <abbr>cache</abbr>.
Alternatively, we could have just run `dvc pull` in this case to automatically
do `dvc fetch` + `dvc checkout`.
What happened is that DVC went through the DVC-files and adjusted the current
set of files to match the `outs` in them. `dvc fetch` is run this once to
download missing data from the remote storage to the <abbr>cache</abbr>.
(Alternatively, we could have just run `dvc pull` to do `dvc fetch` +
`dvc checkout` in one step.)

## Example: Automating DVC checkout

Expand All @@ -223,13 +218,10 @@ running `dvc checkout` when needed.
again:

```dvc
$ git checkout bigrams
Previous HEAD position was d171a12 add evaluation stage
HEAD is now at d092b42 try using bigrams
Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33'.
$ git checkout bigrams-experiment # Has the latest model version
$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951
```

Previously this took two commands, `git checkout` followed by `dvc checkout`. We
Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,9 @@ for more details.) This section contains the following options:

Due to the way DVC handles linking between the data files in the cache and
their counterparts in the <abbr>workspace</abbr>, it's easy to accidentally
corrupt the cached version of a file by editing or overwriting it. Turning
this config option on forces you to run `dvc unprotect` before updating a
file, providing an additional layer of security to your data.
corrupt the cached file by editing or overwriting it. Turning this config
option on forces you to run `dvc unprotect` before updating a file, providing
an additional layer of security to your data.

We highly recommend enabling this option when `cache.type` is set to
`hardlink` or `symlink`.
Expand Down
81 changes: 38 additions & 43 deletions public/static/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,41 @@
# diff

Show differences between two versions of the <abbr>DVC project</abbr>. It can be
narrowed down to specific target files and directories under DVC control.

> This command requires that the project is a [Git](https://git-scm.com/)
> repository.
Show changes between commits in the <abbr>DVC repository</abbr>, or between a
commit and the <abbr>workspace</abbr>. The comparison can be narrowed down to
specific target files/directories tracked by DVC.

## Synopsis

```usage
usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref]
positional arguments:
a_ref Git reference from which diff calculates
b_ref Git reference until which diff calculates, if omitted diff
shows the difference between current HEAD and a_ref
a_rev Old Git commit to compare (defaults to HEAD)
b_rev New Git commit to compare (defaults to the
current workspace)
```

## Description

Given two Git commit references (commit hash, branch or tag name, etc) `a_ref`
and `b_ref`, this command shows a comparative summary of basic statistics: how
many files were deleted/changed, and the file size differences.

> Note that `dvc diff` does not show the line-to-line comparison among the
> target files in each revision, like `git diff` or
> [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This is because the
> data data tracked by DVC can come in many possible formats e.g. structured
> text, or binary blobs, etc.
Given two commit SHA hashes, branch or tag names, etc.
([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this
command shows a comparative summary of basic statistics: how many files were
deleted/changed, and the file size differences.

> For an example on how to create line-to-line text file comparison, refer to
> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256)
> in our GitHub repository.
> Note that `dvc diff` does not show the line-to-line comparisons like
> `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This
> is because the data data tracked by DVC comes in many formats such as
> structured text, binary blobs, etc. For an example on how to create
> line-to-line text file comparison, refer to
> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256).
If the `-t` option is used, the diff is limited to the `TARGET` file or
directory specified.

Note that `dvc diff` does not have an effect when the repository is not tracked
by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.
`dvc diff` does not have an effect when the repository is not tracked by Git,
for example when `dvc init` was used with the `--no-scm` option.

## Options

- `-t TARGET`, `--target TARGET` - path to a data file or directory. If not
specified, compares all files and directories that are under DVC control in
the workspace.
- `-t TARGET`, `--target TARGET` - path to a data file or directory to limit
diff for.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand All @@ -64,8 +56,9 @@ For these examples we can use the chapters in our

Start by cloning our example repo if you don't already have it. Then move into
the repo and checkout the
[version](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
corresponding to the _Add Files_ chapter:
[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
tag, corresponding to the [Add Files](/doc/get-started/add-files) _Get Started_
chapter:

```dvc
$ git clone https://github.com/iterative/example-get-started
Expand All @@ -83,13 +76,14 @@ Preparing to download data from 'https://remote.dvc.org/get-started'

</details>

## Example: Previous version of the same branch
## Example: Previous commit in the same branch

The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from
which to calculate the difference. The "until" reference (`b_ref`) defaults to
`HEAD` (current Git commit).

The minimal `dvc diff` command only includes the A reference (`a_ref`) from
which the difference is to be calculated. The B reference (`b_ref`) defaults to
Git `HEAD` (the currently checked out version). To find the general differences
with the very previous committed version of the project, we can use the `HEAD^`
Git reference.
To see the difference with the very previous commit of the project, we can use
`HEAD^` as `a_ref`:

```dvc
$ dvc diff HEAD^
Expand All @@ -101,7 +95,7 @@ diff for 'data/data.xml'
added file with size 37.9 MB
```

## Example: Specific targets across Git references
## Example: Specific targets across Git commits

We can base this example in the [Metrics](/doc/get-started/metrics) and
[Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get
Expand Down Expand Up @@ -131,8 +125,8 @@ example repo.

</details>

To see the difference in `model.pkl` among these versions, we can run the
following command.
To see the difference in `model.pkl` among these tags, we can run the following
command.

```dvc
$ dvc diff -t model.pkl baseline-experiment bigrams-experiment
Expand All @@ -145,7 +139,8 @@ diff for 'model.pkl'
```

The output from this command confirms that there's a difference in the
`model.pkl` file between the 2 Git references we indicated.
`model.pkl` file between the 2 Git commits (tags `baseline-experiment` and
`bigrams-experiment`) we indicated.

### What about directories?

Expand Down Expand Up @@ -193,6 +188,6 @@ diff for 'data/prepared'
```

The command above checks whether there have been any changes to the
`data/prepared` directory after the `5-preparation` version (since the `b_ref`
is the current version, `HEAD` by default). The output tells us that there have
been no changes to that directory (or to any other file).
`data/prepared` directory after the `5-preparation` tag (since the `b_ref` is
`HEAD` by default). The output tells us that there have been no changes to that
directory (or to any other file).
17 changes: 9 additions & 8 deletions public/static/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,11 @@ specified in DVC-files currently in the project are considered by `dvc fetch`
fetched. The default value is `4 * cpu_count()`. For SSH remotes default is
just 4.

- `-a`, `--all-branches` - fetch cache for all Git branches, not just the active
one. This means DVC may download files needed to reproduce different versions
of a DVC-file ([experiments](/doc/get-started/experiments)), not just the
current one.
- `-a`, `--all-branches` - fetch cache for all Git branches instead of just the
current workspace. This means DVC may download files needed to reproduce
different versions of a DVC-file
([experiments](/doc/get-started/experiments)), not just the ones currently in
the workspace.

- `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note
that both options can be combined, for example using the `-aT` flag.
Expand All @@ -115,9 +116,9 @@ specified in DVC-files currently in the project are considered by `dvc fetch`
## Examples

Let's employ a simple <abbr>workspace</abbr> with some data, code, ML models,
pipeline stages, as well as a few Git tags, such as our
[get started example repo](https://github.com/iterative/example-get-started).
Then we can see what happens with `dvc fetch` as we switch from tag to tag.
pipeline stages, such as the <abbr>DVC project</abbr> created in our
[Get Started](/doc/get-started) section. Then we can see what happens with
`dvc fetch` as we switch from tag to tag.

<details>

Expand Down Expand Up @@ -154,7 +155,7 @@ solving the problem:
$ git tag
baseline-experiment <- first simple version of the model
bigrams-experiment <- use bigrams to improve the model
bigrams-experiment <- use bigrams to improve the model
```

## Example: Default behavior
Expand Down
Loading

0 comments on commit bd0e250

Please sign in to comment.