Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add docs for dvc metrics diff #933

Merged
merged 65 commits into from
Feb 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
5f0da08
add docs for dvc metrics diff
efiop Jan 17, 2020
9da0661
nav: add `metrics diff` to sidebar
jorgeorpinel Jan 24, 2020
4a1b775
cmd ref: typos in `metrics diff`
jorgeorpinel Jan 27, 2020
8ae2e6d
cmd ref: rewrite `metrics diff` ref and
jorgeorpinel Jan 27, 2020
041fd22
Merge branch 'master' into 921
jorgeorpinel Jan 27, 2020
c056834
cmd ref: update descs, review options, link all metrics subcmds
jorgeorpinel Jan 27, 2020
344839a
cmd ref: update cmd argument descriptions for `diff` and `metics diff`
jorgeorpinel Jan 27, 2020
5a08d37
metrics diff: big terminology review around the intro of this new com…
jorgeorpinel Jan 29, 2020
f10cf2b
term: review usage of "hash", "commit hash", "SHA", and "MD5"
jorgeorpinel Jan 29, 2020
1d14086
term: rewrite definition of "workspace"
jorgeorpinel Jan 29, 2020
e55c362
cmd ref: change link from `metrics diff` options to `metrics show`
jorgeorpinel Jan 30, 2020
f989586
cmd ref: update example in `dvc metrics diff` and similar ones
jorgeorpinel Jan 30, 2020
734994a
cmd ref: simplify dvc gc -a option
jorgeorpinel Jan 30, 2020
961b513
cmd ref: use "reference" more than "revision" in diff
jorgeorpinel Jan 30, 2020
6b259ba
cmd ref: link term "revision" in diff and `metrics diff`
jorgeorpinel Jan 30, 2020
c006d18
term: put Git ref exapmles before term and link
jorgeorpinel Jan 30, 2020
e76329a
cmd ref: friendlier explanation of "tip of default branch"
jorgeorpinel Jan 30, 2020
d02ccd2
cmd ref: use tag name instead of term "the revision"
jorgeorpinel Jan 30, 2020
14d4c23
term: revert some "revision"->"reference" changes, and related simpli…
jorgeorpinel Jan 30, 2020
e7e0b97
cmd ref: review desc. of `-a` options throughout refs
jorgeorpinel Jan 30, 2020
c5dbb96
cmd ref: update diff params
jorgeorpinel Jan 30, 2020
b30df29
cmd ref: update notes around moving/static Git refs in import and update
jorgeorpinel Jan 30, 2020
1e9f3ae
revert workspace glossary entry
jorgeorpinel Jan 30, 2020
af6fc63
tutorial: use full name of Deep Dive Tutorial in title and links
jorgeorpinel Jan 30, 2020
bd0c9bd
user-guide: undo change on "binary" literal for analytics example
jorgeorpinel Jan 30, 2020
a0c51ff
use-cases: avoid term "revision" in data-registries
jorgeorpinel Jan 30, 2020
a8a7c1d
term: revert "hash"->"checksum" in this PR
jorgeorpinel Jan 31, 2020
a1c782a
cmd ref: "revision"->"commit" in get ref
jorgeorpinel Jan 31, 2020
fc83207
cmd ref: use correct tag names in checkout examples
jorgeorpinel Jan 31, 2020
cc82aca
diff: remove backquotes adound "HEAD" same as in core repo
jorgeorpinel Feb 1, 2020
f611b4c
Merge branch 'master' into 921
jorgeorpinel Feb 9, 2020
4d7c9c0
cmd ref: don't use link to git reference doc
jorgeorpinel Feb 9, 2020
43120ad
cmd ref: don't use term "revision" in diff, prefer "commit"
jorgeorpinel Feb 9, 2020
541c8d3
cmd ref: no need for word "specific" (or "SHA") in get/import
jorgeorpinel Feb 9, 2020
296c2f0
cmd ref: update "project"->"workspace" term and example intros in `dv…
jorgeorpinel Feb 9, 2020
d158a57
docs: 2 misc updates
jorgeorpinel Feb 10, 2020
cbecbb3
tutorials: update model->"data or model"
jorgeorpinel Feb 10, 2020
455e793
cmd ref: fixed link to `metrics diff` and updated mention of it in `m…
jorgeorpinel Feb 10, 2020
88fbb09
get-started: typo in pipelines chapter
jorgeorpinel Feb 10, 2020
57ff736
cmd ref: rewrite paragraph about fixed revision import stages in `upd…
jorgeorpinel Feb 10, 2020
cb134fe
cmd ref: rewrite p about `repro` rewriting artifacts in cache
jorgeorpinel Feb 10, 2020
b50d6a9
cmd ref: rephrase and split p about what to compare and about targets…
jorgeorpinel Feb 10, 2020
5fbec22
cmd ref: reorg last part of repro desc
jorgeorpinel Feb 11, 2020
76cb691
cmd ref: restore `git tag` sample output in checkout examples
jorgeorpinel Feb 11, 2020
c64e70e
cmd ref: small rewording around tag names
jorgeorpinel Feb 11, 2020
1eb88fe
cmd ref: `metrics diff` and `diff` intro updates
jorgeorpinel Feb 11, 2020
fc25128
term: use "version" instead of "revision" in `import` cmd ref
jorgeorpinel Feb 12, 2020
1400f4a
cmd ref: updates to `metrics diff` (and `diff`) descriptions
jorgeorpinel Feb 12, 2020
d52dc5a
cmd ref: change "ref" -> "rev" per iterative/dvc/pull/3299
jorgeorpinel Feb 12, 2020
9e2b173
cmd ref: "revision"->"version" in a couple more docs
jorgeorpinel Feb 12, 2020
630875f
cmd ref: simplify note about `metrics diff` in `metrics show`
jorgeorpinel Feb 12, 2020
70f9ad5
cmd ref: use descriptive exampe-get-started repo tags in `get` examples
jorgeorpinel Feb 12, 2020
51d5605
term: "commit hash"->"commit SHA hash" to match #962 but
jorgeorpinel Feb 12, 2020
6ed1d44
cmd ref: improve -a adn -c option descs
jorgeorpinel Feb 13, 2020
f0f8c2d
cmd ref: remove p about --targets option in metrics diff
jorgeorpinel Feb 13, 2020
da227d6
cmd ref: rewrite pa bout fixed revisions/re-importing in `update`
jorgeorpinel Feb 13, 2020
0ead25e
tutorials: use and instead of data "or" models in versioning tut
jorgeorpinel Feb 13, 2020
4f49733
user-guide: restore bullet about `git` in analytics
jorgeorpinel Feb 13, 2020
bbf573f
you don't usually merge tags
jorgeorpinel Feb 13, 2020
6215bc2
term: don't use "version of repo/project" when referring to commits
jorgeorpinel Feb 13, 2020
fb459f9
cmd ref: simlpify note about metrics diff in metrics show
jorgeorpinel Feb 13, 2020
e12e3d7
cmd ref: and->and/or in checkout sample of versioning tut
jorgeorpinel Feb 13, 2020
2b2bfca
user-guide: updated analytics details
jorgeorpinel Feb 13, 2020
fc6d3e9
cmd ref: restore simpler wording about status -aT desc
jorgeorpinel Feb 13, 2020
1473ebe
cmd ref: correct (again) the short desc for diff and metrics diff
jorgeorpinel Feb 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions public/static/docs/changelog/0.18.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ really excited to share the progress with you:
- 🙂 **Usability improvements** - DVC interface got more informative and easier
to use:

- More heavy operations render dynamic progress bar (e.g. hash computation):
![](/static/img/0.18-progress.gif)
- More heavy operations render dynamic progress bar (e.g. file hash
computation): ![](/static/img/0.18-progress.gif)

- Pipeline visualization via command line. Just run `dvc pipeline show` with
`ascii` option and a target: ![](/static/img/0.18-pipeline.gif)
Expand Down
8 changes: 4 additions & 4 deletions public/static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ to work with directory hierarchies with `dvc add`:
directory (with default name `dirname.dvc`). Every file in the hierarchy is
added to the cache (unless `--no-commit` flag is added), but DVC does not
produce individual DVC-files for each file in the directory tree. Instead,
the single DVC-file points to a file in the cache that contains references to
the files in the added hierarchy.
the single DVC-file references a file in the cache that in turn points to the
files in the added hierarchy.

In a <abbr>DVC project</abbr>, `dvc add` can be used to version control any
<abbr>data artifact</abbr> (input, intermediate, or output files and
Expand Down Expand Up @@ -197,8 +197,8 @@ Saving information to 'pics.dvc'.

There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this
directory structure, but the images are all added to the <abbr>cache</abbr>. DVC
prints a message about this, mentioning that `md5` values are computed for each
directory. A single `pics.dvc` DVC-file is generated for the top-level
prints a message about this, mentioning that MD5 hash values are computed for
each directory. A single `pics.dvc` DVC-file is generated for the top-level
directory, and it contains:

```yaml
Expand Down
78 changes: 35 additions & 43 deletions public/static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,17 @@ The execution of `dvc checkout` does the following:

- Scans the DVC-files to compare against the data files or directories in the
<abbr>workspace</abbr>. DVC knows which data (<abbr>outputs</abbr>) match
because their checksums are saved in the `outs` fields inside the DVC-files.
Scanning is limited to the given `targets` (if any). See also options
`--with-deps` and `--recursive` below.
because the corresponding file hash values are saved in the `outs` fields in
the DVC-files. Scanning is limited to the given `targets` (if any). See also
options `--with-deps` and `--recursive` below.

- Missing data files or directories, or those that don't match with any
DVC-file, are restored from the <abbr>cache</abbr>. See options `--force` and
`--relink`.

By default, this command tries not to copy files between the cache and the
workspace, using reflinks instead, when supported by the file system. (Refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).)
By default, this command tries not make copies of cached files in the workspace,
using reflinks instead when supported by the file system (refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)).
The next linking strategy default value is `copy` though, so unless other file
link types are manually configured in `cache.type` (using `dvc config`), files
will be copied. Keep in mind that having file copies doesn't present much of a
Expand Down Expand Up @@ -102,10 +102,9 @@ be pulled from remote storage using `dvc pull`.
## Examples

Let's employ a simple <abbr>workspace</abbr> with some data, code, ML models,
pipeline stages, as well as a few Git tags, such as our
[get started example repo](https://github.com/iterative/example-get-started).
Then we can see what happens with `git checkout` and `dvc checkout` as we switch
from tag to tag.
pipeline stages, such as the <abbr>DVC project</abbr> created in our
[Get Started](/doc/get-started) section. Then we can see what happens with
`git checkout` and `dvc checkout` as we switch from tag to tag.

<details>

Expand All @@ -120,8 +119,7 @@ $ cd example-get-started

</details>

The workspace looks almost like in this
[pipeline setup](/doc/tutorials/pipelines):
The workspace looks something like this:

```dvc
.
Expand All @@ -131,17 +129,19 @@ The workspace looks almost like in this
├── featurize.dvc
├── prepare.dvc
├── train.dvc
└── src
└── <code files here>
├── src
│ └── ...
└── train.dvc
```

We have these tags in the repository that represent different iterations of
solving the problem:
This repository includes the following tags, that represent different variants
of the resulting model:

```dvc
$ git tag
baseline-experiment <- first simple version of the model
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
bigrams-experiment <- use bigrams to improve the model
...
baseline-experiment <- First simple version of the model
bigrams-experiment <- Uses bigrams to improve the model
```

This project comes with a predefined HTTP
Expand All @@ -153,57 +153,52 @@ files that are under DVC control. The model file checksum

```dvc
$ dvc pull
...
Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33'
...

$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951
```

What if we want to rewind history, so to speak? The `git checkout` command lets
us checkout at any point in the commit history, or even checkout other tags. It
What if we want to "rewind history", so to speak? The `git checkout` command
lets us restore any point in the repository history, including any tags. It
automatically adjusts the files, by replacing file content and adding or
deleting files as necessary.

```dvc
$ git checkout baseline
Note: checking out 'baseline'.
...
HEAD is now at 40cc182...
$ git checkout baseline-experiment # Stage where model is first created
```

Let's check the `model.pkl` entry in `train.dvc` now:

```yaml
outs:
md5: a66489653d1b6a8ba989799367b32c43
path: model.pkl
- md5: 43630cce66a2432dcecddc9dd006d0a7
path: model.pkl
```

But if you check `model.pkl`, the file hash is still the same:

```dvc
$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951
```

This is because `git checkout` changed `featurize.dvc`, `train.dvc`, and other
DVC-files. But it did nothing with the `model.pkl` and `matrix.pkl` files. Git
doesn't track those files, DVC does, so we must do this:
doesn't track those files; DVC does, so we must do this:

```dvc
$ dvc fetch
$ dvc checkout

$ md5 model.pkl
MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43
MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7
```

What happened is that DVC went through the sole existing DVC-file and adjusted
the current set of files to match the `outs` of that stage. `dvc fetch` is run
once to download missing data from the remote storage to the <abbr>cache</abbr>.
Alternatively, we could have just run `dvc pull` in this case to automatically
do `dvc fetch` + `dvc checkout`.
What happened is that DVC went through the DVC-files and adjusted the current
set of files to match the `outs` in them. `dvc fetch` is run this once to
download missing data from the remote storage to the <abbr>cache</abbr>.
(Alternatively, we could have just run `dvc pull` to do `dvc fetch` +
`dvc checkout` in one step.)

## Example: Automating DVC checkout

Expand All @@ -223,13 +218,10 @@ running `dvc checkout` when needed.
again:

```dvc
$ git checkout bigrams
Previous HEAD position was d171a12 add evaluation stage
HEAD is now at d092b42 try using bigrams
Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33'.
$ git checkout bigrams-experiment # Has the latest model version

$ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951
```

Previously this took two commands, `git checkout` followed by `dvc checkout`. We
Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,9 @@ for more details.) This section contains the following options:

Due to the way DVC handles linking between the data files in the cache and
their counterparts in the <abbr>workspace</abbr>, it's easy to accidentally
corrupt the cached version of a file by editing or overwriting it. Turning
this config option on forces you to run `dvc unprotect` before updating a
file, providing an additional layer of security to your data.
corrupt the cached file by editing or overwriting it. Turning this config
option on forces you to run `dvc unprotect` before updating a file, providing
an additional layer of security to your data.

We highly recommend enabling this option when `cache.type` is set to
`hardlink` or `symlink`.
Expand Down
81 changes: 38 additions & 43 deletions public/static/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,41 @@
# diff

Show differences between two versions of the <abbr>DVC project</abbr>. It can be
narrowed down to specific target files and directories under DVC control.

> This command requires that the project is a [Git](https://git-scm.com/)
> repository.
Show changes between commits in the <abbr>DVC repository</abbr>, or between a
commit and the <abbr>workspace</abbr>. The comparison can be narrowed down to
specific target files/directories tracked by DVC.

## Synopsis

```usage
usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref]

positional arguments:
a_ref Git reference from which diff calculates
b_ref Git reference until which diff calculates, if omitted diff
shows the difference between current HEAD and a_ref
a_rev Old Git commit to compare (defaults to HEAD)
b_rev New Git commit to compare (defaults to the
current workspace)
```

## Description

Given two Git commit references (commit hash, branch or tag name, etc) `a_ref`
and `b_ref`, this command shows a comparative summary of basic statistics: how
many files were deleted/changed, and the file size differences.

> Note that `dvc diff` does not show the line-to-line comparison among the
> target files in each revision, like `git diff` or
> [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This is because the
> data data tracked by DVC can come in many possible formats e.g. structured
> text, or binary blobs, etc.
Given two commit SHA hashes, branch or tag names, etc.
([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this
command shows a comparative summary of basic statistics: how many files were
deleted/changed, and the file size differences.

> For an example on how to create line-to-line text file comparison, refer to
> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256)
> in our GitHub repository.
> Note that `dvc diff` does not show the line-to-line comparisons like
> `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This
> is because the data data tracked by DVC comes in many formats such as
> structured text, binary blobs, etc. For an example on how to create
> line-to-line text file comparison, refer to
> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256).

If the `-t` option is used, the diff is limited to the `TARGET` file or
directory specified.

Note that `dvc diff` does not have an effect when the repository is not tracked
by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.
`dvc diff` does not have an effect when the repository is not tracked by Git,
for example when `dvc init` was used with the `--no-scm` option.

## Options

- `-t TARGET`, `--target TARGET` - path to a data file or directory. If not
specified, compares all files and directories that are under DVC control in
the workspace.
- `-t TARGET`, `--target TARGET` - path to a data file or directory to limit
diff for.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand All @@ -64,8 +56,9 @@ For these examples we can use the chapters in our

Start by cloning our example repo if you don't already have it. Then move into
the repo and checkout the
[version](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
corresponding to the _Add Files_ chapter:
[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
tag, corresponding to the [Add Files](/doc/get-started/add-files) _Get Started_
chapter:

```dvc
$ git clone https://github.com/iterative/example-get-started
Expand All @@ -83,13 +76,14 @@ Preparing to download data from 'https://remote.dvc.org/get-started'

</details>

## Example: Previous version of the same branch
## Example: Previous commit in the same branch

The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from
which to calculate the difference. The "until" reference (`b_ref`) defaults to
`HEAD` (current Git commit).

The minimal `dvc diff` command only includes the A reference (`a_ref`) from
which the difference is to be calculated. The B reference (`b_ref`) defaults to
Git `HEAD` (the currently checked out version). To find the general differences
with the very previous committed version of the project, we can use the `HEAD^`
Git reference.
To see the difference with the very previous commit of the project, we can use
`HEAD^` as `a_ref`:

```dvc
$ dvc diff HEAD^
Expand All @@ -101,7 +95,7 @@ diff for 'data/data.xml'
added file with size 37.9 MB
```

## Example: Specific targets across Git references
## Example: Specific targets across Git commits

We can base this example in the [Metrics](/doc/get-started/metrics) and
[Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get
Expand Down Expand Up @@ -131,8 +125,8 @@ example repo.

</details>

To see the difference in `model.pkl` among these versions, we can run the
following command.
To see the difference in `model.pkl` among these tags, we can run the following
command.

```dvc
$ dvc diff -t model.pkl baseline-experiment bigrams-experiment
Expand All @@ -145,7 +139,8 @@ diff for 'model.pkl'
```

The output from this command confirms that there's a difference in the
`model.pkl` file between the 2 Git references we indicated.
`model.pkl` file between the 2 Git commits (tags `baseline-experiment` and
`bigrams-experiment`) we indicated.

### What about directories?

Expand Down Expand Up @@ -193,6 +188,6 @@ diff for 'data/prepared'
```

The command above checks whether there have been any changes to the
`data/prepared` directory after the `5-preparation` version (since the `b_ref`
is the current version, `HEAD` by default). The output tells us that there have
been no changes to that directory (or to any other file).
`data/prepared` directory after the `5-preparation` tag (since the `b_ref` is
`HEAD` by default). The output tells us that there have been no changes to that
directory (or to any other file).
17 changes: 9 additions & 8 deletions public/static/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,11 @@ specified in DVC-files currently in the project are considered by `dvc fetch`
fetched. The default value is `4 * cpu_count()`. For SSH remotes default is
just 4.

- `-a`, `--all-branches` - fetch cache for all Git branches, not just the active
one. This means DVC may download files needed to reproduce different versions
of a DVC-file ([experiments](/doc/get-started/experiments)), not just the
current one.
- `-a`, `--all-branches` - fetch cache for all Git branches instead of just the
current workspace. This means DVC may download files needed to reproduce
different versions of a DVC-file
([experiments](/doc/get-started/experiments)), not just the ones currently in
the workspace.

- `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note
that both options can be combined, for example using the `-aT` flag.
Expand All @@ -115,9 +116,9 @@ specified in DVC-files currently in the project are considered by `dvc fetch`
## Examples

Let's employ a simple <abbr>workspace</abbr> with some data, code, ML models,
pipeline stages, as well as a few Git tags, such as our
[get started example repo](https://github.com/iterative/example-get-started).
Then we can see what happens with `dvc fetch` as we switch from tag to tag.
pipeline stages, such as the <abbr>DVC project</abbr> created in our
[Get Started](/doc/get-started) section. Then we can see what happens with
`dvc fetch` as we switch from tag to tag.

<details>

Expand Down Expand Up @@ -154,7 +155,7 @@ solving the problem:
$ git tag

baseline-experiment <- first simple version of the model
bigrams-experiment <- use bigrams to improve the model
bigrams-experiment <- use bigrams to improve the model
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

## Example: Default behavior
Expand Down
Loading