Skip to content

Commit

Permalink
cmd ref: dvc add 1.0 update (#1411)
Browse files Browse the repository at this point in the history
* cmd ref: add note that move creates dirs

* cmd ref: improve structure of add ref desc.

* grammar: add some commas

* term: checksum -> hash value in dvcignore guide

* style: lower case bullet text

* cmd ref: remove some redundancy in metrics index

* cmd ref: update plots refs synopsis and descriptions
per iterative/dvc/issues/3924 et al.

* Add plots modify cmd

* typo: CSV->csv

* term: working tree -> workspace
per iterative/dvc/pull/3914

* cmd ref: couple improvements to add ref
per #1382 (review)
and #1382 (review)

* Update config/prismjs/dvc-commands.js

* cmd ref: update plots modify description

* cmd ref: add plots modify to nav, with a few more improvements

* cmd ref: plots --show-json -> --show-vega
per iterative/dvc#3891 (comment)

* rename x-lab to x-label

* cmd ref: review descriptions of plots index, show, and diff

* cmd ref: review and update old plots cmds options
per iterative/dvc#3948 et al.

* cmd ref: a couple more option updates
per #1382 (review)

* cmd ref: emphasize add works with any large file/dir
per #1382 (review)

* cmd ref: updae plots modify top half (definition, description)
per #1382 (review) al.

* cmd ref: improve all plot cmd option descriptions

* Update content/docs/command-reference/plots/modify.md

* cmd ref: review examples (mainly images) in plots modify
per #1382 (comment) et al.

* cmd ref: rephrase info about how data arrays are injected to plot templates
per #1382 (review)

* cmd ref: update info on how targets for for plots show/diff
per #1382 (review)

* cmd ref: double check all plots examples
per #1382 (comment)

* cmd ref: remove info about plots show --select

* cmd ref: update add desc
per #1382 (review)

* cmd ref: re-explain dvc add for dirs
per #1382 (review)

* cmd ref: improve description about targets in plots diff
per #1382 (review)

* cmd ref: make emoji note in plots index
per #1382 (review)

* cmd ref: remove ineffective CSV code block highlighting from plots refs
per #1382 (review)

* get started: improve intro in index

* glossary: remove external deps entry (no need)

* cmd ref: update add for 1.0 (1) up to...
before Examples

* cmd ref: 1.0 updates for add (2) - examples

* cmd ref: remove note about comments in add example
per #1411 (review)

Co-authored-by: Dmitry Petrov <[email protected]>
  • Loading branch information
jorgeorpinel and dmpetrov authored Jun 11, 2020
1 parent 37f4e90 commit 64182e2
Show file tree
Hide file tree
Showing 4 changed files with 66 additions and 73 deletions.
128 changes: 61 additions & 67 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ Track data files or directories with DVC, by creating a corresponding
## Synopsis

```usage
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [-f <filename>]
targets [targets ...]
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external]
[-f <filename>] targets [targets ...]
positional arguments:
targets Input files/directories to add.
targets Files or directories to add
```

## Description
Expand All @@ -36,29 +36,30 @@ Under the hood, a few actions are taken for each file (or directory) in

1. Calculate the file hash.
2. Move the file contents to the cache (by default in `.dvc/cache`), using the
file hash to form the cached file names. (See
file hash to form the cached file path. (See
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
for more details.)
3. Attempt to replace the file with a link to the cached data (more details
further down).
4. Create a corresponding `.dvc` file to store the file (as an
<abbr>output</abbr>), using its path and hash to identify the cached data.
Unless the `-f` option is used, the `.dvc` file name generated by default is
`<file>.dvc`, where `<file>` is the file name of the first target.
5. Unless `dvc init --no-scm` was used when initializing the project, add the
`targets` to `.gitignore` in order to prevent them from being committed to
the Git repository.
3. Attempt to replace the file with a link to the cached data (more details on
file linking further down).
4. Create a corresponding [`.dvc` file](/doc/user-guide/dvc-file-format) to
track the file, using its path and hash to identify the cached data. The
`.dvc` file lists the DVC-tracked file as an <abbr>output</abbr> (`outs`
field). Unless the `-f` option is used, the `.dvc` file name generated by
default is `<file>.dvc`, where `<file>` is the file name of the first target.
5. Add the `targets` to `.gitignore` in order to prevent them from being
committed to the Git repository (unless `dvc init --no-scm` was used when
initializing the DVC project).
6. Instructions are printed showing `git` commands for adding the files, if
appropriate.

Summarizing, the result is that the target data is replaced by small `.dvc`
files that can be tracked with Git. See
files that can easily be tracked with Git. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.

> Note that `.dvc` files created by this command are considered _orphan stage
> files_ because they have no _dependencies_, only outputs. These are always
> treated as _changed_ by `dvc repro`, which always executes them. See `dvc run`
> to learn more about stage files.
> Note that `.dvc` files can be considered _orphan stages_, because they have no
> <abbr>dependencies</abbr>, only outputs. These are treated as _always changed_
> by `dvc status` and `dvc repro`, which always executes them. See
> [`dvc.yaml`](/doc/user-guide/dvc-file-format) to learn more about stages.
To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file.
Expand Down Expand Up @@ -111,6 +112,9 @@ undesirable for data directories with a large number of files.
file name of the given target. This option allows to set the name and the path
of the generated `.dvc` file.

- `--external` - allow `targets` that are outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
Expand All @@ -124,15 +128,14 @@ Track a file with DVC:

```dvc
$ dvc add data.xml
...
Saving information to 'data.xml.dvc'.
To track the changes with git run:
To track the changes with git, run:
git add .gitignore data.xml.dvc
git add .gitignore data.xml.dvc
```

As shown above, a [`.dvc` file](/doc/user-guide/dvc-file-format) has been
As indicated above, a [`.dvc` file](/doc/user-guide/dvc-file-format) has been
created for `data.xml`. Let's explore the result:

```dvc
Expand All @@ -145,32 +148,21 @@ $ tree
Let's check the `data.xml.dvc` file inside:

```yaml
md5: aae37d74224b05178153acd94e15956b
outs:
- cache: true
md5: d8acabbfd4ee51c95da5d7628c7ef74b
metric: false
- md5: 6137cde4893c59f76f005a8123d8e8e6
path: data.xml
meta: # Special field to contain arbitary user data
name: John
email: [email protected]
```
This is a standard `.dvc` file with only one output (in the `outs` field). The
hash value should correspond to a file path in the <abbr>cache</abbr>.

> Note that the `meta` values above were entered manually for this example. Meta
> values and `#` comments are not preserved when a `.dvc` file is overwritten
> with the `dvc add`, `dvc run`, `dvc import`, or `dvc import-url` commands.
This is a standard `.dvc` file with only one output (`outs` field). The hash
value (`md5` field) corresponds to a file path in the <abbr>cache</abbr>.

```dvc
$ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b
.dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b: ASCII text
.dvc/cache/61/37cde4893c59f76f005a8123d8e8e6: ASCII text
```

Note that tracking compressed files (e.g. ZIP or TAR archives) is not
recommended, as `dvc add` supports tracking directories. (Details below.)
⚠️ Note that tracking compressed files (e.g. ZIP or TAR archives) is not
recommended, as `dvc add` supports tracking directories (details below).

## Example: Directory

Expand All @@ -193,63 +185,64 @@ Tracking a directory with DVC as simple as with a single file:

```dvc
$ dvc add pics
Computing md5 for a large number of files. This is only done once.
...
Linking directory 'pics'.
Saving information to 'pics.dvc'.
...
```

There are no [`.dvc` files](/doc/user-guide/dvc-file-format) generated within
this directory structure, but the images are all added to the
<abbr>cache</abbr>. DVC prints a message mentioning that MD5 hash values are
computed for each file. A single `pics.dvc` file is generated for the top-level
this directory structure to match each images, but the image files are all
<abbr>cached</abbr>. A single `pics.dvc` file is generated for the top-level
directory, and it contains:

```yaml
md5: df06d8d51e6483ed5a74d3979f8fe42e
outs:
- cache: true
md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
metric: false
- md5: ce57450aa92ab8f2b957c24b0df73edc.dir
path: pics
wdir: .
```

> Refer to
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
> for more info.

This allows us to treat the entire directory structure as a single <abbr>data
artifact</abbr>. This lets you pass the whole directory tree as a
artifact</abbr>. For example, you can pass the whole directory tree as a
<abbr>dependency</abbr> to a `dvc run` stage definition:

```dvc
$ dvc run -f train.dvc \
$ dvc run -n train \
-d train.py -d pics \
-M metrics.json -o model.h5 \
python train.py
```

> To follow the full example, see the [Versioning](/doc/tutorials/versioning)
> tutorial.
> To try this example, see the [Versioning](/doc/tutorials/versioning) tutorial.

If instead we use the `--recursive` (`-R`) option, the output looks like this:

```dvc
$ dvc add -R pics
Saving information to 'pics/cat1.jpg.dvc'.
Saving information to 'pics/cat3.jpg.dvc'.
Saving information to 'pics/cat2.jpg.dvc'.
Saving information to 'pics/cat4.jpg.dvc'.
...
```

In this case, a `.dvc` file is generated for each file in the `pics/` directory
tree. No top-level `.dvc` file is generated, which is typically less convenient.
For example, we cannot use the directory structure as one unit with `dvc run` or
other commands.
tree:

```dvc
$ tree pics
pics
├── train
| ├── cats
| | ├── img1.jpg
| | ├── img1.jpg.dvc
| | ├── img2.jpg
| | ├── img2.jpg.dvc
| | ├── ...
| └── dogs
| ├── img1.jpg
| ├── img1.jpg.dvc
| ...
```

Note that no top-level `.dvc` file is generated, which is typically less
convenient. For example, we cannot use the directory structure as one unit with
`dvc run` or other commands.

## Example: Dvcignore

Expand Down Expand Up @@ -290,6 +283,7 @@ $ tree .dvc/cache
└── 4bcc8502a70ac49bf441db350eafc2
```

Only the hash values of directory (`dir/`) and `file2` have been cached.
Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.

See [Dvcignore](/doc/user-guide/dvcignore) for more details.
3 changes: 1 addition & 2 deletions content/docs/command-reference/plots/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,7 @@ file:///Users/dmitry/src/plots/logs.html

> Note that we renamed the X axis label with option `--x-label x`.
Compare two specific versions (commit hashes, tags, or branches can be provided,
for example):
Compare two specific versions (commit hashes, tags, or branches):

```dvc
$ dvc plots diff --targets logs.csv HEAD 0135527
Expand Down
6 changes: 3 additions & 3 deletions content/docs/command-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,9 @@ describing the changes (described below).
someone manually edited the file).

- _always changed_ means that this is a DVC-file with no dependencies (an
_orphan_ stage file) or that it has the `always_changed: true` value set (see
`--always-changed` option in `dvc run`), so its considered always changed, and
thus is always executed by `dvc repro`.
_orphan stage_ (see `dvc add`) or that it has the `always_changed: true` value
set (see `--always-changed` option in `dvc run`), so its considered always
changed, and thus is always executed by `dvc repro`.

- _changed deps_ or _changed outs_ means that there are changes in dependencies
or outputs tracked by the <abbr>DVC-file</abbr>. Depending on the use case,
Expand Down
2 changes: 1 addition & 1 deletion content/docs/tutorials/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ hidden from the user. This directory is automatically staged with `git add`, so
it can be easily committed with Git.

Note that the DVC-file created by `dvc add` has no dependencies, a.k.a. an
_orphan_ [stage file](/doc/command-reference/run):
_orphan stage_ (see `dvc add`):

```yaml
md5: c183f094869ef359e87e68d2264b6cdd
Expand Down

0 comments on commit 64182e2

Please sign in to comment.