Skip to content

Commit

Permalink
Merge pull request #1384 from iterative/fix-886
Browse files Browse the repository at this point in the history
cmd: document target granularity for push/pull/etc, et al.
  • Loading branch information
jorgeorpinel authored Aug 6, 2020
2 parents 138acdf + acf0196 commit 81fbb66
Show file tree
Hide file tree
Showing 10 changed files with 290 additions and 275 deletions.
28 changes: 13 additions & 15 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,8 @@ each one:
Summarizing, the result is that the target data is replaced by small `.dvc`
files that can be easily tracked with Git.

> Note that `.dvc` files can be considered _orphan stages_, because they have no
> <abbr>dependencies</abbr>, only outputs. These are treated as _always changed_
> by `dvc status` and `dvc repro`, which always executes them. See `dvc.yaml` to
> learn more about stages.
To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file.
It's possible to prevent files or directories from being added by DVC by adding
the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file.

By default, DVC tries to use reflinks (see
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
Expand All @@ -73,15 +68,14 @@ large files. DVC also supports other link types for use on file systems without

### Adding entire directories

A `dvc add` target can be an individual file or a directory. In the latter case,
a `.dvc` file is created for the top of the directory (with default name
A `dvc add` target can be either a file or a directory. In the latter case, a
`.dvc` file is created for the top of the hierarchy (with default name
`<dir_name>.dvc`).

Every file in the hierarchy is added to the cache (unless the `--no-commit`
option is used), but DVC does not produce individual `.dvc` files for each file
in the directory tree. Instead, the single `.dvc` file references a special JSON
file in the cache (with `.dir` extension), that in turn points to the added
files.
Every file inside is added to the cache (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each file in the
entire tree. Instead, the single `.dvc` file references a special JSON file in
the cache (with `.dir` extension), that in turn points to the added files.

> Refer to
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
Expand All @@ -97,6 +91,9 @@ generated for each file in he same location. This may be helpful to save time
adding several data files grouped in a structural directory, but it's
undesirable for data directories with a large number of files.

To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) to `.dvcignore`.

## Options

- `-R`, `--recursive` - determines the files to add by searching each target
Expand Down Expand Up @@ -180,7 +177,8 @@ pics
└── dogs [more image files]
```

Tracking a directory with DVC as simple as with a single file:
[Tracking a directory](#tracking-directories) with DVC as simple as with a
single file:

```dvc
$ dvc add pics
Expand Down
129 changes: 70 additions & 59 deletions content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# checkout

Update data files and directories in the <abbr>workspace</abbr> based on current
DVC-files.
Update DVC-tracked files and directories in the <abbr>workspace</abbr> based on
current `dvc.lock` and `.dvc` files.

## Synopsis

Expand All @@ -10,39 +10,39 @@ usage: dvc checkout [-h] [-q | -v] [--summary] [-d] [-R] [-f]
[--relink] [targets [targets ...]]
positional arguments:
targets Limit command scope to these stages or .dvc files.
Using -R, directories to search for stages or .dvc
files can also be given.
targets Limit command scope to these tracked files/directories,
.dvc files, or stage names.
```

## Description

`.dvc` and `dvc.lock` [files](/doc/user-guide/dvc-files-and-directories) act as
pointers to specific version of data files or directories tracked by DVC. This
command synchronizes the workspace data with the versions specified in the
current `.dvc` and `dvc.lock` files.
This command is usually needed after `git checkout`, `git clone`, or any other
operation that changes the current `dvc.lock` or `.dvc` files. It restores the
corresponding versions of the DVC-tracked files and directories from the
<abbr>cache</abbr> to the workspace.

`dvc checkout` is useful, for example, when using Git in the
<abbr>project</abbr>, after `git clone`, `git checkout`, or any other operation
that changes the DVC files in the workspace.

💡 For convenience, a Git hook is available to automate running `dvc checkout`
after `git checkout`. See the
[Automating example](#example-automating-dvc-checkout) below or `dvc install`
for more details.
The `targets` given to this command (if any) limit what to checkout. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).

The execution of `dvc checkout` does the following:

- Scans the `.dvc` and `dvc.lock` files to compare against the data files or
directories in the <abbr>workspace</abbr>. DVC knows which data
(<abbr>outputs</abbr>) match because the corresponding hash values are saved
in the `outs` fields in those files. Scanning is limited to the given
`targets` (if any). See also options `--with-deps` and `--recursive` below.
- Checks `dvc.lock` and `.dvc` files to compare the hash values of their
<abbr>outputs</abbr> against the actual files or directories in the
<abbr>workspace</abbr> (similar to `dvc status`).

> Stage outputs should be defined in `dvc.yaml`. If found there but not in
> `dvc.lock`, they'll be skipped, with a warning.
- Missing data files or directories are restored from the <abbr>cache</abbr>.
Those that don't match with any DVC-file are removed. See options `--force`
- Missing data files or directories are restored from the cache. Those that
don't match with `dvc.lock` or `.dvc` files are removed. See options `--force`
and `--relink`. A list of the changes done is printed.

💡 For convenience, a Git hook is available to automate running `dvc checkout`
after `git checkout`. See the
[Automating example](#example-automating-dvc-checkout) below or `dvc install`
for more details.

By default, this command tries not make copies of cached files in the workspace,
using reflinks instead when supported by the file system (refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)).
Expand All @@ -64,9 +64,9 @@ such a case, `dvc checkout` prints a warning message. It also lists the partial
progress made by the checkout.

There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
regenerate its outputs (see also `dvc dag`). In other cases the cache can be
pulled from remote storage using `dvc pull`.
situation. In some cases the cache can be pulled from
[remote storage](/doc/command-reference/remote) using `dvc pull`. In other cases
the pipeline must be reproduced (using `dvc repro`) to regenerate its outputs.

## Options

Expand Down Expand Up @@ -130,9 +130,8 @@ below.

The workspace looks like this:

````dvc
```dvc
.
├── README.md
├── data
│   └── data.xml.dvc
├── dvc.lock
Expand All @@ -141,15 +140,11 @@ The workspace looks like this:
├── prc.json
├── scores.json
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py```
````
└── <code files here>
```

This repository includes the following tags, that represent different variants
of the resulting model:
Note that this repository includes the following tags, that represent different
variants of the resulting model:

```dvc
$ git tag
Expand All @@ -158,10 +153,9 @@ baseline-experiment <- First simple version of the model
bigrams-experiment <- Uses bigrams to improve the model
```

We can now just run `dvc checkout` that will update the most recent `model.pkl`,
`data.xml`, and other files that are tracked by DVC. The model file hash is
defined in the `dvc.lock` file, and in the `data.xml.dvc` file for the
`data.xml`:
We can now run `dvc checkout` to update the most recent `model.pkl`, `data.xml`,
and any other files tracked by DVC. The model file hash (`ab349c2...`) is saved
in `dvc.lock`, and it can be confirmed with:

```dvc
$ dvc checkout
Expand All @@ -170,13 +164,15 @@ $ md5 model.pkl
MD5 (data.xml) = ab349c2b5fa2a0f66d6f33f94424aebe
```

## Example: Switch versions

What if we want to "rewind history", so to speak? The `git checkout` command
lets us restore any point in the repository history, including any tags. It
automatically adjusts the files, by replacing file content and adding or
deleting files as necessary.
lets us restore any commit in the repository history (including tags). It
automatically adjusts the repo files, by replacing, adding, or deleting them as
necessary.

```dvc
$ git checkout baseline-experiment # Stage where model is first created
$ git checkout baseline-experiment # Git commit where model was created
```

Let's check the hash value of `model.pkl` in `dvc.lock` now:
Expand All @@ -187,16 +183,10 @@ outs:
md5: 98af33933679a75c2a51b953d3ab50aa
```
But if you check `model.pkl`, the file hash is still the same:

```dvc
$ md5 model.pkl
MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe
```

This is because `git checkout` changed `dvc.lock` and other DVC files. But it
did nothing with the `model.pkl` and `matrix.pkl` files. Git doesn't track those
files; DVC does, so we must do this:
But if you check the MD5 of `model.pkl`, the file hash is still the same
(`ab349c2...`). This is because `git checkout` changed `dvc.lock` and other DVC
files, but it did nothing with `model.pkl`, or any other DVC-tracked files/dirs.
Since Git doesn't track them, we must do this:

```dvc
$ dvc checkout
Expand All @@ -207,8 +197,29 @@ $ md5 model.pkl
MD5 (model.pkl) = 98af33933679a75c2a51b953d3ab50aa
```

What happened is that DVC went through the DVC-files and adjusted the current
set of <abbr>output</abbr> files to match the `outs` in them.
DVC went through the stages (in `dvc.yaml`) and adjusted the current set of
<abbr>outputs</abbr> to match the `outs` in the corresponding `dvc.lock`.

## Example: Specific files or directories

`dvc checkout` only affects the tracked data corresponding to any given
`targets`:

```dvc
$ git checkout master
$ dvc checkout # Start with latest version of everything.
$ git checkout baseline-experiment -- dvc.lock
$ dvc checkout model.pkl # Get previous model file only.
```

Note that you can checkout data within directories tracked. For example, the
`featurize` stage has the entire `data/features` directory as output, but we can
just get this:

```dvc
$ dvc checkout data/features/test.pkl
```

## Example: Automating DVC checkout

Expand All @@ -235,5 +246,5 @@ MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe
```

Previously this took two commands, `git checkout` followed by `dvc checkout`. We
can now skip the second one, which is automatically run for us. The workspace is
automatically synchronized accordingly.
can now skip the second one, which is automatically run for us. The workspace
files are automatically updated accordingly.
Loading

0 comments on commit 81fbb66

Please sign in to comment.