Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: document target granularity for push/pull/etc, et al. #1384

Merged
merged 71 commits into from
Aug 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
08e2a25
docs: document target granularity for push/pull/etc
efiop May 31, 2020
619bdd9
Merge branch 'master' into fix-886
jorgeorpinel Jul 9, 2020
eba2cf5
cmd: move parenthesis
jorgeorpinel Jul 9, 2020
e24e2bb
cmd: prep status for granularity info
jorgeorpinel Jul 9, 2020
76e94b1
cmd: remove _changed checksum_ status
jorgeorpinel Jul 9, 2020
3398775
Merge branch 'master' into fix-886
jorgeorpinel Jul 10, 2020
1bf9af5
cmd: reinstate _changed checksum_ status (but only for .dvc files)
jorgeorpinel Jul 10, 2020
94a86c5
cmd: explain granularity for status, and other 1.0 updates
jorgeorpinel Jul 10, 2020
489a6c1
cmd: add info about granularity to push and pull
jorgeorpinel Jul 10, 2020
169db68
cmd: update target arg desc in several commands
jorgeorpinel Jul 11, 2020
f866f1e
cmd: simplify status target behavior desc.
jorgeorpinel Jul 11, 2020
06fe612
cmd: remove granularity note from status desc, but leave example note
jorgeorpinel Jul 11, 2020
a0292ef
cmd: add granularity example to status
jorgeorpinel Jul 11, 2020
02340f1
cmd: granularity examples for status and fetch
jorgeorpinel Jul 12, 2020
c2d7b5b
cmd: fix example in status
jorgeorpinel Jul 12, 2020
b01ccd9
cmd: fix formatting in a few cmds
jorgeorpinel Jul 12, 2020
44e3396
cmd: fixes to checkout desc
jorgeorpinel Jul 13, 2020
e81060a
cmd: improve fetch ref per coming changes to checkout...
jorgeorpinel Jul 13, 2020
bd3f657
cmd: granularity example for checkout
jorgeorpinel Jul 13, 2020
ae6f85e
cmd: note which commands support granularity in file/dir targets
jorgeorpinel Jul 13, 2020
bcd2c88
cmd: link granularity notes to add directory example and
jorgeorpinel Jul 13, 2020
bcbfe68
cmd: add notes about granular path support for (list) get and import
jorgeorpinel Jul 13, 2020
12bb0c4
cmd: roll back note about orphan stages
jorgeorpinel Jul 13, 2020
cb518f9
cmd: roll back note about usefulness of checkout
jorgeorpinel Jul 13, 2020
5e29b0d
cmd: improve first step bullet in checkout
jorgeorpinel Jul 13, 2020
46404b3
cmd: remove note on orphan stages from add
jorgeorpinel Jul 13, 2020
e30ad99
cmd: fix characters in fetch
jorgeorpinel Jul 14, 2020
9327d61
cmd: small fix to push/pull
jorgeorpinel Jul 14, 2020
d67d0ca
cmd: make granularity note part of path desc in list/get/import
jorgeorpinel Jul 14, 2020
29f7465
cmd: update target granularity notes in add and checkout
jorgeorpinel Jul 14, 2020
d9b7f53
cmd: update example title and note on granularity
jorgeorpinel Jul 14, 2020
b241bf5
cmd: shorten remaining notes on granularity
jorgeorpinel Jul 14, 2020
7502b84
cmd: update specific target example titles
jorgeorpinel Jul 14, 2020
f3f6425
cmd: improve targets explanation in fetch
jorgeorpinel Jul 14, 2020
81db6e1
cmd: simplify granularity example note in checkout
jorgeorpinel Jul 14, 2020
9879c1d
cmd: further simplify notes about granularity
jorgeorpinel Jul 15, 2020
46caf71
cmd: fix typo in fetch
jorgeorpinel Jul 15, 2020
7f354d9
Merge branch 'master' into fix-886
jorgeorpinel Jul 15, 2020
f434350
Merge branch 'master' into fix-886
jorgeorpinel Jul 20, 2020
99a9789
cmd: update granularity note in add: Tracking directories + ex
jorgeorpinel Jul 20, 2020
a36ba41
cmd: update checkout ref to improve tagets and granularity explanations
jorgeorpinel Jul 20, 2020
80074ed
cmd: correct dvc.yaml -> dvc.lock in status, checkout, and fetch
jorgeorpinel Jul 20, 2020
73512e1
cmd: introduce p about fetch targets arg
jorgeorpinel Jul 20, 2020
cac98e3
Update content/docs/command-reference/status.md
jorgeorpinel Jul 20, 2020
b89fed8
cmd: intro status targets arg p and granularity note
jorgeorpinel Jul 20, 2020
b745aa6
cmd: update status and fetch example granularity note
jorgeorpinel Jul 20, 2020
f3c8dd4
Merge branch 'fix-886' of github.com:iterative/dvc.org into fix-886
jorgeorpinel Jul 20, 2020
8e7f63a
cmd: improve status dependency example and text
jorgeorpinel Jul 20, 2020
20927fe
cmd: apply unified notes about target granularity to push and pull
jorgeorpinel Jul 20, 2020
5cc92af
cmd: double check granularity notes in list/import/get are unified
jorgeorpinel Jul 20, 2020
3ecc352
Merge branch 'master' into fix-886
jorgeorpinel Jul 22, 2020
ff3d7aa
cmd: update dvc.lock -> dvc.yaml in some cases
jorgeorpinel Jul 22, 2020
e87e397
cmd: rewrite checkout desc per private feedback
jorgeorpinel Jul 23, 2020
478c8a2
cmd: rewrite checkout desc
jorgeorpinel Jul 24, 2020
1e420b6
cmd: don't use term "synchronize" in checkout
jorgeorpinel Jul 24, 2020
9f1876f
cmd: a couple more updates to checkout
jorgeorpinel Jul 24, 2020
508cf97
cmd: mention checkout deals with several filel/dirs in example
jorgeorpinel Jul 24, 2020
9895f02
Merge branch 'master' into fix-886
jorgeorpinel Aug 2, 2020
a0c869b
cmd: fix fetch description and related passages in other refs
jorgeorpinel Aug 3, 2020
87faff3
cmd: update note about granularity in list, get, import
jorgeorpinel Aug 3, 2020
f280bdf
cmd: update status example title
jorgeorpinel Aug 3, 2020
5370f1f
cmd: more feedback on fetch rewrite
jorgeorpinel Aug 4, 2020
41f4db7
cmd: simplify granularity note in all its docs
jorgeorpinel Aug 4, 2020
4cac81f
term: don't use "as a whole" phrase for tracked dirs
jorgeorpinel Aug 4, 2020
cabc8bf
cmd: make paragraph plural for consistency
jorgeorpinel Aug 4, 2020
6ac8b0f
cmd: update dvc.yaml vs lock file mention in status
jorgeorpinel Aug 4, 2020
8fd47ac
Merge branch 'master' into fix-886
jorgeorpinel Aug 5, 2020
b3eb463
cmd: edit notes about .dvcignore. General one in desc, specific one i…
jorgeorpinel Aug 5, 2020
a919f85
cmd: mention dvc.yaml in checkout desc
jorgeorpinel Aug 6, 2020
d6f9610
cmd: small corrections to fetch
jorgeorpinel Aug 6, 2020
acf0196
cmd: update explanation of stages and output hash values in status
jorgeorpinel Aug 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 13 additions & 15 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,8 @@ each one:
Summarizing, the result is that the target data is replaced by small `.dvc`
files that can be easily tracked with Git.

> Note that `.dvc` files can be considered _orphan stages_, because they have no
> <abbr>dependencies</abbr>, only outputs. These are treated as _always changed_
> by `dvc status` and `dvc repro`, which always executes them. See `dvc.yaml` to
> learn more about stages.

To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file.
It's possible to prevent files or directories from being added by DVC by adding
the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file.

By default, DVC tries to use reflinks (see
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
Expand All @@ -73,15 +68,14 @@ large files. DVC also supports other link types for use on file systems without

### Adding entire directories

A `dvc add` target can be an individual file or a directory. In the latter case,
a `.dvc` file is created for the top of the directory (with default name
A `dvc add` target can be either a file or a directory. In the latter case, a
`.dvc` file is created for the top of the hierarchy (with default name
`<dir_name>.dvc`).

Every file in the hierarchy is added to the cache (unless the `--no-commit`
option is used), but DVC does not produce individual `.dvc` files for each file
in the directory tree. Instead, the single `.dvc` file references a special JSON
file in the cache (with `.dir` extension), that in turn points to the added
files.
Every file inside is added to the cache (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each file in the
entire tree. Instead, the single `.dvc` file references a special JSON file in
the cache (with `.dir` extension), that in turn points to the added files.

> Refer to
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
Expand All @@ -97,6 +91,9 @@ generated for each file in he same location. This may be helpful to save time
adding several data files grouped in a structural directory, but it's
undesirable for data directories with a large number of files.

To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) to `.dvcignore`.

## Options

- `-R`, `--recursive` - determines the files to add by searching each target
Expand Down Expand Up @@ -180,7 +177,8 @@ pics
└── dogs [more image files]
```

Tracking a directory with DVC as simple as with a single file:
[Tracking a directory](#tracking-directories) with DVC as simple as with a
single file:

```dvc
$ dvc add pics
Expand Down
129 changes: 70 additions & 59 deletions content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# checkout

Update data files and directories in the <abbr>workspace</abbr> based on current
DVC-files.
Update DVC-tracked files and directories in the <abbr>workspace</abbr> based on
current `dvc.lock` and `.dvc` files.

## Synopsis

Expand All @@ -10,39 +10,39 @@ usage: dvc checkout [-h] [-q | -v] [--summary] [-d] [-R] [-f]
[--relink] [targets [targets ...]]

positional arguments:
targets Limit command scope to these stages or .dvc files.
Using -R, directories to search for stages or .dvc
files can also be given.
targets Limit command scope to these tracked files/directories,
.dvc files, or stage names.
```

## Description

`.dvc` and `dvc.lock` [files](/doc/user-guide/dvc-files-and-directories) act as
pointers to specific version of data files or directories tracked by DVC. This
command synchronizes the workspace data with the versions specified in the
current `.dvc` and `dvc.lock` files.
This command is usually needed after `git checkout`, `git clone`, or any other
operation that changes the current `dvc.lock` or `.dvc` files. It restores the
corresponding versions of the DVC-tracked files and directories from the
<abbr>cache</abbr> to the workspace.

`dvc checkout` is useful, for example, when using Git in the
<abbr>project</abbr>, after `git clone`, `git checkout`, or any other operation
that changes the DVC files in the workspace.

💡 For convenience, a Git hook is available to automate running `dvc checkout`
after `git checkout`. See the
[Automating example](#example-automating-dvc-checkout) below or `dvc install`
for more details.
The `targets` given to this command (if any) limit what to checkout. It accepts
paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, or stage names (found in `dvc.yaml`).

The execution of `dvc checkout` does the following:

- Scans the `.dvc` and `dvc.lock` files to compare against the data files or
directories in the <abbr>workspace</abbr>. DVC knows which data
(<abbr>outputs</abbr>) match because the corresponding hash values are saved
in the `outs` fields in those files. Scanning is limited to the given
`targets` (if any). See also options `--with-deps` and `--recursive` below.
- Checks `dvc.lock` and `.dvc` files to compare the hash values of their
<abbr>outputs</abbr> against the actual files or directories in the
<abbr>workspace</abbr> (similar to `dvc status`).

> Stage outputs should be defined in `dvc.yaml`. If found there but not in
> `dvc.lock`, they'll be skipped, with a warning.

- Missing data files or directories are restored from the <abbr>cache</abbr>.
Those that don't match with any DVC-file are removed. See options `--force`
- Missing data files or directories are restored from the cache. Those that
don't match with `dvc.lock` or `.dvc` files are removed. See options `--force`
and `--relink`. A list of the changes done is printed.

💡 For convenience, a Git hook is available to automate running `dvc checkout`
after `git checkout`. See the
[Automating example](#example-automating-dvc-checkout) below or `dvc install`
for more details.

By default, this command tries not make copies of cached files in the workspace,
using reflinks instead when supported by the file system (refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)).
Expand All @@ -64,9 +64,9 @@ such a case, `dvc checkout` prints a warning message. It also lists the partial
progress made by the checkout.

There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
regenerate its outputs (see also `dvc dag`). In other cases the cache can be
pulled from remote storage using `dvc pull`.
situation. In some cases the cache can be pulled from
[remote storage](/doc/command-reference/remote) using `dvc pull`. In other cases
the pipeline must be reproduced (using `dvc repro`) to regenerate its outputs.

## Options

Expand Down Expand Up @@ -130,9 +130,8 @@ below.

The workspace looks like this:

````dvc
```dvc
.
├── README.md
├── data
│   └── data.xml.dvc
├── dvc.lock
Expand All @@ -141,15 +140,11 @@ The workspace looks like this:
├── prc.json
├── scores.json
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py```
````
└── <code files here>
```

This repository includes the following tags, that represent different variants
of the resulting model:
Note that this repository includes the following tags, that represent different
variants of the resulting model:

```dvc
$ git tag
Expand All @@ -158,10 +153,9 @@ baseline-experiment <- First simple version of the model
bigrams-experiment <- Uses bigrams to improve the model
```

We can now just run `dvc checkout` that will update the most recent `model.pkl`,
`data.xml`, and other files that are tracked by DVC. The model file hash is
defined in the `dvc.lock` file, and in the `data.xml.dvc` file for the
`data.xml`:
We can now run `dvc checkout` to update the most recent `model.pkl`, `data.xml`,
and any other files tracked by DVC. The model file hash (`ab349c2...`) is saved
in `dvc.lock`, and it can be confirmed with:

```dvc
$ dvc checkout
Expand All @@ -170,13 +164,15 @@ $ md5 model.pkl
MD5 (data.xml) = ab349c2b5fa2a0f66d6f33f94424aebe
```

## Example: Switch versions

What if we want to "rewind history", so to speak? The `git checkout` command
lets us restore any point in the repository history, including any tags. It
automatically adjusts the files, by replacing file content and adding or
deleting files as necessary.
lets us restore any commit in the repository history (including tags). It
automatically adjusts the repo files, by replacing, adding, or deleting them as
necessary.

```dvc
$ git checkout baseline-experiment # Stage where model is first created
$ git checkout baseline-experiment # Git commit where model was created
```

Let's check the hash value of `model.pkl` in `dvc.lock` now:
Expand All @@ -187,16 +183,10 @@ outs:
md5: 98af33933679a75c2a51b953d3ab50aa
```

But if you check `model.pkl`, the file hash is still the same:

```dvc
$ md5 model.pkl
MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe
```

This is because `git checkout` changed `dvc.lock` and other DVC files. But it
did nothing with the `model.pkl` and `matrix.pkl` files. Git doesn't track those
files; DVC does, so we must do this:
But if you check the MD5 of `model.pkl`, the file hash is still the same
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
(`ab349c2...`). This is because `git checkout` changed `dvc.lock` and other DVC
files, but it did nothing with `model.pkl`, or any other DVC-tracked files/dirs.
Since Git doesn't track them, we must do this:

```dvc
$ dvc checkout
Expand All @@ -207,8 +197,29 @@ $ md5 model.pkl
MD5 (model.pkl) = 98af33933679a75c2a51b953d3ab50aa
```

What happened is that DVC went through the DVC-files and adjusted the current
set of <abbr>output</abbr> files to match the `outs` in them.
DVC went through the stages (in `dvc.yaml`) and adjusted the current set of
<abbr>outputs</abbr> to match the `outs` in the corresponding `dvc.lock`.

## Example: Specific files or directories

`dvc checkout` only affects the tracked data corresponding to any given
`targets`:

```dvc
$ git checkout master
$ dvc checkout # Start with latest version of everything.

$ git checkout baseline-experiment -- dvc.lock
$ dvc checkout model.pkl # Get previous model file only.
```

Note that you can checkout data within directories tracked. For example, the
`featurize` stage has the entire `data/features` directory as output, but we can
just get this:

```dvc
$ dvc checkout data/features/test.pkl
```

## Example: Automating DVC checkout

Expand All @@ -235,5 +246,5 @@ MD5 (model.pkl) = ab349c2b5fa2a0f66d6f33f94424aebe
```

Previously this took two commands, `git checkout` followed by `dvc checkout`. We
can now skip the second one, which is automatically run for us. The workspace is
automatically synchronized accordingly.
can now skip the second one, which is automatically run for us. The workspace
files are automatically updated accordingly.
Loading