Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd ref: add checkout --relink option, improve checkout desc. #864

Merged
merged 28 commits into from
Jan 7, 2020
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
47e602f
cmd ref: add checkout --relink option
Suor Dec 14, 2019
0d7e4f8
cmd ref: reword --relink option of checkout
jorgeorpinel Dec 27, 2019
73bbb3a
user-guide: link to `dvc checkout --relink` option
jorgeorpinel Dec 28, 2019
8e1fb17
cmd ref: add link to `dvc checkout --relink` in cache.type option of …
jorgeorpinel Dec 28, 2019
d322dac
cmd ref: rewrite description and --relink option in checkout
jorgeorpinel Dec 28, 2019
36e1d80
Merge branch 'master' into relink
jorgeorpinel Dec 28, 2019
8530e9b
checkout: rephrase --relink option explanation
jorgeorpinel Jan 2, 2020
1c783b0
checkout: small impro
jorgeorpinel Jan 2, 2020
101c400
cmd ref: improve link from config to checkout --relink
jorgeorpinel Jan 3, 2020
9ba0a4b
cmd ref: update checkout description
jorgeorpinel Jan 3, 2020
56f0780
cmd ref: introduce "outputs" term (and tooltip) earlier in the descri…
jorgeorpinel Jan 3, 2020
8375746
cmd ref: rewrite --relink option desc. in checkout (again)
jorgeorpinel Jan 3, 2020
f2a5d61
cmd ref: switch order of sentences in checkout --relink option desc.
jorgeorpinel Jan 3, 2020
7deb5e2
cmd ref: remove --relink sentence from description, mention all options
jorgeorpinel Jan 3, 2020
b86d5af
cmd ref: simplify checkout description intro
jorgeorpinel Jan 3, 2020
f9395b4
cmd ref: further simplify checkout desc. intro and remove "output" te…
jorgeorpinel Jan 4, 2020
85eec5f
cmd ref: move `--with-deps` to the first bullet in checkout desc.
jorgeorpinel Jan 4, 2020
10d1eb6
cmd ref: simplify bullets in checkout desc.
jorgeorpinel Jan 4, 2020
7392e5c
cmd ref: remove "checksum" term from checkout --relink option desc.
jorgeorpinel Jan 4, 2020
1813f94
term: replace "recreate" by "restore" for checkout --relink option, e…
jorgeorpinel Jan 4, 2020
01934dc
typo
jorgeorpinel Jan 6, 2020
1f30581
cmd ref: reword last sentence in checkout --relink option desc.
jorgeorpinel Jan 6, 2020
82bff27
user-guide: simplify note about checkout --relink in large-dataset-op…
jorgeorpinel Jan 6, 2020
95a8804
cmd ref: try to use "data files and directories" always in checkout
jorgeorpinel Jan 6, 2020
74f8a8d
Merge branch 'master' into relink
jorgeorpinel Jan 6, 2020
48dc37c
cmd ref: various wording updates to checkout
jorgeorpinel Jan 6, 2020
4f2470e
cmd ref: another small wording update for checkout
jorgeorpinel Jan 6, 2020
99c0a33
cmd ref: update intro and implementation details
jorgeorpinel Jan 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 38 additions & 42 deletions static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ DVC-files.
## Synopsis

```usage
usage: dvc checkout [-h] [-q | -v] [-d] [-f] [-R]
usage: dvc checkout [-h] [-q | -v] [-d] [-R] [-f] [--relink]
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[targets [targets ...]]

positional arguments:
Expand All @@ -16,60 +16,49 @@ positional arguments:

## Description

[DVC-files](/doc/user-guide/dvc-file-format) in a <abbr>project</abbr> specify
which instance of each data file or directory should be used, with the checksums
saved in the `outs` field. The `dvc checkout` command updates the workspace data
to match with the <abbr>cached</abbr> files that correspond to those checksums.

Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM repository, the DVC-files will contain checksums for
the corresponding data files kept in the cache. After an SCM command like
`git checkout` is run, the DVC-files will change to the state at the specified
branch or commit or tag. Afterwards, the `dvc checkout` command is required in
order to synchronize the data files with the currently checked out DVC-files.

This command must be executed after `git checkout` since Git doesn't track files
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
that are under DVC control. For convenience a Git hook is available, simply by
running `dvc install`, that will automate running `dvc checkout` after
`git checkout`. See `dvc install` for more information.

The execution of `dvc checkout` does:

- Scan the `outs` entries in DVC-files to compare with the currently checked out
data files. The scanned DVC-files is limited by the listed `targets` (if any)
on the command line. And if the `--with-deps` option is specified, it scans
backward from the given `targets` in the corresponding
[pipeline](/doc/command-reference/pipeline).

- For any data files where the checksum doesn't match their DVC-file entry, the
data file is restored from the cache. The link strategy used (`reflink`,
`hardlink`, `symlink`, or `copy`) depends on the OS and the configured value
for `cache.type` – See `dvc config cache`.

Note that this command by default tries NOT to copy files between the cache and
the workspace, using reflinks instead when supported by the file system. (Refer
to
[DVC-files](/doc/user-guide/dvc-file-format) are essentially placeholders that
point to the actual data files or a directories under DVC control. This command
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
synchronizes the workspace data with the versions specified in the current
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
DVC-files. DVC knows which data files (<abbr>outputs</abbr>) to use because
their checksums are saved in the `outs` fields inside the DVC-files.

`dvc checkout` is useful when using Git in the <abbr>project</abbr>, after
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`git clone`, `git checkout`, or any other repository operations that change the
currently present DVC-files.

💡 For convenience, a Git hook is available to automate running `dvc checkout`
after `git checkout`. Use `dvc install` to install it.

The execution of `dvc checkout` does the following:

- Scans the DVC-files to compare vs. the data files or directories currently in
the <abbr>workspace</abbr>. Scanning is limited to the given `targets` (if
any). See also options `--with-deps` and `--recursive` below.

- Missing data files or directories, or those that don't match with any
DVC-file, are restored from the <abbr>cache</abbr>. See options `--force` and
`--relink`.

By default, this command tries not to copy files between the cache and the
workspace, using reflinks instead, when supported by the file system. (Refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).)
The next linking strategy default value is `copy` though, so unless other file
link types are manually configured in `cache.type` (using `dvc config`), files
will be copied. Keep in mind that having file copies doesn't present much of a
negative impact unless the project uses very large data (several GBs or more).
But leveraging file links is crucial for large files where checking out a 50Gb
by copying file might take a few minutes for example, whereas with links,
But leveraging file links is crucial with large files, for example when checking
out a 50Gb file by copying might take a few minutes whereas, with links,
restoring any file size will be almost instantaneous.

> When linking files takes longer than expected (10 seconds for any one file)
> and `cache.type` is not set, a warning will be displayed reminding users about
> the faster link types available. These warnings can be turned off setting the
> `cache.slow_link_warning` config option to `false` with `dvc config cache`.

The output of `dvc checkout` does not list which data files were restored. It
does report removed files and files that DVC was unable to restore because
they're missing from the <abbr>cache</abbr>.

This command will fail to checkout files that are missing from the cache. In
such a case, `dvc checkout` prints a warning message. Any files that can be
checked out without error will be restored.
such a case, `dvc checkout` prints a warning message. It also lists removed
files. Any files that can be checked out without error will be restored without
being reported individually.

There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
Expand All @@ -94,6 +83,13 @@ be pulled from remote storage using `dvc pull`.
remove files that don't match those DVC-file references or are missing from
cache. (They are not "committed", in DVC terms.)

- `--relink` - ensures the file linking strategy (`reflink`, `hardlink`,
`symlink`, or `copy`) for all data files in the workspace is consistent with
the project's [`cache.type`](/doc/command-reference/config#cache). This is
achieved by restoring **all data files or a directories** referenced in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably restoring happens only if link type does not match, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. @Suor can you confirm please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if --relink is specified then it always relinks, i.e. removes and links. There is no way in general case to check whether link type is correct. For example, we can't distinguish reflink from a copy. And even checking for hardlink is not completely reliable.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Leaving unresolved so it's easy to find for future reference.

Merging this PR finally!

current DVC-files (regardless of whether they match a current DVC-file). This
means overwriting the file links or copies from cache to workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `-h`, `--help` - shows the help message and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
Expand Down
4 changes: 4 additions & 0 deletions static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,10 @@ for more details.) This section contains the following options:
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
for a full explanation of each one.

To apply changes to this option in the workspace, by restoring all file
links/copies from cache, please use `dvc checkout --relink`. See
[checkout options](/doc/command-reference/checkout#options) for more details.

- `cache.slow_link_warning` - used to turn off the warnings about having a slow
cache link type. These warnings are thrown by `dvc pull` and `dvc checkout`
when linking files takes longer than usual, to remind them that there are
Expand Down
8 changes: 4 additions & 4 deletions static/docs/understanding-dvc/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,11 @@ http://studio.ml/
- Git-annex is a datafile-centric system whereas DVC is focused on providing a
workflow for machine learning and reproducible experiments. When a DVC or
Git-annex repository is cloned via `git clone`, data files won't be copied to
the local machine as file contents are stored in separate
the local machine, as file contents are stored in separate
[remotes](/doc/command-reference/remote). With DVC,
[DVC-files](/doc/user-guide/dvc-file-format) (that provide the reproducible
workflow) are always included in the Git repository and hence can be recreated
locally with minimal effort.
[DVC-files](/doc/user-guide/dvc-file-format), which provide the reproducible
workflow, are always included in the Git repository. Hence, they can be
executed locally with minimal effort.

- DVC is not fundamentally bound to Git, and users have the option of changing
the repository format.
Expand Down
23 changes: 15 additions & 8 deletions static/docs/user-guide/large-dataset-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,11 @@ Symbolic links, and Reflinks in more recent systems. While reflinks bring all
the benefits and none of the worries, they're not commonly supported in most
platforms yet. Hard/soft links optimize **speed** and **space** in the file
system, but may break your workflow since updating hard/sym-linked files tracked
by DVC in the workspace causes <abbr>cache</abbr> corruption. These 2 link types
thus require using cache **protected mode** (see the `cache.protected` config
option in `dvc config cache`). Finally, a 4th "linking" option is to actually
copy files from/to the cache, which is safe but inefficient – especially for
large files (several GBs or more).
by DVC in the <abbr>workspace</abbr> causes <abbr>cache</abbr> corruption. These
2 link types thus require using cache **protected mode** (see the
`cache.protected` config option in `dvc config cache`). Finally, a 4th "linking"
option is to actually copy files from/to the cache, which is safe but
inefficient – especially for large files (several GBs or more).

> Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise)
> support hard or soft links on the
Expand Down Expand Up @@ -92,9 +92,9 @@ efficiency:

4. **`copy`**: An inefficient "linking" strategy, yet supported on all file
systems. Using `copy` means there will be no file links, but that the tracked
files will be duplicated as copies existing in both the cache and workspace.
Suitable for scenarios with relatively small data files, where copying them
is not a storage performance concern.
files will be duplicated as copies existing in both the cache and
<abbr>workspace</abbr>. Suitable for scenarios with relatively small data
files, where copying them is not a storage performance concern.

> DVC avoids `symlink` and `hardlink` types by default to protect user from
> accidental cache corruption. Refer to the
Expand All @@ -120,6 +120,13 @@ file link types. Please refer to the
[Update a Tracked File](/doc/user-guide/updating-tracked-files) on how to manage
tracked files under these cache configurations.

### Re-linking data in the workspace

To re-create the file links in the workspace, for example after changing the
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`cache.type` option for a <abbr>project</abbr>, please use
`dvc checkout --relink`. See
[checkout options](/doc/command-reference/checkout#options) for more details.

---

> \***copy-on-write links or "reflinks"** are a relatively new way to link files
Expand Down