Skip to content

Commit

Permalink
Merge pull request #345 from jorgeorpinel/master
Browse files Browse the repository at this point in the history
New cache file linking guide and related updates to existing docs
  • Loading branch information
shcheklein authored May 21, 2019
2 parents 2923fea + a5f3004 commit 0e4cfcb
Show file tree
Hide file tree
Showing 16 changed files with 234 additions and 105 deletions.
2 changes: 2 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
"files": [
"dvc-files-and-directories.md",
"dvc-file-format.md",
"cache-file-linking.md",
"external-dependencies.md",
"external-outputs.md",
"development.md",
Expand All @@ -65,6 +66,7 @@
"labels": {
"dvc-files-and-directories.md": "Files and Directories",
"dvc-file-format.md": "File Format (.dvc)",
"cache-file-linking.md": "Large File Optimization",
"development.md": "Development Version",
"external-dependencies.md": "External Dependencies",
"external-outputs.md": "External Outputs",
Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,12 @@ references the DVC cache entry using the checksum. See
[DVC File Format](/doc/user-guide/dvc-file-format) for the detailed description
of the DVC _metafile_ format.

By default DVC tries a range of link types (`reflink`, `hardlink`, `symlink`, or
`copy`) to try to avoid copying any file contents and to optimize DVC file
operations even for large files. The `reflink` is the best link type available,
but even though it is frequently supported by modern filesystems, many others
still don't support it. DVC has the other link types for use on filesystems
without `reflink` support. See `dvc config` for more information.
By default DVC tries using reflinks (See [Cache File
Linking](/docs/user-guide/cache-file-linking)) to avoid copying any file
contents and to optimize DVC file operations for large files. DVC also supports
other link types for use on file systems without `reflink` support, but they
have to be specified manually. Refer to the `cache.type` config option in `dvc
config cache` for more information.

A `dvc add` target can be an individual file or a directory. There are two ways
to work with directory hierarchies with `dvc add`.
Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ default `cache` directory.
The DVC cache is where your data files, models, etc (anything you want to
version with DVC) are actually stored. The corresponding files you see in the
working directory or "workspace" simply link to the ones in cache. (See `dvc
config cache` `type` setting for more info on file links on different
config cache` `type` setting for more information on file links on different
platforms.)

> For more cache-related configuration options refer to `dvc config cache`.
Expand Down
4 changes: 2 additions & 2 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ The execution of `dvc checkout` does:
backward in the pipeline from the named targets.
- For any data files where the checksum does not match with the DVC file entry,
the data file is restored from the cache. The link type used (`reflink`,
`hardlink`, `symlink`, or `copy`) by default depends on the OS, or the
configured value is used. (See `cache.type` in `dvc config cache`)
`hardlink`, `symlink`, or `copy`) depends on the OS, or the configured value
is used. (See `cache.type` in `dvc config cache`.)

This command must be executed after `git checkout` since Git does not handle
files that are under DVC control. For convenience a Git hook is available,
Expand Down
65 changes: 18 additions & 47 deletions static/docs/commands-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ This is the main section with the general config options:
These are sections in the config file that describe particular remotes. These
sections contain a `url` value, and can also specify `user`, `port`, `keyfile`,
`timeout`, `ask_password`, and other cloud-specific key/value pairs for each
remote. See `dvc remote` for more info.
remote. See `dvc remote` for more information.

### cache

Expand All @@ -87,74 +87,45 @@ files](/doc/user-guide/dvc-files-and-directories) for more details.)
The default value is `cache`, which resolved relative to the default project
config location results in `.dvc/cache`.
> See also helper command `dvc cache dir` that properly transform paths
relative to the present working directory into relative to the project config
file.
> relative to the present working directory into relative to the project
> config file.
- `cache.protected` - makes files in the workspace read-only. Possible values
are `true` or `false` (default). Run `dvc checkout` for the change go into
effect. (It affects only files that are under DVC control.)
effect. (It affects only files that are under DVC control.)
Due to the way DVC handles linking between the data files in the cache and
their counterparts in the working directory, it's easy to accidentally corrupt
the cached version of a file by editing or overwriting it. Turning this config
option on forces you to run `dvc unprotect` before updating a file, providing
an additional layer of security to your data.
an additional layer of security to your data.
It's highly recommended to enable this mod when `cache.type` is set to
`hardlink` or `symlink`.

- `cache.type` - link type that DVC should use to link data files from cache to
your workspace. Possible values: `reflink`, `symlink`, `hardlink`, `copy` or a
combination of those, separated by commas: `reflink,symlink`.
By default, DVC will try `reflink` and `copy` link type in order to choose the
most effective of those two. DVC avoids `symlink` and `hardlink` types by
default to protect user from accidental cache and repository corruption.
> **Note!** Unless your workspace supports `reflinks` – if you are on a recent
Mac chances are you are using `reflinks` – or you've manually specified
`cache.type copy` **you are corrupting** the cache if you edit data files in
the workspace. See the `cache.protected` config option above and corresponding
`dvc unprotect` command to modify files safely.

There are pros and cons to different link types. Each type is explained below,
from the best and most efficient to the least efficient:

1. **`reflink`** - this is the best link type that could be. It is as fast as
hard/symlinks, but doesn't carry a risk of cache corruption, since
filesystem takes care of copying the file if you try to edit it in place,
thus keeping a linked cache file intact.
Unfortunately reflinks are currently supported on a limited number of
filesystems (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but they are coming to
every new filesystem and in the future will be supported by the majority of
filesystems in use.

2. **`hardlink`** - the most efficient way to link your data to cache if both
your repo and your cache directory are located on the same
filesystem/drive.
Please note that hardlinked data files should never be edited in place, but
instead deleted and then replaced with a new file, otherwise it might cause
cache corruption and automatic deletion of a cache file by dvc.

3. **`symlink`** - The most efficient way to link your data to cache if your
repo and your cache directory are located on different filesystems/drives
(i.e. repo is located on ssd for performance, but cache dir is located on
hdd for bigger storage).
Please note that data file linked with symlink should never be edited in
place, but instead deleted and then replaced with a new file, otherwise it
might cause cache corruption and automatic deletion of a cache file by dvc.

4. **`copy`** - The most inefficient link type, yet the most widely supported
for any repo/cache FS combination. Suitable for scenarios with relatively
small data files, where copying them is not a performance/storage concern.
combination of those, separated by commas e.g: `reflink,hardlink,copy`.
By default, DVC will try `reflink,copy` link types in order to choose the most
effective of those two. DVC avoids `symlink` and `hardlink` types by default
to protect user from accidental cache and repository corruption.
> **Note!** If you manually set `cache.type` to `hardlink` or `symlink`, **you
> will corrupt the cache** if you modify tracked data files in the workspace.
> See the `cache.protected` config option above and corresponding
> `dvc unprotect` command to modify files safely.
There are pros and cons to different link types. Refer to [Cache File
Linking](/docs/user-guide/cache-file-linking) for a full explanation of each
one.

- `cache.slow_link_warning` - used to turn off the warnings about having a slow
cache link type. These warnings are thrown by `dvc pull` and `dvc checkout`
when linking files takes longer than usual, to remind them that there are
faster cache link types available than the defaults (`reflink` or `copy` – see
faster cache link types available than the defaults (`reflink,copy` – see
`cache.type`). Accepts values `true` and `false`.
> These warnings are automatically turned off when `cache.type` is manually
> set.
- `cache.local` - name of a local remote to use as local cache. This will
overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`.
Refer to `dvc remote` for more info on "local remotes".
Refer to `dvc remote` for more information on "local remotes".

- `cache.ssh` - name of an [SSH remote to use as external
cache](/doc/user-guide/external-outputs#ssh).
Expand Down
4 changes: 2 additions & 2 deletions static/docs/commands-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ stages that affect the target stage.

In the `local` mode, changes are detected through the checksum of every file
listed in every stage file in the pipeline against the corresponding file in the
filesystem. The output indicates the detected changes, if any. If no differences
are detected, `dvc status` prints this message:
file system. The output indicates the detected changes, if any. If no
differences are detected, `dvc status` prints this message:

```dvc
$ dvc status
Expand Down
37 changes: 30 additions & 7 deletions static/docs/get-started/add-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,11 @@ $ git commit -m "add source data to DVC"

### Expand to learn about DVC internals

You can see that actual data file has been moved to the `.dvc/cache` directory
(usually hardlink or reflink is created, so no physical copying is happening).
You can see that actual data file has been moved to the `.dvc/cache` directory,
while the entries in the working directory may be links to the actual files in
the DVC cache. (See [Cache File Linking](/docs/user-guide/cache-file-linking) to
learn about the supported file linking options, their tradeoffs, and how to
enable them).

```dvc
$ ls -R .dvc/cache
Expand All @@ -55,12 +58,32 @@ hash inside.

</details>

<details>

### Expand for an important note on cache performance

DVC tries to use reflinks\* by default to link your data files from the DVC
cache to the workspace, optimizing speed and storage space. However, reflinks
are not widely supported yet and DVC falls back to actually copying data files
to/from the cache **which can be very slow with large files**, and duplicates
storage requirements.

Hardlinks and symlinks are also available for optimized cache linking but,
(unlike reflinks) they carry the risk of accidentally corrupting the cache if
tacked data files are modified in the workspace.

See [Cache File Linking](/docs/user-guide/cache-file-linking) and
`dvc config cache` for more information.

> \***copy-on-write links or "reflinks"** are a relatively new way to link files
> in UNIX-style file systems. Unlike hardlinks or symlinks, they support
> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This
> means that editing a reflinked file is always safe as all the other links to
> the file will reflect the changes.
</details>

Refer to
[Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning),
`dvc add`, and `dvc run` for more information on storing and versioning data
files with DVC.

Note that to modify or replace a data file that is under DVC control you may
need to run `dvc unprotect` or `dvc remove` first (check the
[Update Tracked File](/doc/user-guide/update-tracked-file) guide). Use
`dvc move` to rename or move a data file that is under DVC control.
8 changes: 6 additions & 2 deletions static/docs/get-started/example-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,12 @@ intermediate result.
Second, you should see by now that the actual data is stored in the `.dvc/cache`
directory, each file having a name in a form of an md5 hash. This cache is
similar to Git's internal objects store but made specifically to handle large
data files. DVC is using reflinks, hardlinks and other optimizations to manage
your actual workspace without copying every time object from/to the cache.
data files.

> **Note!** For performance with large data files, DVC can use file links from
the cache to the workspace to avoid copying actual file contents. Refer to
[Cache File Linking](/docs/user-guide/cache-file-linking) to learn which options
exist and how to enable them.

</details>

Expand Down
4 changes: 0 additions & 4 deletions static/docs/get-started/example-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,10 +227,6 @@ $ python train.py
$ dvc add model.h5
```

Note! `dvc remove` or `dvc unprotect` is required, otherwise `python train.py`
will overwrite the existing file and may corrupt the cached version. Check this
[guide](/doc/user-guide/update-tracked-file) to learn more.

Let's commit the second version:

```dvc
Expand Down
30 changes: 16 additions & 14 deletions static/docs/tutorial/define-ml-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,17 +91,18 @@ copy) might take a few minutes.

DVC was designed with large data files in mind. This means gigabytes or even
hundreds of gigabytes in file size. Instead of copying files from cache to
workspace, DVC creates [hardlinks](https://en.wikipedia.org/wiki/Hard_link).
(This is similar to what [Git-annex](https://git-annex.branchable.com/) does.)

Creating file hardlinks (or reflinks on the modern file systems) is a quick
operation. So, with DVC you can easily checkout a few dozen files of any size. A
hardlink does not require you to have twice as much space in the hard drive.
Even if each of the files contains 41MB of data, the overall size of the
repository is still 41MB. Both of the files correspond to the same `inode` (file
meta-data record) in a file system. Use `ls -i` to see file system inodes. If
you are using a modern file system with reflinks you might see different inodes,
still only one copy if the actual file data is stored.
workspace, DVC creates reflinks or other file link types. (See [Cache File
Linking](/docs/user-guide/cache-file-linking).)

Creating file links is a quick file system operation. So, with DVC you can
easily checkout a few dozen files of any size. A file link does not require you
to have twice as much space in the hard drive. Even if each of the files
contains 41MB of data, the overall size of the repository is still 41MB. Both of
the files correspond to the same `inode` (file metadata record) in a file
system. Use `ls -i` to see file system inodes. If you are using a modern file
system with reflinks you might see different inodes, still only one copy if the
actual file data is stored. (Refer to [Cache File
Linking](/docs/user-guide/cache-file-linking) for more details.)

> Note: In case of systems supporting reflink, use `df` utility to see that free
> space on the drive didn't decline by the file size that we are adding and no
Expand All @@ -119,9 +120,10 @@ $ du -sh .
41M .
```

> Note that DVC uses hardlinks in all the supported OSs, including Mac OS, Linux
> and Windows. Some implementation details (like inodes) might differ, but the
> overall DVC behavior is the same.
> Note that DVC tries to use reflinks by default in the platforms that support
> them (Mac OS or Linux, depending on the file system). Some implementation
> details (like inodes) might differ, but the overall DVC behavior is the same
> on those platforms.
## Running commands

Expand Down
17 changes: 13 additions & 4 deletions static/docs/understanding-dvc/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,8 @@ process.
want to see in your Git repository) in a local key-value store and use file
symlinks instead of the actual files.

- DVC uses hardlinks instead of symlinks to make user experience better.
- DVC can use reflinks\* or hardlinks (depending on the system) instead of
symlinks to improve performance and make the user experience better.

- DVC optimizes checksum calculation.

Expand Down Expand Up @@ -111,11 +112,19 @@ process.
`git clone` command. It gives more granularity on managing data and code
separately. Hooks could be configured to make workflow simpler.

- DVC creates hardlinks (or even reflinks if they are supported). The
- DVC attempts to use reflinks\* and has other
[file linking options](/docs/user-guide/cache-file-linking). The
`dvc checkout` command does not actually copy data files from cache to the
workspace, as copying files is a heavy operation for large files (30
GB+).
workspace, as copying files is a heavy operation for large files (30 GB+).

- `git-lfs` was not made with data science scenarios in mind, thus it does
not support certain features, e.g. pipelines and metrics, and thus Github
has a limit of 2 GB per repository.

---

> \***copy-on-write links or "reflinks"** are a relatively new way to link files
> in UNIX-style file systems. Unlike hardlinks or symlinks, they support
> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This
> means that editing a reflinked file is always safe as all the other links to
> the file will reflect the changes.
4 changes: 2 additions & 2 deletions static/docs/use-cases/share-data-and-model-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ dead easy to consistently get all your data files and code to any machine. All
you need to do is to setup a remote DVC repository, that will store cache files
for your project. Currently DVC supports AWS S3, Google Cloud Storage, Microsoft
Azure Blob Storage, SSH and HDFS as remote location and the list is constantly
growing. To get a full info about supported remote types and their configuration
take a look at `dvc remote`.
growing. For complete information about supported remote types and their
configuration take a look at `dvc remote`.

![](/static/img/model-sharing-digram.png)

Expand Down
2 changes: 1 addition & 1 deletion static/docs/user-guide/analytics.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ User and event data have a 14 month retention period.
DVC's analytics record the following information per event:

- The DVC version, e.g. `0.22.0`
- The operating system info, e.g. `linux`, `ubuntu`, `14.04`, etc
- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc
- The underlying version control system, e.g. `git`
- Command type, e.g. `CmdDataPull`
- Command return code, e.g. `1`
Expand Down
Loading

0 comments on commit 0e4cfcb

Please sign in to comment.