From 4d66a95f528994e7300d2ae9b96a55e8f21fe5cb Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 16 May 2019 20:17:12 -0500 Subject: [PATCH 1/5] Updates hardlink referencs to reflink throughout the docs. Related to https://github.com/iterative/dvc/pull/1841 --- static/docs/commands-reference/add.md | 12 ++++---- static/docs/commands-reference/checkout.md | 4 +-- static/docs/commands-reference/status.md | 4 +-- static/docs/get-started/add-files.md | 2 +- static/docs/get-started/example-pipeline.md | 4 +-- static/docs/tutorial/define-ml-pipeline.md | 30 ++++++++++--------- .../understanding-dvc/related-technologies.md | 10 +++---- 7 files changed, 34 insertions(+), 32 deletions(-) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 56226e7580..31eb3e4628 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -41,12 +41,12 @@ references the DVC cache entry using the checksum. See [DVC File Format](/doc/user-guide/dvc-file-format) for the detailed description of the DVC _metafile_ format. -By default DVC tries a range of link types (`reflink`, `hardlink`, `symlink`, or -`copy`) to try to avoid copying any file contents and to optimize DVC file -operations even for large files. The `reflink` is the best link type available, -but even though it is frequently supported by modern filesystems, many others -still don't support it. DVC has the other link types for use on filesystems -without `reflink` support. See `dvc config` for more information. +By default DVC tries using reflinks (See [Cache File +Linking](/docs/user-guide/cache-file-linking)) to avoid copying any file +contents and to optimize DVC file operations even for large files. DVC also +supports other link types for use on file systems without `reflink` support, but +they have to be specified manually. Refer to the `cache.type` config option in +`dvc config cache` for more information. A `dvc add` target can be an individual file or a directory. There are two ways to work with directory hierarchies with `dvc add`. diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 5ae606f069..e838ba61b0 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -35,8 +35,8 @@ The execution of `dvc checkout` does: backward in the pipeline from the named targets. - For any data files where the checksum does not match with the DVC file entry, the data file is restored from the cache. The link type used (`reflink`, - `hardlink`, `symlink`, or `copy`) by default depends on the OS, or the - configured value is used. (See `cache.type` in `dvc config cache`) + `hardlink`, `symlink`, or `copy`) depends on the OS, or the configured value + is used. (See `cache.type` in `dvc config cache`.) This command must be executed after `git checkout` since Git does not handle files that are under DVC control. For convenience a Git hook is available, diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index fb12821408..ce6c581e9d 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -38,8 +38,8 @@ stages that affect the target stage. In the `local` mode, changes are detected through the checksum of every file listed in every stage file in the pipeline against the corresponding file in the -filesystem. The output indicates the detected changes, if any. If no differences -are detected, `dvc status` prints this message: +file system. The output indicates the detected changes, if any. If no +differences are detected, `dvc status` prints this message: ```dvc $ dvc status diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index f733fd8b3a..975f8a7b60 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -41,7 +41,7 @@ $ git commit -m "add source data to DVC" ### Expand to learn about DVC internals You can see that actual data file has been moved to the `.dvc/cache` directory -(usually hardlink or reflink is created, so no physical copying is happening). +(usually a reflink is created, so no file content copying is happening). ```dvc $ ls -R .dvc/cache diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index 78dfd543aa..7265644a79 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -167,8 +167,8 @@ intermediate result. Second, you should see by now that the actual data is stored in the `.dvc/cache` directory, each file having a name in a form of an md5 hash. This cache is similar to Git's internal objects store but made specifically to handle large -data files. DVC is using reflinks, hardlinks and other optimizations to manage -your actual workspace without copying every time object from/to the cache. +data files. DVC uses reflinks, hardlinks, or other optimizations to manage your +actual workspace without copying actual file contents from/to the cache. diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index a403a5468c..654d2a67d4 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -91,17 +91,18 @@ copy) might take a few minutes. DVC was designed with large data files in mind. This means gigabytes or even hundreds of gigabytes in file size. Instead of copying files from cache to -workspace, DVC creates [hardlinks](https://en.wikipedia.org/wiki/Hard_link). -(This is similar to what [Git-annex](https://git-annex.branchable.com/) does.) - -Creating file hardlinks (or reflinks on the modern file systems) is a quick -operation. So, with DVC you can easily checkout a few dozen files of any size. A -hardlink does not require you to have twice as much space in the hard drive. -Even if each of the files contains 41MB of data, the overall size of the -repository is still 41MB. Both of the files correspond to the same `inode` (file -meta-data record) in a file system. Use `ls -i` to see file system inodes. If -you are using a modern file system with reflinks you might see different inodes, -still only one copy if the actual file data is stored. +workspace, DVC creates reflinks or other file link types. (See [Cache File +Linking](/docs/user-guide/cache-file-linking).) + +Creating file links is a quick file system operation. So, with DVC you can +easily checkout a few dozen files of any size. A file link does not require you +to have twice as much space in the hard drive. Even if each of the files +contains 41MB of data, the overall size of the repository is still 41MB. Both of +the files correspond to the same `inode` (file metadata record) in a file +system. Use `ls -i` to see file system inodes. If you are using a modern file +system with reflinks you might see different inodes, still only one copy if the +actual file data is stored. (Refer to [Cache File +Linking](/docs/user-guide/cache-file-linking) for more details.) > Note: In case of systems supporting reflink, use `df` utility to see that free > space on the drive didn't decline by the file size that we are adding and no @@ -119,9 +120,10 @@ $ du -sh . 41M . ``` -> Note that DVC uses hardlinks in all the supported OSs, including Mac OS, Linux -> and Windows. Some implementation details (like inodes) might differ, but the -> overall DVC behavior is the same. +> Note that DVC tries to use reflinks by default in the platforms that support +> them (Mac OS or Linux, depending on the file system). Some implementation +> details (like inodes) might differ, but the overall DVC behavior is the same +> on those platforms. ## Running commands diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 1a63f3fc7d..5be8bf7c32 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -81,7 +81,8 @@ process. want to see in your Git repository) in a local key-value store and use file symlinks instead of the actual files. - - DVC uses hardlinks instead of symlinks to make user experience better. + - DVC uses reflinks (or hardlinks) instead of symlinks to make user + experience better. - DVC optimizes checksum calculation. @@ -111,10 +112,9 @@ process. `git clone` command. It gives more granularity on managing data and code separately. Hooks could be configured to make workflow simpler. - - DVC creates hardlinks (or even reflinks if they are supported). The - `dvc checkout` command does not actually copy data files from cache to the - workspace, as copying files is a heavy operation for large files (30 - GB+). + - DVC creates reflinks. The `dvc checkout` command does not actually copy + data files from cache to the workspace, as copying files is a heavy + operation for large files (30 GB+). - `git-lfs` was not made with data science scenarios in mind, thus it does not support certain features, e.g. pipelines and metrics, and thus Github From 067313305399b5e9e0ac016d524d008f5f0d2196 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 17 May 2019 12:43:52 -0500 Subject: [PATCH 2/5] Adds /doc/user-guide/cache-file-linking and... - Makes updates to several other docs to link to the new guide; - Addresses 1st round of feedback in #345; - Other misc. improvements related to this topic --- src/Documentation/sidebar.json | 2 + static/docs/commands-reference/add.md | 8 +- static/docs/commands-reference/cache.md | 2 +- static/docs/commands-reference/config.md | 62 ++++--------- static/docs/get-started/add-files.md | 4 +- static/docs/get-started/example-pipeline.md | 8 +- static/docs/get-started/example-versioning.md | 7 +- .../understanding-dvc/related-technologies.md | 18 +++- .../use-cases/share-data-and-model-files.md | 4 +- static/docs/user-guide/analytics.md | 2 +- static/docs/user-guide/cache-file-linking.md | 93 +++++++++++++++++++ .../user-guide/dvc-files-and-directories.md | 13 ++- static/docs/user-guide/update-tracked-file.md | 20 ++-- 13 files changed, 166 insertions(+), 77 deletions(-) create mode 100644 static/docs/user-guide/cache-file-linking.md diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index 4b933e4254..e39c14888a 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -51,6 +51,7 @@ "files": [ "dvc-files-and-directories.md", "dvc-file-format.md", + "cache-file-linking.md", "external-dependencies.md", "external-outputs.md", "development.md", @@ -65,6 +66,7 @@ "labels": { "dvc-files-and-directories.md": "Files and Directories", "dvc-file-format.md": "File Format (.dvc)", + "cache-file-linking.md": "Cache File Linking", "development.md": "Development Version", "external-dependencies.md": "External Dependencies", "external-outputs.md": "External Outputs", diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 31eb3e4628..cd66a79e12 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -43,10 +43,10 @@ of the DVC _metafile_ format. By default DVC tries using reflinks (See [Cache File Linking](/docs/user-guide/cache-file-linking)) to avoid copying any file -contents and to optimize DVC file operations even for large files. DVC also -supports other link types for use on file systems without `reflink` support, but -they have to be specified manually. Refer to the `cache.type` config option in -`dvc config cache` for more information. +contents and to optimize DVC file operations for large files. DVC also supports +other link types for use on file systems without `reflink` support, but they +have to be specified manually. Refer to the `cache.type` config option in `dvc +config cache` for more information. A `dvc add` target can be an individual file or a directory. There are two ways to work with directory hierarchies with `dvc add`. diff --git a/static/docs/commands-reference/cache.md b/static/docs/commands-reference/cache.md index 7e158ad42f..fda7167e23 100644 --- a/static/docs/commands-reference/cache.md +++ b/static/docs/commands-reference/cache.md @@ -21,7 +21,7 @@ default `cache` directory. The DVC cache is where your data files, models, etc (anything you want to version with DVC) are actually stored. The corresponding files you see in the working directory or "workspace" simply link to the ones in cache. (See `dvc -config cache` `type` setting for more info on file links on different +config cache` `type` setting for more information on file links on different platforms.) > For more cache-related configuration options refer to `dvc config cache`. diff --git a/static/docs/commands-reference/config.md b/static/docs/commands-reference/config.md index bc97a881a2..5b333eadc4 100644 --- a/static/docs/commands-reference/config.md +++ b/static/docs/commands-reference/config.md @@ -73,7 +73,7 @@ This is the main section with the general config options: These are sections in the config file that describe particular remotes. These sections contain a `url` value, and can also specify `user`, `port`, `keyfile`, `timeout`, `ask_password`, and other cloud-specific key/value pairs for each -remote. See `dvc remote` for more info. +remote. See `dvc remote` for more information. ### cache @@ -92,69 +92,41 @@ files](/doc/user-guide/dvc-files-and-directories) for more details.) - `cache.protected` - makes files in the workspace read-only. Possible values are `true` or `false` (default). Run `dvc checkout` for the change go into - effect. (It affects only files that are under DVC control.) + effect. (It affects only files that are under DVC control.) Due to the way DVC handles linking between the data files in the cache and their counterparts in the working directory, it's easy to accidentally corrupt the cached version of a file by editing or overwriting it. Turning this config option on forces you to run `dvc unprotect` before updating a file, providing - an additional layer of security to your data. + an additional layer of security to your data. It's highly recommended to enable this mod when `cache.type` is set to `hardlink` or `symlink`. - `cache.type` - link type that DVC should use to link data files from cache to your workspace. Possible values: `reflink`, `symlink`, `hardlink`, `copy` or a - combination of those, separated by commas: `reflink,symlink`. - By default, DVC will try `reflink` and `copy` link type in order to choose the - most effective of those two. DVC avoids `symlink` and `hardlink` types by - default to protect user from accidental cache and repository corruption. - > **Note!** Unless your workspace supports `reflinks` – if you are on a recent - Mac chances are you are using `reflinks` – or you've manually specified - `cache.type copy` **you are corrupting** the cache if you edit data files in - the workspace. See the `cache.protected` config option above and corresponding - `dvc unprotect` command to modify files safely. - - There are pros and cons to different link types. Each type is explained below, - from the best and most efficient to the least efficient: - - 1. **`reflink`** - this is the best link type that could be. It is as fast as - hard/symlinks, but doesn't carry a risk of cache corruption, since - filesystem takes care of copying the file if you try to edit it in place, - thus keeping a linked cache file intact. - Unfortunately reflinks are currently supported on a limited number of - filesystems (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but they are coming to - every new filesystem and in the future will be supported by the majority of - filesystems in use. - - 2. **`hardlink`** - the most efficient way to link your data to cache if both - your repo and your cache directory are located on the same - filesystem/drive. - Please note that hardlinked data files should never be edited in place, but - instead deleted and then replaced with a new file, otherwise it might cause - cache corruption and automatic deletion of a cache file by dvc. - - 3. **`symlink`** - The most efficient way to link your data to cache if your - repo and your cache directory are located on different filesystems/drives - (i.e. repo is located on ssd for performance, but cache dir is located on - hdd for bigger storage). - Please note that data file linked with symlink should never be edited in - place, but instead deleted and then replaced with a new file, otherwise it - might cause cache corruption and automatic deletion of a cache file by dvc. - - 4. **`copy`** - The most inefficient link type, yet the most widely supported - for any repo/cache FS combination. Suitable for scenarios with relatively - small data files, where copying them is not a performance/storage concern. + + combination of those, separated by commas e.g: `reflink,hardlink,copy`. + By default, DVC will try `reflink,copy` link types in order to choose the most + effective of those two. DVC avoids `symlink` and `hardlink` types by default + to protect user from accidental cache and repository corruption. + > **Note!** If you manually set `cache.type` to `hardlink` or `symlink`, **you + will corrupt the cache** if you edit data files in the workspace. See the + `cache.protected` config option above and corresponding `dvc unprotect` + command to modify files safely. + There are pros and cons to different link types. Refer to [Cache File + Linking](/docs/user-guide/cache-file-linking) for a full explanation of each + one. - `cache.slow_link_warning` - used to turn off the warnings about having a slow cache link type. These warnings are thrown by `dvc pull` and `dvc checkout` when linking files takes longer than usual, to remind them that there are - faster cache link types available than the defaults (`reflink` or `copy` – see + faster cache link types available than the defaults (`reflink,copy` – see `cache.type`). Accepts values `true` and `false`. > These warnings are automatically turned off when `cache.type` is manually > set. - `cache.local` - name of a local remote to use as local cache. This will overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. - Refer to `dvc remote` for more info on "local remotes". + Refer to `dvc remote` for more information on "local remotes". - `cache.ssh` - name of an [SSH remote to use as external cache](/doc/user-guide/external-outputs#ssh). diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 975f8a7b60..111e67d49f 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -41,7 +41,9 @@ $ git commit -m "add source data to DVC" ### Expand to learn about DVC internals You can see that actual data file has been moved to the `.dvc/cache` directory -(usually a reflink is created, so no file content copying is happening). +(ideally with reflinks if available on the system, otherwise by file copy – see +[Cache File Linking](/docs/user-guide/cache-file-linking) to learn about all the +supported file linking options, their tradeoffs, and how to enable them). ```dvc $ ls -R .dvc/cache diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index 7265644a79..019fb86236 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -167,8 +167,12 @@ intermediate result. Second, you should see by now that the actual data is stored in the `.dvc/cache` directory, each file having a name in a form of an md5 hash. This cache is similar to Git's internal objects store but made specifically to handle large -data files. DVC uses reflinks, hardlinks, or other optimizations to manage your -actual workspace without copying actual file contents from/to the cache. +data files. + +> **Note!** For performance with large data files, DVC can use file links from +the cache to the workspace to avoid copying actual file contents. Refer to +[Cache File Linking](/docs/user-guide/cache-file-linking) to learn which options +exist and how to enable them. diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 5dc971a4bb..ebaf368b24 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -227,9 +227,10 @@ $ python train.py $ dvc add model.h5 ``` -Note! `dvc remove` or `dvc unprotect` is required, otherwise `python train.py` -will overwrite the existing file and may corrupt the cached version. Check this -[guide](/doc/user-guide/update-tracked-file) to learn more. +> **Note!** `dvc remove` or `dvc unprotect` is required, otherwise `python +train.py` will overwrite the existing file and may corrupt the cached version. +Refer to the [Update a Tracked File](/doc/user-guide/update-tracked-file) guide +to learn more. Let's commit the second version: diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 5be8bf7c32..a5ac38ca48 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -81,8 +81,8 @@ process. want to see in your Git repository) in a local key-value store and use file symlinks instead of the actual files. - - DVC uses reflinks (or hardlinks) instead of symlinks to make user - experience better. + - DVC can use reflinks* or hardlinks (depending on the system) instead of + symlinks to improve performance and make the user experience better. - DVC optimizes checksum calculation. @@ -112,10 +112,18 @@ process. `git clone` command. It gives more granularity on managing data and code separately. Hooks could be configured to make workflow simpler. - - DVC creates reflinks. The `dvc checkout` command does not actually copy - data files from cache to the workspace, as copying files is a heavy - operation for large files (30 GB+). + - DVC attempts to use reflinks* and has other [file linking + options](/docs/user-guide/cache-file-linking). The `dvc checkout` + command does not actually copy data files from cache to the workspace, as + copying files is a heavy operation for large files (30 GB+). - `git-lfs` was not made with data science scenarios in mind, thus it does not support certain features, e.g. pipelines and metrics, and thus Github has a limit of 2 GB per repository. + +--- + +> \***copy-on-write links or "reflinks"** are a relatively new way to link files +in UNIX-style file systems. Unlike hardlinks or symlinks, they support +transparent copy on write. This means that editing a reflinked file is always +safe as all the other links to the file will reflect the changes. diff --git a/static/docs/use-cases/share-data-and-model-files.md b/static/docs/use-cases/share-data-and-model-files.md index 716aeb50fb..6dc6b1f13b 100644 --- a/static/docs/use-cases/share-data-and-model-files.md +++ b/static/docs/use-cases/share-data-and-model-files.md @@ -5,8 +5,8 @@ dead easy to consistently get all your data files and code to any machine. All you need to do is to setup a remote DVC repository, that will store cache files for your project. Currently DVC supports AWS S3, Google Cloud Storage, Microsoft Azure Blob Storage, SSH and HDFS as remote location and the list is constantly -growing. To get a full info about supported remote types and their configuration -take a look at `dvc remote`. +growing. For complete information about supported remote types and their +configuration take a look at `dvc remote`. ![](/static/img/model-sharing-digram.png) diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md index 6e0acdd222..0594b416d0 100644 --- a/static/docs/user-guide/analytics.md +++ b/static/docs/user-guide/analytics.md @@ -26,7 +26,7 @@ User and event data have a 14 month retention period. DVC's analytics record the following information per event: - The DVC version, e.g. `0.22.0` -- The operating system info, e.g. `linux`, `ubuntu`, `14.04`, etc +- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc - The underlying version control system, e.g. `git` - Command type, e.g. `CmdDataPull` - Command return code, e.g. `1` diff --git a/static/docs/user-guide/cache-file-linking.md b/static/docs/user-guide/cache-file-linking.md new file mode 100644 index 0000000000..e1c3c53a3e --- /dev/null +++ b/static/docs/user-guide/cache-file-linking.md @@ -0,0 +1,93 @@ +# File Link Types for the DVC Cache + +File links are entries in the file system that don't necessarily hold the file +contents, but which point to where the file is actually stored. DVC uses file +links in your project's workspace to avoid copying them from/to the DVC cache. + +The DVC cache is a hidden storage (by default located in the `.dvc/cache` +directory) for files that are under DVC control, and their different versions. +(See `dvc cache` and [DVC internal +files](/doc/user-guide/dvc-files-and-directories) for more details.) + +File links are more common in file systems used with UNIX-like operating systems +and come in different flavors that have to do with how they connect filenames to +inodes in the system. Inodes are metadata file records to locate and manage +permissions to the actual file contents. Hard links, Soft or Symbolic links, and +Reflinks (in more recent systems) are the types of file links that DVC leverages +for performance. + +> See **Linking files** in [this +> doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for technical +> details on Linux. +> Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise) +> also support hard or soft links on the +> [NTFS](https://support.microsoft.com/en-us/help/100108/overview-of-fat-hpfs-and-ntfs-file-systems) +> and +> [ReFS](https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview) +> file systems. + +## Supported file links types and their tradeoffs + +There are pros and cons to the 3 supported link types (`reflink`, `hardlink`, +`symlink` or soft link). While reflinks have all the benefits and none of the +worries, they're not commonly supported in most platforms yet. Hard/soft links +also optimize speed and space in the file system, but carry the risk of breaking +your workflow ,since updating tracked files in the workspace causes data +corruption. These 2 link types thus require using cache protected mode (see the +`cache.protected` config option in `dvc config cache`). Finally, a 4th "linking" +option is to actually `copy` files from/to the cache, which is safe but quite +inefficient for large files (not recommended for GBs of data). + +Each file linking option is further detailed below, in order of efficiency: + +1. **`reflink`** - copy-on-write links or "reflinks" are the best possible link + type, when available. They're is as fast as hard/symlinks, but don't carry a + risk of cache corruption since file system takes care of copying the file if + you try to edit it in place, thus keeping a linked cache file intact. + Unfortunately reflinks are currently supported on a limited number of file + systems only (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but in the future they + will be supported by the majority of file systems in use. + +2. **`hardlink`** - hard links are the most efficient way to link your data to + cache if both your repo and your cache directory are located on the same + file system/drive. + > Please note that hardlinked data files should never be edited in place, + but instead deleted and then replaced with a new file, otherwise it might + cause cache corruption and automatic deletion of cached files by DVC. + +3. **`symlink`** - symbolic (aka "soft") links are the most efficient way to + link your data to cache if your repo and your cache directory are located on + different file systems/drives (i.e. repo is located on SSD for performance, + but cache dir is located on HDD for bigger storage). + > Please note that symlinked data files should never be edited in place, + but instead deleted and then replaced with a new file, otherwise it might + cause cache corruption and automatic deletion of cached files by DVC. + +4. **`copy`** - the most inefficient "linking" strategy, yet the most widely + supported for any repo/cache file system combination. Using `copy` means + there will be no links but the actual files will be duplicated by making + copies from the cache into the workspace. Suitable for scenarios with + relatively small data files, where copying them is not a storage performance + concern + +> DVC avoids `symlink` and `hardlink` types by default to protect user from + accidental cache and repository corruption. Refer to the [Update a Tracked + File](/doc/user-guide/update-tracked-file) guide to learn more. + +## Configuring DVC cache file link type + +By default DVC tries to use reflinks if available on your system, however this +is not the most common case at this time, so it falls back to the copying +strategy. If you wish to enable hard or soft links you can configure DVC like +this: + +```dvc +$ dvc config cache.type reflink,hardlink,symlink +$ dvc config cache.protected true +``` +> Refer to `dvc config cache` for more options. + +Setting `cache.protected` is important with `hardlink` and/or `symlink` cache +file link types. Please refer to the [Update a Tracked +File](/docs/user-guide/update-tracked-file) to how to manage tracked files +under such a cache configuration. diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index b2a9d1f1b3..920b0f2c10 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -9,14 +9,15 @@ Once initialized in a project, DVC populates its installation directory - `.dvc/config.local` - this is a local configuration file, that will overwrite options in `.dvc/config`. This is useful when you need to specify private options in your config, that you don't want to track and share through git. - The local config file can be edited by hand or with a special command: - `dvc config --local`. + The local config file can be edited by hand or with a special command: `dvc + config --local`. -- `.dvc/cache` - the cache directory will contain your data files (the data +- `.dvc/cache` - the cache directory will contain your data files. (The data directories of DVC repositories will only contain links to the data files in - the cache). + the cache, refer to [Cache File + Linking](/docs/user-guide/cache-file-linking).) - **Note:** DVC includes the cache directory in `.gitignore` during the + > **Note:** DVC includes the cache directory in `.gitignore` during the initialization. No data files (with actual content) will ever be pushed to the Git repository, only DVC-files that are needed to reproduce them. @@ -29,8 +30,10 @@ Once initialized in a project, DVC populates its installation directory - `.dvc/state-journal` - temporary file for SQLite operations - `.dvc/state-wal` - another SQLite temporary file + - `.dvc/updater` - this file is used store latest available version of dvc, which is used to remind user to upgrade. + - `.dvc/updater.lock` - a lock file for `.dvc/updater`. - `.dvc/lock` - a lock file for the whole dvc project. diff --git a/static/docs/user-guide/update-tracked-file.md b/static/docs/user-guide/update-tracked-file.md index e7a055368d..074a0ca63a 100644 --- a/static/docs/user-guide/update-tracked-file.md +++ b/static/docs/user-guide/update-tracked-file.md @@ -1,10 +1,14 @@ # Update a Tracked File Due to the way DVC handles linking between the data files in the cache and their -counterparts in the working directory (see -[#799](https://github.com/iterative/dvc/issues/799) and -[#599](https://github.com/iterative/dvc/issues/599) for example), updating -tracked files has to be carried out with caution. +counterparts in the working directory (refer to [Cache File +Linking](/docs/user-guide/cache-file-linking)), updating tracked files has to be +carried out with caution to avoid data corruption when the DVC config option +`cache.type` is set to `hardlink` or/and `symlink`. (See `dvc config cache` for +more details on setting the cache file link types.) + +> For an example of the cache corruption problem see issue +> [#599](https://github.com/iterative/dvc/issues/599 in our code repository. Assume `train.tsv` is tracked by dvc and you want to update it. Here updating may mean either replacing `train.tsv` with a new file having the same name or @@ -15,10 +19,10 @@ manually, DVC removes them for you before running the stage which generates them. If you use DVC to track a file that is generated during your pipeline (e.g. some -intermediate result or a final model file - `model.pkl`) and you don't use -`dvc run` and `dvc repro` to manage your pipeline, use the procedure below (run -`dvc unprotect` or `dvc remove`) to unlink it from DVC cache prior to the -execution of the script that modifies it. +intermediate result or a final model file - `model.pkl`) and you don't use `dvc +run` and `dvc repro` to manage your pipeline, use the procedure below (run `dvc +unprotect` or `dvc remove`) to unlink it from DVC cache prior to the execution +of the script that modifies it. See also `dvc unprotect` and `dvc config cache` to learn more about the recommended ways to protect your data files. From fd23531d457e1c49a680958ce67cd22fabc1026e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 20 May 2019 03:58:36 -0500 Subject: [PATCH 3/5] Addesses some feedback from #345 and runs prettier on changed files. Feedback addressed: - https://github.com/iterative/dvc.org/pull/345#discussion_r285225280 - https://github.com/iterative/dvc.org/pull/345#pullrequestreview-239181772 - https://github.com/iterative/dvc.org/pull/345#pullrequestreview-239181801 - https://github.com/iterative/dvc.org/pull/345#pullrequestreview-239181897 --- .../understanding-dvc/related-technologies.md | 17 ++-- static/docs/user-guide/cache-file-linking.md | 78 ++++++++++--------- 2 files changed, 51 insertions(+), 44 deletions(-) diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index a5ac38ca48..84cbb6140a 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -81,7 +81,7 @@ process. want to see in your Git repository) in a local key-value store and use file symlinks instead of the actual files. - - DVC can use reflinks* or hardlinks (depending on the system) instead of + - DVC can use reflinks\* or hardlinks (depending on the system) instead of symlinks to improve performance and make the user experience better. - DVC optimizes checksum calculation. @@ -112,10 +112,10 @@ process. `git clone` command. It gives more granularity on managing data and code separately. Hooks could be configured to make workflow simpler. - - DVC attempts to use reflinks* and has other [file linking - options](/docs/user-guide/cache-file-linking). The `dvc checkout` - command does not actually copy data files from cache to the workspace, as - copying files is a heavy operation for large files (30 GB+). + - DVC attempts to use reflinks\* and has other + [file linking options](/docs/user-guide/cache-file-linking). The + `dvc checkout` command does not actually copy data files from cache to the + workspace, as copying files is a heavy operation for large files (30 GB+). - `git-lfs` was not made with data science scenarios in mind, thus it does not support certain features, e.g. pipelines and metrics, and thus Github @@ -124,6 +124,7 @@ process. --- > \***copy-on-write links or "reflinks"** are a relatively new way to link files -in UNIX-style file systems. Unlike hardlinks or symlinks, they support -transparent copy on write. This means that editing a reflinked file is always -safe as all the other links to the file will reflect the changes. +> in UNIX-style file systems. Unlike hardlinks or symlinks, they support +> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This +> means that editing a reflinked file is always safe as all the other links to +> the file will reflect the changes. diff --git a/static/docs/user-guide/cache-file-linking.md b/static/docs/user-guide/cache-file-linking.md index e1c3c53a3e..5ac58ce90f 100644 --- a/static/docs/user-guide/cache-file-linking.md +++ b/static/docs/user-guide/cache-file-linking.md @@ -1,13 +1,15 @@ # File Link Types for the DVC Cache File links are entries in the file system that don't necessarily hold the file -contents, but which point to where the file is actually stored. DVC uses file -links in your project's workspace to avoid copying them from/to the DVC cache. +contents, but which point to where the file is actually stored. DVC can use file +links in your project's workspace instead of copies to avoid duplicating +contents that will exist in the DVC cache. The DVC cache is a hidden storage (by default located in the `.dvc/cache` directory) for files that are under DVC control, and their different versions. -(See `dvc cache` and [DVC internal -files](/doc/user-guide/dvc-files-and-directories) for more details.) +(See `dvc cache` and +[DVC internal files](/doc/user-guide/dvc-files-and-directories) for more +details.) File links are more common in file systems used with UNIX-like operating systems and come in different flavors that have to do with how they connect filenames to @@ -16,9 +18,9 @@ permissions to the actual file contents. Hard links, Soft or Symbolic links, and Reflinks (in more recent systems) are the types of file links that DVC leverages for performance. -> See **Linking files** in [this -> doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for technical -> details on Linux. +> See **Linking files** in +> [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for +> technical details on Linux. > Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise) > also support hard or soft links on the > [NTFS](https://support.microsoft.com/en-us/help/100108/overview-of-fat-hpfs-and-ntfs-file-systems) @@ -32,47 +34,50 @@ There are pros and cons to the 3 supported link types (`reflink`, `hardlink`, `symlink` or soft link). While reflinks have all the benefits and none of the worries, they're not commonly supported in most platforms yet. Hard/soft links also optimize speed and space in the file system, but carry the risk of breaking -your workflow ,since updating tracked files in the workspace causes data +your workflow, since updating tracked files in the workspace causes data corruption. These 2 link types thus require using cache protected mode (see the `cache.protected` config option in `dvc config cache`). Finally, a 4th "linking" option is to actually `copy` files from/to the cache, which is safe but quite inefficient for large files (not recommended for GBs of data). Each file linking option is further detailed below, in order of efficiency: - + 1. **`reflink`** - copy-on-write links or "reflinks" are the best possible link - type, when available. They're is as fast as hard/symlinks, but don't carry a - risk of cache corruption since file system takes care of copying the file if - you try to edit it in place, thus keeping a linked cache file intact. - Unfortunately reflinks are currently supported on a limited number of file - systems only (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but in the future they - will be supported by the majority of file systems in use. + type, when available. They're is as fast as hard/symlinks, but don't carry a + risk of cache corruption since file system takes care of copying the file if + you try to edit it in place, thus keeping a linked cache file intact. + Unfortunately reflinks are currently supported on a limited number of file + systems only (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but in the future they + will be supported by the majority of file systems in use. 2. **`hardlink`** - hard links are the most efficient way to link your data to - cache if both your repo and your cache directory are located on the same - file system/drive. - > Please note that hardlinked data files should never be edited in place, - but instead deleted and then replaced with a new file, otherwise it might - cause cache corruption and automatic deletion of cached files by DVC. + cache if both your repo and your cache directory are located on the same file + system/drive. + + > Please note that hardlinked data files should never be edited in place, but + > instead deleted and then replaced with a new file, otherwise it might cause + > cache corruption and automatic deletion of cached files by DVC. 3. **`symlink`** - symbolic (aka "soft") links are the most efficient way to - link your data to cache if your repo and your cache directory are located on - different file systems/drives (i.e. repo is located on SSD for performance, - but cache dir is located on HDD for bigger storage). - > Please note that symlinked data files should never be edited in place, - but instead deleted and then replaced with a new file, otherwise it might - cause cache corruption and automatic deletion of cached files by DVC. + link your data to cache if your repo and your cache directory are located on + different file systems/drives (i.e. repo is located on SSD for performance, + but cache dir is located on HDD for bigger storage). + + > Please note that symlinked data files should never be edited in place, but + > instead deleted and then replaced with a new file, otherwise it might cause + > cache corruption and automatic deletion of cached files by DVC. 4. **`copy`** - the most inefficient "linking" strategy, yet the most widely - supported for any repo/cache file system combination. Using `copy` means - there will be no links but the actual files will be duplicated by making - copies from the cache into the workspace. Suitable for scenarios with - relatively small data files, where copying them is not a storage performance - concern + supported for any repo/cache file system combination. Using `copy` means + there will be no links but the actual files will be duplicated by making + copies from the cache into the workspace. Suitable for scenarios with + relatively small data files, where copying them is not a storage performance + concern > DVC avoids `symlink` and `hardlink` types by default to protect user from - accidental cache and repository corruption. Refer to the [Update a Tracked - File](/doc/user-guide/update-tracked-file) guide to learn more. +> accidental cache and repository corruption. Refer to the +> [Update a Tracked File](/doc/user-guide/update-tracked-file) guide to learn +> more. ## Configuring DVC cache file link type @@ -85,9 +90,10 @@ this: $ dvc config cache.type reflink,hardlink,symlink $ dvc config cache.protected true ``` + > Refer to `dvc config cache` for more options. Setting `cache.protected` is important with `hardlink` and/or `symlink` cache -file link types. Please refer to the [Update a Tracked -File](/docs/user-guide/update-tracked-file) to how to manage tracked files -under such a cache configuration. +file link types. Please refer to the +[Update a Tracked File](/docs/user-guide/update-tracked-file) to how to manage +tracked files under such a cache configuration. From 448854346252805acf386a8780622e14d105de9d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 20 May 2019 17:58:01 -0500 Subject: [PATCH 4/5] Addressing feedback (round 2) from #345 - https://github.com/iterative/dvc.org/pull/345/files/0aa0b61fd62686b0c39ed36fa05b5c2b5934a30e#pullrequestreview-239181940 - adds cache performance expandable dtails section to get-started/add-files.md https://github.com/iterative/dvc.org/pull/345/files/0aa0b61fd62686b0c39ed36fa05b5c2b5934a30e#pullrequestreview-239182006 - some misc. reformatting in related files --- static/docs/commands-reference/config.md | 11 +++--- static/docs/get-started/add-files.md | 39 ++++++++++++++----- static/docs/get-started/example-versioning.md | 5 --- static/docs/user-guide/cache-file-linking.md | 21 ++++++---- 4 files changed, 48 insertions(+), 28 deletions(-) diff --git a/static/docs/commands-reference/config.md b/static/docs/commands-reference/config.md index 5b333eadc4..ab036f8f6c 100644 --- a/static/docs/commands-reference/config.md +++ b/static/docs/commands-reference/config.md @@ -87,8 +87,8 @@ files](/doc/user-guide/dvc-files-and-directories) for more details.) The default value is `cache`, which resolved relative to the default project config location results in `.dvc/cache`. > See also helper command `dvc cache dir` that properly transform paths - relative to the present working directory into relative to the project config - file. + > relative to the present working directory into relative to the project + > config file. - `cache.protected` - makes files in the workspace read-only. Possible values are `true` or `false` (default). Run `dvc checkout` for the change go into @@ -103,15 +103,14 @@ files](/doc/user-guide/dvc-files-and-directories) for more details.) - `cache.type` - link type that DVC should use to link data files from cache to your workspace. Possible values: `reflink`, `symlink`, `hardlink`, `copy` or a - combination of those, separated by commas e.g: `reflink,hardlink,copy`. By default, DVC will try `reflink,copy` link types in order to choose the most effective of those two. DVC avoids `symlink` and `hardlink` types by default to protect user from accidental cache and repository corruption. > **Note!** If you manually set `cache.type` to `hardlink` or `symlink`, **you - will corrupt the cache** if you edit data files in the workspace. See the - `cache.protected` config option above and corresponding `dvc unprotect` - command to modify files safely. + > will corrupt the cache** if you modify tracked data files in the workspace. + > See the `cache.protected` config option above and corresponding + > `dvc unprotect` command to modify files safely. There are pros and cons to different link types. Refer to [Cache File Linking](/docs/user-guide/cache-file-linking) for a full explanation of each one. diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 111e67d49f..3a1225e2b1 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -40,10 +40,11 @@ $ git commit -m "add source data to DVC" ### Expand to learn about DVC internals -You can see that actual data file has been moved to the `.dvc/cache` directory -(ideally with reflinks if available on the system, otherwise by file copy – see -[Cache File Linking](/docs/user-guide/cache-file-linking) to learn about all the -supported file linking options, their tradeoffs, and how to enable them). +You can see that actual data file has been moved to the `.dvc/cache` directory, +while the entries in the working directory may be links to the actual files in +the DVC cache. (See [Cache File Linking](/docs/user-guide/cache-file-linking) to +learn about the supported file linking options, their tradeoffs, and how to +enable them). ```dvc $ ls -R .dvc/cache @@ -57,12 +58,32 @@ hash inside. +
+ +### Expand for an important note on cache performance + +DVC tries to use reflinks\* by default to link your data files from the DVC +cache to the workspace, optimizing speed and storage space. However, reflinks +are not widely supported yet and DVC falls back to actually copying data files +to/from the cache **which can be very slow with large files**, and duplicates +storage requirements. + +Hardlinks and symlinks are also available for optimized cache linking but, +(unlike reflinks) they carry the risk of accidentally corrupting the cache if +tacked data files are modified in the workspace. + +See [Cache File Linking](/docs/user-guide/cache-file-linking) and +`dvc config cache` for more information. + +> \***copy-on-write links or "reflinks"** are a relatively new way to link files +> in UNIX-style file systems. Unlike hardlinks or symlinks, they support +> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This +> means that editing a reflinked file is always safe as all the other links to +> the file will reflect the changes. + +
+ Refer to [Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning), `dvc add`, and `dvc run` for more information on storing and versioning data files with DVC. - -Note that to modify or replace a data file that is under DVC control you may -need to run `dvc unprotect` or `dvc remove` first (check the -[Update Tracked File](/doc/user-guide/update-tracked-file) guide). Use -`dvc move` to rename or move a data file that is under DVC control. diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index ebaf368b24..58da3085a2 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -227,11 +227,6 @@ $ python train.py $ dvc add model.h5 ``` -> **Note!** `dvc remove` or `dvc unprotect` is required, otherwise `python -train.py` will overwrite the existing file and may corrupt the cached version. -Refer to the [Update a Tracked File](/doc/user-guide/update-tracked-file) guide -to learn more. - Let's commit the second version: ```dvc diff --git a/static/docs/user-guide/cache-file-linking.md b/static/docs/user-guide/cache-file-linking.md index 5ac58ce90f..eb68ad87a1 100644 --- a/static/docs/user-guide/cache-file-linking.md +++ b/static/docs/user-guide/cache-file-linking.md @@ -42,18 +42,18 @@ inefficient for large files (not recommended for GBs of data). Each file linking option is further detailed below, in order of efficiency: -1. **`reflink`** - copy-on-write links or "reflinks" are the best possible link - type, when available. They're is as fast as hard/symlinks, but don't carry a - risk of cache corruption since file system takes care of copying the file if - you try to edit it in place, thus keeping a linked cache file intact. +1. **`reflink`** - copy-on-write\* links or "reflinks" are the best possible + link type, when available. They're is as fast as hard/symlinks, but don't + carry a risk of cache corruption since the file system takes care of copying + the file if you try to edit it in place, thus keeping a linked cache file + intact. Unfortunately reflinks are currently supported on a limited number of file systems only (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but in the future they will be supported by the majority of file systems in use. 2. **`hardlink`** - hard links are the most efficient way to link your data to cache if both your repo and your cache directory are located on the same file - system/drive. - + system/drive. > Please note that hardlinked data files should never be edited in place, but > instead deleted and then replaced with a new file, otherwise it might cause > cache corruption and automatic deletion of cached files by DVC. @@ -61,8 +61,7 @@ Each file linking option is further detailed below, in order of efficiency: 3. **`symlink`** - symbolic (aka "soft") links are the most efficient way to link your data to cache if your repo and your cache directory are located on different file systems/drives (i.e. repo is located on SSD for performance, - but cache dir is located on HDD for bigger storage). - + but cache dir is located on HDD for bigger storage). > Please note that symlinked data files should never be edited in place, but > instead deleted and then replaced with a new file, otherwise it might cause > cache corruption and automatic deletion of cached files by DVC. @@ -97,3 +96,9 @@ Setting `cache.protected` is important with `hardlink` and/or `symlink` cache file link types. Please refer to the [Update a Tracked File](/docs/user-guide/update-tracked-file) to how to manage tracked files under such a cache configuration. + +> \***copy-on-write links or "reflinks"** are a relatively new way to link files +> in UNIX-style file systems. Unlike hardlinks or symlinks, they support +> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This +> means that editing a reflinked file is always safe as all the other links to +> the file will reflect the changes. From a5f3004ac2dc9a483e918c03c15649cdafb66d67 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 21 May 2019 00:49:50 -0500 Subject: [PATCH 5/5] Expands /doc/user-guide/cache-file-linking to full optimization guide Per https://github.com/iterative/dvc.org/pull/345#pullrequestreview-239798153 Fixes #265 --- src/Documentation/sidebar.json | 2 +- static/docs/user-guide/cache-file-linking.md | 71 +++++++++++--------- 2 files changed, 42 insertions(+), 31 deletions(-) diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index e39c14888a..92dc5bd436 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -66,7 +66,7 @@ "labels": { "dvc-files-and-directories.md": "Files and Directories", "dvc-file-format.md": "File Format (.dvc)", - "cache-file-linking.md": "Cache File Linking", + "cache-file-linking.md": "Large File Optimization", "development.md": "Development Version", "external-dependencies.md": "External Dependencies", "external-outputs.md": "External Outputs", diff --git a/static/docs/user-guide/cache-file-linking.md b/static/docs/user-guide/cache-file-linking.md index eb68ad87a1..80fd9cc247 100644 --- a/static/docs/user-guide/cache-file-linking.md +++ b/static/docs/user-guide/cache-file-linking.md @@ -1,24 +1,34 @@ -# File Link Types for the DVC Cache +# Performance Optimization for Large Files -File links are entries in the file system that don't necessarily hold the file -contents, but which point to where the file is actually stored. DVC can use file -links in your project's workspace instead of copies to avoid duplicating -contents that will exist in the DVC cache. - -The DVC cache is a hidden storage (by default located in the `.dvc/cache` -directory) for files that are under DVC control, and their different versions. -(See `dvc cache` and +In order to track the data files added with `dvc add` or `dvc run`, DVC moves +all these files to a special cache directory. The DVC cache is a hidden storage +(by default located in `.dvc/cache`) for files that are under DVC control, and +their different versions. (See `dvc cache` and [DVC internal files](/doc/user-guide/dvc-files-and-directories) for more details.) -File links are more common in file systems used with UNIX-like operating systems -and come in different flavors that have to do with how they connect filenames to -inodes in the system. Inodes are metadata file records to locate and manage -permissions to the actual file contents. Hard links, Soft or Symbolic links, and -Reflinks (in more recent systems) are the types of file links that DVC leverages -for performance. +However, the versions of the tracked files +[corresponding to the current code](/doc/get-started/connect-code-and-data) +branch are also needed in the workspace, so a subset of the cache files must be +kept in the working directory at all times (see `dvc checkout`). Does this mean +that some files will be duplicated between the workspace and the cache? +**That would not be efficient!** Especially with large files (several Gigabytes +or larger). + +In order to have the files present in both directories without duplication, DVC +can automatically create **file links** in the workspace that "point" to the +data in cache. + +## File link types for the DVC cache -> See **Linking files** in +File links are entries in the file system that don't necessarily hold the file +contents, but which point to where the file is actually stored. File links are +more common in file systems used with UNIX-like operating systems and come in +different kinds, that differ in how they connect filenames to inodes in the +system. + +> **Inodes** are metadata file records to locate and store permissions to the +> actual file contents. See **Linking files** in > [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for > technical details on Linux. > Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise) @@ -28,17 +38,16 @@ for performance. > [ReFS](https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview) > file systems. -## Supported file links types and their tradeoffs - -There are pros and cons to the 3 supported link types (`reflink`, `hardlink`, -`symlink` or soft link). While reflinks have all the benefits and none of the -worries, they're not commonly supported in most platforms yet. Hard/soft links -also optimize speed and space in the file system, but carry the risk of breaking -your workflow, since updating tracked files in the workspace causes data -corruption. These 2 link types thus require using cache protected mode (see the -`cache.protected` config option in `dvc config cache`). Finally, a 4th "linking" -option is to actually `copy` files from/to the cache, which is safe but quite -inefficient for large files (not recommended for GBs of data). +There are pros and cons to the 3 supported link types: Hard links, Soft or +Symbolic links, and Reflinks\* in more recent systems. While reflinks bring all +the benefits and none of the worries, they're not commonly supported in most +platforms yet. Hard/soft links also optimize speed and space in the file system, +but carry the risk of breaking your workflow, since updating tracked files in +the workspace causes data corruption. These 2 link types thus require using +cache protected mode (see the `cache.protected` config option in +`dvc config cache`). Finally, a 4th "linking" option is to actually `copy` files +from/to the cache, which is safe but quite inefficient for large files (not +recommended for GBs of data). Each file linking option is further detailed below, in order of efficiency: @@ -78,7 +87,7 @@ Each file linking option is further detailed below, in order of efficiency: > [Update a Tracked File](/doc/user-guide/update-tracked-file) guide to learn > more. -## Configuring DVC cache file link type +### Configuring DVC cache file link type By default DVC tries to use reflinks if available on your system, however this is not the most common case at this time, so it falls back to the copying @@ -94,8 +103,10 @@ $ dvc config cache.protected true Setting `cache.protected` is important with `hardlink` and/or `symlink` cache file link types. Please refer to the -[Update a Tracked File](/docs/user-guide/update-tracked-file) to how to manage -tracked files under such a cache configuration. +[Update a Tracked File](/docs/user-guide/update-tracked-file) on how to manage +tracked files under these cache configurations. + +--- > \***copy-on-write links or "reflinks"** are a relatively new way to link files > in UNIX-style file systems. Unlike hardlinks or symlinks, they support