Skip to content

Commit

Permalink
Merge pull request #2475 from iterative/cases/shared-dev/external-cache
Browse files Browse the repository at this point in the history
guide: external cache info (extracted from Use Cases)
jorgeorpinel authored May 26, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
2 parents b14a5b8 + 11f5986 commit 27e3143
Showing 7 changed files with 59 additions and 38 deletions.
2 changes: 1 addition & 1 deletion content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
@@ -341,7 +341,7 @@ Only the hash values of the `dir/` directory (with `.dir` file extension) and
When you have a large dataset in an external location, you may want to add it to
the <abbr>project</abbr> without having to copy it into the workspace. Maybe
your local disk doesn't have enough space, but you have setup an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
[external cache](/doc/user-guide/managing-external-data#setting-up-an-external-cache)
that could handle it.

The `--out` option lets you add external paths in a way that they are
8 changes: 4 additions & 4 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
@@ -197,10 +197,10 @@ This section contains the following options, which affect the project's
[`os.umask`](https://docs.python.org/3/library/os.html#os.umask).

The following parameters allow setting an
[external cache](/doc/user-guide/managing-external-data#examples) location. A
[DVC remote](/doc/command-reference/remote) name is used (instead of the URL)
because often it's necessary to configure authentication or other connection
settings, and configuring a remote is the way that can be done.
[external cache](/doc/user-guide/managing-external-data#setting-up-an-external-cache)
location. A [DVC remote](/doc/command-reference/remote) name is used (instead of
the URL) because often it's necessary to configure authentication or other
connection settings, and configuring a remote is the way that can be done.

- `cache.local` - name of a _local remote_ to use as external cache (refer to
`dvc remote` for more info. on "local remotes".) This will overwrite the value
23 changes: 11 additions & 12 deletions content/docs/command-reference/destroy.md
Original file line number Diff line number Diff line change
@@ -57,27 +57,26 @@ $ ls -a
.git code.py foo
```

## Example: External cache directory
## Example: Preserve an external cache directory

By default, the <abbr>cache</abbr> location is `.dvc/cache`. Let's change the
cache location to `/mnt/cache` and then execute `dvc destroy` command:
By default, the <abbr>cache</abbr> location is `.dvc/cache`. Let's change its
location to `/mnt/cache` using `dvc cache dir`, add some data, and then try
`dvc destroy`:

```dvc
$ dvc init
$ echo foo > foo
$ dvc cache dir /mnt/cache
$ echo foo > foo
$ dvc add foo
```

`dvc cache dir` changed the location of the cache directory to an external
location. Contents of the <abbr>project</abbr>:
Contents of the <abbr>workspace</abbr>:

```dvc
$ ls -a
.dvc .git code.py foo foo.dvc
```

Contents of the external `/mnt/cache` directory:
Contents of the (external) cache (`b1/946a...` contains `foo`):

```dvc
$ tree /mnt/cache
@@ -86,7 +85,7 @@ $ tree /mnt/cache
└── 946ac92492d2347c6235b4d2611184
```

Let's execute `dvc destroy`:
OK, let's destroy the <abbr>DVC project</abbr>:

```dvc
$ dvc destroy
@@ -99,9 +98,9 @@ $ ls -a
.git code.py foo
```

`dvc destroy` removed `foo.dvc` and the internal `.dvc/` directory from
<abbr>project</abbr>. But the cache files that are present in `/mnt/cache`
persist:
`foo.dvc` and the internal `.dvc/` directory were removed (this would include
any cached data prior to changing the cache location). But the cache files in
`/mnt/cache` persist:

```dvc
$ tree /mnt/cache
7 changes: 4 additions & 3 deletions content/docs/start/data-and-model-versioning.md
Original file line number Diff line number Diff line change
@@ -250,9 +250,10 @@ In cases where you process very large datasets, you need an efficient mechanism
versions. Do you use network attached storage (NAS)? Or a large external volume?
You can learn more about advanced workflows using these links:

- A shared [external cache](/doc/use-cases/shared-development-server) can be set
up to store, version and access a lot of data on a large shared volume
efficiently.
- A
[shared cache](/doc/use-cases/shared-development-server#configure-the-shared-cache)
can be set up to store, version and access a lot of data on a large shared
volume efficiently.
- A quite advanced scenario is to track and version data directly on the remote
storage (e.g. S3). See
[Managing External Data](https://dvc.org/doc/user-guide/managing-external-data)
16 changes: 7 additions & 9 deletions content/docs/use-cases/shared-development-server.md
Original file line number Diff line number Diff line change
@@ -46,18 +46,16 @@ $ sudo find /home/shared/dvc-cache -type f -exec chmod 0444 {} \;
$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/
```

## Configure the external shared cache
## Configure the shared cache

Tell DVC to use the directory we've set up above as the <abbr>cache</abbr> for
your <abbr>project</abbr>:
A <abbr>cache</abbr> directory outside the <abbr>workspace</abbr> is called an
[external cache](/doc/user-guide/managing-external-data#setting-up-an-external-cache).
Set it to the directory we created earlier with `dvc cache dir` and configure it
with `dvc config cache`:

```dvc
$ dvc cache dir /home/shared/dvc-cache
```

External cache configuration:
```dvc
$ dvc config cache.shared group
$ dvc config cache.type symlink
```
@@ -72,8 +70,8 @@ enable symlinks to avoid having copies from the external cache to the
⚠️ Note that enabling soft/hard links causes DVC to protect the linked data,
because editing them in-place would corrupt the cache. See `dvc unprotect`.

If you're using Git, commit changes to your project's config file (`.dvc/config`
by default):
If you're using Git, commit the changes to your project's config file (usually
`.dvc/config`):

```dvc
$ git add .dvc/config
2 changes: 1 addition & 1 deletion content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@ For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

External dependencies and
_External dependencies_ and
[external outputs](/doc/user-guide/managing-external-data) provide ways to track
and version data outside of the <abbr>project</abbr>.

39 changes: 31 additions & 8 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# External Outputs
# Managing External Data

> ⚠️ This is an advanced feature for very specific situations and not
> recommended except if there's absolutely no other alternative. In most cases
@@ -15,14 +15,13 @@ versioning existing data on a network attached storage (NAS), processing data on
HDFS, running [Dask](https://dask.org/) via SSH, or any code that generates
massive files directly to the cloud.

External outputs (and
_External outputs_ (and
[external dependencies](/doc/user-guide/external-dependencies)) provide ways to
track and version data outside of the <abbr>project</abbr>.

## How external outputs work

External <abbr>outputs</abbr> are considered part of the (extended)
<abbr>workspace</abbr>: DVC will track them for
External <abbr>outputs</abbr> will be tracked by DVC for
[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when
they change (reported by `dvc status`, for example).

@@ -36,17 +35,41 @@ their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml`
- HDFS
- Local files and directories outside the workspace

⚠️ External outputs require an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file.

> Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. as
> external cache, because it may cause data collisions: the hash of an external
> output could collide with that of a local file with different content.
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.
## Setting up an external cache

DVC requires that the project's <abbr>cache</abbr> is configured in the same
external location as the data that will be tracked (external outputs). This
avoids transferring files to the local environment and enables
[file linking](/doc/user-guide/large-dataset-optimization) within the external
storage.

As an example, let's create a directory external to the workspace and set it up
as cache:

```dvc
$ mkdir -p /home/shared/dvcstore
$ dvc cache dir /home/shared/dvcstore
```

> See `dvc cache dir` and `dvc config cache` for more information.
💡 Note that in real-life scenarios, often the directory will be in a remote
location, e.g. `s3://mybucket/cache` or `ssh://user@example.com/cache` (see the
examples below).

> ⚠️ An external cache could be
> [shared](/doc/use-cases/shared-development-server) among copies of a DVC
> project. Please **do not** use external outputs in that scenario, as
> `dvc checkout` in any project would overwrite the working data for all
> projects.
## Examples

Let's take a look at the following operations on all the supported location

0 comments on commit 27e3143

Please sign in to comment.