Skip to content

Commit

Permalink
cache: centralize concept explanation in tooltip
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Jan 4, 2021
1 parent e6858cc commit 193e043
Show file tree
Hide file tree
Showing 6 changed files with 29 additions and 37 deletions.
8 changes: 4 additions & 4 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,10 @@ A `dvc add` target can be either a file or a directory. In the latter case, a
`.dvc` file is created for the top of the hierarchy (with default name
`<dir_name>.dvc`).

Every file inside is stored in the cache (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each file in the
entire tree. Instead, the single `.dvc` file references a special JSON file in
the cache (with `.dir` extension), that in turn points to the added files.
Every file in the dir is cached normally (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each one. Instead,
the single `.dvc` file references a special JSON file in the cache (with `.dir`
extension), that in turn points to the added files.

> Refer to
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
Expand Down
11 changes: 4 additions & 7 deletions content/docs/command-reference/cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,12 @@ positional arguments:

## Description

The DVC Cache is where your data files, models, etc. (anything you want to
version with DVC) are actually stored. The data files and directories visible in
the <abbr>workspace</abbr> are links\* to (or copies of) the ones in cache.
Learn more about it's
[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
Tracked files and directories visible in the <abbr>workspace</abbr> are links\*
to the ones in the project's <abbr>cache</abbr>.

> \* Refer to
> \* Or copies. Refer to
> [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> for more information on file links on different platforms.
> for more information on supported linking on different platforms.
For cache configuration options, refer to `dvc config cache`.

Expand Down
7 changes: 2 additions & 5 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,11 +131,8 @@ remote. See `dvc remote` for more information.

### cache

A DVC project <abbr>cache</abbr> is the hidden storage (by default located in
the `.dvc/cache` directory) for files that are tracked by DVC, and their
different versions. (See `dvc cache` and
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
for more details.) This section contains the following options:
This section contains the following options, which affect the project's
<abbr>cache</abbr>:

- `cache.dir` - set/unset cache directory location. A correct value is either an
absolute path, or a path **relative to the config file location**. The default
Expand Down
6 changes: 3 additions & 3 deletions content/docs/user-guide/basic-concepts/dvc-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: 'DVC Cache'
match: ['DVC cache', cache, caches, cached, 'cache directory']
---

The DVC cache is a hidden storage (by default located in the `.dvc/cache`
directory) for files that are tracked by DVC, and their different versions.
Learn more about it's
The DVC cache is a hidden storage (by default in `.dvc/cache`) for files and
directories tracked by DVC, and their different versions. Data is cached into a
special flattened
[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
25 changes: 14 additions & 11 deletions content/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ Full <abbr>parameters</abbr> (key and value) are listed separately under
- `.dvc/cache`: The <abbr>cache</abbr> directory will store your data in a
special [structure](#structure-of-the-cache-directory). The data files and
directories in the <abbr>workspace</abbr> will only contain links to the data
files in the cache. (Refer to
files in the cache (refer to
[Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See
`dvc config cache` for related configuration options.

Expand Down Expand Up @@ -297,13 +297,17 @@ Full <abbr>parameters</abbr> (key and value) are listed separately under

## Structure of the cache directory

The DVC cache is a
The DVC cache is a hidden
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
(by default in `.dvc/cache`), which adds a layer of indirection between code and
(by default in `.dvc/cache`). It adds a layer of indirection between code and
data.

There are two ways in which the data is <abbr>cached</abbr>: As a single file
(eg. `data.csv`), or as a directory.
There are two ways in which the data is <abbr>cached</abbr>, depending on
whether it's a single file, or a directory (which may contain multiple files).

Note files are renamed, reorganized, and directory trees are flattened in the
cache, which always has exactly one depth level with 2-character directories
(based on hashes of the data contents, as explained next).

### Files

Expand Down Expand Up @@ -331,9 +335,7 @@ data/images/
$ dvc add data/images
```

The directory is cached as a JSON file with `.dir` extension. The files it
contains are stored in the cache regularly, as explained earlier. It looks like
this:
The resulting cache dir looks like this:

```dvc
.dvc/cache/
Expand All @@ -345,13 +347,14 @@ this:
    └── 0b40427ee0998e9802335d98f08cd98f
```

The `.dir` file contains the mapping of files in `data/images` (as a JSON
array), including their hash values:
The files in the directory are cached normally. The directory itself gets a
similar entry, which with the `.dir` extension. It contains the mapping of files
inside (as a JSON array), identified by their hash values:

```dvc
$ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir
[{"md5": "dff70c0392d7d386c39a23c64fcc0376", "relpath": "cat.jpeg"},
{"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"}]
```

That's how DVC knows that the other two cached files belong in the directory.
That's how DVC knows that those two cached files belong in the directory.
9 changes: 2 additions & 7 deletions content/docs/user-guide/large-dataset-optimization.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
# Large Dataset Optimization

In order to track the data files and directories added with `dvc add` or
`dvc run`, DVC moves all these files to the <abbr>cache</abbr>. A
<abbr>project</abbr>'s cache is the hidden storage (by default located in
`.dvc/cache`) for files that are tracked by DVC, and their different versions.
(See `dvc cache` and
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more
details.)
In order to track the data files and directories added with `dvc add`,
`dvc repro`, etc. DVC moves all these files to the project's <abbr>cache</abbr>.

However, the versions of the tracked files that
[match the current code](/doc/tutorials/get-started/data-pipelines) are also
Expand Down

0 comments on commit 193e043

Please sign in to comment.