Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: still pending 1.x updates (2) #2075

Merged
merged 8 commits into from
Jan 7, 2021
8 changes: 4 additions & 4 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,10 @@ A `dvc add` target can be either a file or a directory. In the latter case, a
`.dvc` file is created for the top of the hierarchy (with default name
`<dir_name>.dvc`).

Every file inside is stored in the cache (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each file in the
entire tree. Instead, the single `.dvc` file references a special JSON file in
the cache (with `.dir` extension), that in turn points to the added files.
Every file in the dir is cached normally (unless the `--no-commit` option is
used), but DVC does not produce individual `.dvc` files for each one. Instead,
the single `.dvc` file references a special JSON file in the cache (with `.dir`
extension), that in turn points to the added files.

> Refer to
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
Expand Down
11 changes: 4 additions & 7 deletions content/docs/command-reference/cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,12 @@ positional arguments:

## Description

The DVC Cache is where your data files, models, etc. (anything you want to
version with DVC) are actually stored. The data files and directories visible in
the <abbr>workspace</abbr> are links\* to (or copies of) the ones in cache.
Learn more about it's
[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
Tracked files and directories visible in the <abbr>workspace</abbr> are links\*
to the ones in the project's <abbr>cache</abbr>.

> \* Refer to
> \* Or copies. Refer to
> [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> for more information on file links on different platforms.
> for more information on supported linking on different platforms.

For cache configuration options, refer to `dvc config cache`.

Expand Down
7 changes: 2 additions & 5 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,11 +131,8 @@ remote. See `dvc remote` for more information.

### cache

A DVC project <abbr>cache</abbr> is the hidden storage (by default located in
the `.dvc/cache` directory) for files that are tracked by DVC, and their
different versions. (See `dvc cache` and
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory)
for more details.) This section contains the following options:
This section contains the following options, which affect the project's
<abbr>cache</abbr>:

- `cache.dir` - set/unset cache directory location. A correct value is either an
absolute path, or a path **relative to the config file location**. The default
Expand Down
28 changes: 28 additions & 0 deletions content/docs/command-reference/list.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,31 @@ images/dvc-logo-outlines.png.dvc
images/owl_sticker.png
...
```

## Example: Create an archive of you DVC project

Just like you can use `git archive` to make a quick bundle (ZIP) file of the
current code, `dvc list` can be easily complemented with simple archive tools to
bundle the current data files in the project.

For example, here's a TAR archive of the entire <abbr>workspace</abbr>
(Linux/GNU):

```dvc
$ dvc list . -R | tar -cvf project.tar
```

Or separate ZIP archives of code and DVC-tracked data (POSIX terminal with
`zip`):

```
$ git archive -o code.zip HEAD
$ dvc list . -R --dvc-only | zip -@ data.zip
```

ZIP alternative for [POSIX on Windows](/doc/user-guide/running-dvc-on-windows)
(Python installed):

```dvc
$ dvc list . -R --dvc-only | xargs python -m zipfile -c data.zip
```
4 changes: 2 additions & 2 deletions content/docs/command-reference/metrics/show.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ Print [metrics](/doc/command-reference/metrics), with optional formatting.
## Synopsis

```usage
usage: dvc metrics show [-h] [-q | -v] [-a] [-T] [--all-commits] [-R]
[--show-json]
usage: dvc metrics show [-h] [-q | -v] [-a] [-T] [--all-commits]
[--show-json] [-R]
[targets [targets ...]]

positional arguments:
Expand Down
10 changes: 5 additions & 5 deletions content/docs/command-reference/params/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Define a [stage](/doc/command-reference/run) that depends on params `lr`,
specify `layers` and `epochs` from the `train` group:

```dvc
$ dvc run -n train -d users.csv -o model.pkl \
$ dvc run -n train -d train.py -d users.csv -o model.pkl \
-p lr,train.epochs,train.layers \
python train.py
```
Expand Down Expand Up @@ -130,7 +130,7 @@ Alternatively, the entire group of parameters `train` can be referenced, instead
of specifying each of the group parameters separately:

```dvc
$ dvc run -n train -d users.csv -o model.pkl \
$ dvc run -n train -d train.py -d users.csv -o model.pkl \
-p lr,train \
python train.py
```
Expand All @@ -139,7 +139,7 @@ In the examples above, the default parameters file name `params.yaml` was used.
This file name can be redefined with a prefix in the `-p` argument:

```dvc
$ dvc run -n train -d logs/ -o users.csv \
$ dvc run -n train -d train.py -d logs/ -o users.csv -f \
-p parse_params.yaml:threshold,classes_num \
python train.py
```
Expand Down Expand Up @@ -182,7 +182,7 @@ The following [stage](/doc/command-reference/run) depends on params `BOOL`,
`INT`, as well as `TrainConfig`'s `EPOCHS` and `layers`:

```dvc
$ dvc run -n train -d users.csv -o model.pkl \
$ dvc run -n train -d train.py -d users.csv -o model.pkl \
-p params.py:BOOL,INT,TrainConfig.EPOCHS,TrainConfig.layers \
python train.py
```
Expand Down Expand Up @@ -227,7 +227,7 @@ can be referenced
supported), instead of the parameters in it:

```dvc
$ dvc run -n train -d users.csv -o model.pkl \
$ dvc run -n train -d train.py -d users.csv -o model.pkl \
-p params.py:BOOL,INT,TestConfig \
python train.py
```
Expand Down
8 changes: 4 additions & 4 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ positional arguments:

`dvc repro` provides a way to regenerate data pipeline results, by restoring the
dependency graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))
implicitly defined by the stages listed in `dvc.yaml`. The commands defined in
these stages are then executed in the correct order, reproducing pipeline
results.
implicitly defined by the stages listed in `dvc.yaml` files. The commands
defined in these stages are then executed in the correct order, reproducing
pipeline results.

> Pipeline stages are defined in a `dvc.yaml` file (either manually or by using
> Pipeline stages are defined in `dvc.yaml` (either manually or by using
> `dvc run`) while initial data dependencies can be registered with `dvc add`.

This command is similar to [Make](https://www.gnu.org/software/make/) in
Expand Down
5 changes: 4 additions & 1 deletion content/docs/command-reference/root.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,14 @@ Use this command to build fixed paths to dependencies, files, or stage

- `-v`, `--verbose` - displays detailed tracing information.

## Example: Basic output
## Examples

Basic demonstration:

```dvc
$ dvc root
.

$ mkdir subdir
$ cd subdir
$ dvc root
Expand Down
20 changes: 10 additions & 10 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ so on (see `dvc dag`). This graph can be restored by DVC later to modify or

```dvc
$ dvc run -n printer -d write.sh -o pages ./write.sh
$ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh
$ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages
```

Stage dependencies can be any file or directory, either untracked, or more
Expand Down Expand Up @@ -150,8 +150,8 @@ like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to
variables in it that should be evaluated dynamically. Examples:

```dvc
$ dvc run -n my_stage "./my_script.sh > /dev/null 2>&1"
$ dvc run -n my_stage './my_script.sh $MYENVVAR'
$ dvc run -n first_stage "./a_script.sh > /dev/null 2>&1"
$ dvc run -n second_stage './another_script.sh $MYENVVAR'
```

## Options
Expand Down Expand Up @@ -317,17 +317,17 @@ dataset (`20180226` is a seed value):

```dvc
$ dvc run -n train \
-d matrix-train.p -d train_model.py \
-o model.p \
python train_model.py matrix-train.p 20180226 model.p
-d train_model.py -d matrix-train.p -o model.p \
python train_model.py 20180226 model.p
```

To update a stage that is already defined, the `-f` (`--force`) option is
needed. Let's update the seed for the `train` stage:

```dvc
$ dvc run -n train -f -d matrix-train.p -d train_model.py -o model.p \
python train_model.py matrix-train.p 18494003 model.p
$ dvc run -n train --force \
-d train_model.p -d matrix-train.p -o model.p \
python train_model.py 18494003 model.p
```

## Example: Separate stages in a subdirectory
Expand Down Expand Up @@ -421,9 +421,9 @@ Define a stage with both regular dependencies as well as parameter dependencies:

```dvc
$ dvc run -n train \
-d matrix-train.p -d train_model.py -o model.p \
-d train_model.py -d matrix-train.p -o model.p \
-p seed,train.lr,train.epochs
python train_model.py matrix-train.p model.p
python train_model.py 20200105 model.p
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
```

`train_model.py` will include some code to open and parse the parameters:
Expand Down
2 changes: 1 addition & 1 deletion content/docs/start/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ $ dvc run -n prepare \
```

A `dvc.yaml` file is generated. It includes information about the command we ran
(`python src/prepare.py`), its <abbr>dependencies</abbr>, and
(`python src/prepare.py data/data.xml`), its <abbr>dependencies</abbr>, and
<abbr>outputs</abbr>.

<details>
Expand Down
8 changes: 5 additions & 3 deletions content/docs/use-cases/shared-development-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,9 @@ Let's say you are cleaning up raw data for later stages:

```dvc
$ dvc add raw
$ dvc run -n clean_data -d raw -o clean ./cleanup.py raw clean
# The data is cached in the shared location.
$ dvc run -n clean_data -d cleanup.py -d raw -o clean \
./cleanup.py raw clean
# The data is cached in the shared location.
$ git add raw.dvc dvc.yaml dvc.lock .gitignore
$ git commit -m "cleanup raw data"
$ git push
Expand All @@ -97,7 +98,8 @@ manually. After this, they could decide to continue building this
$ git pull
$ dvc checkout
A raw # Data is linked from cache to workspace.
$ dvc run -n process_clean_data -d clean -o processed ./process.py clean process
$ dvc run -n process_clean_data -d process.py -d clean -o processed
./process.py clean processed
$ git add dvc.yaml dvc.lock
$ git commit -m "process clean data"
$ git push
Expand Down
8 changes: 4 additions & 4 deletions content/docs/user-guide/basic-concepts/dvc-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: 'DVC Cache'
match: ['DVC cache', cache, caches, cached, 'cache directory']
---

The DVC cache is a hidden storage (by default located in the `.dvc/cache`
directory) for files that are tracked by DVC, and their different versions.
Learn more about it's
[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
The DVC cache is a hidden storage (by default in `.dvc/cache`) for files and
directories tracked by DVC, and their different versions. Learn more about it's
structure
[here](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
31 changes: 19 additions & 12 deletions content/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,11 @@ the following fields:

`dvc.yaml` files also support `# comments`.

💡 We maintain a `dvc.yaml`
💡 Keep in mind that there may be more than one `dvc.yaml` files in each
<abbr>DVC project</abbr>. DVC checks all of them for consistency during
operations that require rebuilding DAGs (like `dvc dag`).

Note that we maintain a `dvc.yaml`
[schema](https://github.com/iterative/dvcyaml-schema) that can be used by
editors like [VSCode](/doc/install/plugins#visual-studio-code) or
[PyCharm](/doc/install/plugins#pycharmintellij) to enable automatic syntax
Expand Down Expand Up @@ -256,7 +260,7 @@ Full <abbr>parameters</abbr> (key and value) are listed separately under
- `.dvc/cache`: The <abbr>cache</abbr> directory will store your data in a
special [structure](#structure-of-the-cache-directory). The data files and
directories in the <abbr>workspace</abbr> will only contain links to the data
files in the cache. (Refer to
files in the cache (refer to
[Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See
`dvc config cache` for related configuration options.

Expand Down Expand Up @@ -297,13 +301,17 @@ Full <abbr>parameters</abbr> (key and value) are listed separately under

## Structure of the cache directory

The DVC cache is a
The DVC cache is a hidden
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
(by default in `.dvc/cache`), which adds a layer of indirection between code and
(by default in `.dvc/cache`). It adds a layer of indirection between code and
data.

There are two ways in which the data is <abbr>cached</abbr>: As a single file
(eg. `data.csv`), or as a directory.
There are two ways in which the data is <abbr>cached</abbr>, depending on
whether it's a single file, or a directory (which may contain multiple files).

Note files are renamed, reorganized, and directory trees are flattened in the
cache, which always has exactly one depth level with 2-character directories
(based on hashes of the data contents, as explained next).

### Files

Expand Down Expand Up @@ -331,9 +339,7 @@ data/images/
$ dvc add data/images
```

The directory is cached as a JSON file with `.dir` extension. The files it
contains are stored in the cache regularly, as explained earlier. It looks like
this:
The resulting cache dir looks like this:

```dvc
.dvc/cache/
Expand All @@ -345,13 +351,14 @@ this:
    └── 0b40427ee0998e9802335d98f08cd98f
```

The `.dir` file contains the mapping of files in `data/images` (as a JSON
array), including their hash values:
The files in the directory are cached normally. The directory itself gets a
similar entry, which with the `.dir` extension. It contains the mapping of files
inside (as a JSON array), identified by their hash values:

```dvc
$ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir
[{"md5": "dff70c0392d7d386c39a23c64fcc0376", "relpath": "cat.jpeg"},
{"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"}]
```

That's how DVC knows that the other two cached files belong in the directory.
That's how DVC knows that those two cached files belong in the directory.
8 changes: 4 additions & 4 deletions content/docs/user-guide/how-to/add-deps-or-outs-to-a-stage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,13 @@ output. To add a missing dependency (`data/raw.csv`) as well as a missing output
> dependency/output to the stage:
>
> ```dvc
> $ dvc run -f --no-exec \
> -n prepare \
> -d data/raw.csv \
> $ dvc run -n prepare \
> -f --no-exec \
> -d src/prepare.py \
> -d data/raw.csv \
> -o data/train \
> -o data/validate \
> python src/prepare.py
> python src/prepare.py data/raw.csv
> ```
>
> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage
Expand Down
9 changes: 2 additions & 7 deletions content/docs/user-guide/large-dataset-optimization.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
# Large Dataset Optimization

In order to track the data files and directories added with `dvc add` or
`dvc run`, DVC moves all these files to the <abbr>cache</abbr>. A
<abbr>project</abbr>'s cache is the hidden storage (by default located in
`.dvc/cache`) for files that are tracked by DVC, and their different versions.
(See `dvc cache` and
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more
details.)
In order to track the data files and directories added with `dvc add`,
`dvc repro`, etc. DVC moves all these files to the project's <abbr>cache</abbr>.

However, the versions of the tracked files that
[match the current code](/doc/tutorials/get-started/data-pipelines) are also
Expand Down