Skip to content

Commit

Permalink
Merge pull request #1881 from iterative/jorge
Browse files Browse the repository at this point in the history
Misc. updates
  • Loading branch information
jorgeorpinel authored Oct 23, 2020
2 parents a5831d6 + 9f93ffb commit 1ac5f01
Show file tree
Hide file tree
Showing 15 changed files with 75 additions and 72 deletions.
19 changes: 9 additions & 10 deletions content/docs/command-reference/cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,15 @@ positional arguments:

## Description

At DVC initialization, a new `.dvc/` directory is created for internal
configuration and <abbr>cache</abbr>
[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files),
that are hidden from the user.

The cache is where your data files, models, etc. (anything you want to version
with DVC) are actually stored. The corresponding files you see in the
<abbr>workspace</abbr> can simply link to the ones in cache. (Refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
for more information on file links on different platforms.)
The DVC Cache is where your data files, models, etc. (anything you want to
version with DVC) are actually stored. The data files and directories visible in
the <abbr>workspace</abbr> are links\* to (or copies of) the ones in cache.
Learn more about it's
[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).

> \* Refer to
> [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> for more information on file links on different platforms.
> For more cache-related configuration options refer to `dvc config cache`.
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Fetching is performed automatically by `dvc pull` (when the data is not already
in the <abbr>cache</abbr>), along with `dvc checkout`:

```
Controlled files Commands
Tracked files Commands
---------------- ---------------------------------
remote storage
Expand Down Expand Up @@ -277,4 +277,4 @@ into the workspace (with `dvc repro train.dvc`).
> Note that in this example project, the last stage file `evaluate.dvc` doesn't
> add any more data files than those form previous stages, so at this point all
> of the data for this pipeline is cached and `dvc status -c` would output
> `Data and pipelines are up to date.`
> `Cache and remote 'myremote' are in sync.`
10 changes: 5 additions & 5 deletions content/docs/command-reference/init.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,11 @@ In rare cases, the `--no-scm` option might be desirable: to initialize DVC in a
directory that is not part of a Git repo, or to make DVC ignore Git. Examples
include:

- Version control other than Git is being used. Even though there are DVC
features that require DVC to be run in the Git repo, DVC can work well with
other version control systems. Since DVC relies on simple `dvc.yaml` files to
manage <abbr>pipelines</abbr>, data, etc, they can be added into any version
control system, thus providing large data files and directories versioning.
- SCM other than Git is being used. Even though there are DVC features that
require DVC to be run in the Git repo, DVC can work well with other version
control systems. Since DVC relies on simple `dvc.yaml` files to manage
<abbr>pipelines</abbr>, data, etc, they can be added into any version control
system, thus providing large data files and directories versioning.

- There is no need to keep the history at all, e.g. having a deployment
automation like running a data pipeline using `cron`.
Expand Down
11 changes: 6 additions & 5 deletions content/docs/command-reference/plots/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ positional arguments:

## Description

This command is a way to visualize the "difference" between metrics among
experiments in the <abbr>repository</abbr> history, by plotting multiple
versions of the metrics. All plots defined in `dvc.yaml` are used by default.
This command is a way to visualize the "difference" between
[certain metrics](/doc/command-reference/plots#supported-file-formats) among
versions of the <abbr>repository</abbr>, by overlaying them in a single plot.

> Note that unlike `dvc metrics diff`, this command does not calculate numeric
> differences between metrics file values.
Expand All @@ -34,8 +34,9 @@ revision results in comparing the workspace and that version.
💡 Note that any number of `revisions` can be provided, and the resulting plot
shows all of them in a single image.

Specific plots files can be specified with the `--targets` option. Note that
these don't have to be defined as `plots` in `dvc.yaml`.
All plots defined in `dvc.yaml` are used by default, but specific plots files
can be specified with the `--targets` option (note that targets don't
necessarily have to be defined in `dvc.yaml`).

The plot style can be customized with
[plot templates](/doc/command-reference/plots#plot-templates), using the
Expand Down
12 changes: 5 additions & 7 deletions content/docs/command-reference/plots/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,13 @@ learning training or data processing:

## Description

DVC provides a set of commands to visualize metrics of machine learning
experiments. Usual plot examples are AUC curves, loss functions, confusion
matrices, among others.
DVC provides a set of commands to visualize certain metrics of machine learning
experiments as plots. Usual plot examples are AUC curves, loss functions,
confusion matrices, among others.

This type of metrics files are created by users, or generated by user data
processing code, and get defined with the `-p` (`--plots`) and
`--plots-no-cache`) options of `dvc run`. `dvc plots` subcommands can work with
plots files committed to a Git repo history, data files controlled by DVC, or
any other file in system.
processing code, and can be defined in `dvc.yaml` (`plots` field) for tracking
(optional).

DVC generates plots as HTML files that can be open with a web browser. These
HTML files use [Vega-Lite](https://vega.github.io/vega-lite/). Vega is a
Expand Down
11 changes: 6 additions & 5 deletions content/docs/command-reference/plots/show.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ positional arguments:

## Description

This command provides a quick way to visualize metrics such as loss functions,
AUC curves, confusion matrices, etc. All plots defined in `dvc.yaml` are used by
default.
This command provides a quick way to visualize
[certain metrics](/doc/command-reference/plots#supported-file-formats) such as
loss functions, AUC curves, confusion matrices, etc.

Optionally, specific metric file `targets` to show are accepted. Note that these
don't have to be defined as `plots` in `dvc.yaml`.
All plots defined in `dvc.yaml` are used by default, but specific plots files
can be specified as `targets` (note that targets don't necessarily have to be
defined in `dvc.yaml`).

The plot style can be customized with
[plot templates](/doc/command-reference/plots#plot-templates), using the
Expand Down
7 changes: 4 additions & 3 deletions content/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ to `dvc config cache.type`).
It has the same effect as running `dvc fetch` and `dvc checkout`:

```
Controlled files Commands
Tracked files Commands
---------------- ---------------------------------
remote storage
Expand Down Expand Up @@ -112,8 +112,9 @@ used to see what files `dvc pull` would download.
`dvc remote list`).

- `--run-cache` - downloads all available history of stage runs from the remote
repository into the local run-cache. A `dvc repro <stage_name>` is necessary
to checkout these files into the workspace and update the `dvc.lock` file.
repository (to the cache only, like `dvc fetch --run-cache`). Note that
`dvc repro <stage_name>` is necessary to checkout these files (into the
workspace) and update `dvc.lock`.

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from remote storage. This only applies when the `--cloud` option is used, or a
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ $ dvc push --with-deps matrix-train
... Push the rest of the data
$ dvc status --cloud
Data and pipelines are up to date.
Cache and remote 'r1' are in sync.
```

We specified a stage in the middle of this pipeline (`test-posts`) with the
Expand Down Expand Up @@ -259,7 +259,7 @@ $ tree ~/vault/recursive
10 directories, 10 files
$ dvc status --cloud
Data and pipelines are up to date.
Cache and remote 'r1' are in sync.
```

And running `dvc status --cloud`, DVC verifies that indeed there are no more
Expand Down
12 changes: 8 additions & 4 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ and caches relevant <abbr>data artifacts</abbr> along the way.
needed after a `git commit`. See `dvc install` for more details.

`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results.
files, intermediate or final results (except if the `--pull` option is used).

By default, this command checks all pipeline stages to determine which ones have
changed. Then it executes the corresponding commands. <abbr>Outputs</abbr> are
Expand Down Expand Up @@ -135,7 +135,7 @@ up-to-date and only execute the final stage.
present in the DVC project.

- `--no-run-cache` - execute stage commands even if they have already been run
with the same command/dependencies/outputs/etc before.
with the same dependencies/outputs/etc. before.

- `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will
reproduce `A` first and then `B`, even if `B` was previously executed with the
Expand All @@ -157,8 +157,12 @@ up-to-date and only execute the final stage.
corresponding pipelines, including the target stages themselves. This option
has no effect if `targets` are not provided.

- `--pull` - try automatically [pulling](/doc/command-reference/pull) missing
cache for outputs restored from run-cache.
- `--pull` - [pulls](/doc/command-reference/pull) dependencies and outputs
involved in the stages being reproduced, if they are found in the
[default](/doc/command-reference/remote/default) remote storage. Note that it
checks the local run-cache too (available history of stage runs).

> Has no effect if combined with `--no-run-cache`.
- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
23 changes: 11 additions & 12 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,9 +170,8 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'

- `-O <path>`, `--outs-no-cache <path>` - the same as `-o` except that outputs
are not tracked by DVC. It means that they are not cached, and it's up to a
user to save and version control them. This is useful if the outputs are small
enough to be tracked by Git directly, or if these files are not of future
interest.
user to manage them separately. This is useful if the outputs are small enough
to be tracked by Git directly, or if these files are not of future interest.

- `--outs-persist <path>` - declare output file or directory that will not be
removed upon `dvc repro`.
Expand All @@ -197,9 +196,9 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'

- `-M <path>`, `--metrics-no-cache <path>` - the same as `-m` except that DVC
does not track the metrics file. This means that the file is not cached, so
it's up to the user to save and version control it. This is typically
desirable with _metrics_ because they are small enough to be tracked with Git
directly. See also the difference between `-o` and `-O`.
it's up to the user to manage them separately. This is typically desirable
with _metrics_ because they are small enough to be tracked with Git directly.
See also the difference between `-o` and `-O`.

- `--plots <path>` - specify a plot metrics file produces by this stage. This
option behaves like `-o` but registers the file in a `plots` field inside the
Expand All @@ -210,8 +209,8 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'

- `--plots-no-cache <path>` - the same as `--plots` except that DVC does not
track the plots metrics file. This means that the file is not cached, so it's
up to the user to save and version control it. See also the difference between
`-o` and `-O`.
up to the user to manage them separately. See also the difference between `-o`
and `-O`.

- `-w <path>`, `--wdir <path>` - specifies a working directory for the `command`
to run in (uses the `wdir` field in `dvc.yaml`). Dependency and output files
Expand All @@ -231,10 +230,10 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without
asking for confirmation.

- `--no-run-cache` - forcefully execute the `command` again, even if the same
`dvc run` command has already been run in this workspace. Useful if the
command's code is non-deterministic (meaning it produces different outputs
from the same list of inputs).
- `--no-run-cache` - execute the stage `command` even if it has already been run
with the same dependencies/outputs/etc. before. Useful for example if the
command's code is non-deterministic
([not recommended](#avoiding-unexpected-behavior)).

- `--no-commit` - do not save outputs to cache. A stage created and an entry is
added to `.dvc/state`, while nothing is added to the cache. In the stage file,
Expand Down
19 changes: 10 additions & 9 deletions content/docs/command-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ positional arguments:

## Description

`dvc status` searches for changes in the existing tracked data and pipelines,
either showing which files or directories have changed in the
<abbr>workspace</abbr> and should be added or reproduced again (with `dvc add`
or `dvc repro`); or differences between <abbr>cache</abbr> vs. remote storage
Searches for changes in the existing tracked data and pipelines, either showing
which files or directories have changed in the <abbr>workspace</abbr> and should
be added or reproduced again (with `dvc add` or `dvc repro`); or differences
between <abbr>cache</abbr> vs. [remote storage](/doc/command-reference/remote)
(implying `dvc push` or `dvc pull` should be run to synchronize them). The
_remote_ mode is triggered by using the `--cloud` or `--remote` options:

Expand All @@ -43,11 +43,12 @@ paths to tracked files or directories (including paths inside tracked
directories), `.dvc` files, and stage names (found in `dvc.yaml`).

If no differences are detected, `dvc status` prints
`Data and pipelines are up to date.` If differences are detected by
`dvc status`, the command output indicates the changes. For each stage with
differences, the changes in <abbr>dependencies</abbr> and/or
<abbr>outputs</abbr> that differ are listed. For each item listed, either the
file name or hash is shown, along with a _state description_, as detailed below:
`Data and pipelines are up to date.` or
`Cache and remote 'myremote' are in sync` (if using the `-c` or `-r` options are
used). If differences are detected, the changes in <abbr>dependencies</abbr>
and/or <abbr>outputs</abbr> for each stage that differs are listed. For each
item listed, either the file name or hash is shown, along with a _state
description_, as detailed below:

### Local workspace status

Expand Down
2 changes: 1 addition & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@
"children": ["tutorial"]
},
{
"label": "Sharing Data & Model Files",
"label": "Sharing Data and Model Files",
"slug": "sharing-data-and-model-files"
},
"shared-development-server",
Expand Down
5 changes: 2 additions & 3 deletions content/docs/use-cases/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,8 @@ knowledge, they are still difficult to implement, reuse, and manage.
If you store and process data files or datasets to produce other data or machine
learning models, and you want to

- capture and save <abbr>data artifacts</abbr> the same way you capture code;
- track, control, and switch between different versions of data or models
easily;
- track and save <abbr>data artifacts</abbr> the same way you capture code;
- create and switch among different versions of data or models easily;
- understand how data or ML models were built in the first place;
- compare machine learning models and metrics to each other;
- bring software engineering best practices and tools to your data science team
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ to build a powerful image classifier using a pretty small dataset.

> We highly recommend reading the François' tutorial itself. It's a great
> demonstration of how a general pre-trained model can be leveraged to build a
> new highly performant model, with very limited resources.
> new high-performance model, with very limited resources.
We first train a classifier model using 1000 labeled images, then we double the
number of images (2000) and retrain our model. We capture both datasets and
Expand Down
6 changes: 3 additions & 3 deletions content/docs/user-guide/basic-concepts/dvc-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ match: ['DVC cache', cache, caches, cached]
---

The DVC cache is a hidden storage (by default located in the `.dvc/cache`
directory) for files that are under DVC control, and their different versions.
For more details, please refer to this
[document](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
directory) for files that are tracked by DVC, and their different versions.
Learn more about it's
[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).

0 comments on commit 1ac5f01

Please sign in to comment.