diff --git a/content/docs/command-reference/cache/index.md b/content/docs/command-reference/cache/index.md index 07c8fb54c6..f9c1bc91f4 100644 --- a/content/docs/command-reference/cache/index.md +++ b/content/docs/command-reference/cache/index.md @@ -15,16 +15,15 @@ positional arguments: ## Description -At DVC initialization, a new `.dvc/` directory is created for internal -configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), -that are hidden from the user. - -The cache is where your data files, models, etc. (anything you want to version -with DVC) are actually stored. The corresponding files you see in the -workspace can simply link to the ones in cache. (Refer to -[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -for more information on file links on different platforms.) +The DVC Cache is where your data files, models, etc. (anything you want to +version with DVC) are actually stored. The data files and directories visible in +the workspace are links\* to (or copies of) the ones in cache. +Learn more about it's +[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory). + +> \* Refer to +> [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> for more information on file links on different platforms. > For more cache-related configuration options refer to `dvc config cache`. diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 8a618ee79c..83725c165e 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -35,7 +35,7 @@ Fetching is performed automatically by `dvc pull` (when the data is not already in the cache), along with `dvc checkout`: ``` -Controlled files Commands +Tracked files Commands ---------------- --------------------------------- remote storage @@ -277,4 +277,4 @@ into the workspace (with `dvc repro train.dvc`). > Note that in this example project, the last stage file `evaluate.dvc` doesn't > add any more data files than those form previous stages, so at this point all > of the data for this pipeline is cached and `dvc status -c` would output -> `Data and pipelines are up to date.` +> `Cache and remote 'myremote' are in sync.` diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index d4aad4fb53..6a3cb03e46 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -116,11 +116,11 @@ In rare cases, the `--no-scm` option might be desirable: to initialize DVC in a directory that is not part of a Git repo, or to make DVC ignore Git. Examples include: -- Version control other than Git is being used. Even though there are DVC - features that require DVC to be run in the Git repo, DVC can work well with - other version control systems. Since DVC relies on simple `dvc.yaml` files to - manage pipelines, data, etc, they can be added into any version - control system, thus providing large data files and directories versioning. +- SCM other than Git is being used. Even though there are DVC features that + require DVC to be run in the Git repo, DVC can work well with other version + control systems. Since DVC relies on simple `dvc.yaml` files to manage + pipelines, data, etc, they can be added into any version control + system, thus providing large data files and directories versioning. - There is no need to keep the history at all, e.g. having a deployment automation like running a data pipeline using `cron`. diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index d0bf5504ab..5bcd21353d 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -19,9 +19,9 @@ positional arguments: ## Description -This command is a way to visualize the "difference" between metrics among -experiments in the repository history, by plotting multiple -versions of the metrics. All plots defined in `dvc.yaml` are used by default. +This command is a way to visualize the "difference" between +[certain metrics](/doc/command-reference/plots#supported-file-formats) among +versions of the repository, by overlaying them in a single plot. > Note that unlike `dvc metrics diff`, this command does not calculate numeric > differences between metrics file values. @@ -34,8 +34,9 @@ revision results in comparing the workspace and that version. 💡 Note that any number of `revisions` can be provided, and the resulting plot shows all of them in a single image. -Specific plots files can be specified with the `--targets` option. Note that -these don't have to be defined as `plots` in `dvc.yaml`. +All plots defined in `dvc.yaml` are used by default, but specific plots files +can be specified with the `--targets` option (note that targets don't +necessarily have to be defined in `dvc.yaml`). The plot style can be customized with [plot templates](/doc/command-reference/plots#plot-templates), using the diff --git a/content/docs/command-reference/plots/index.md b/content/docs/command-reference/plots/index.md index 7c1616fe9c..9a9af955bf 100644 --- a/content/docs/command-reference/plots/index.md +++ b/content/docs/command-reference/plots/index.md @@ -29,15 +29,13 @@ learning training or data processing: ## Description -DVC provides a set of commands to visualize metrics of machine learning -experiments. Usual plot examples are AUC curves, loss functions, confusion -matrices, among others. +DVC provides a set of commands to visualize certain metrics of machine learning +experiments as plots. Usual plot examples are AUC curves, loss functions, +confusion matrices, among others. This type of metrics files are created by users, or generated by user data -processing code, and get defined with the `-p` (`--plots`) and -`--plots-no-cache`) options of `dvc run`. `dvc plots` subcommands can work with -plots files committed to a Git repo history, data files controlled by DVC, or -any other file in system. +processing code, and can be defined in `dvc.yaml` (`plots` field) for tracking +(optional). DVC generates plots as HTML files that can be open with a web browser. These HTML files use [Vega-Lite](https://vega.github.io/vega-lite/). Vega is a diff --git a/content/docs/command-reference/plots/show.md b/content/docs/command-reference/plots/show.md index 326fb34444..441380cb5e 100644 --- a/content/docs/command-reference/plots/show.md +++ b/content/docs/command-reference/plots/show.md @@ -18,12 +18,13 @@ positional arguments: ## Description -This command provides a quick way to visualize metrics such as loss functions, -AUC curves, confusion matrices, etc. All plots defined in `dvc.yaml` are used by -default. +This command provides a quick way to visualize +[certain metrics](/doc/command-reference/plots#supported-file-formats) such as +loss functions, AUC curves, confusion matrices, etc. -Optionally, specific metric file `targets` to show are accepted. Note that these -don't have to be defined as `plots` in `dvc.yaml`. +All plots defined in `dvc.yaml` are used by default, but specific plots files +can be specified as `targets` (note that targets don't necessarily have to be +defined in `dvc.yaml`). The plot style can be customized with [plot templates](/doc/command-reference/plots#plot-templates), using the diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index bdf334dd51..7ba525ab0c 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -37,7 +37,7 @@ to `dvc config cache.type`). It has the same effect as running `dvc fetch` and `dvc checkout`: ``` -Controlled files Commands +Tracked files Commands ---------------- --------------------------------- remote storage @@ -112,8 +112,9 @@ used to see what files `dvc pull` would download. `dvc remote list`). - `--run-cache` - downloads all available history of stage runs from the remote - repository into the local run-cache. A `dvc repro ` is necessary - to checkout these files into the workspace and update the `dvc.lock` file. + repository (to the cache only, like `dvc fetch --run-cache`). Note that + `dvc repro ` is necessary to checkout these files (into the + workspace) and update `dvc.lock`. - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. This only applies when the `--cloud` option is used, or a diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 72cbae4a47..b8e045949b 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -168,7 +168,7 @@ $ dvc push --with-deps matrix-train ... Push the rest of the data $ dvc status --cloud -Data and pipelines are up to date. +Cache and remote 'r1' are in sync. ``` We specified a stage in the middle of this pipeline (`test-posts`) with the @@ -259,7 +259,7 @@ $ tree ~/vault/recursive 10 directories, 10 files $ dvc status --cloud -Data and pipelines are up to date. +Cache and remote 'r1' are in sync. ``` And running `dvc status --cloud`, DVC verifies that indeed there are no more diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 21bd84b763..37fc319574 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -39,7 +39,7 @@ and caches relevant data artifacts along the way. needed after a `git commit`. See `dvc install` for more details. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data -files, intermediate or final results. +files, intermediate or final results (except if the `--pull` option is used). By default, this command checks all pipeline stages to determine which ones have changed. Then it executes the corresponding commands. Outputs are @@ -135,7 +135,7 @@ up-to-date and only execute the final stage. present in the DVC project. - `--no-run-cache` - execute stage commands even if they have already been run - with the same command/dependencies/outputs/etc before. + with the same dependencies/outputs/etc. before. - `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will reproduce `A` first and then `B`, even if `B` was previously executed with the @@ -157,8 +157,12 @@ up-to-date and only execute the final stage. corresponding pipelines, including the target stages themselves. This option has no effect if `targets` are not provided. -- `--pull` - try automatically [pulling](/doc/command-reference/pull) missing - cache for outputs restored from run-cache. +- `--pull` - [pulls](/doc/command-reference/pull) dependencies and outputs + involved in the stages being reproduced, if they are found in the + [default](/doc/command-reference/remote/default) remote storage. Note that it + checks the local run-cache too (available history of stage runs). + + > Has no effect if combined with `--no-run-cache`. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 6eef7e1605..e9e19633cf 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -170,9 +170,8 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' - `-O `, `--outs-no-cache ` - the same as `-o` except that outputs are not tracked by DVC. It means that they are not cached, and it's up to a - user to save and version control them. This is useful if the outputs are small - enough to be tracked by Git directly, or if these files are not of future - interest. + user to manage them separately. This is useful if the outputs are small enough + to be tracked by Git directly, or if these files are not of future interest. - `--outs-persist ` - declare output file or directory that will not be removed upon `dvc repro`. @@ -197,9 +196,9 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' - `-M `, `--metrics-no-cache ` - the same as `-m` except that DVC does not track the metrics file. This means that the file is not cached, so - it's up to the user to save and version control it. This is typically - desirable with _metrics_ because they are small enough to be tracked with Git - directly. See also the difference between `-o` and `-O`. + it's up to the user to manage them separately. This is typically desirable + with _metrics_ because they are small enough to be tracked with Git directly. + See also the difference between `-o` and `-O`. - `--plots ` - specify a plot metrics file produces by this stage. This option behaves like `-o` but registers the file in a `plots` field inside the @@ -210,8 +209,8 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' - `--plots-no-cache ` - the same as `--plots` except that DVC does not track the plots metrics file. This means that the file is not cached, so it's - up to the user to save and version control it. See also the difference between - `-o` and `-O`. + up to the user to manage them separately. See also the difference between `-o` + and `-O`. - `-w `, `--wdir ` - specifies a working directory for the `command` to run in (uses the `wdir` field in `dvc.yaml`). Dependency and output files @@ -231,10 +230,10 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' - `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without asking for confirmation. -- `--no-run-cache` - forcefully execute the `command` again, even if the same - `dvc run` command has already been run in this workspace. Useful if the - command's code is non-deterministic (meaning it produces different outputs - from the same list of inputs). +- `--no-run-cache` - execute the stage `command` even if it has already been run + with the same dependencies/outputs/etc. before. Useful for example if the + command's code is non-deterministic + ([not recommended](#avoiding-unexpected-behavior)). - `--no-commit` - do not save outputs to cache. A stage created and an entry is added to `.dvc/state`, while nothing is added to the cache. In the stage file, diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 5b80d814ca..e3f2d26774 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -19,10 +19,10 @@ positional arguments: ## Description -`dvc status` searches for changes in the existing tracked data and pipelines, -either showing which files or directories have changed in the -workspace and should be added or reproduced again (with `dvc add` -or `dvc repro`); or differences between cache vs. remote storage +Searches for changes in the existing tracked data and pipelines, either showing +which files or directories have changed in the workspace and should +be added or reproduced again (with `dvc add` or `dvc repro`); or differences +between cache vs. [remote storage](/doc/command-reference/remote) (implying `dvc push` or `dvc pull` should be run to synchronize them). The _remote_ mode is triggered by using the `--cloud` or `--remote` options: @@ -43,11 +43,12 @@ paths to tracked files or directories (including paths inside tracked directories), `.dvc` files, and stage names (found in `dvc.yaml`). If no differences are detected, `dvc status` prints -`Data and pipelines are up to date.` If differences are detected by -`dvc status`, the command output indicates the changes. For each stage with -differences, the changes in dependencies and/or -outputs that differ are listed. For each item listed, either the -file name or hash is shown, along with a _state description_, as detailed below: +`Data and pipelines are up to date.` or +`Cache and remote 'myremote' are in sync` (if using the `-c` or `-r` options are +used). If differences are detected, the changes in dependencies +and/or outputs for each stage that differs are listed. For each +item listed, either the file name or hash is shown, along with a _state +description_, as detailed below: ### Local workspace status diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index f86cf0295a..aed67a329e 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -71,7 +71,7 @@ "children": ["tutorial"] }, { - "label": "Sharing Data & Model Files", + "label": "Sharing Data and Model Files", "slug": "sharing-data-and-model-files" }, "shared-development-server", diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index bc316c479b..2eb692b461 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -18,9 +18,8 @@ knowledge, they are still difficult to implement, reuse, and manage. If you store and process data files or datasets to produce other data or machine learning models, and you want to -- capture and save data artifacts the same way you capture code; -- track, control, and switch between different versions of data or models - easily; +- track and save data artifacts the same way you capture code; +- create and switch among different versions of data or models easily; - understand how data or ML models were built in the first place; - compare machine learning models and metrics to each other; - bring software engineering best practices and tools to your data science team diff --git a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md index 44588ad577..0810bf8d1a 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md +++ b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md @@ -11,7 +11,7 @@ to build a powerful image classifier using a pretty small dataset. > We highly recommend reading the François' tutorial itself. It's a great > demonstration of how a general pre-trained model can be leveraged to build a -> new highly performant model, with very limited resources. +> new high-performance model, with very limited resources. We first train a classifier model using 1000 labeled images, then we double the number of images (2000) and retrain our model. We capture both datasets and diff --git a/content/docs/user-guide/basic-concepts/dvc-cache.md b/content/docs/user-guide/basic-concepts/dvc-cache.md index 1d080775f4..49c0644100 100644 --- a/content/docs/user-guide/basic-concepts/dvc-cache.md +++ b/content/docs/user-guide/basic-concepts/dvc-cache.md @@ -4,6 +4,6 @@ match: ['DVC cache', cache, caches, cached] --- The DVC cache is a hidden storage (by default located in the `.dvc/cache` -directory) for files that are under DVC control, and their different versions. -For more details, please refer to this -[document](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory). +directory) for files that are tracked by DVC, and their different versions. +Learn more about it's +[structure](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).