From 79ec70b943eb780d6aa04a57b9f6d9080c8dbbfd Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 23 Dec 2020 21:08:27 -0600 Subject: [PATCH 1/4] docs: misc copy edits --- content/docs/start/index.md | 14 ++++++++------ content/docs/use-cases/data-registries.md | 2 +- .../docs/user-guide/dvc-files-and-directories.md | 10 ++++++---- content/docs/user-guide/how-to/merge-conflicts.md | 5 ++--- .../docs/user-guide/setup-google-drive-remote.md | 2 +- 5 files changed, 18 insertions(+), 15 deletions(-) diff --git a/content/docs/start/index.md b/content/docs/start/index.md index 060edd3e9a..a31786869e 100644 --- a/content/docs/start/index.md +++ b/content/docs/start/index.md @@ -49,13 +49,15 @@ Changes to be committed: $ git commit -m "Initialize DVC" ``` -DVC features can be grouped into functional components. We'll explore them one -by one in the next few sections: +Now you're ready to DVC! -- [**Data versioning**](/doc/start/data-versioning) is the base layer of DVC for - large files, datasets, and machine learning models. It looks like a regular - Git workflow, but without storing large files in the repo (think "Git for - data"). Data is stored separately, which allows for efficient sharing. +DVC's features can be grouped into functional components. We'll explore them one +by one in the next few pages: + +- [**Data versioning**](/doc/start/data-versioning) (try this next) is the base + layer of DVC for large files, datasets, and machine learning models. Use a + regular Git workflow, but without storing large files in the repo (think "Git + for data"). Data is stored separately, which allows for efficient sharing. - [**Data access**](/doc/start/data-access) shows how to use data artifacts from outside of the project and how to import data artifacts from another DVC diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index 2848f75877..3cdaa4fb2d 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -30,7 +30,7 @@ Advantages of data registries: management and optimizes space requirements. - **Data as code**: leverage Git workflows such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think - "Git for cloud storage", but without ad-hoc conventions. + "Git for cloud storage". - **Security**: registries can be setup with read-only remote storage (e.g. an HTTP server). diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 5d687dafb4..d7760b6272 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -72,14 +72,15 @@ An _output entry_ (`outs`) can have these fields: HTTP, S3, or Azure [external outputs](/doc/user-guide/managing-external-data); and a special _checksum_ for HDFS and WebHDFS. - `size`: Size of the file or directory (sum of all files). -- `nfiles`: If a directory, number of files inside. +- `nfiles`: If this output is a directory, the number of files inside + (recursive). - `cache`: Whether or not this file or directory is cached (`true` by default, if not present). See the `--no-commit` option of `dvc add`. - `persist`: Whether the output file/dir should remain in place while `dvc repro` runs. By default outputs are deleted when `dvc repro` starts (if this value is not present). -- `desc`: User description for this output. This doesn't affect any DVC - operations. +- `desc` (optional): User description for this output. This doesn't affect any + DVC operations. A _dependency entry_ (`deps`) can have these fields: @@ -91,7 +92,8 @@ A _dependency entry_ (`deps`) can have these fields: HTTP, S3, or Azure external dependencies; and a special _checksum_ for HDFS and WebHDFS. See `dvc import-url` for more information. - `size`: Size of the file or directory (sum of all files). -- `nfiles`: If a directory, number of files inside. +- `nfiles`: If this dependency is a directory, the number of files inside + (recursive). - `repo`: This entry is only for external dependencies created with `dvc import`, and can contains the following fields: diff --git a/content/docs/user-guide/how-to/merge-conflicts.md b/content/docs/user-guide/how-to/merge-conflicts.md index 89c003b9ab..dd4408ef15 100644 --- a/content/docs/user-guide/how-to/merge-conflicts.md +++ b/content/docs/user-guide/how-to/merge-conflicts.md @@ -36,9 +36,8 @@ stages: ## `dvc.lock` -There's no need to resolve lock file conflicts manually. You can safely delete -this file and then use `dvc repro` after merging `dvc.yaml` to regenerate this -file. +There's no need to resolve lock file conflicts manually. You can safely +overwrite this file by using `dvc repro` after merging `dvc.yaml`. > `dvc commit` can also be a good option, but only for the specific case where > the `HEAD` version is chosen. diff --git a/content/docs/user-guide/setup-google-drive-remote.md b/content/docs/user-guide/setup-google-drive-remote.md index e86e7961fb..cecff6b626 100644 --- a/content/docs/user-guide/setup-google-drive-remote.md +++ b/content/docs/user-guide/setup-google-drive-remote.md @@ -215,7 +215,7 @@ individually. If you use multiple GDrive remotes, by default they will be sharing the same `.dvc/tmp/gdrive-user-credentials.json` file. It can be overridden with the -`gdrive_user_credentials_file` setting: +`gdrive_user_credentials_file` parameter: ```dvc $ dvc remote modify myremote gdrive_user_credentials_file \ From 4fe62ce1f8f817ae87f867a1d0a0ca224a80a818 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 23 Dec 2020 21:17:53 -0600 Subject: [PATCH 2/4] guide: reorg repro (like in master) --- content/docs/command-reference/repro.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index c571646d45..886aa560e5 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -39,14 +39,9 @@ and caches the pipeline's outputs along the way. 💡 For convenience, a Git hook is available to remind you to `dvc repro` when needed after a `git commit`. See `dvc install` for more details. -`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data -files, intermediate or final results (except if the `--pull` option is used). - By default, this command checks all [pipeline](/doc/command-reference/dag) stages to determine which ones have changed. Then it executes the corresponding -commands (`cmd` field of `dvc.yaml`). [Stage](/doc/command-reference/run) -outputs are deleted from the workspace before executing the stage -commands that produce them (unless `persist: true` is used in `dvc.yaml`). +commands (`cmd` field of `dvc.yaml`). There are a few ways to restrict what will be regenerated by this command: by specifying specific reproduction [`targets`](#options), or by using certain @@ -55,6 +50,13 @@ command [options](#options), such as `--single-item` or `--all-pipelines`. > Note that stages without dependencies are considered _always changed_, so > `dvc repro` always executes them. +[Stage](/doc/command-reference/run) outputs are deleted from the +workspace before executing the stage commands that produce them +(unless `persist: true` is used in `dvc.yaml`). + +`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data +files, intermediate or final results (except if the `--pull` option is used). + It stores all the data files, intermediate or final results in the cache (unless the `--no-commit` option is used), and updates the hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc` From c2eced990c27fabbbcf37a19af6989ba7d80ee6f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 31 Jan 2021 14:56:54 -0600 Subject: [PATCH 3/4] cmd: misc. ref updates --- content/docs/command-reference/checkout.md | 7 +-- content/docs/command-reference/diff.md | 2 +- content/docs/command-reference/init.md | 2 +- content/docs/command-reference/install.md | 9 ++-- .../docs/command-reference/metrics/diff.md | 14 +++--- content/docs/command-reference/repro.md | 44 +++++++++---------- content/docs/command-reference/status.md | 6 +-- 7 files changed, 43 insertions(+), 41 deletions(-) diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index 8c043f4404..9a0e4667f3 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -32,8 +32,8 @@ The execution of `dvc checkout` does the following: outputs against the actual files or directories in the workspace (similar to `dvc status`). - > Stage outputs must be defined in `dvc.yaml`. If found there but not in - > `dvc.lock`, they'll be skipped with a warning. + > Stage outputs must be defined in `dvc.yaml` (and `dvc.lock` contain their + > hash values), or they'll be skipped with a warning. - Missing data files or directories are restored from the cache. Those that don't match with `dvc.lock` or `.dvc` files are removed. See options `--force` @@ -67,7 +67,8 @@ progress made by the checkout. There are two methods to restore a file missing from the cache, depending on the situation. In some cases the cache can be pulled from [remote storage](/doc/command-reference/remote) using `dvc pull`. In other cases -the pipeline must be reproduced (using `dvc repro`) to regenerate its outputs. +the [pipeline](/doc/command-reference/dag) must be reproduced (using +`dvc repro`) to regenerate its outputs. ## Options diff --git a/content/docs/command-reference/diff.md b/content/docs/command-reference/diff.md index 0f9cf87aeb..6cdbb2667c 100644 --- a/content/docs/command-reference/diff.md +++ b/content/docs/command-reference/diff.md @@ -67,7 +67,7 @@ for example when `dvc init` was used with the `--no-scm` option. Useful for debug purposes. - `--hide-missing` - do not list data missing from both workspace and cache - (`not in cache`). Only list files and directories which have been expliclity + (`not in cache`). Only list files and directories which have been explicitly added, modified, or deleted. This option does nothing when comparing two Git commits. diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 6a3cb03e46..f58d28eac4 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -51,7 +51,7 @@ initializing DVC in the Git repo root: - DVC [internals](/doc/user-guide/dvc-files-and-directories) (config file, cache directory, etc.) would be shared across different subdirectories. This forces - all of them to use the same DVC settings and + all of them to use the same DVC configuration and [remote storage](/doc/command-reference/remote). - By default, DVC commands like `dvc pull` and `dvc repro` explore the whole diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index 9043b14d6e..080b2bc2c4 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -106,20 +106,17 @@ repos: ``` Note that by default, the pre-commit tool only installs `pre-commit` hooks. To -enable the DVC `pre-push` and `post-checkout` hooks with pre-commit, you must -explicitly configure pre-commit to install the appropriate hook types: +enable the `post-checkout` and `pre-push` hooks, you must explicitly configure +the tool this way: ```dvc $ pre-commit install --hook-type pre-push --hook-type post-checkout ``` -This command can be run at any time before or after configuring the DVC hooks in -`.pre-commit-config.yaml`. - ## Options - `--use-pre-commit-tool` - configures DVC pre-commit, pre-push, post-checkout - Git hooks in the [pre-commit](https://pre-commit.com/) config file + Git hooks in the [pre-commit](#using-the-pre-commit-tool) config file (`.pre-commit-config.yaml`). - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index 54d09757ee..3d0b17479c 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -21,9 +21,9 @@ positional arguments: ## Description This command provides a quick way to compare metrics among experiments in the -repository history. All metrics defined in `dvc.yaml` are used by default. The -differences shown by this command include the new value, and numeric difference -(delta) from the previous value of metrics (rounded to 5 digits precision). +repository history. The differences shown by this command include the new value, +and numeric difference (delta) from the previous value of metrics (rounded to 5 +digits precision). `a_rev` and `b_rev` are Git commit hashes, tag, or branch names. If none are specified, `dvc metrics diff` compares metrics currently present in the @@ -31,6 +31,10 @@ specified, `dvc metrics diff` compares metrics currently present in the (required). A single specified revision results in comparing the workspace and that version. +All metrics defined in `dvc.yaml` are used by default, but specific metrics +files can be specified with the `--targets` option (note that targets don't +necessarily have to be defined in `dvc.yaml`). + Another way to display metrics is the `dvc metrics show` command, which just lists all the current metrics, without comparisons. @@ -38,8 +42,8 @@ lists all the current metrics, without comparisons. - `--targets ` - limit command scope to these metrics files. Using `-R`, directories to search metrics files in can also be given. When specifying - arguments for `--targets` before `revisions`, you should use `--` after this - option's arguments, e.g.: + arguments for `--targets` before `a_rev`/`b_rev`, you should use `--` after + this option's arguments, e.g.: ```dvc $ dvc metrics diff --targets t1.json t2.yaml -- HEAD v1 diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 0075af0a61..fbd54cd030 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -2,8 +2,7 @@ Reproduce complete or partial [pipelines](/doc/command-reference/dag) by executing commands defined in their [stages](/doc/command-reference/run) in the -correct order. The commands to be executed are determined by recursively -analyzing dependencies and outputs of the target stages. +correct order. ## Synopsis @@ -14,34 +13,39 @@ usage: dvc repro [-h] [-q | -v] [-f] [-s] [-m] [--dry] [-i] [targets [ ...]] positional arguments: - targets Limit command scope to these .dvc or dvc.yaml files, - or stage names. + targets Stages to reproduce. 'dvc.yaml' by default. ``` > See [`targets`](#options) for more details. ## Description -`dvc repro` provides a way to regenerate data pipeline results, by restoring the -dependency graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) -implicitly defined by the stages listed in `dvc.yaml` files. The commands -defined in these stages are then executed in the correct order, reproducing -pipeline results. +Provides a way to regenerate data pipeline results, by restoring the dependency +graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) implicitly +defined by the stages listed in `dvc.yaml`. The commands defined in these stages +are then executed in the correct order. + +For stages with multiple commands (having a list in the `cmd` field), commands +are run one after the other in the order they are defined. The failure of any +command will halt the remaining stage execution, and raises an error. > Pipeline stages are defined in `dvc.yaml` (either manually or by using > `dvc run`) while initial data dependencies can be registered with `dvc add`. -This command is similar to [Make](https://www.gnu.org/software/make/) in -software build automation, but DVC captures build requirements +`dvc repro` is similar to [Make](https://www.gnu.org/software/make/) in software +build automation, but DVC captures build requirements ([dependencies and outputs](/doc/command-reference/run#dependencies-and-outputs)) and caches the pipeline's outputs along the way. 💡 For convenience, a Git hook is available to remind you to `dvc repro` when needed after a `git commit`. See `dvc install` for more details. -By default, this command checks all [pipeline](/doc/command-reference/dag) -stages to determine which ones have changed. Then it executes the corresponding -commands (`cmd` field of `dvc.yaml`). +By default, all [pipeline](/doc/command-reference/dag) stages are checked to +determine which ones have changed. Then the corresponding commands (`cmd` field +of `dvc.yaml`) are executed. + +⚠️ Note that [stage](/doc/command-reference/run) outputs are deleted from the +workspace before executing the stage commands that produce them. There are a few ways to restrict what will be regenerated by this command: by specifying specific reproduction [`targets`](#options), or by using certain @@ -50,12 +54,8 @@ command [options](#options), such as `--single-item` or `--all-pipelines`. > Note that stages without dependencies are considered _always changed_, so > `dvc repro` always executes them. -[Stage](/doc/command-reference/run) outputs are deleted from the -workspace before executing the stage commands that produce them -(unless `persist: true` is used in `dvc.yaml`). - -`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data -files, intermediate or final results (except if the `--pull` option is used). +This does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, +intermediate or final results (except if the `--pull` option is used). It stores all the data files, intermediate or final results in the cache (unless the `--no-commit` option is used), and updates the @@ -133,8 +133,8 @@ up-to-date and only execute the final stage. `dvc commit` to finish the operation. - `-m`, `--metrics` - show metrics after reproduction. The target pipelines must - have at least one metrics file defined either with the `dvc metrics` command, - or by the `-M` or `-m` options of the `dvc run` command. + have at least one metrics file defined either with `dvc metrics` or by the + `-M` or `-m` options of `dvc run` - `--dry` - only print the commands that would be executed without actually executing the commands. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 0e783b3380..3ad026944a 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -158,6 +158,9 @@ that. ```dvc $ dvc status +baz.dvc: + changed outs: + modified: baz dofoo: changed deps: modified: baz @@ -168,9 +171,6 @@ dobar: modified: foo changed outs: deleted: bar -baz.dvc: - changed outs: - modified: baz ``` This shows that for stage `dofoo`, the dependency `baz` and the output `foo` From 71cd972902fa9fa8c56648418ccca0b619669b3a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 31 Jan 2021 15:54:29 -0600 Subject: [PATCH 4/4] glossary: there's no pipelines tooltip yet --- content/docs/command-reference/init.md | 5 +++-- content/docs/command-reference/repro.md | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index c9ba922aea..e6b741dca0 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -118,8 +118,9 @@ include: - SCM other than Git is being used. Even though there are DVC features that require DVC to be run in the Git repo, DVC can work well with other version control systems. Since DVC relies on simple `dvc.yaml` files to manage - pipelines, data, etc, they can be added into any version control - system, thus providing large data files and directories versioning. + [pipelines](/doc/command-reference/dag), data, etc, they can be added into any + version control system, thus providing large data files and directories + versioning. - There is no need to keep the history at all, e.g. having a deployment automation like running a data pipeline using `cron`. diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 64871ce64e..08844a1999 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -69,7 +69,8 @@ files. Currently, `dvc repro` is not able to parallelize stage execution automatically. If you need to do this, you can launch `dvc repro` multiple times manually. For -example, let's say a pipeline graph looks something like this: +example, let's say a [pipelines](/doc/command-reference/dag) graph looks +something like this: ```dvc $ dvc dag