From d04f0824b7bf4d29e762dd4c292cfd6fc980a3ce Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 22 Jul 2020 14:44:00 -0500 Subject: [PATCH 01/54] cmd: review repro examples per https://github.com/iterative/dvc.org/pull/1572#issuecomment-661705102 --- content/docs/command-reference/repro.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 8245f80434..807db073f7 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -164,9 +164,8 @@ only execute the final stage. ## Examples -For simplicity, let's build a pipeline defined below. (If you want get your -hands-on something more real, see this short -[pipeline tutorial](/doc/tutorials/pipelines)). It takes this `text.txt` file: +For simplicity, let's build a pipeline defined below. It takes this `text.txt` +file: ``` dvc @@ -215,7 +214,7 @@ $ tree You may want to check the contents of `dvc.lock` and `count.txt` for later reference. -Ok, now, let's run the `dvc repro` command: +Ok, now let's run `dvc repro`: ```dvc $ dvc repro @@ -250,6 +249,8 @@ respectively. ## Example: Downstream +> This example continues the previous one. + The `--downstream` option allows us to only reproduce results from commands after a specific stage in a pipeline. To demonstrate how it works, let's make a change in `text.txt` (the input of our first stage, created in the previous @@ -268,14 +269,13 @@ $ dvc repro --downstream Data and pipelines are up to date. ``` -The reason being that the `text.txt` file is a dependency in the last stage of -the pipeline (used by default by `dvc repro`), This last `count` stage is -dependent on `filter` stage, which happens first in this pipeline (shown in the -following figure): +The reason being that the `text.txt` file is not a dependency in the last stage +of the pipeline, used as the default target by `dvc repro`. `text.txt` is a +dependency of the `filter` stage, which happens earlier (shown in the figure +below), so it's skipped given the `--downstream` option. ```dvc $ dvc dag - .------------. | filter | `------------' @@ -286,3 +286,7 @@ $ dvc dag | count | `---------' ``` + +> Note that using `dvc repro --downstream` without a target will always have a +> similar effect, where all previous stages are ignored — only if the last stage +> is changed will it have any effect. From a201abf259c40fde7bf4c8b4130d83f5d68f5055 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 22 Jul 2020 16:53:46 -0500 Subject: [PATCH 02/54] cmd: fic tyupo in get-url --- content/docs/command-reference/get-url.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 02356500a1..70442bb699 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -98,7 +98,7 @@ By default, DVC expects that AWS CLI is already [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). DVC will use the AWS credentials file to access S3. To override the -configuration, you can the parameters described in `dvc remote modify`. +configuration, you can use the parameters described in `dvc remote modify`. > We use the `boto3` library to and communicate with AWS. The following API > methods may be performed: From fe8687e08815f2e2348fda3e59d26e61a548ecfe Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 23 Jul 2020 14:08:05 -0500 Subject: [PATCH 03/54] cmd: updates to repro per https://github.com/iterative/dvc.org/pull/1572#pullrequestreview-453722844 --- content/docs/command-reference/repro.md | 30 ++++++++++++------------- 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 58503c9963..ea3fe55b02 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -20,11 +20,9 @@ positional arguments: `dvc repro` provides a way to regenerate data pipeline results, by restoring the dependency graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) -implicitly defined by the [stages](/doc/command-reference/run) listed in the -[`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file). The -commands defined in these stages can then be executed in the correct order, -reproducing pipeline results. `dvc repro` relies on the DAG definition that it -reads from `dvc.yaml` file. +implicitly defined by the [stages](/doc/command-reference/run) listed in +`dvc.yaml`. The commands defined in these stages can then be executed in the +correct order, reproducing pipeline results. > Pipeline stages are typically defined by manually adding or editing them in a > [`dvc.yaml` file](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or @@ -39,25 +37,25 @@ and caches relevant data artifacts along the way. 💡 For convenience, a Git hook is available to remind you to `dvc repro` when needed after a `git commit`. See `dvc install` for more details. -There are a few ways to restrict the stages that will be regenerated by this -command: by specifying stages or `.dvc` files to reproduce, or by using the -`--single-item`, `--cwd`, among other options. - `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. -By default, this command recursively searches in pipeline stages, starting from -the `targets`, to determine which ones have changed. Then it executes the -corresponding commands. Outputs are deleted from the -workspace before executing the stages command that produces them. +By default, this command checks all pipeline stages to determine which ones have +changed. Then it executes the corresponding commands. Outputs are +deleted from the workspace before executing the stage commands that +produce them. + +There are a few ways to restrict what will be regenerated by this command: by +specifying stage names (or `.dvc` files) as `targets`, or by using the +`--single-item`, `--cwd`, among other options. > Note that stages without dependencies are considered _always changed_, so > `dvc repro` always executes them. It saves all the data files, intermediate or final results into the DVC cache (unless the `--no-commit` option is used), and updates the hash -values of changed dependencies and outputs in the corresponding stages (or -`.dvc` files). +values of changed dependencies and outputs in the corresponding stages and +`.dvc` files ### Parallel stage execution @@ -99,7 +97,7 @@ only execute the final stage. - `-s`, `--single-item` - reproduce only a single stage by turning off the recursive search for changed dependencies. Multiple stages are executed - (non-recursively) if multiple stages are given as `targets`. + (non-recursively) if multiple stage names are given as `targets`. - `-c `, `--cwd ` - directory within the project to reproduce from. Instead of using `--cwd`, one can alternately specify a target in a From f3df9d5e80c035d1d7aee7c87da19d9d0e418a69 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 23 Jul 2020 14:18:29 -0500 Subject: [PATCH 04/54] cmd: rewrite repro -P desc rel https://github.com/iterative/dvc.org/pull/1572#pullrequestreview-453725568 --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index ea3fe55b02..40f7636836 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -128,8 +128,8 @@ only execute the final stage. - `-p`, `--pipeline` - reproduce the entire pipelines that the `targets` belong to. Use `dvc dag ` to show the parent pipeline of a target. -- `-P`, `--all-pipelines` - reproduce all pipelines, for all the stages present - in `DVC` repository. +- `-P`, `--all-pipelines` - reproduce all pipelines for all the `dvc.yaml` files + in present the DVC repository (there can be one in every directory). - `--no-run-cache` - execute stage commands even if they have already been run with the same command/dependencies/outputs/etc before. From ebc1560584dbcb2c8c6c8f829ec32213175259e5 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 23 Jul 2020 14:35:43 -0500 Subject: [PATCH 05/54] cmd: simplified and generalize repro targets desc and DVC file mention per https://github.com/iterative/dvc.org/pull/1572#discussion_r459675120 and https://github.com/iterative/dvc.org/pull/1572#discussion_r459675807 --- content/docs/command-reference/repro.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 40f7636836..9314a82d35 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -46,16 +46,16 @@ deleted from the workspace before executing the stage commands that produce them. There are a few ways to restrict what will be regenerated by this command: by -specifying stage names (or `.dvc` files) as `targets`, or by using the -`--single-item`, `--cwd`, among other options. +specifying stages as `targets`, or by using the `--single-item`, `--cwd`, among +other options. > Note that stages without dependencies are considered _always changed_, so > `dvc repro` always executes them. It saves all the data files, intermediate or final results into the DVC cache (unless the `--no-commit` option is used), and updates the hash -values of changed dependencies and outputs in the corresponding stages and -`.dvc` files +values of changed dependencies and outputs in the DVC files (`dvc.lock` and +`.dvc`). ### Parallel stage execution From 9a8c9777589bb91629a6d905c7de662e5e2187f5 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 23 Jul 2020 17:07:54 -0500 Subject: [PATCH 06/54] cmd: minor update for repro desc wording per https://github.com/iterative/dvc.org/pull/1572#discussion_r459684277 --- content/docs/command-reference/repro.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 9314a82d35..8f7d6284b0 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -54,8 +54,7 @@ other options. It saves all the data files, intermediate or final results into the DVC cache (unless the `--no-commit` option is used), and updates the hash -values of changed dependencies and outputs in the DVC files (`dvc.lock` and -`.dvc`). +values of changed dependencies and outputs in the `dvc.lock` and `.dvc` files. ### Parallel stage execution From c38ce31b9f85d03d113fa298fcccf0deac7f643e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 11:06:03 -0500 Subject: [PATCH 07/54] term: don't use "synchronize" in the context of checkout --- content/docs/command-reference/install.md | 7 +++---- content/docs/start/data-versioning.md | 2 +- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index d04a692973..c3c3954cef 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -25,8 +25,7 @@ Namely: [DVC-files](/doc/user-guide/dvc-files-and-directories) corresponding to that version. The project's DVC-files in turn refer to data stored in cache, but not necessarily in the workspace. Normally, -it would be necessary to use `dvc checkout` to synchronize workspace and -DVC-files. +it would be necessary to use `dvc checkout` to update the workspace accordingly. This hook automates `dvc checkout` after `git checkout`. @@ -49,7 +48,7 @@ This hook automates `dvc push` before `git push`. ## Installed Git hooks - A `post-checkout` hook executes `dvc checkout` after `git checkout` to - automatically synchronize the data files with the new workspace state. + automatically update the workspace with the correct data file versions. - A `pre-commit` hook executes `dvc status` before `git commit` to inform the user about the differences between cache and workspace. - A `pre-push` hook executes `dvc push` before `git push` to upload files and @@ -300,5 +299,5 @@ Data and pipelines are up to date. After reproducing this pipeline up to the "evaluate" stage, the data files are in sync with the code/config files, but we must now commit the changes with Git. Looking closely we see that `dvc status` is used again, informing us that the -data files are synchronized with the `Data and pipelines are up to date.` +data files have been updated, with the `Data and pipelines are up to date.` message. diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-versioning.md index 8f2c848c0a..5ad583b364 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-versioning.md @@ -232,7 +232,7 @@ $ git commit data/data.xml.dvc -m "Revert dataset updates" Yes, DVC is technically even not a version control system! `.dvc` files content defines data file versions. Git itself serves as the version control system. DVC in turn creates these `.dvc` files, updates them, and synchronizes DVC-tracked -data in the workspace efficiently to match them. +data in the workspace efficiently to match them. ## Large datasets versioning From 34217b8e2b7e11d178053b844a3b5a2a1a4105d6 Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Fri, 24 Jul 2020 23:07:27 +0530 Subject: [PATCH 08/54] cmd: rewrite Downstream example and added info for sequential execution of stages --- content/docs/command-reference/repro.md | 65 +++++++++++++++---------- 1 file changed, 38 insertions(+), 27 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 9314a82d35..7422d14c6a 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -54,8 +54,7 @@ other options. It saves all the data files, intermediate or final results into the DVC cache (unless the `--no-commit` option is used), and updates the hash -values of changed dependencies and outputs in the DVC files (`dvc.lock` and -`.dvc`). +values of changed dependencies and outputs in the `dvc.lock` and `.dvc` files. ### Parallel stage execution @@ -83,11 +82,12 @@ $ dvc dag ``` This pipeline consists of two parallel branches (`A` and `B`), and the final -"result" stage, where the branches merge. To reproduce both branches at the same -time, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in -separate terminals). After both finish successfully, you can then run -`dvc repro train`: DVC will know that both branches are already up-to-date and -only execute the final stage. +"result" stage, where the branches merge. If you run `dvc repro` at this point, +it would reproduce the complete pipeline with all stages executing sequentially. +To reproduce both branches at the same time, you could run `dvc repro A2` and +`dvc repro B2` at the same time (e.g. in separate terminals). After both finish +successfully, you can then run `dvc repro train`: DVC will know that both +branches are already up-to-date and only execute the final stage. ## Options @@ -151,7 +151,8 @@ only execute the final stage. each execution, meaning the cache cannot be trusted for such stages. - `--downstream` - only execute the stages after the given `targets` in their - corresponding pipelines, including the target stages themselves. + corresponding pipelines, including the target stages themselves. This option + doesn't have any effect if no `targets` are provided. - `-h`, `--help` - prints the usage/help message, and exit. @@ -262,31 +263,41 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -Now, using the `--downstream` option results in the following output: +And add a new stage to the pipeline: ```dvc -$ dvc repro --downstream +$ dvc run -n final -d count.txt -o alphabet.txt \ + "cat count.txt | egrep -o '[a-zA-Z]+' > alphabet.txt" +``` + +Now, using the `--downstream` option with `count` as a target stage results in +the following output: + +```dvc +$ dvc repro --downstream count Data and pipelines are up to date. ``` -The reason being that the `text.txt` file is not a dependency in the last stage -of the pipeline, used as the default target by `dvc repro`. `text.txt` is a -dependency of the `filter` stage, which happens earlier (shown in the figure -below), so it's skipped given the `--downstream` option. +The reason being that the `text.txt` file is a dependency in the `filter` stage +of the pipeline which happens before the `count` stage (shown in the following +figure) and hence did not get updated. ```dvc $ dvc dag - .------------. - | filter | - `------------' - * - * - * - .---------. - | count | - `---------' -``` -> Note that using `dvc repro --downstream` without a target will always have a -> similar effect, where all previous stages are ignored — only if the last stage -> is changed will it have any effect. + +--------+ + | filter | + +--------+ + * + * + * + +-------+ + | count | + +-------+ + * + * + * + +-------+ + | final | + +-------+ +``` From a071b7d83c95cbd011ae5e4bb6cc19bc7799c998 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 15:15:09 -0500 Subject: [PATCH 09/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 7422d14c6a..fd1c0902c7 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -82,7 +82,7 @@ $ dvc dag ``` This pipeline consists of two parallel branches (`A` and `B`), and the final -"result" stage, where the branches merge. If you run `dvc repro` at this point, +`train` stage, where the branches merge. If you run `dvc repro` at this point, it would reproduce the complete pipeline with all stages executing sequentially. To reproduce both branches at the same time, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in separate terminals). After both finish From edff33eb8ab738a20c862d027218f4801f2a1809 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 15:16:34 -0500 Subject: [PATCH 10/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index fd1c0902c7..97699311ef 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -84,7 +84,7 @@ $ dvc dag This pipeline consists of two parallel branches (`A` and `B`), and the final `train` stage, where the branches merge. If you run `dvc repro` at this point, it would reproduce the complete pipeline with all stages executing sequentially. -To reproduce both branches at the same time, you could run `dvc repro A2` and +To reproduce both branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in separate terminals). After both finish successfully, you can then run `dvc repro train`: DVC will know that both branches are already up-to-date and only execute the final stage. From 71a5088371ea48ce19fb9de0db8004a9d7fb7284 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jul 2020 15:19:55 -0500 Subject: [PATCH 11/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 97699311ef..1da6e31047 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -152,7 +152,7 @@ branches are already up-to-date and only execute the final stage. - `--downstream` - only execute the stages after the given `targets` in their corresponding pipelines, including the target stages themselves. This option - doesn't have any effect if no `targets` are provided. + has no effect if no `targets` are provided. - `-h`, `--help` - prints the usage/help message, and exit. From 163ed1982c41a99b0f1756f02a5a9017bfab20db Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Sat, 25 Jul 2020 12:41:34 +0530 Subject: [PATCH 12/54] cmd: Updated Downstream example --- content/docs/command-reference/repro.md | 37 +++++++++++++------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 1da6e31047..abd8351af3 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -83,11 +83,11 @@ $ dvc dag This pipeline consists of two parallel branches (`A` and `B`), and the final `train` stage, where the branches merge. If you run `dvc repro` at this point, -it would reproduce the complete pipeline with all stages executing sequentially. -To reproduce both branches simultaneously, you could run `dvc repro A2` and -`dvc repro B2` at the same time (e.g. in separate terminals). After both finish -successfully, you can then run `dvc repro train`: DVC will know that both -branches are already up-to-date and only execute the final stage. +it would reproduce each branch sequentially before train. To reproduce both +branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the +same time (e.g. in separate terminals). After both finish successfully, you can +then run `dvc repro train`: DVC will know that both branches are already +up-to-date and only execute the final stage. ## Options @@ -152,7 +152,7 @@ branches are already up-to-date and only execute the final stage. - `--downstream` - only execute the stages after the given `targets` in their corresponding pipelines, including the target stages themselves. This option - has no effect if no `targets` are provided. + has no effect if `targets` are not provided. - `-h`, `--help` - prints the usage/help message, and exit. @@ -263,19 +263,26 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -And add a new stage to the pipeline: +And update the `process.py` file to count the number of digits. -```dvc -$ dvc run -n final -d count.txt -o alphabet.txt \ - "cat count.txt | egrep -o '[a-zA-Z]+' > alphabet.txt" +```python +import sys +num_digits = 0 +with open(sys.argv[1], 'r') as f: + for number in f: + num_digits += len(number) - 1 +print("Number of digits:",end=" ") +print(num_digits) ``` -Now, using the `--downstream` option with `count` as a target stage results in +Now, using the `--downstream` option with `count` as a target stage, results in the following output: ```dvc $ dvc repro --downstream count -Data and pipelines are up to date. +Running stage 'count' with command: + python3 process.py numbers.txt > count.txt +Updating lock file 'dvc.lock' ``` The reason being that the `text.txt` file is a dependency in the `filter` stage @@ -293,11 +300,5 @@ $ dvc dag * +-------+ | count | - +-------+ - * - * - * - +-------+ - | final | +-------+ ``` From cf873a480b7f0aceedee02c6cf01f75179101ce4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 27 Jul 2020 17:38:03 -0500 Subject: [PATCH 13/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index abd8351af3..f1657c63da 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -83,7 +83,7 @@ $ dvc dag This pipeline consists of two parallel branches (`A` and `B`), and the final `train` stage, where the branches merge. If you run `dvc repro` at this point, -it would reproduce each branch sequentially before train. To reproduce both +it would reproduce each branch sequentially before `train`. To reproduce both branches simultaneously, you could run `dvc repro A2` and `dvc repro B2` at the same time (e.g. in separate terminals). After both finish successfully, you can then run `dvc repro train`: DVC will know that both branches are already From ca04fb0becba7dc250c2f292f32f5ebc7d4172e9 Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Tue, 28 Jul 2020 12:00:03 +0530 Subject: [PATCH 14/54] repro: Updated Downstream example --- content/docs/command-reference/repro.md | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index abd8351af3..a251f1955c 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -240,7 +240,7 @@ If we now run `dvc repro`, we should see this: $ dvc repro Stage 'filter' didn't change, skipping Running stage 'count' with command: - python3 process.py numbers.txt > count.txt + python process.py numbers.txt > count.txt Updating lock file 'dvc.lock' ``` @@ -263,16 +263,12 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -And update the `process.py` file to count the number of digits. +Let's say we want to print the filename also in the description and so we update +the `process.py` as: ```python -import sys -num_digits = 0 -with open(sys.argv[1], 'r') as f: - for number in f: - num_digits += len(number) - 1 -print("Number of digits:",end=" ") -print(num_digits) +print('Number of lines in %s:'%(sys.argv[1])) +print(num_lines) ``` Now, using the `--downstream` option with `count` as a target stage, results in @@ -281,13 +277,14 @@ the following output: ```dvc $ dvc repro --downstream count Running stage 'count' with command: - python3 process.py numbers.txt > count.txt + python process.py numbers.txt > count.txt Updating lock file 'dvc.lock' ``` -The reason being that the `text.txt` file is a dependency in the `filter` stage -of the pipeline which happens before the `count` stage (shown in the following -figure) and hence did not get updated. +The change in the `text.txt` file is ignored as it is a dependency in the +`filter` stage which did not get updated in the above command. This is because +the `filter` stage happens before the `count` stage in the pipeline (shown in +the following figure). ```dvc $ dvc dag From 30ce7bb7967432391a12685aa23cee820906c5d0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jul 2020 02:13:25 -0500 Subject: [PATCH 15/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index b5e69ef132..c0388751e6 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -299,3 +299,5 @@ $ dvc dag | count | +-------+ ``` + +> Refer to `dvc dag` for more details on that command. From 66e0603c960af396429e5510f4f360e4dffccedc Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Thu, 30 Jul 2020 00:31:02 +0530 Subject: [PATCH 16/54] cmd: updated last para for the description of --downstream and improved formatting --- content/docs/command-reference/repro.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index c0388751e6..22b03c6878 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -267,7 +267,7 @@ Let's say we want to print the filename also in the description and so we update the `process.py` as: ```python -print('Number of lines in %s:'%(sys.argv[1])) +print(f'Number of lines in {sys.argv[1]}:') print(num_lines) ``` @@ -281,10 +281,11 @@ Running stage 'count' with command: Updating lock file 'dvc.lock' ``` -The change in the `text.txt` file is ignored as it is a dependency in the -`filter` stage which did not get updated in the above command. This is because -the `filter` stage happens before the `count` stage in the pipeline (shown in -the following figure). +The change in the `text.txt` file is ignored because that file is a dependency +in the `filter` stage, which did not get updated in the above command. This is +because `filter` happens before `count` in the pipeline (shown below) and the +`--downstream` option only execute the stages after a given target stage +(`count` in this case), including the target stage itself. ```dvc $ dvc dag From e5320105b6f6ca2f4f5d1898c0ee4494aeb46d74 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 13:38:15 -0500 Subject: [PATCH 17/54] cmd: review language of init --subdir --- content/docs/command-reference/init.md | 39 +++++++++++++------------- 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 11b87d4716..1e74fa6ae0 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -18,7 +18,7 @@ The command [options](#options) can be used to start an alternative workflow for advanced scenarios: - [Initializing DVC in subdirectories](#initializing-dvc-in-subdirectories) - - support for monorepos, nested DVC projects, etc. + support for monorepos and nested DVC projects. - [Initializing DVC without Git](#initializing-dvc-without-git) - support for SCM other than Git, deployment automation cases, etc. @@ -32,10 +32,10 @@ that are hidden from the user. This directory is automatically staged with `--subdir` must be provided to initialize DVC in a subdirectory of a Git repository. DVC still expects to find the Git repository (will check all -directories up to the root to find `.git`). This options does not affect any -config files, `.dvc` directory is created the same way as in the default mode. -This way multiple DVC projects (including nested ones) could be initialized in a -single Git repository providing isolation and granular project management. +directories up to the system root to find `.git`). This options does not affect +any config files, `.dvc/` directory is created the same way as in the default +mode. This way multiple DVC projects can be initialized in a single +Git repository, providing isolation and granular project management. #### When is this useful? @@ -47,7 +47,7 @@ Let's imagine we have an existing Git repository that is split into sub-projects (monorepo). In this case `dvc init --subdir` can be run in one or many sub-projects to mitigate the issues of initializing in the Git repository root: -- Repository maintainers might not allow extra `.dvc` top level directory, +- Repository maintainers might not allow extra `.dvc/` top level directory, especially if DVC is being used by a small number of sub-projects. - Not enough isolation/granularity - DVC config, cache, and other files are @@ -64,16 +64,15 @@ sub-projects to mitigate the issues of initializing in the Git repository root: #### How does it affect DVC commands? -No matter what mode is used, DVC looks for the `.dvc` directory when it starts -(from the current working directory and up). Location of the found `.dvc` -directory determines the root of the DVC project. (In case of `--subdir` it -might happen that Git repository root is located at different path than the DVC -project root.) +No matter what mode is used, DVC looks for the `.dvc/` directory when it starts +(from the current working directory and up). Its location determines the root of +the DVC projects. With `--subdir`, it might happen that the Git +repository root is in a different location than the DVC project root. -DVC project root defines the scope for most DVC commands. Mostly meaning that +The project root defines the scope for most DVC commands. Mostly meaning that all `dvc.yaml` and `.dvc` files under the root path are being analyzed. -If there are multiple DVC sub-projects but they _are not_ nested, e.g.: +If there are multiple DVC sub-projects, but they are not nested, e.g.: ``` . @@ -87,12 +86,12 @@ If there are multiple DVC sub-projects but they _are not_ nested, e.g.: │ ... ``` -DVC considers them a two separate DVC projects. Any DVC command that is being -run in the `project-A` is not aware about DVC `project-B`. DVC does not consider -Git repository root an initialized DVC project in this case and commands that -require DVC project will raise an error. +DVC considers them separate DVC projects. Any DVC command that is being run in +`project-A` is not aware about `project-B`. DVC does not consider the Git +repository root a DVC project in this case, and commands that require an +initialized project will produce an error. -On the other hand, if there _are_ nested DVC projects, e.g.: +On the other hand, if there are nested DVC projects, e.g.: ``` project-A @@ -105,8 +104,8 @@ project-A │ ... ``` -Nothing changes for the `project-B`. But for any DVC command being run in the -`project-A` ignores the whole directory `project-B/`, meaning for example: +Nothing changes for `project-B`. But any DVC command being run in `project-A` +ignores the whole directory `project-B/`. For example ```dvc $ cd project-A From b0fe9c1bdd4319afaa7fd2c63f025c2dbabddeae Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 13:49:05 -0500 Subject: [PATCH 18/54] term: revuew usage of "granular", esp. around init --subdir --- content/docs/command-reference/init.md | 22 +++++++++---------- .../what-is-dvc/related-technologies.md | 2 +- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 1e74fa6ae0..6f89089553 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -35,13 +35,13 @@ repository. DVC still expects to find the Git repository (will check all directories up to the system root to find `.git`). This options does not affect any config files, `.dvc/` directory is created the same way as in the default mode. This way multiple DVC projects can be initialized in a single -Git repository, providing isolation and granular project management. +Git repository, providing isolation between projects. #### When is this useful? `--subdir` is mostly used in the scenario of a [monorepo](https://en.wikipedia.org/wiki/Monorepo), but also can be used in -other workflows when such isolation and/or advanced granularity is needed. +other workflows when such isolation is needed. Let's imagine we have an existing Git repository that is split into sub-projects (monorepo). In this case `dvc init --subdir` can be run in one or many @@ -50,17 +50,17 @@ sub-projects to mitigate the issues of initializing in the Git repository root: - Repository maintainers might not allow extra `.dvc/` top level directory, especially if DVC is being used by a small number of sub-projects. -- Not enough isolation/granularity - DVC config, cache, and other files are - shared across different sub-projects. Means that it's not easy to use - different remote storages, for example, for different sub-projects, etc. +- Not enough isolation - DVC config, cache, and other files are shared across + different sub-projects. Means that it's not easy to use different remote + storages, for example, for different sub-projects, etc. -- Not enough isolation/granularity - commands like `dvc pull`, `dvc checkout`, - and others analyze the whole repository to look for `dvc.yaml` or `.dvc` files - to download files and directories, to reproduce pipelines, etc. - It can be expensive in the large repositories with a lot of projects. +- Not enough isolation - commands like `dvc pull`, `dvc checkout`, and others + analyze the whole repository to look for `dvc.yaml` or `.dvc` files to + download files and directories, to reproduce pipelines, etc. It + can be expensive in the large repositories with a lot of projects. -- Not enough isolation/granularity - commands like `dvc metrics diff`, `dvc dag` - and others by default dump all the metrics, all the pipelines, etc. +- Not enough isolation - commands like `dvc metrics diff`, `dvc dag` and others + by default dump all the metrics, all the pipelines, etc. #### How does it affect DVC commands? diff --git a/content/docs/user-guide/what-is-dvc/related-technologies.md b/content/docs/user-guide/what-is-dvc/related-technologies.md index 67a0333c31..86049754fa 100644 --- a/content/docs/user-guide/what-is-dvc/related-technologies.md +++ b/content/docs/user-guide/what-is-dvc/related-technologies.md @@ -118,7 +118,7 @@ Luigi, etc. - DVC does not add any hooks to the Git repo by default. To checkout data files, the `dvc checkout` command has to be run after each `git checkout` and - `git clone` command. It gives more granularity on managing data and code + `git clone` command. It provides control for managing data and code separately. Hooks could be configured to make workflows simpler. - DVC attempts to use reflinks\* and has other From 8597f533b21a853ad64bedd6aa266344b8546b1f Mon Sep 17 00:00:00 2001 From: sarthakforwet Date: Fri, 31 Jul 2020 01:02:25 +0530 Subject: [PATCH 19/54] repro.md: updated Downstream example --- content/docs/command-reference/repro.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 22b03c6878..a273651d0d 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -271,8 +271,8 @@ print(f'Number of lines in {sys.argv[1]}:') print(num_lines) ``` -Now, using the `--downstream` option with `count` as a target stage, results in -the following output: +Now, using the `--downstream` option with `dvc repro`, results in the execution +of stages after the target stage (`count` in this case) in the pipeline. ```dvc $ dvc repro --downstream count @@ -281,11 +281,9 @@ Running stage 'count' with command: Updating lock file 'dvc.lock' ``` -The change in the `text.txt` file is ignored because that file is a dependency -in the `filter` stage, which did not get updated in the above command. This is -because `filter` happens before `count` in the pipeline (shown below) and the -`--downstream` option only execute the stages after a given target stage -(`count` in this case), including the target stage itself. +The change in `text.txt` is ignored because that file is a dependency in the +`filter` stage, which did not get updated in the above command. This is because +`filter` happens before `count` in the pipeline (shown below). ```dvc $ dvc dag From 8ca9134022a58c01b5d2c2cef625a5b326425735 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 15:28:45 -0500 Subject: [PATCH 20/54] cmd: improve init --subdir explanation --- content/docs/command-reference/init.md | 72 +++++++++++--------------- 1 file changed, 30 insertions(+), 42 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 6f89089553..21571df05b 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -32,66 +32,60 @@ that are hidden from the user. This directory is automatically staged with `--subdir` must be provided to initialize DVC in a subdirectory of a Git repository. DVC still expects to find the Git repository (will check all -directories up to the system root to find `.git`). This options does not affect +directories up to the system root to find `.git/`). This options does not affect any config files, `.dvc/` directory is created the same way as in the default mode. This way multiple DVC projects can be initialized in a single Git repository, providing isolation between projects. #### When is this useful? -`--subdir` is mostly used in the scenario of a -[monorepo](https://en.wikipedia.org/wiki/Monorepo), but also can be used in -other workflows when such isolation is needed. +This option is mostly used in the scenario of a +[monorepo](https://en.wikipedia.org/wiki/Monorepo) (Git repository split into +several project directories), but can also be used with other patterns when such +isolation is needed. `dvc init --subdir` mitigates the issues of initializing +DVC in the Git repository root: -Let's imagine we have an existing Git repository that is split into sub-projects -(monorepo). In this case `dvc init --subdir` can be run in one or many -sub-projects to mitigate the issues of initializing in the Git repository root: +- Repository maintainers might not allow a top level `.dvc/` directory, + especially if DVC is being used by several sub-projects (monorepo). -- Repository maintainers might not allow extra `.dvc/` top level directory, - especially if DVC is being used by a small number of sub-projects. +- DVC config file, cache directory, + [etc.](/doc/user-guide/dvc-files-and-directories) are shared across different + sub-projects. This makes it difficult to use different DVC settings, + [remote storage](/doc/command-reference/remote) locations, etc. -- Not enough isolation - DVC config, cache, and other files are shared across - different sub-projects. Means that it's not easy to use different remote - storages, for example, for different sub-projects, etc. - -- Not enough isolation - commands like `dvc pull`, `dvc checkout`, and others - analyze the whole repository to look for `dvc.yaml` or `.dvc` files to - download files and directories, to reproduce pipelines, etc. It - can be expensive in the large repositories with a lot of projects. - -- Not enough isolation - commands like `dvc metrics diff`, `dvc dag` and others - by default dump all the metrics, all the pipelines, etc. +- DVC commands like `dvc pull`, `dvc checkout`, `dvc metrics diff`, and plenty + others analyze the whole repository to find `dvc.yaml` and `.dvc` files to + work with. This may be undesirable and it's inefficient for large repositories + containing many projects. #### How does it affect DVC commands? -No matter what mode is used, DVC looks for the `.dvc/` directory when it starts -(from the current working directory and up). Its location determines the root of -the DVC projects. With `--subdir`, it might happen that the Git -repository root is in a different location than the DVC project root. - -The project root defines the scope for most DVC commands. Mostly meaning that -all `dvc.yaml` and `.dvc` files under the root path are being analyzed. +The project root defines the scope for most DVC commands, meaning +that only `dvc.yaml` and `.dvc` files under that location are used by DVC +commands. To determines the root of the project, DVC looks for `.dvc/` from the +current working directory and up. With `--subdir`though, the Git repository root +will be different (superior) to the project root. -If there are multiple DVC sub-projects, but they are not nested, e.g.: +If there are multiple sub-projects, but they are not nested, e.g.: ``` . ├── .git | ├── project-A -│   └── .dvc +│   ├── .dvc │ ... ├── project-B -│ └── .dvc -│ ... +│ ├── .dvc +... ``` -DVC considers them separate DVC projects. Any DVC command that is being run in +DVC considers them separate projects. Any DVC command that is being run in `project-A` is not aware about `project-B`. DVC does not consider the Git repository root a DVC project in this case, and commands that require an initialized project will produce an error. -On the other hand, if there are nested DVC projects, e.g.: +On the other hand, if there are nested projects, e.g.: ``` project-A @@ -101,18 +95,12 @@ project-A └── project-B ├── .dvc ├── data-B.dvc - │ ... + ... ``` Nothing changes for `project-B`. But any DVC command being run in `project-A` -ignores the whole directory `project-B/`. For example - -```dvc -$ cd project-A -$ dvc pull -``` - -won't download or checkout data for the `data-B.dvc` file. +ignores the nested directory `project-B/`. For example, `dvc pull` (run in +`project-A/`) wouldn't download data for the `data-B.dvc` file. ### Initializing DVC without Git From 8134010b2eeadd275accc8c0e858690e7cb95f42 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 16:08:23 -0500 Subject: [PATCH 21/54] cmd: add info about nested subrepos to init --- content/docs/command-reference/init.md | 56 ++++++++++++++++++-------- 1 file changed, 40 insertions(+), 16 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 21571df05b..11c9563a59 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -39,7 +39,7 @@ Git repository, providing isolation between projects. #### When is this useful? -This option is mostly used in the scenario of a +This option is mostly useful in the scenario of a [monorepo](https://en.wikipedia.org/wiki/Monorepo) (Git repository split into several project directories), but can also be used with other patterns when such isolation is needed. `dvc init --subdir` mitigates the issues of initializing @@ -53,25 +53,24 @@ DVC in the Git repository root: sub-projects. This makes it difficult to use different DVC settings, [remote storage](/doc/command-reference/remote) locations, etc. -- DVC commands like `dvc pull`, `dvc checkout`, `dvc metrics diff`, and plenty - others analyze the whole repository to find `dvc.yaml` and `.dvc` files to - work with. This may be undesirable and it's inefficient for large repositories - containing many projects. +- Many DVC commands can explore the whole DVC repository to find + DVC-tracked data and pipelines to work with. This can be undesirable and + inefficient for large monorepos. #### How does it affect DVC commands? -The project root defines the scope for most DVC commands, meaning -that only `dvc.yaml` and `.dvc` files under that location are used by DVC -commands. To determines the root of the project, DVC looks for `.dvc/` from the -current working directory and up. With `--subdir`though, the Git repository root -will be different (superior) to the project root. +The project root defines the possible scope of action for most DVC +commands (e.g. `dvc repro`, `dvc pull`, `dvc metrics diff`), meaning that only +`dvc.yaml`, `dvc.lock`, and `.dvc` files under that location are usable by the +commands. To determine the root of the project, DVC looks for `.dvc/` from the +current working directory, up. With `--subdir`, the project root will be found +before the Git repo root, reducing this scope. -If there are multiple sub-projects, but they are not nested, e.g.: +If there are multiple `--subdir` projects, but they are not nested, e.g.: ``` . ├── .git -| ├── project-A │   ├── .dvc │ ... @@ -85,7 +84,7 @@ DVC considers them separate projects. Any DVC command that is being run in repository root a DVC project in this case, and commands that require an initialized project will produce an error. -On the other hand, if there are nested projects, e.g.: +On the other hand, if there are nested `--subdir` projects, e.g.: ``` project-A @@ -98,9 +97,34 @@ project-A ... ``` -Nothing changes for `project-B`. But any DVC command being run in `project-A` -ignores the nested directory `project-B/`. For example, `dvc pull` (run in -`project-A/`) wouldn't download data for the `data-B.dvc` file. +Nothing changes for inner `project-B`. But any DVC command being run in outer +`project-A` ignores the nested directory `project-B/`. For example, `dvc pull` +(run in `project-A/`) wouldn't download data for the `data-B.dvc` file. + +### Nested DVC repositories + +Similarly, entire DVC repositories (each with it's own Git repo) +can be nested. For example: + +``` +project-A +├── .git +├── .dvc +├── project-B +│   ├── .git +│   ├── .dvc +│ ... +├── project-C # initialized with +│   ├── .dvc # --subdir or --no-scm +... +``` + +> This is a questionable Git practice though, unless employing submodules. + +In these cases all projects are also isolated from each other and commands run +in one or the other only affects itself. Inner ones wouldn't search for +data/pipelines above them anyway, and outer ones know to ignore sub-repos by +default. ### Initializing DVC without Git From a5db93fec4b9b43fc5b25744b1c6643539c5a64a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 21:29:48 -0500 Subject: [PATCH 22/54] cmd: fix -P option desc. per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-458820628 --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 80d1e17bc7..78af78c30c 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -127,8 +127,8 @@ only execute the final stage. - `-p`, `--pipeline` - reproduce the entire pipelines that the `targets` belong to. Use `dvc dag ` to show the parent pipeline of a target. -- `-P`, `--all-pipelines` - reproduce all pipelines for all the `dvc.yaml` files - in present the DVC repository (there can be one in every directory). +- `-P`, `--all-pipelines` - reproduce all pipelines for all `dvc.yaml` files + present in the DVC project. - `--no-run-cache` - execute stage commands even if they have already been run with the same command/dependencies/outputs/etc before. From 40512a116519a5a92a01d3c1bec2228b11f6d418 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 21:47:41 -0500 Subject: [PATCH 23/54] cmd: improve explanation on how --subdir affects commands per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-458824530 --- content/docs/command-reference/init.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 11c9563a59..e8127fc419 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -59,12 +59,15 @@ DVC in the Git repository root: #### How does it affect DVC commands? -The project root defines the possible scope of action for most DVC +The project root is found by DVC by looking for `.dvc/` from the +current working directory, up. It defines the scope of action for most DVC commands (e.g. `dvc repro`, `dvc pull`, `dvc metrics diff`), meaning that only -`dvc.yaml`, `dvc.lock`, and `.dvc` files under that location are usable by the -commands. To determine the root of the project, DVC looks for `.dvc/` from the -current working directory, up. With `--subdir`, the project root will be found -before the Git repo root, reducing this scope. +`dvc.yaml`, `dvc.lock`, and `.dvc` files inside the project are usable by the +commands. + +With `--subdir`, the project root will be found before the Git repo root, making +sure the scope of DVC commands is constrained to this project alone, and not any +other that may be found in the repo. If there are multiple `--subdir` projects, but they are not nested, e.g.: @@ -121,10 +124,9 @@ project-A > This is a questionable Git practice though, unless employing submodules. -In these cases all projects are also isolated from each other and commands run -in one or the other only affects itself. Inner ones wouldn't search for -data/pipelines above them anyway, and outer ones know to ignore sub-repos by -default. +In these cases all projects are also isolated from each other, and commands run +in one or the other only affects themselves. Inner ones wouldn't search for +data/pipelines above them anyway, and outer ones ignore sub-repos by default. ### Initializing DVC without Git From 1f77e8fe8e41119e522882bd18df0d5ba55d9e26 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jul 2020 23:33:38 -0500 Subject: [PATCH 24/54] cmd: simplify nested structures explanation in init per https://github.com/iterative/dvc.org/pull/1615#discussion_r463272826 --- content/docs/command-reference/init.md | 95 ++++++++++++-------------- 1 file changed, 44 insertions(+), 51 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index e8127fc419..750a3cb770 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -17,10 +17,11 @@ repository root (a `.git/` directory should be present). The command [options](#options) can be used to start an alternative workflow for advanced scenarios: -- [Initializing DVC in subdirectories](#initializing-dvc-in-subdirectories) - - support for monorepos and nested DVC projects. -- [Initializing DVC without Git](#initializing-dvc-without-git) - support for - SCM other than Git, deployment automation cases, etc. +- [Initializing DVC in subdirectories](#initializing-dvc-in-subdirectories) + (`--subdir`) - for monorepos and nested DVC projects +- [Initializing DVC without Git](#initializing-dvc-without-git) (`--no-scm`) - + for very simple projects, SCM other than Git, deployment automation, among + other uses At DVC initialization, a new `.dvc/` directory is created for internal configuration and cache @@ -43,7 +44,7 @@ This option is mostly useful in the scenario of a [monorepo](https://en.wikipedia.org/wiki/Monorepo) (Git repository split into several project directories), but can also be used with other patterns when such isolation is needed. `dvc init --subdir` mitigates the issues of initializing -DVC in the Git repository root: +DVC in the Git repo root: - Repository maintainers might not allow a top level `.dvc/` directory, especially if DVC is being used by several sub-projects (monorepo). @@ -62,71 +63,63 @@ DVC in the Git repository root: The project root is found by DVC by looking for `.dvc/` from the current working directory, up. It defines the scope of action for most DVC commands (e.g. `dvc repro`, `dvc pull`, `dvc metrics diff`), meaning that only -`dvc.yaml`, `dvc.lock`, and `.dvc` files inside the project are usable by the -commands. +`dvc.yaml`, `.dvc` files, etc. inside the project are usable by the commands. -With `--subdir`, the project root will be found before the Git repo root, making -sure the scope of DVC commands is constrained to this project alone, and not any -other that may be found in the repo. +With `--subdir`, the project root will be found before the Git root, making sure +the scope of DVC commands run here is constrained to this project alone, even if +there are more DVC-related files elsewhere in the repo. Similarly, DVC commands +run outside this project root will ignore its contents. -If there are multiple `--subdir` projects, but they are not nested, e.g.: +**Simple structures**: multiple `--subdir` projects, but they are not nested, +e.g.: -``` +```dvc . -├── .git +├── .git # plain Git repo ├── project-A -│   ├── .dvc +│   ├── .dvc # dvc init --subdir │ ... ├── project-B -│ ├── .dvc -... -``` - -DVC considers them separate projects. Any DVC command that is being run in -`project-A` is not aware about `project-B`. DVC does not consider the Git -repository root a DVC project in this case, and commands that require an -initialized project will produce an error. - -On the other hand, if there are nested `--subdir` projects, e.g.: - -``` -project-A -├── .dvc -├── data-A.dvc +│ ├── .dvc # dvc init --no-scm │ ... -└── project-B - ├── .dvc - ├── data-B.dvc - ... ``` -Nothing changes for inner `project-B`. But any DVC command being run in outer -`project-A` ignores the nested directory `project-B/`. For example, `dvc pull` -(run in `project-A/`) wouldn't download data for the `data-B.dvc` file. +DVC considers them separate projects. Any DVC command run in `project-A` is not +aware of `project-B`. However, commands that involve versioning (like +`dvc checkout`) can access the commit history from the Git root (`.`), when run +in `project-A`. -### Nested DVC repositories +> `.` is not a DVC project in this case, so most DVC commands can't be run +> there. -Similarly, entire DVC repositories (each with it's own Git repo) -can be nested. For example: +**Advanced structures**: If there are nested projects, either `--subdir`, +[`--no-scm`](#initializing-dvc-without-git), or full DVC +repositories (with their own Git root) e.g.: -``` -project-A -├── .git +```dvc +. # full DVC+Git repo ├── .dvc +├── .git +├── dvc.yaml +├── ... +├── project-A # initialized with +│   ├── .dvc # --subdir or --no-scm +│   ├── data.dvc +│ ... ├── project-B -│   ├── .git -│   ├── .dvc +│   ├── .dvc # a full sub-repo +│   ├── .git # (no --subdir) +│   ├── data.dvc │ ... -├── project-C # initialized with -│   ├── .dvc # --subdir or --no-scm -... ``` -> This is a questionable Git practice though, unless employing submodules. +Nothing changes for the inner projects. And any DVC command run in the outer one +(`.`) actively ignores the nested `project-A/` and `project-B/` directories. For +example, using `dvc pull` in `.` wouldn't download data for the `data.dvc` +files. -In these cases all projects are also isolated from each other, and commands run -in one or the other only affects themselves. Inner ones wouldn't search for -data/pipelines above them anyway, and outer ones ignore sub-repos by default. +> Note that nesting Git repos is a questionable practice, unless employing Git +> submodules. ### Initializing DVC without Git From 42b670f1655493f5d31690fb96dad7b2b71b359e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 14:34:33 -0500 Subject: [PATCH 25/54] guide: add note aboud `cp` not being a download in external deps per https://github.com/iterative/dvc.org/pull/1643#pullrequestreview-459415977 --- content/docs/user-guide/external-dependencies.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index f9ee3eda93..3139224357 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -50,6 +50,9 @@ $ dvc run -n download_file cp /home/shared/data.txt data.txt ``` +> Note in this case it's not necessarily a "download", since a simple `cp` is +> used. + ### SSH ```dvc From d30bc634d76afc8a0b09c5b6880c16b5e2b48763 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 14:44:46 -0500 Subject: [PATCH 26/54] cmd: add note about what --cwd means to repro per https://github.com/iterative/dvc/issues/4292#issuecomment-667319168 --- content/docs/command-reference/repro.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 78af78c30c..1783a752cc 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -99,11 +99,11 @@ only execute the final stage. (non-recursively) if multiple stage names are given as `targets`. - `-c `, `--cwd ` - directory within the project to reproduce from. - Instead of using `--cwd`, one can alternately specify a target in a - subdirectory as `path/to/target.dvc`. This option can be useful for example - with subdirectories containing a separate pipeline that can either be - reproduced as part of the pipeline in the parent directory, or as an - independent unit. + `targets` will be searched relative to this path. Instead of using `--cwd`, + one can alternately specify a target in a subdirectory as + `path/to/target.dvc`. This option can be useful for example with + subdirectories containing a separate pipeline that can either be reproduced as + part of the pipeline in the parent directory, or as an independent unit. - `-R`, `--recursive` - determines the stages to reproduce by searching each target directory (if any) and their subdirectories. From 0ea4bd35a0ed0083fb2078278662ff6d3eb70140 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 14:48:44 -0500 Subject: [PATCH 27/54] guide: nvmd! removing that note in external deps per https://github.com/iterative/dvc.org/commit/42b670f1655493f5d31690fb96dad7b2b71b359e#r41087633 --- content/docs/user-guide/external-dependencies.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 3139224357..f9ee3eda93 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -50,9 +50,6 @@ $ dvc run -n download_file cp /home/shared/data.txt data.txt ``` -> Note in this case it's not necessarily a "download", since a simple `cp` is -> used. - ### SSH ```dvc From 70b7d2a3a22a2ab94a5f6c550e6f97d115e44516 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:05:23 -0500 Subject: [PATCH 28/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index a273651d0d..613a6fb048 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -271,8 +271,8 @@ print(f'Number of lines in {sys.argv[1]}:') print(num_lines) ``` -Now, using the `--downstream` option with `dvc repro`, results in the execution -of stages after the target stage (`count` in this case) in the pipeline. +Now, using the `--downstream` option with `dvc repro` results in the execution +only of stages after the target (`count`). ```dvc $ dvc repro --downstream count From 73499f2950e17fc1a04a7dc050b5cc043b089c56 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:05:41 -0500 Subject: [PATCH 29/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 613a6fb048..c3292d4197 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of stages after the target (`count`). +only of stages after the target (`count`): ```dvc $ dvc repro --downstream count From e40402ca83e2b392328fd066bede2742894c45fc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:11:36 -0500 Subject: [PATCH 30/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index c3292d4197..5aa5a7e923 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of stages after the target (`count`): +only of the target stage (and any following ones): ```dvc $ dvc repro --downstream count From 1696951338434baf2fe91668521f087c7cacd071 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:11:44 -0500 Subject: [PATCH 31/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 5aa5a7e923..77056c1d48 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -282,8 +282,8 @@ Updating lock file 'dvc.lock' ``` The change in `text.txt` is ignored because that file is a dependency in the -`filter` stage, which did not get updated in the above command. This is because -`filter` happens before `count` in the pipeline (shown below). +`filter` stage, which wasn't executed by the `dvc repro` above. This is because +`filter` happens before the target (`count`) in the pipeline, as shown below: ```dvc $ dvc dag From bfe6800d16804fb26f22747e39f0c949562d6c75 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:12:49 -0500 Subject: [PATCH 32/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 77056c1d48..e6aeba6946 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of the target stage (and any following ones): +only of the target stage, and following ones (none in these case): ```dvc $ dvc repro --downstream count From 3a20f21d7354b76389c0235888d9399c5765f90d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 15:31:40 -0500 Subject: [PATCH 33/54] cmd: more small updates to init --- content/docs/command-reference/init.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 750a3cb770..9b8111ee14 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -70,8 +70,7 @@ the scope of DVC commands run here is constrained to this project alone, even if there are more DVC-related files elsewhere in the repo. Similarly, DVC commands run outside this project root will ignore its contents. -**Simple structures**: multiple `--subdir` projects, but they are not nested, -e.g.: +**Simple structures**: multiple `--subdir` projects, not nested, e.g.: ```dvc . @@ -84,10 +83,10 @@ e.g.: │ ... ``` -DVC considers them separate projects. Any DVC command run in `project-A` is not +DVC considers these separate projects. Any DVC command run in `project-A` is not aware of `project-B`. However, commands that involve versioning (like `dvc checkout`) can access the commit history from the Git root (`.`), when run -in `project-A`. +in `--subdir` projects. > `.` is not a DVC project in this case, so most DVC commands can't be run > there. From e83bc5d4aad670ed41a8fe1ddc33250b36bc9368 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 16:58:07 -0500 Subject: [PATCH 34/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index e6aeba6946..04b878fba6 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -272,7 +272,7 @@ print(num_lines) ``` Now, using the `--downstream` option with `dvc repro` results in the execution -only of the target stage, and following ones (none in these case): +only of the target stage, and following ones (none in this case): ```dvc $ dvc repro --downstream count From 012b72fe97b1719330745df36c3992becf104bd7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 31 Jul 2020 16:59:22 -0500 Subject: [PATCH 35/54] Update content/docs/command-reference/repro.md --- content/docs/command-reference/repro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 04b878fba6..8d3d12dd0e 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -263,7 +263,7 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -Let's say we want to print the filename also in the description and so we update +Let's say we also want to print the filename in the description, and so we update the `process.py` as: ```python From 80d257560c53d61ab6d089ffaa4283c7b95ca600 Mon Sep 17 00:00:00 2001 From: "Restyled.io" Date: Fri, 31 Jul 2020 21:59:32 +0000 Subject: [PATCH 36/54] Restyled by prettier --- content/docs/command-reference/repro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 8d3d12dd0e..0709f40e8e 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -263,8 +263,8 @@ The answer to universe is 42 - The Hitchhiker's Guide to the Galaxy ``` -Let's say we also want to print the filename in the description, and so we update -the `process.py` as: +Let's say we also want to print the filename in the description, and so we +update the `process.py` as: ```python print(f'Number of lines in {sys.argv[1]}:') From a85d8a0ce208b0cab3bc4b626fc6f5c8f244be04 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 3 Aug 2020 11:53:39 -0500 Subject: [PATCH 37/54] cmd: rewrap metrics diff usage paragraph --- content/docs/command-reference/metrics/diff.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index f73fd2ac86..daec243ab7 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -6,9 +6,10 @@ Show changes in [metrics](/doc/command-reference/metrics) between commits in the ## Synopsis ```usage -usage: dvc metrics diff [-h] [-q | -v] [--targets [ [ ...]]] [-R] - [--all] [--show-json] [--show-md] [--no-path] [--old] - [--precision ] +usage: dvc metrics diff [-h] [-q | -v] + [--targets [ [ ...]]] [-R] + [--all] [--show-json] [--show-md] [--no-path] + [--old] [--precision ] [a_rev] [b_rev] positional arguments: From fd2a9bb1563deaa5eac93556ded07baded83cc09 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 3 Aug 2020 13:58:19 -0500 Subject: [PATCH 38/54] term: remove "just" from -j desc in 3 refs per eda27fcc52b8c98f6909782461b9652ff7f005a8 --- content/docs/command-reference/fetch.md | 6 +++--- content/docs/command-reference/pull.md | 6 +++--- content/docs/command-reference/push.md | 6 +++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 8eeda06825..dde867c9ba 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -90,9 +90,9 @@ or `-T` options are used). - `-j `, `--jobs ` - number of threads to run simultaneously to handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs - may improve the total download speed if a combination of small and large files - are being fetched. + `4 * cpu_count()`. For SSH remotes, the default is `4`. Using more jobs may + improve the total download speed if a combination of small and large files are + being fetched. - `-a`, `--all-branches` - fetch cache for all Git branches instead of just the current workspace. This means DVC may download files needed to reproduce diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index c05893072c..50c12b51e6 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -98,9 +98,9 @@ reflinks or hardlinks to put it in the workspace without copying. See - `-j `, `--jobs ` - number of threads to run simultaneously to handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs - may improve the total download speed if a combination of small and large files - are being fetched. + `4 * cpu_count()`. For SSH remotes, the default is `4`. Using more jobs may + improve the total download speed if a combination of small and large files are + being fetched. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 1df93ca313..694b66bb77 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -105,9 +105,9 @@ the target [stage files](/doc/command-reference/run), through the corresponding - `-j `, `--jobs ` - number of threads to run simultaneously to handle the uploading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs - may improve the total download speed if a combination of small and large files - are being fetched. + `4 * cpu_count()`. For SSH remotes, the default is `4`. Using more jobs may + improve the total download speed if a combination of small and large files are + being fetched. - `-h`, `--help` - prints the usage/help message, and exit. From 221ed757fbb64672feb0178be89d2d97f14da15a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 4 Aug 2020 22:40:59 -0500 Subject: [PATCH 39/54] cmd: add command examples to init --subdir use cases per https://github.com/iterative/dvc.org/pull/1615 --- content/docs/command-reference/init.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 9b8111ee14..be7fcd99c5 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -49,14 +49,14 @@ DVC in the Git repo root: - Repository maintainers might not allow a top level `.dvc/` directory, especially if DVC is being used by several sub-projects (monorepo). -- DVC config file, cache directory, - [etc.](/doc/user-guide/dvc-files-and-directories) are shared across different - sub-projects. This makes it difficult to use different DVC settings, - [remote storage](/doc/command-reference/remote) locations, etc. - -- Many DVC commands can explore the whole DVC repository to find - DVC-tracked data and pipelines to work with. This can be undesirable and - inefficient for large monorepos. +- DVC [internals](/doc/user-guide/dvc-files-and-directories) (config file, cache + directory, etc.) are shared across different sub-projects. This forces all of + them to use the same DVC settings and + [remote storage](/doc/command-reference/remote). + +- By default, DVC commands like `dvc checkout` and `dvc repro` explore the whole + DVC repository to find DVC-tracked data and pipelines to work + with. This can be undesirable and inefficient for large monorepos. #### How does it affect DVC commands? From d14e960e145952a684712df415fa2067ffb432c1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 4 Aug 2020 23:16:58 -0500 Subject: [PATCH 40/54] cmd: explain nested repo and projects of all kinds outside of --subdir per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-458917868 --- content/docs/command-reference/init.md | 68 ++++++++++++-------------- 1 file changed, 32 insertions(+), 36 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index be7fcd99c5..ff8fb5e1c1 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -14,6 +14,12 @@ DVC works best in a Git repository. This enables all features, providing the most value. For this reason, `dvc init` (without flags) expects to run in a Git repository root (a `.git/` directory should be present). +At DVC initialization, a new `.dvc/` directory is created for internal +configuration and cache +[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), +that are hidden from the user. This directory is automatically staged with +`git add`, so it can be easily committed with Git. + The command [options](#options) can be used to start an alternative workflow for advanced scenarios: @@ -23,11 +29,9 @@ advanced scenarios: for very simple projects, SCM other than Git, deployment automation, among other uses -At DVC initialization, a new `.dvc/` directory is created for internal -configuration and cache -[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files), -that are hidden from the user. This directory is automatically staged with -`git add`, so it can be easily committed with Git. +> Note that DVC repositories nested inside other DVC repos (for +> example when using Git submodules) are isolated from the outer ones, and vice +> versa. This is because each one has their own Git root. ### Initializing DVC in subdirectories @@ -73,52 +77,40 @@ run outside this project root will ignore its contents. **Simple structures**: multiple `--subdir` projects, not nested, e.g.: ```dvc -. -├── .git # plain Git repo +. # plain Git repo +├── .git ├── project-A │   ├── .dvc # dvc init --subdir │ ... ├── project-B -│ ├── .dvc # dvc init --no-scm +│ ├── .dvc # dvc init --subdir │ ... ``` -DVC considers these separate projects. Any DVC command run in `project-A` is not -aware of `project-B`. However, commands that involve versioning (like -`dvc checkout`) can access the commit history from the Git root (`.`), when run -in `--subdir` projects. +DVC considers A and B separate projects. Any DVC command run in `project-A` is +not aware of `project-B`. However, commands that involve versioning (like +`dvc checkout`) access the commit history from the Git root (`.`). > `.` is not a DVC project in this case, so most DVC commands can't be run > there. -**Advanced structures**: If there are nested projects, either `--subdir`, -[`--no-scm`](#initializing-dvc-without-git), or full DVC -repositories (with their own Git root) e.g.: +**Nested structures**: If there are nested `--subdir`projects e.g.: ```dvc -. # full DVC+Git repo +project-A # full DVC + Git repo ├── .dvc ├── .git ├── dvc.yaml ├── ... -├── project-A # initialized with -│   ├── .dvc # --subdir or --no-scm -│   ├── data.dvc -│ ... ├── project-B -│   ├── .dvc # a full sub-repo -│   ├── .git # (no --subdir) -│   ├── data.dvc +│   ├── .dvc # dvc init --subdir +│   ├── data-B.dvc │ ... ``` Nothing changes for the inner projects. And any DVC command run in the outer one -(`.`) actively ignores the nested `project-A/` and `project-B/` directories. For -example, using `dvc pull` in `.` wouldn't download data for the `data.dvc` -files. - -> Note that nesting Git repos is a questionable practice, unless employing Git -> submodules. +actively ignores the nested project directories. For example, using `dvc pull` +in `project-A` wouldn't download data for the `data-B.dvc` file. ### Initializing DVC without Git @@ -135,15 +127,19 @@ include: - There is no need to keep the history at all, e.g. having a deployment automation like running a data pipeline using `cron`. -In this mode DVC features that depend on Git being present are not available - -e.g. managing `.gitignore` files on `dvc add` or `dvc run` to avoid committing -DVC-tracked files into Git, or `dvc diff` and `dvc metrics diff` that accept -Git-revisions to compare, etc. +In this mode, DVC features related to versioning are not available. For example +automatic creation and updating of `.gitignore` files on `dvc add` or `dvc run`, +as well as `dvc diff` and `dvc metrics diff`, which require Git revisions to +compare. DVC sets the `core.no_scm` config option value to `true` in the DVC -[config](/doc/command-reference/config) when it is initialized this way. It -means that even if the project was Git-tracked already or Git is initialized in -it later, DVC keeps operating in the detached from Git mode. +[config](/doc/command-reference/config) when initialized this way. This means +that even if the project is tracked by Git, or if Git is initialized in it +later, DVC will keep operating detached from Git in this project. + +> Note that like with nested repositories and `--subdir` projects, +> `--no-scm` projects inside regular projects are ignored by any +> parent DVC projects, and vice versa. ## Options From 5b47bd4f02fc45a595b4e0695d06ad0f75a31278 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 4 Aug 2020 23:23:46 -0500 Subject: [PATCH 41/54] cmd: remove bold names to nested and not-nested structure examples in init --subdir per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-459522002 --- content/docs/command-reference/init.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index ff8fb5e1c1..f46d2691a0 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -74,7 +74,7 @@ the scope of DVC commands run here is constrained to this project alone, even if there are more DVC-related files elsewhere in the repo. Similarly, DVC commands run outside this project root will ignore its contents. -**Simple structures**: multiple `--subdir` projects, not nested, e.g.: +If there are multiple `--subdir` projects, but not nested, e.g.: ```dvc . # plain Git repo @@ -94,7 +94,7 @@ not aware of `project-B`. However, commands that involve versioning (like > `.` is not a DVC project in this case, so most DVC commands can't be run > there. -**Nested structures**: If there are nested `--subdir`projects e.g.: +If there are nested `--subdir`projects e.g.: ```dvc project-A # full DVC + Git repo From 80a0f092c3094b668159beb59182bcec5b5830d7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 4 Aug 2020 23:32:36 -0500 Subject: [PATCH 42/54] cmd: standardize --jobs option in all refs per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461196461 et al. --- content/docs/command-reference/fetch.md | 9 ++++----- content/docs/command-reference/gc.md | 6 ++++-- content/docs/command-reference/pull.md | 9 ++++----- content/docs/command-reference/push.md | 9 ++++----- content/docs/command-reference/status.md | 8 ++++---- 5 files changed, 20 insertions(+), 21 deletions(-) diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index dde867c9ba..882047a766 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -88,11 +88,10 @@ or `-T` options are used). directory and its subdirectories for `dvc.yaml` and `.dvc` files to inspect. If there are no directories among the `targets`, this option is ignored. -- `-j `, `--jobs ` - number of threads to run simultaneously to - handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is `4`. Using more jobs may - improve the total download speed if a combination of small and large files are - being fetched. +- `-j `, `--jobs ` - parallelism level for DVC to download data + from remote storage. This only applies when the `--cloud` option is used, or a + `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, + the default is `4`. - `-a`, `--all-branches` - fetch cache for all Git branches instead of just the current workspace. This means DVC may download files needed to reproduce diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 130b20b446..2bea48c20b 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -89,8 +89,10 @@ The default remote is cleaned (see `dvc config core.remote`) unless the [remote storage](/doc/command-reference/remote) to collect unused objects from if `-c` option is specified (see `dvc remote list`). -- `-j `, `--jobs ` - garbage collector parallelism level. The - default `JOBS` argument is `4 * cpu_count()`. For SSH remotes default is 4. +- `-j `, `--jobs ` - parallelism level for DVC to access data + from remote storage. This only applies when the `--cloud` option is used, or a + `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, + the default is `4`. > For now only some phases of garbage collection are parallel. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 50c12b51e6..91067bbd52 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -96,11 +96,10 @@ reflinks or hardlinks to put it in the workspace without copying. See repository into the local run cache. A `dvc repro ` is necessary to checkout these files into the workspace and update the `dvc.lock` file. -- `-j `, `--jobs ` - number of threads to run simultaneously to - handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is `4`. Using more jobs may - improve the total download speed if a combination of small and large files are - being fetched. +- `-j `, `--jobs ` - parallelism level for DVC to download data + from remote storage. This only applies when the `--cloud` option is used, or a + `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, + the default is `4`. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 694b66bb77..3527006cae 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -103,11 +103,10 @@ the target [stage files](/doc/command-reference/run), through the corresponding - `--run-cache` - uploads all available history of stage runs to the remote repository. -- `-j `, `--jobs ` - number of threads to run simultaneously to - handle the uploading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is `4`. Using more jobs may - improve the total download speed if a combination of small and large files are - being fetched. +- `-j `, `--jobs ` - parallelism level for DVC to upload data + from remote storage. This only applies when the `--cloud` option is used, or a + `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, + the default is `4`. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index a2068465b8..b39de0371c 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -142,10 +142,10 @@ workspace) is different from remote storage. Bringing the two into sync requires - `-r `, `--remote ` - specifies which remote storage (see `dvc remote list`) to compare against. Implies `--cloud`. -- `-j `, `--jobs ` - specifies the number of jobs DVC can use to - retrieve information from remote servers. This only applies when the `--cloud` - option is used or a remote is given. The default value is `4 * cpu_count()`. - For SSH remotes, the default is `4`. +- `-j `, `--jobs ` - parallelism level for DVC to retrieve + information from remote storage. This only applies when the `--cloud` option + is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For + SSH remotes, the default is `4`. - `-h`, `--help` - prints the usage/help message, and exit. From 5078044b3e82b1d5f2dc254cfb7d590310f9f2b5 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 4 Aug 2020 23:55:31 -0500 Subject: [PATCH 43/54] cmd: add speed note to --jobs desc in all refs. per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461196461 --- content/docs/command-reference/fetch.md | 2 +- content/docs/command-reference/gc.md | 2 +- content/docs/command-reference/pull.md | 2 +- content/docs/command-reference/push.md | 2 +- content/docs/command-reference/status.md | 3 ++- 5 files changed, 6 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 882047a766..36d3194ff8 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -91,7 +91,7 @@ or `-T` options are used). - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, - the default is `4`. + the default is `4`. Using more jobs may improve the overall transfer speed. - `-a`, `--all-branches` - fetch cache for all Git branches instead of just the current workspace. This means DVC may download files needed to reproduce diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 2bea48c20b..d3d17312d5 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -92,7 +92,7 @@ The default remote is cleaned (see `dvc config core.remote`) unless the - `-j `, `--jobs ` - parallelism level for DVC to access data from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, - the default is `4`. + the default is `4`. Using more jobs may improve the overall transfer speed. > For now only some phases of garbage collection are parallel. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 91067bbd52..19182db6f2 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -99,7 +99,7 @@ reflinks or hardlinks to put it in the workspace without copying. See - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, - the default is `4`. + the default is `4`. Using more jobs may improve the overall transfer speed. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 3527006cae..33ff39cc9b 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -106,7 +106,7 @@ the target [stage files](/doc/command-reference/run), through the corresponding - `-j `, `--jobs ` - parallelism level for DVC to upload data from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, - the default is `4`. + the default is `4`. Using more jobs may improve the overall transfer speed. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index b39de0371c..941260e405 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -145,7 +145,8 @@ workspace) is different from remote storage. Bringing the two into sync requires - `-j `, `--jobs ` - parallelism level for DVC to retrieve information from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For - SSH remotes, the default is `4`. + SSH remotes, the default is `4`. Using more jobs may improve the overall + transfer speed. - `-h`, `--help` - prints the usage/help message, and exit. From f17687217abce6f6e9d62af675906b203934715c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 00:02:06 -0500 Subject: [PATCH 44/54] cmd: change versioning command example in init per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461337219 --- content/docs/command-reference/init.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index f46d2691a0..963a399fa1 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -89,16 +89,16 @@ If there are multiple `--subdir` projects, but not nested, e.g.: DVC considers A and B separate projects. Any DVC command run in `project-A` is not aware of `project-B`. However, commands that involve versioning (like -`dvc checkout`) access the commit history from the Git root (`.`). +`dvc diff`, among others) access the commit history from the Git root (`.`). > `.` is not a DVC project in this case, so most DVC commands can't be run > there. -If there are nested `--subdir`projects e.g.: +If there are nested `--subdir` projects e.g.: ```dvc -project-A # full DVC + Git repo -├── .dvc +project-A +├── .dvc # full DVC + Git repo ├── .git ├── dvc.yaml ├── ... From cac92e586171fd1b422db5caae9daaaf3b24b35d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 00:03:35 -0500 Subject: [PATCH 45/54] cd: change repo comments in init --subdir examples for https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461337818 --- content/docs/command-reference/init.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 963a399fa1..614de06b9f 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -77,7 +77,7 @@ run outside this project root will ignore its contents. If there are multiple `--subdir` projects, but not nested, e.g.: ```dvc -. # plain Git repo +. # git init ├── .git ├── project-A │   ├── .dvc # dvc init --subdir @@ -98,7 +98,7 @@ If there are nested `--subdir` projects e.g.: ```dvc project-A -├── .dvc # full DVC + Git repo +├── .dvc # git init && dvc init ├── .git ├── dvc.yaml ├── ... From d1f9d24a7f1446c5b98543ad170b6a348cda8d1a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 00:08:30 -0500 Subject: [PATCH 46/54] cmd: improve note on DVC submodules a little for https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461335110 --- content/docs/command-reference/init.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 614de06b9f..7689af3c3d 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -29,9 +29,9 @@ advanced scenarios: for very simple projects, SCM other than Git, deployment automation, among other uses -> Note that DVC repositories nested inside other DVC repos (for -> example when using Git submodules) are isolated from the outer ones, and vice -> versa. This is because each one has their own Git root. +> Note that Git-enabled DVC repositories nested inside parent DVC +> repositories (for example when using Git submodules) are isolated from the +> parent, and vice versa. This is because each one has their own Git root. ### Initializing DVC in subdirectories From 8d854a1e2c01dd78be2fbeb8c72a94fda1c0af28 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 00:14:41 -0500 Subject: [PATCH 47/54] cmd: better explain why isolation is important in --subdir bullet per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461336187 --- content/docs/command-reference/init.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 7689af3c3d..f8c3e1d1aa 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -60,7 +60,8 @@ DVC in the Git repo root: - By default, DVC commands like `dvc checkout` and `dvc repro` explore the whole DVC repository to find DVC-tracked data and pipelines to work - with. This can be undesirable and inefficient for large monorepos. + with. This can produce undesirable results and/or be inefficient for large + monorepos. #### How does it affect DVC commands? From be5bc6dcd62dd78230cea2f139d776bc89d8438e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 00:30:00 -0500 Subject: [PATCH 48/54] cmd: split last --subdir cases explicitly as 2 bullets per https://github.com/iterative/dvc.org/pull/1615#discussion_r465478838 --- content/docs/command-reference/init.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index f8c3e1d1aa..90e3eab23d 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -58,10 +58,12 @@ DVC in the Git repo root: them to use the same DVC settings and [remote storage](/doc/command-reference/remote). -- By default, DVC commands like `dvc checkout` and `dvc repro` explore the whole +- By default, DVC commands like `dvc pull` and `dvc repro` explore the whole DVC repository to find DVC-tracked data and pipelines to work - with. This can produce undesirable results and/or be inefficient for large - monorepos. + with. This can be inefficient for large monorepos. + +- Other commands such as `dvc status` and `dvc metrics show` would produce + unexpected results if not constrained to a single project scope. #### How does it affect DVC commands? From 9c3ba7751c848df4c425b89a8bfdb9c2007b7b37 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 14:11:30 -0500 Subject: [PATCH 49/54] cmd: remove most notes and code block examples about nesting projects/repos in init per https://github.com/iterative/dvc.org/pull/1615#pullrequestreview-461338242 --- content/docs/command-reference/init.md | 50 ++------------------------ 1 file changed, 3 insertions(+), 47 deletions(-) diff --git a/content/docs/command-reference/init.md b/content/docs/command-reference/init.md index 90e3eab23d..b962349dbe 100644 --- a/content/docs/command-reference/init.md +++ b/content/docs/command-reference/init.md @@ -29,10 +29,6 @@ advanced scenarios: for very simple projects, SCM other than Git, deployment automation, among other uses -> Note that Git-enabled DVC repositories nested inside parent DVC -> repositories (for example when using Git submodules) are isolated from the -> parent, and vice versa. This is because each one has their own Git root. - ### Initializing DVC in subdirectories `--subdir` must be provided to initialize DVC in a subdirectory of a Git @@ -74,46 +70,10 @@ commands (e.g. `dvc repro`, `dvc pull`, `dvc metrics diff`), meaning that only With `--subdir`, the project root will be found before the Git root, making sure the scope of DVC commands run here is constrained to this project alone, even if -there are more DVC-related files elsewhere in the repo. Similarly, DVC commands -run outside this project root will ignore its contents. - -If there are multiple `--subdir` projects, but not nested, e.g.: - -```dvc -. # git init -├── .git -├── project-A -│   ├── .dvc # dvc init --subdir -│ ... -├── project-B -│ ├── .dvc # dvc init --subdir -│ ... -``` - -DVC considers A and B separate projects. Any DVC command run in `project-A` is -not aware of `project-B`. However, commands that involve versioning (like -`dvc diff`, among others) access the commit history from the Git root (`.`). +there are more DVC-related files elsewhere in the repo. -> `.` is not a DVC project in this case, so most DVC commands can't be run -> there. - -If there are nested `--subdir` projects e.g.: - -```dvc -project-A -├── .dvc # git init && dvc init -├── .git -├── dvc.yaml -├── ... -├── project-B -│   ├── .dvc # dvc init --subdir -│   ├── data-B.dvc -│ ... -``` - -Nothing changes for the inner projects. And any DVC command run in the outer one -actively ignores the nested project directories. For example, using `dvc pull` -in `project-A` wouldn't download data for the `data-B.dvc` file. +Similarly, DVC commands run outside this project root (if nested inside another +DVC project, for example) will ignore this project's contents completely. ### Initializing DVC without Git @@ -140,10 +100,6 @@ DVC sets the `core.no_scm` config option value to `true` in the DVC that even if the project is tracked by Git, or if Git is initialized in it later, DVC will keep operating detached from Git in this project. -> Note that like with nested repositories and `--subdir` projects, -> `--no-scm` projects inside regular projects are ignored by any -> parent DVC projects, and vice versa. - ## Options - `-f`, `--force` - remove `.dvc/` if it exists before initialization. Will From cde3cd25b0c7cf69c0bc8d8fe4ac81eccee72515 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 15:35:43 -0500 Subject: [PATCH 50/54] guide: clarify that tracked data is often needed for for DVC commands to use a remote and start the GDrive auth process --- .../docs/user-guide/setup-google-drive-remote.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/setup-google-drive-remote.md b/content/docs/user-guide/setup-google-drive-remote.md index d8d7d42b50..2c874bd89f 100644 --- a/content/docs/user-guide/setup-google-drive-remote.md +++ b/content/docs/user-guide/setup-google-drive-remote.md @@ -18,10 +18,13 @@ to establish GDrive remote connections (e.g. CI/CD). ## Quick start To start using a Google Drive remote, you only need to add it with a -[valid URL format](#url-format). Then use any DVC command that needs it (e.g. -`dvc pull`, `dvc fetch`, `dvc push`). For example: +[valid URL format](#url-format). Then use any DVC command that needs to connect +to it (e.g. `dvc pull` or `dvc push` once there's tracked data to synchronize). +For example: ```dvc +$ dvc add data +... $ dvc remote add --default myremote \ gdrive://0AIac4JZqHhKmUk9PDA/dvcstore $ dvc push @@ -192,9 +195,10 @@ authentication is needed. ## Authorization On the first usage of a GDrive [remote](/doc/command-reference/remote), for -example when trying to `dvc push` for the first time after adding the remote -with a [valid URL](#url-format), DVC will prompt you to visit a special Google -authentication web page. There you'll need to sign into your Google account. The +example when trying to `dvc push` tracked data for the first time, DVC will +prompt you to visit a special Google authentication web page. There you'll need +to sign into a Google account with the needed access to the GDrive +[URL](#url-format) in question. The [auth process](https://developers.google.com/drive/api/v2/about-auth) will ask you to grant DVC the necessary permissions, and produce a verification code needed for DVC to complete the connection. On success, the necessary credentials From d5d16e4ec310ec4ec00d60d2c7361c5d6f2187af Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 16:59:11 -0500 Subject: [PATCH 51/54] cmd: add/improve note on tracking directories with add and run --- content/docs/command-reference/add.md | 17 +++++++++------- content/docs/command-reference/run.md | 29 ++++++++++++++++++--------- 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 75501d2eb9..8adb73cd1f 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -71,7 +71,7 @@ large files. DVC also supports other link types for use on file systems without `reflink` support, but they have to be specified manually. Refer to the `cache.type` config option in `dvc config cache` for more information. -### Tracking directories +### Adding entire directories A `dvc add` target can be an individual file or a directory. In the latter case, a `.dvc` file is created for the top of the directory (with default name @@ -83,9 +83,13 @@ in the directory tree. Instead, the single `.dvc` file references a special JSON file in the cache (with `.dir` extension), that in turn points to the added files. -Note that DVC commands that use tracked files support granular targeting of -files, even when the directory is added as a whole. Examples: `dvc push`, -`dvc pull`, `dvc get`, `dvc import`, etc. +> Refer to +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> for more info. on `.dir` cache entries. + +Note that DVC commands that use tracked data support granular targeting of files +and directories, even when contained in a parent directory added as a whole. +Examples: `dvc push`, `dvc pull`, `dvc get`, `dvc import`, etc. As a rarely needed alternative, the `--recursive` option causes every file in the hierarchy to be added individually. A corresponding `.dvc` file will be @@ -192,9 +196,8 @@ outs: path: pics ``` -> Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) -> for more info. +> Refer to [Adding entire directories](#adding-entire-directories) for more +> info. This allows us to treat the entire directory structure as a single data artifact. For example, you can pass the whole directory tree as a diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index e94a797c6b..d193645099 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -73,23 +73,34 @@ $ dvc run -n printer -d write.sh -o pages ./write.sh $ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh ``` +Stage dependencies can be any file in the workspace, either untracked, or much +more commonly, tracked by DVC or Git. Outputs will be tracked and +cached by DVC when the stage is run. Every output version will be +cached when the stage is reproduced (see also `dvc gc`). + Relevant notes: -- Typically, scripts being run (or a directory containing the source code) are - included among the specified `-d` dependencies. This ensures that when the - source code changes, DVC knows that the stage needs to be reproduced. (You can - chose whether to do this.) +- Typically, scripts being run (or possibly a directory containing the source + code) are included among the specified `-d` dependencies. This ensures that + when the source code changes, DVC knows that the stage needs to be reproduced. + (You can chose whether to do this.) - `dvc run` checks the dependency graph integrity before creating a new stage. - For example: two stage cannot explicitly specify the same output, there should - be no cycles, etc. + For example: two stage cannot specify the same output or overlapping output + paths, there should be no cycles, etc. - DVC does not feed dependency files to the command being run. The program will have to read by itself the files specified with `-d`. +- Entire directories can be tracked as outputs, which produces a single `.dir` + entry in the cache (refer to + [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) + for more info.) + - Outputs are deleted from the workspace before executing the - command (including at `dvc repro`), so it should be able to recreate any - directories marked as outputs. + command (including at `dvc repro`) if their paths are found as existing + files/directories. This also means that the stage command needs to recreate + any directory structures defined as outputs every time its executed by DVC. ### For displaying and comparing data science experiments @@ -117,7 +128,7 @@ systems and require certain software packages to be installed. Wrap the command with double quotes `"` if there are special characters in it like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to -`dvc run` as a whole. Use single quotes `'` instead if there are environment +`dvc run` itself. Use single quotes `'` instead if there are environment variables in it that should be evaluated dynamically. Examples: ```dvc From 7546c1f45a435ebd7840426a9349c402e72bace1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 17:00:44 -0500 Subject: [PATCH 52/54] guide: update title of #structure-of-the-cache-directory section of of DVC files & dirs guide --- content/docs/api-reference/get_url.md | 2 +- content/docs/command-reference/add.md | 4 ++-- content/docs/command-reference/cache/dir.md | 2 +- content/docs/command-reference/config.md | 2 +- content/docs/command-reference/fetch.md | 2 +- content/docs/command-reference/gc.md | 2 +- content/docs/command-reference/push.md | 4 ++-- content/docs/command-reference/run.md | 2 +- content/docs/user-guide/basic-concepts/dvc-cache.md | 2 +- content/docs/user-guide/dvc-files-and-directories.md | 4 ++-- content/docs/user-guide/dvcignore.md | 2 +- 11 files changed, 14 insertions(+), 14 deletions(-) diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 0cdb531271..27aed63526 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -36,7 +36,7 @@ URL returned depends on the `remote` used (see the [Parameters](#parameters) section). If the target is a directory, the returned URL will end in `.dir`. Refer to -[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) and `dvc add` to learn more about how DVC handles data directories. ⚠️ This function does not check for the actual existence of the file or diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 8adb73cd1f..1173c78b41 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -38,7 +38,7 @@ each one: 1. Calculate the file hash. 2. Move the file contents to the cache (by default in `.dvc/cache`), using the file hash to form the cached file path. (See - [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) + [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more details.) 3. Attempt to replace the file with a link to the cached data (more details on file linking further down). @@ -84,7 +84,7 @@ file in the cache (with `.dir` extension), that in turn points to the added files. > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. on `.dir` cache entries. Note that DVC commands that use tracked data support granular targeting of files diff --git a/content/docs/command-reference/cache/dir.md b/content/docs/command-reference/cache/dir.md index cc8bc49ddc..d2c64e119d 100644 --- a/content/docs/command-reference/cache/dir.md +++ b/content/docs/command-reference/cache/dir.md @@ -17,7 +17,7 @@ positional arguments: ## Description Helper to set the `cache.dir` configuration option. (See -[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).) +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).) Unlike doing so with `dvc config cache`, this command transform paths (`value`) that are provided relative to the current working directory into paths **relative to the config file location**. However, if the `value` provided is an diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index b683cec054..9c5af7052f 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -99,7 +99,7 @@ remote. See `dvc remote` for more information. A DVC project cache is the hidden storage (by default located in the `.dvc/cache` directory) for files that are tracked by DVC, and their different versions. (See `dvc cache` and -[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more details.) This section contains the following options: - `cache.dir` - set/unset cache directory location. A correct value must be diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 36d3194ff8..af3abf8509 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -189,7 +189,7 @@ $ tree .dvc Note that the `.dvc/cache` directory was created and populated. > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. Used without arguments (as above), `dvc fetch` downloads all assets needed by diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index d3d17312d5..2fb92be3f2 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -29,7 +29,7 @@ of commits (determined by reading the DVC-files in them). See the [Options](#options) section for more details. > Note that `dvc gc` tries to fetch any missing -> [`.dir` files](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [`.dir` files](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > from [remote storage](/doc/command-reference/remote) to the local > cache, in order to know which files should exist inside cached > directories. These files may be missing if the cache directory was previously diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 33ff39cc9b..e9168bb81b 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -194,7 +194,7 @@ Finally, we used `dvc status` to double check that all data had been uploaded. ## Example: What happens in the cache? Let's take a detailed look at what happens to the -[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) as you run an experiment locally and push data to remote storage. To set the example consider having created a workspace that contains some code and data, and having set up a remote. @@ -242,7 +242,7 @@ the cache having more files in it than the remote – which is what the `new` state means. > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. Next we can copy the remaining data from the cache to the remote using diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index d193645099..3d391ccddf 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -94,7 +94,7 @@ Relevant notes: - Entire directories can be tracked as outputs, which produces a single `.dir` entry in the cache (refer to - [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) + [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more info.) - Outputs are deleted from the workspace before executing the diff --git a/content/docs/user-guide/basic-concepts/dvc-cache.md b/content/docs/user-guide/basic-concepts/dvc-cache.md index b1afec5846..1d080775f4 100644 --- a/content/docs/user-guide/basic-concepts/dvc-cache.md +++ b/content/docs/user-guide/basic-concepts/dvc-cache.md @@ -6,4 +6,4 @@ match: ['DVC cache', cache, caches, cached] The DVC cache is a hidden storage (by default located in the `.dvc/cache` directory) for files that are under DVC control, and their different versions. For more details, please refer to this -[document](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory). +[document](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory). diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index d5e78bfea8..48b9c3a34d 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -236,7 +236,7 @@ separately under `params`, grouped by parameters file. hand or with the command `dvc config --local`. - `.dvc/cache`: The cache directory will store your data in a - special [structure](#structure-of-cache-directory). The data files and + special [structure](#structure-of-the-cache-directory). The data files and directories in the workspace will only contain links to the data files in the cache. (Refer to [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See @@ -277,7 +277,7 @@ separately under `params`, grouped by parameters file. dependencies and outputs, to allow safely running multiple DVC commands in parallel -## Structure of cache directory +## Structure of the cache directory There are two ways in which the data is stored in cache: As a single file (eg. `data.csv`), or a directory of files. diff --git a/content/docs/user-guide/dvcignore.md b/content/docs/user-guide/dvcignore.md index eceef9dff1..5fc2ca3291 100644 --- a/content/docs/user-guide/dvcignore.md +++ b/content/docs/user-guide/dvcignore.md @@ -85,7 +85,7 @@ Only the hash values of a directory (`data/`) and one file have been (`data1`). > Refer to -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) > for more info. Now, let's modify file `data1` and see if it affects `dvc status`. From cb80e026d0f5cb6cd28dc7a579139f79e395f5c4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 17:33:38 -0500 Subject: [PATCH 53/54] cmd: mention external deps/outs as a bullet in run notes per https://github.com/iterative/dvc.org/pull/1662#pullrequestreview-462062653 --- content/docs/command-reference/run.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 3d391ccddf..de9e3751d6 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -73,10 +73,10 @@ $ dvc run -n printer -d write.sh -o pages ./write.sh $ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh ``` -Stage dependencies can be any file in the workspace, either untracked, or much -more commonly, tracked by DVC or Git. Outputs will be tracked and -cached by DVC when the stage is run. Every output version will be -cached when the stage is reproduced (see also `dvc gc`). +Stage dependencies can be any file or directory, either untracked, or more +commonly tracked by DVC or Git. Outputs will be tracked and cached +by DVC when the stage is run. Every output version will be cached when the stage +is reproduced (see also `dvc gc`). Relevant notes: @@ -97,10 +97,14 @@ Relevant notes: [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more info.) -- Outputs are deleted from the workspace before executing the - command (including at `dvc repro`) if their paths are found as existing - files/directories. This also means that the stage command needs to recreate - any directory structures defined as outputs every time its executed by DVC. +- [external dependencies](/doc/user-guide/external-dependencies) and + [external outputs](/doc/user-guide/managing-external-data) (outside of the + workspace) are also supported. + +- Outputs are deleted from the workspace before executing the command (including + at `dvc repro`) if their paths are found as existing files/directories. This + also means that the stage command needs to recreate any directory structures + defined as outputs every time its executed by DVC. ### For displaying and comparing data science experiments From c02e81e2ac0a46e745efb4c491b436431a2df579 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 5 Aug 2020 17:36:18 -0500 Subject: [PATCH 54/54] cmd: improve note about tracking dirs as outputs in run per https://github.com/iterative/dvc.org/pull/1662#pullrequestreview-462062954 --- content/docs/command-reference/run.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index de9e3751d6..7746a9cfdf 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -92,8 +92,8 @@ Relevant notes: - DVC does not feed dependency files to the command being run. The program will have to read by itself the files specified with `-d`. -- Entire directories can be tracked as outputs, which produces a single `.dir` - entry in the cache (refer to +- Entire directories produced by the stage can be tracked as outputs by DVC, + which generates a single `.dir` entry in the cache (refer to [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory) for more info.)