From 71b9c234cc7fade4dc76c12df73af0cf86be0baf Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 1 Feb 2021 18:01:48 -0600 Subject: [PATCH 1/3] run-cache: write basic docs (in internal files guide + tooltip) for #1289 --- content/docs/command-reference/fetch.md | 5 ++- content/docs/command-reference/pull.md | 7 ++-- content/docs/command-reference/push.md | 5 ++- content/docs/command-reference/repro.md | 11 +++-- content/docs/command-reference/run.md | 7 ++-- .../user-guide/basic-concepts/run-cache.md | 11 +++++ .../project-structure/internal-files.md | 41 ++++++++++++++++--- 7 files changed, 68 insertions(+), 19 deletions(-) create mode 100644 content/docs/user-guide/basic-concepts/run-cache.md diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 5a7a4575c7..60922537c9 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -70,8 +70,9 @@ specific one is given with `--remote`. [remote storage](/doc/command-reference/remote) to fetch from (see `dvc remote list`). -- `--run-cache` - downloads all available history of stage runs from the remote - repository. +- `--run-cache` - downloads all available history of + [stage runs](/doc/user-guide/project-structure/internal-files#run-cache) from + the remote repository. See the same option in `dvc push`. - `-d`, `--with-deps` - determines files to download by tracking dependencies to the `targets`. If none are provided, this option is ignored. By traversing all diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 42d46f905b..86099655a8 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -110,9 +110,10 @@ used to see what files `dvc pull` would download. [remote storage](/doc/command-reference/remote) to pull from (see `dvc remote list`). -- `--run-cache` - downloads all available history of stage runs from the remote - repository (to the cache only, like `dvc fetch --run-cache`). Note that - `dvc repro ` is necessary to checkout these files (into the +- `--run-cache` - downloads all available history of + [stage runs](/doc/user-guide/project-structure/internal-files#run-cache) from + the remote repository (to the cache only, like `dvc fetch --run-cache`). Note + that `dvc repro ` is necessary to checkout these files (into the workspace) and update `dvc.lock`. - `-j `, `--jobs ` - parallelism level for DVC to download data diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 67a1c9dbf2..89c1373a3f 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -88,8 +88,9 @@ in the cache (compared to the default remote.) It can be used to see what files [remote storage](/doc/command-reference/remote) to push to (see `dvc remote list`). -- `--run-cache` - uploads all available history of stage runs to the remote - repository. +- `--run-cache` - uploads all available history of + [stage runs](/doc/user-guide/project-structure/internal-files#run-cache) to + the remote repository. - `-j `, `--jobs ` - parallelism level for DVC to upload data to remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 08844a1999..64d76061a4 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -153,8 +153,11 @@ up-to-date and only execute the final stage. present in the DVC project. Specifying `targets` has no effects with this option, as all possible targets are already included. -- `--no-run-cache` - execute stage commands even if they have already been run - with the same dependencies/outputs/etc. before. +- `--no-run-cache` - execute stage command(s) even if they have already been run + with the same dependencies and outputs (see the + [details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful + for example if the stage command/s is/are non-deterministic + ([not recommended](/doc/command-reference/run#avoiding-unexpected-behavior)). - `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will reproduce `A` first and then `B`, even if `B` was previously executed with the @@ -178,8 +181,8 @@ up-to-date and only execute the final stage. - `--pull` - [pulls](/doc/command-reference/pull) dependencies and outputs involved in the stages being reproduced, if they are found in the - [default](/doc/command-reference/remote/default) remote storage. Note that it - checks the local run-cache too (available history of stage runs). + [default remote storage](/doc/command-reference/remote/default). Note that it + tries the local run-cache first. > Has no effect if combined with `--no-run-cache`. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index aab7be0502..c6862d8850 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -240,9 +240,10 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR' - `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without asking for confirmation. -- `--no-run-cache` - execute the stage `command` even if it has already been run - with the same dependencies/outputs/etc. before. Useful for example if the - command's code is non-deterministic +- `--no-run-cache` - execute the stage command(s) even if they have already been + run with the same dependencies and outputs (see the + [details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful + for example if the stage command/s is/are non-deterministic ([not recommended](#avoiding-unexpected-behavior)). - `--no-commit` - do not store the outputs of this execution in the cache diff --git a/content/docs/user-guide/basic-concepts/run-cache.md b/content/docs/user-guide/basic-concepts/run-cache.md new file mode 100644 index 0000000000..04e2afe24c --- /dev/null +++ b/content/docs/user-guide/basic-concepts/run-cache.md @@ -0,0 +1,11 @@ +--- +name: 'Run-cache' +match: ['run-cache'] +--- + +The DVC run-cache is a log of stage runs in the project. It's comprised of +`dvc.lock` file backups, identified as combinations of dependencies, commands, +and outputs that correspond to each other. `dvc repro` and `dvc run` populate +and reutilize the run-cache. See +[Run-cache](/doc/user-guide/project-structure/internal-files#run-cache) for more +details. diff --git a/content/docs/user-guide/project-structure/internal-files.md b/content/docs/user-guide/project-structure/internal-files.md index 675bdb1f96..4f1932190e 100644 --- a/content/docs/user-guide/project-structure/internal-files.md +++ b/content/docs/user-guide/project-structure/internal-files.md @@ -15,18 +15,22 @@ operation. (credentials, private locations, etc). The local config file can be edited by hand or with the command `dvc config --local`. -- `.dvc/cache`: The cache directory will store your data in a - special [structure](#structure-of-the-cache-directory). The data files and - directories in the workspace will only contain links to the data - files in the cache. (Refer to +- `.dvc/cache`: Default location of the cache directory. The cache + stores the project data in a special + [structure](#structure-of-the-cache-directory). The data files and directories + in the workspace will only contain links to the data files in the + cache (refer to [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization). See - `dvc config cache` for related configuration options. + `dvc config cache` for related configuration options, including changing the + its location. > Note that DVC includes the cache directory in `.gitignore` during > initialization. No data tracked by DVC should ever be pushed to the Git > repository, only the DVC files that are needed to download or > reproduce that data. +- `.dvc/cache/runs`: Default location of the [run-cache](#run-cache). + - `.dvc/plots`: Directory for [plot templates](/doc/command-reference/plots#plot-templates) @@ -120,3 +124,30 @@ $ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir ``` That's how DVC knows that the other two cached files belong in the directory. + +### Run-cache + +`dvc repro` and `dvc run` by default populate and reutilize a log of stage runs +known, which is found in the `runs/` directory inside the cache (or +[remote storage](/doc/command-reference/remote)). + +Runs are identified as combinations of dependencies, commands, and +outputs that correspond to each other. These combinations are +hashed into special values that make up the file paths inside the run-cache dir. + +```dvc +$ tree .dvc/cache/runs +.dvc/cache/runs +└── 86 + └── 8632e1555283d6e23ec808c9ee1fadc30630c888d5c08695333609ef341508bf + └── e98a34c44fa6b564ef211e76fb3b265bc67f19e5de2e255217d3900d8f... +``` + +The files themselves are backups of the `dvc.lock` file that resulted from that +run. + +> Note that the run's outputs are stored and retrieved from the +> regular cache. + +💡 `dvc push` and `dvc pull` (and `dvc fetch`) can download and upload the +run-cache to remote storage for sharing and/or as a back up. From 88fc6883881d262b4b2b09bd8cc4f6299638c3a5 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 1 Feb 2021 18:19:27 -0600 Subject: [PATCH 2/3] guide: clarify what the run-cache log is about --- content/docs/user-guide/basic-concepts/run-cache.md | 8 ++++---- .../docs/user-guide/project-structure/internal-files.md | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/user-guide/basic-concepts/run-cache.md b/content/docs/user-guide/basic-concepts/run-cache.md index 04e2afe24c..0eab76c538 100644 --- a/content/docs/user-guide/basic-concepts/run-cache.md +++ b/content/docs/user-guide/basic-concepts/run-cache.md @@ -3,9 +3,9 @@ name: 'Run-cache' match: ['run-cache'] --- -The DVC run-cache is a log of stage runs in the project. It's comprised of -`dvc.lock` file backups, identified as combinations of dependencies, commands, -and outputs that correspond to each other. `dvc repro` and `dvc run` populate -and reutilize the run-cache. See +The DVC run-cache is a log of stages that have been run in the project. It's +comprised of `dvc.lock` file backups, identified as combinations of +dependencies, commands, and outputs that correspond to each other. `dvc repro` +and `dvc run` populate and reutilize the run-cache. See [Run-cache](/doc/user-guide/project-structure/internal-files#run-cache) for more details. diff --git a/content/docs/user-guide/project-structure/internal-files.md b/content/docs/user-guide/project-structure/internal-files.md index 4f1932190e..2d1f25eec1 100644 --- a/content/docs/user-guide/project-structure/internal-files.md +++ b/content/docs/user-guide/project-structure/internal-files.md @@ -127,9 +127,9 @@ That's how DVC knows that the other two cached files belong in the directory. ### Run-cache -`dvc repro` and `dvc run` by default populate and reutilize a log of stage runs -known, which is found in the `runs/` directory inside the cache (or -[remote storage](/doc/command-reference/remote)). +`dvc repro` and `dvc run` by default populate and reutilize a log of stages that +have been run in the project. It is found in the `runs/` directory inside the +cache (or [remote storage](/doc/command-reference/remote)). Runs are identified as combinations of dependencies, commands, and outputs that correspond to each other. These combinations are From d2f8ebbcce1990fa2c0a88ed272b97313f30ae8c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 3 Feb 2021 10:27:32 -0600 Subject: [PATCH 3/3] cmd: correct repro --pull per https://github.com/iterative/dvc.org/pull/2137#pullrequestreview-580982343 --- content/docs/command-reference/repro.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 64d76061a4..30e41612a8 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -182,9 +182,7 @@ up-to-date and only execute the final stage. - `--pull` - [pulls](/doc/command-reference/pull) dependencies and outputs involved in the stages being reproduced, if they are found in the [default remote storage](/doc/command-reference/remote/default). Note that it - tries the local run-cache first. - - > Has no effect if combined with `--no-run-cache`. + tries the local run-cache first (unless `--no-run-cache` is also used). - `-h`, `--help` - prints the usage/help message, and exit.