From 2494b952d1b60e0848f1584244d06b85d347e74f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 13 Nov 2020 21:45:17 -0800 Subject: [PATCH 01/23] cmd: link add --glob to the glob py mod per https://github.com/iterative/dvc.org/pull/1929#pullrequestreview-529551406 --- content/docs/command-reference/add.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 4d49e0c9fe..b86d51a22a 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -144,9 +144,9 @@ directory symlinks. - `--external` - allow `targets` that are outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). -- `--glob` - allows adding files and directories that match the specified - pattern as specified by `target`. Shell-style wildcards are supported: `*`, - `?`, `[seq]`, `[!seq]`, and `**`. +- `--glob` - allows adding files and directories that match the + [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. + Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` - `-h`, `--help` - prints the usage/help message, and exit. From 8843e949d6c7ae49fb8f6a983a9b1feea005981f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 13 Nov 2020 22:02:48 -0800 Subject: [PATCH 02/23] cmd: copy edit symlink targets section of add --- content/docs/command-reference/add.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index b86d51a22a..8a950d2a4e 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -98,13 +98,13 @@ undesirable for data directories with a large number of files. To avoid adding files inside a directory accidentally, you can add the corresponding [patterns](/doc/user-guide/dvcignore) to `.dvcignore`. -### Adding symlinked targets {#add-symlink} +### Adding symlink targets {#add-symlink} -DVC only supports symlinked files as valid targets for `dvc add`. If the target -path is a directory symlink, or if the target path contains any intermediate -directory symlinks, `dvc add` will fail. +`dvc add` supports symlinked files as `targets`. But if a target path is a +directory symlink, or if it contains any intermediate directory symlinks, it +cannot be added to DVC. -So given the following project structure: +For example, given the following project structure: ``` . @@ -117,10 +117,9 @@ So given the following project structure: └── link_to_file -> dir/file ``` -`dir`, `dir/file`, `link_to_external_file` and `link_to_file` are all valid -targets for `dvc add`. `link_to_dir`, `link_to_external_dir` and -`link_to_dir/file` are invalid targets, since the target path would contain -directory symlinks. +`link_to_file` and `link_to_external_file` are both valid symlink targets to +`dvc add`. But `link_to_dir`, `link_to_external_dir`, and `link_to_dir/file` are +not. ## Options @@ -144,6 +143,9 @@ directory symlinks. - `--external` - allow `targets` that are outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). + > Note that this option implies `--no-commit`, as external outputs are never + > pushed or pulled from/to remote storage. See link above for more details. + - `--glob` - allows adding files and directories that match the [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` From fd47b432786009a20afd083a8366c16525a9e866 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 13 Nov 2020 22:03:51 -0800 Subject: [PATCH 03/23] cmd: copy edit import(-url) --no-exec text --- content/docs/command-reference/import-url.md | 7 ++++--- content/docs/command-reference/import.md | 11 ++++++----- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 5c18fa65df..b675bd6606 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -125,9 +125,10 @@ source. - `--no-exec` - create `.dvc` file without actually downloading `url`. E.g. if the file or directory already exists, this can be used to skip the download. - The data hash is not calculated by this, only the metadata is saved into the - `.dvc` file. You can use `dvc commit .dvc` if you need the hashes in the - new `.dvc` file and save existing data to the cache. + The data hash is not calculated when this option is used, only the import + metadata is saved to the `.dvc` file. `dvc commit .dvc` can be used if + the data hashes are needed in the `.dvc` file, and to save existing data to + the cache. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index f435d8a017..d7abd0f96d 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -103,11 +103,12 @@ repo at `url`) are not supported. > [Importing and updating fixed revisions](#example-importing-and-updating-fixed-revisions) > example below). -- `--no-exec` - create `.dvc` file without actually downloading the file or - directory. E.g. if the file or directory already exists, this can be used to - skip the download. The data hash is not calculated by this, only the metadata - is saved into the `.dvc` file. You can use `dvc commit .dvc` if you need - the hashes in the new `.dvc` file and save existing data to the cache. +- `--no-exec` - create the import `.dvc` file without actually downloading the + file or directory. E.g. if the file or directory already exists, this can be + used to skip the download. The data hash is not calculated when this option is + used, only the import metadata is saved to the `.dvc` file. + `dvc commit .dvc` can be used if the data hashes are needed in the `.dvc` + file, and to save existing data to the cache. - `-h`, `--help` - prints the usage/help message, and exit. From 06d24fd707a01572b5ee5ff55751adafb2a8b5f3 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 13 Nov 2020 22:05:34 -0800 Subject: [PATCH 04/23] cmd: copy edits to plots and metrics --- content/docs/command-reference/metrics/diff.md | 4 ++-- content/docs/command-reference/plots/diff.md | 4 ++-- content/docs/command-reference/plots/index.md | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/metrics/diff.md b/content/docs/command-reference/metrics/diff.md index cfbecae900..54d09757ee 100644 --- a/content/docs/command-reference/metrics/diff.md +++ b/content/docs/command-reference/metrics/diff.md @@ -22,8 +22,8 @@ positional arguments: This command provides a quick way to compare metrics among experiments in the repository history. All metrics defined in `dvc.yaml` are used by default. The -comparison shown by this command includes the new value, and the numeric -difference (delta) with the previous value (rounded to 5 digits precision). +differences shown by this command include the new value, and numeric difference +(delta) from the previous value of metrics (rounded to 5 digits precision). `a_rev` and `b_rev` are Git commit hashes, tag, or branch names. If none are specified, `dvc metrics diff` compares metrics currently present in the diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index 5bcd21353d..f295dda61b 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -31,8 +31,8 @@ versions of the repository, by overlaying them in a single plot. (uncommitted changes) with their latest commit (required). A single specified revision results in comparing the workspace and that version. -💡 Note that any number of `revisions` can be provided, and the resulting plot -shows all of them in a single image. +💡 Note that any number of `revisions` can be provided (the resulting plot shows +all of them in a single image). All plots defined in `dvc.yaml` are used by default, but specific plots files can be specified with the `--targets` option (note that targets don't diff --git a/content/docs/command-reference/plots/index.md b/content/docs/command-reference/plots/index.md index 9a9af955bf..5171024e5f 100644 --- a/content/docs/command-reference/plots/index.md +++ b/content/docs/command-reference/plots/index.md @@ -134,7 +134,7 @@ header (first row) are equivalent to field names. ### DVC template anchors -- `` (**required**) - the plot data from any kind of metrics +- `` (**required**) - the plot data from any type of metrics files is converted to a single JSON array internally, and injected instead of this anchor. Two additional fields will be added: `index` and `rev` (explained above). From 6dcba8a636143b6a602449817de8ca24c8c6c4f7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 15 Nov 2020 23:46:46 -0800 Subject: [PATCH 05/23] cmd: move add --glob option up 1 place per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-530601196 --- content/docs/command-reference/add.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 8a950d2a4e..462526d64e 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -140,16 +140,16 @@ not. the given target. This option allows to set the name and the path of the generated `.dvc` file. +- `--glob` - allows adding files and directories that match the + [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. + Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` + - `--external` - allow `targets` that are outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). > Note that this option implies `--no-commit`, as external outputs are never > pushed or pulled from/to remote storage. See link above for more details. -- `--glob` - allows adding files and directories that match the - [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. - Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` - - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no From 21e37254fa2a451fe84af600a8cad5627d5a4716 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 16 Nov 2020 12:26:00 -0800 Subject: [PATCH 06/23] cmd: correct note about add --external per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-530601167 --- content/docs/command-reference/add.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 462526d64e..7b4a1c9ec1 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -147,8 +147,10 @@ not. - `--external` - allow `targets` that are outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). - > Note that this option implies `--no-commit`, as external outputs are never - > pushed or pulled from/to remote storage. See link above for more details. + > Note that external outputs require an external cache setup (unless + > `--no-commit` is used), and are never pushed or pulled from/to + > [remote storage](/doc/command-reference/remote). See link above for more + > details. - `-h`, `--help` - prints the usage/help message, and exit. From 89dd826911822b7bee8367a2bdbb68ff014b65ab Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 16 Nov 2020 14:56:51 -0800 Subject: [PATCH 07/23] cmd: clarify --jobs for status and gc per https://github.com/iterative/dvc.org/pull/1941#discussion_r524541298 --- content/docs/command-reference/gc.md | 4 ++-- content/docs/command-reference/status.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 3b30203195..33f454dfee 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -93,8 +93,8 @@ The default remote is cleaned (see `dvc config core.remote`) unless the from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` - config option with `dvc remote modify`. Using more jobs may improve the - overall connection speed. + config option with `dvc remote modify`. Using more jobs may speed up the + operation. > For now only some phases of garbage collection are parallel. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 4f31cbace5..d7cf0a2225 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -143,8 +143,8 @@ that. information from remote storage. This only applies when the `--cloud` option is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using - the `jobs` config option with `dvc remote modify`. Using more jobs may improve - the overall connection speed. + the `jobs` config option with `dvc remote modify`. Using more jobs may speed + up the operation. - `-h`, `--help` - prints the usage/help message, and exit. From 2ab35cc2b8e87194235b0ad772e1082561263839 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 16 Nov 2020 17:47:28 -0800 Subject: [PATCH 08/23] cmd: fix path to .dvc/tmp/state file in --no-commit desc --- content/docs/command-reference/add.md | 2 +- content/docs/command-reference/repro.md | 2 +- content/docs/command-reference/run.md | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 7b4a1c9ec1..5412c38c5e 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -129,7 +129,7 @@ not. file is created using the process described in this command's description. - `--no-commit` - do not save outputs to cache. A `.dvc` file is created and an - entry is added to `.dvc/state`, while nothing is added to the cache. + entry is added to `.dvc/tmp/state`, while nothing is added to the cache. (`dvc status` will report that the file is `not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. This is analogous to using `git add` before `git commit`. diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 5a6379739a..a2024e870c 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -115,7 +115,7 @@ up-to-date and only execute the final stage. If there are no directories among the targets, this option is ignored. - `--no-commit` - do not save outputs to cache. A DVC-file is created and an - entry is added to `.dvc/state`, while nothing is added to the cache. + entry is added to `.dvc/tmp/state`, while nothing is added to the cache. (`dvc status` will report that the file is `not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. Useful to avoid caching unnecessary data repeatedly when running multiple experiments. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 8eb6714f81..97eed018da 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -245,10 +245,10 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' ([not recommended](#avoiding-unexpected-behavior)). - `--no-commit` - do not save outputs to cache. A stage created and an entry is - added to `.dvc/state`, while nothing is added to the cache. In the stage file, - the file hash values will be empty; They will be populated the next time this - stage is actually executed, or `dvc commit` can be used to force committing - existing output file versions to cache. + added to `.dvc/tmp/state`, while nothing is added to the cache. In the stage + file, the file hash values will be empty; They will be populated the next time + this stage is actually executed, or `dvc commit` can be used to force + committing existing output file versions to cache. This is useful to avoid caching unnecessary data repeatedly when running multiple experiments. From 7bad093e8b20c221527c013d0b6412be67e1b9df Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 16 Nov 2020 18:38:58 -0800 Subject: [PATCH 09/23] cmd: move notes about --external outputs to x data guide per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-531756262 --- content/docs/command-reference/add.md | 15 +++++------ content/docs/command-reference/repro.md | 10 ++++---- content/docs/command-reference/run.md | 12 ++++----- .../docs/user-guide/managing-external-data.md | 25 +++++++++++-------- 4 files changed, 32 insertions(+), 30 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 5412c38c5e..257ce0926c 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -128,11 +128,10 @@ not. among the `targets`, this option is ignored. For each file found, a new `.dvc` file is created using the process described in this command's description. -- `--no-commit` - do not save outputs to cache. A `.dvc` file is created and an - entry is added to `.dvc/tmp/state`, while nothing is added to the cache. - (`dvc status` will report that the file is `not in cache`.) Use `dvc commit` - when ready to commit outputs with DVC. This is analogous to using `git add` - before `git commit`. +- `--no-commit` - do not save outputs to cache. A `.dvc` file is created, while + nothing is added to the cache. (`dvc status` will report that the file is + `not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. This + is analogous to using `git add` before `git commit`. - `--file ` - specify name of the `.dvc` file it generates. This option works only if there is a single target. By default the name of the @@ -147,10 +146,8 @@ not. - `--external` - allow `targets` that are outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). - > Note that external outputs require an external cache setup (unless - > `--no-commit` is used), and are never pushed or pulled from/to - > [remote storage](/doc/command-reference/remote). See link above for more - > details. + > Note that external outputs typically require an external cache setup. See + > link above for more details. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index a2024e870c..733d5c2a90 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -114,11 +114,11 @@ up-to-date and only execute the final stage. target directory and its subdirectories for stages (in `dvc.yaml`) to inspect. If there are no directories among the targets, this option is ignored. -- `--no-commit` - do not save outputs to cache. A DVC-file is created and an - entry is added to `.dvc/tmp/state`, while nothing is added to the cache. - (`dvc status` will report that the file is `not in cache`.) Use `dvc commit` - when ready to commit outputs with DVC. Useful to avoid caching unnecessary - data repeatedly when running multiple experiments. +- `--no-commit` - do not save outputs to cache. A DVC-file is created, while + nothing is added to the cache. (`dvc status` will report that the file is + `not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. + Useful to avoid caching unnecessary data repeatedly when running multiple + experiments. - `-m`, `--metrics` - show metrics after reproduction. The target pipelines must have at least one metrics file defined either with the `dvc metrics` command, diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 97eed018da..940f82dd4b 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -244,11 +244,11 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' command's code is non-deterministic ([not recommended](#avoiding-unexpected-behavior)). -- `--no-commit` - do not save outputs to cache. A stage created and an entry is - added to `.dvc/tmp/state`, while nothing is added to the cache. In the stage - file, the file hash values will be empty; They will be populated the next time - this stage is actually executed, or `dvc commit` can be used to force - committing existing output file versions to cache. +- `--no-commit` - do not save outputs to cache. A stage created, while nothing + is added to the cache. In the stage file, the file hash values will be empty; + They will be populated the next time this stage is actually executed, or + `dvc commit` can be used to force committing existing output file versions to + cache. This is useful to avoid caching unnecessary data repeatedly when running multiple experiments. @@ -260,7 +260,7 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR' > Note that DVC-files without dependencies are automatically considered > "always changed", so this option has no effect in those cases. -- `--external` - allow outputs that are outside of the DVC repository. See +- `--external` - allow writing outputs outside of the DVC repository. See [Managing External Data](/doc/user-guide/managing-external-data). - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index c31545c67b..1393cf4880 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -19,12 +19,13 @@ outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). External outputs are considered part of the (extended) DVC project: DVC will track changes in them, and reflect this in `dvc status` reports, for example. -For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to -[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -in the same external/remote file system first. +Normally, external outputs (e.g. `dvc add`, `dvc run -o`), require an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +in the same external/remote file. The exception is when they are not +cached (`--no-commit` option). -Currently, the following types (protocols) of external outputs (and -cache) are supported: +Currently, the following types (protocols) of external outputs (and cache) are +supported: - Amazon S3 - Microsoft Azure Blob Storage @@ -33,13 +34,17 @@ Currently, the following types (protocols) of external outputs (and - HDFS - Local files and directories outside the workspace -> Note that these are a subset of the remote storage types supported by +> Note that these are only a subset of the remote storage types supported by > `dvc remote`. -> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for -> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file -> hash overlaps: the hash of an external output could collide with a hash -> generated locally for another file with different content. +> 💡 Note that external outputs are never pushed or pulled from/to +> [remote storage](/doc/command-reference/remote), as they are already stored in +> an external location. + +> ⚠️ Avoid using the same DVC remote (used for `dvc push`, `dvc pull`, etc.) for +> external outputs, because it may cause file hash overlaps: the hash of an +> external output could collide with a hash generated locally for another file +> with different content. ## Examples From be86844396d2bca358eb63cbc49a5f7a8c0fc12b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 16 Nov 2020 19:13:48 -0800 Subject: [PATCH 10/23] guide: improve nots around external types supported per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-531982134 --- content/docs/user-guide/external-dependencies.md | 3 --- content/docs/user-guide/managing-external-data.md | 8 ++------ 2 files changed, 2 insertions(+), 9 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 1b6e2ae883..12a26ea492 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -27,9 +27,6 @@ supported: - HTTP - Local files and directories outside the workspace -> Note that these are a subset of the remote storage types supported by -> `dvc remote`. - In order to specify an external dependency for your stage, use the usual `-d` option in `dvc run` with the external path or URL to your desired file or directory. diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 1393cf4880..dd182efc9f 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -34,12 +34,8 @@ supported: - HDFS - Local files and directories outside the workspace -> Note that these are only a subset of the remote storage types supported by -> `dvc remote`. - -> 💡 Note that external outputs are never pushed or pulled from/to -> [remote storage](/doc/command-reference/remote), as they are already stored in -> an external location. +💡 Note that external outputs are never pushed or pulled from/to +[remote storage](/doc/command-reference/remote). > ⚠️ Avoid using the same DVC remote (used for `dvc push`, `dvc pull`, etc.) for > external outputs, because it may cause file hash overlaps: the hash of an From 595c9ed5111e778e81f8e244caf839a60b64fb7c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 16 Nov 2020 19:15:44 -0800 Subject: [PATCH 11/23] gude: remove mention of --no-commit for ext data per https://github.com/iterative/dvc.org/pull/1945#discussion_r524858084 --- content/docs/user-guide/managing-external-data.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index dd182efc9f..f84294a778 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -21,8 +21,7 @@ track changes in them, and reflect this in `dvc status` reports, for example. Normally, external outputs (e.g. `dvc add`, `dvc run -o`), require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -in the same external/remote file. The exception is when they are not -cached (`--no-commit` option). +in the same external/remote file. Currently, the following types (protocols) of external outputs (and cache) are supported: From 70d353d404754e754c159f9d2f9de70d8b7db328 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 17 Nov 2020 01:05:25 -0800 Subject: [PATCH 12/23] guide: simplify notes around external deps/outs docs --- content/docs/user-guide/external-dependencies.md | 8 ++++---- content/docs/user-guide/managing-external-data.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 12a26ea492..b90e4fcc3c 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -10,7 +10,7 @@ External dependencies and [external outputs](/doc/user-guide/managing-external-data) provide ways to track data outside of the project. -## How it works +## How they work You can specify external files or directories as dependencies for your pipeline stages. DVC will track changes in them and reflect this in the output of @@ -27,9 +27,9 @@ supported: - HTTP - Local files and directories outside the workspace -In order to specify an external dependency for your stage, use the -usual `-d` option in `dvc run` with the external path or URL to your desired -file or directory. +Note that these supported locations correspond to a subset of the +[remote storage](/doc/command-reference/remote) types supported by `dvc remote`, +but that is a different thing. ## Examples diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index f84294a778..1236a25a03 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -19,7 +19,7 @@ outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). External outputs are considered part of the (extended) DVC project: DVC will track changes in them, and reflect this in `dvc status` reports, for example. -Normally, external outputs (e.g. `dvc add`, `dvc run -o`), require an +External outputs (e.g. `dvc add`, `dvc run -o`), require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. From 0543b5e33d9116be447ffa0f04bfdefa4a156e15 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 17 Nov 2020 01:36:34 -0800 Subject: [PATCH 13/23] guide: review base explanations and remove remote storage notes (for now) from x data docs --- .../docs/user-guide/external-dependencies.md | 10 ++--- .../docs/user-guide/managing-external-data.md | 43 ++++++++----------- 2 files changed, 23 insertions(+), 30 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index b90e4fcc3c..73ef105b60 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -1,16 +1,16 @@ # External Dependencies There are cases when data is so large, or its processing is organized in a way -such that you would like to avoid moving it out of its external/remote location. -For example from a network attached storage (NAS), processing data on HDFS, -running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. +such that its preferable to avoid moving it from its original location, even if +it's external or remote to the project. For example: data on a network attached +storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via +SSH, or for a script that streams data from S3 to process it. External dependencies and [external outputs](/doc/user-guide/managing-external-data) provide ways to track data outside of the project. -## How they work +## How external dependencies work You can specify external files or directories as dependencies for your pipeline stages. DVC will track changes in them and reflect this in the output of diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 1236a25a03..8d59304373 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -1,10 +1,10 @@ # Managing External Data There are cases when data is so large, or its processing is organized in a way -such that its preferable to avoid moving it from its external/remote location. -For example data on a network attached storage (NAS), processing data on HDFS, -running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. +such that its preferable to avoid moving it from its original location, even if +it's external or remote to the project. For example: data on a network attached +storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via +SSH, or for a script that streams data from S3 to process it. External outputs and [external dependencies](/doc/user-guide/external-dependencies) provide ways to @@ -13,18 +13,17 @@ track data outside of the project. ## How external outputs work DVC can track existing files or directories on an external location with -`dvc add` (`out` field). It can also create external files or directories as -outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). +`dvc add`. It can also define external outputs for `dvc.yaml` stages to create. External outputs are considered part of the (extended) DVC project: DVC will -track changes in them, and reflect this in `dvc status` reports, for example. +detect when they change, reporting this in `dvc status` for example. -External outputs (e.g. `dvc add`, `dvc run -o`), require an +Note that they require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. -Currently, the following types (protocols) of external outputs (and cache) are -supported: +Currently, the following locations (protocols) are supported for external +outputs (and cache): - Amazon S3 - Microsoft Azure Blob Storage @@ -33,25 +32,19 @@ supported: - HDFS - Local files and directories outside the workspace -💡 Note that external outputs are never pushed or pulled from/to -[remote storage](/doc/command-reference/remote). - -> ⚠️ Avoid using the same DVC remote (used for `dvc push`, `dvc pull`, etc.) for -> external outputs, because it may cause file hash overlaps: the hash of an -> external output could collide with a hash generated locally for another file -> with different content. - ## Examples -Let's take a look at: +Let's take a look at the following operations on all the supported location +types: -1. Adding a `dvc remote` to use as cache for data in the external location, and +1. Adding a `dvc remote` in the same location as the desired outputs, and configure it as external cache with `dvc config`. -2. Tracking existing data on an external location with `dvc add` (this doesn't - download it). This produces a `.dvc` file with an external output. -3. Creating a simple [stage](/doc/command-reference/run) that moves a local file - to the external location. This produces a stage with another external output - in `dvc.yaml`. +2. Tracking existing data on the location using `dvc add` (`--external` option + needed). This produces a `.dvc` file with an external path in its `outs` + field. +3. Creating a simple [stage](/doc/command-reference/run) with `dvc run` + (`--external` option needed) that moves a local file to the external + location. This produces a stage with an external output, in `dvc.yaml`.
From ada0b1c372364401f7dbfa11746843715d26d5dc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 17 Nov 2020 01:46:14 -0800 Subject: [PATCH 14/23] guide: improve example description in ext data docs --- content/docs/user-guide/external-dependencies.md | 16 ++++++++++------ .../docs/user-guide/managing-external-data.md | 4 ++-- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 73ef105b60..912944b3cd 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -13,8 +13,8 @@ data outside of the project. ## How external dependencies work You can specify external files or directories as dependencies for your pipeline -stages. DVC will track changes in them and reflect this in the output of -`dvc status`. +[stages](/doc/command-reference/run). DVC will track changes in them and reflect +this in the output of `dvc status`. Currently, the following types (protocols) of external dependencies are supported: @@ -33,11 +33,15 @@ but that is a different thing. ## Examples -Let's take a look at a `download_file` [stage](/doc/command-reference/run) that -simply downloads a file from an external location. +To define an external dependency, add the external URL or path to +the `deps` field of `dvc.yaml`. For example, with the usual `-d` option in +`dvc run`, giving it the external URL/path to your desired file or directory. -> Note that some of these commands use the `/home/shared` directory, typical in -> Linux distributions. +Let's take a look at defining and running a `download_file` stage that simply +downloads a file from an external location, on all the supported location types. + +> Note that some of the example commands below use the `/home/shared` directory, +> typical in Linux distributions.
diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 8d59304373..c8a7c1744e 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -40,8 +40,8 @@ types: 1. Adding a `dvc remote` in the same location as the desired outputs, and configure it as external cache with `dvc config`. 2. Tracking existing data on the location using `dvc add` (`--external` option - needed). This produces a `.dvc` file with an external path in its `outs` - field. + needed). This produces a `.dvc` file with an external URL or path in its + `outs` field. 3. Creating a simple [stage](/doc/command-reference/run) with `dvc run` (`--external` option needed) that moves a local file to the external location. This produces a stage with an external output, in `dvc.yaml`. From d6ad0676a998b6b61418e250dcf3db724b6341d8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 17 Nov 2020 02:05:02 -0800 Subject: [PATCH 15/23] guide: mention remote storage properly in ext data docs per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-531992838 --- content/docs/user-guide/external-dependencies.md | 13 ++++++++----- .../docs/user-guide/managing-external-data.md | 16 ++++++++++------ 2 files changed, 18 insertions(+), 11 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 912944b3cd..a680f52402 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -16,7 +16,14 @@ You can specify external files or directories as dependencies for your pipeline [stages](/doc/command-reference/run). DVC will track changes in them and reflect this in the output of `dvc status`. -Currently, the following types (protocols) of external dependencies are + + +The remote URLs or external paths can be defined with the same format as the +`url` of certain `dvc remote` types. Currently, the following protocols are supported: - Amazon S3 @@ -27,10 +34,6 @@ supported: - HTTP - Local files and directories outside the workspace -Note that these supported locations correspond to a subset of the -[remote storage](/doc/command-reference/remote) types supported by `dvc remote`, -but that is a different thing. - ## Examples To define an external dependency, add the external URL or path to diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index c8a7c1744e..787b85d719 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -18,12 +18,9 @@ DVC can track existing files or directories on an external location with External outputs are considered part of the (extended) DVC project: DVC will detect when they change, reporting this in `dvc status` for example. -Note that they require an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -in the same external/remote file. - -Currently, the following locations (protocols) are supported for external -outputs (and cache): +The remote URLs or external paths can be defined with the same format as the +`url` of certain `dvc remote` types. Currently, the following protocols are +supported: - Amazon S3 - Microsoft Azure Blob Storage @@ -32,6 +29,13 @@ outputs (and cache): - HDFS - Local files and directories outside the workspace +> Note [remote storage](/doc/command-reference/remote) is a separate feature, +> and that external outputs are not pushed or pulled from/to DVC remotes. + +Importantly, external outputs require an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +in the same external/remote file. + ## Examples Let's take a look at the following operations on all the supported location From 9a8dc1913cf3064030d62bf8924cd01c177e3061 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 18 Nov 2020 11:53:52 -0600 Subject: [PATCH 16/23] guide: forgot to update note about remotes in ext deps --- content/docs/user-guide/external-dependencies.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index a680f52402..40db17287d 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -16,12 +16,6 @@ You can specify external files or directories as dependencies for your pipeline [stages](/doc/command-reference/run). DVC will track changes in them and reflect this in the output of `dvc status`. - - The remote URLs or external paths can be defined with the same format as the `url` of certain `dvc remote` types. Currently, the following protocols are supported: @@ -34,6 +28,8 @@ supported: - HTTP - Local files and directories outside the workspace +> Note [remote storage](/doc/command-reference/remote) is a separate feature. + ## Examples To define an external dependency, add the external URL or path to From bc7b97cac3a96986bffb3a140ae7e261eee65082 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 19 Nov 2020 16:58:23 -0600 Subject: [PATCH 17/23] guide: address external data feedback per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-532908331 and https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-532911315 --- content/docs/user-guide/managing-external-data.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 787b85d719..a1e77ca95a 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -16,7 +16,9 @@ DVC can track existing files or directories on an external location with `dvc add`. It can also define external outputs for `dvc.yaml` stages to create. External outputs are considered part of the (extended) DVC project: DVC will -detect when they change, reporting this in `dvc status` for example. +track them for [versioning](/doc/use-cases/versioning-data-and-model-files), +thus detecting when they change, and reporting their state in `dvc status` for +example. The remote URLs or external paths can be defined with the same format as the `url` of certain `dvc remote` types. Currently, the following protocols are @@ -29,13 +31,13 @@ supported: - HDFS - Local files and directories outside the workspace -> Note [remote storage](/doc/command-reference/remote) is a separate feature, -> and that external outputs are not pushed or pulled from/to DVC remotes. - -Importantly, external outputs require an +External outputs require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. +> Note [remote storage](/doc/command-reference/remote) is a separate feature, +> and that external outputs are not pushed or pulled from/to DVC remotes. + ## Examples Let's take a look at the following operations on all the supported location From defd99d8d507f8fed1b6c9b84a40f0622f9ff863 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 20 Nov 2020 19:29:07 -0600 Subject: [PATCH 18/23] guide: reinstate note about data hash overlaps btw local and ext data per https://github.com/iterative/dvc.org/pull/1945#discussion_r525647321 --- content/docs/user-guide/managing-external-data.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index a1e77ca95a..1a359e1930 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -35,8 +35,14 @@ External outputs require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. -> Note [remote storage](/doc/command-reference/remote) is a separate feature, -> and that external outputs are not pushed or pulled from/to DVC remotes. +> Note that [remote storage](/doc/command-reference/remote) is a separate +> feature, and that external outputs are not pushed or pulled from/to DVC +> remotes. + +> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for +> external outputs, because it may cause data collisions: the hash of an +> external output could collide with that of a local file with different +> content. ## Examples From 8a6892dddac3ba77b021a02a641c1183e02f1f70 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 12:04:48 -0600 Subject: [PATCH 19/23] guide: simplify ext deps intro per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-535877407 --- content/docs/user-guide/external-dependencies.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 40db17287d..3b15f5c185 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -1,10 +1,10 @@ # External Dependencies -There are cases when data is so large, or its processing is organized in a way -such that its preferable to avoid moving it from its original location, even if -it's external or remote to the project. For example: data on a network attached -storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via -SSH, or for a script that streams data from S3 to process it. +There are cases when data is so large, or its processing is organized in such a +way, that its preferable to avoid moving it from its original location. For +example data on a network attached storage (NAS), processing data on HDFS, +running [Dask](https://dask.org/) via SSH, or for a script that streams data +from S3 to process it. External dependencies and [external outputs](/doc/user-guide/managing-external-data) provide ways to track From dd1501bbc9d7eaba58c4a9d0c682a2ef447b1397 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 13:07:17 -0600 Subject: [PATCH 20/23] guide: rewrite ext deps/outs explanations per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-535877588 --- .../docs/user-guide/external-dependencies.md | 20 ++++++---- .../docs/user-guide/managing-external-data.md | 38 +++++++++---------- 2 files changed, 31 insertions(+), 27 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 3b15f5c185..18c2f01c38 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -6,19 +6,22 @@ example data on a network attached storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via SSH, or for a script that streams data from S3 to process it. -External dependencies and +External dependencies and [external outputs](/doc/user-guide/managing-external-data) provide ways to track data outside of the project. ## How external dependencies work -You can specify external files or directories as dependencies for your pipeline -[stages](/doc/command-reference/run). DVC will track changes in them and reflect -this in the output of `dvc status`. +External dependencies are considered part of the (extended) DVC +project: DVC will track them, detecting when they change (triggering stage +executions on `dvc repro`, for example). -The remote URLs or external paths can be defined with the same format as the -`url` of certain `dvc remote` types. Currently, the following protocols are -supported: +DVC can track files or directories on an external location as +[stage](/doc/command-reference/run) dependencies. Their remote URLs or external +paths are defined in `dvc.yaml` with the same format as the `url` of certain +`dvc remote` types. + +Currently, the following protocols are supported: - Amazon S3 - Microsoft Azure Blob Storage @@ -28,7 +31,8 @@ supported: - HTTP - Local files and directories outside the workspace -> Note [remote storage](/doc/command-reference/remote) is a separate feature. +> Note that [remote storage](/doc/command-reference/remote) is a different +> feature. ## Examples diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 649052342d..e1b91ba15e 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -1,28 +1,28 @@ # Managing External Data -There are cases when data is so large, or its processing is organized in a way -such that its preferable to avoid moving it from its original location, even if -it's external or remote to the project. For example: data on a network attached -storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via -SSH, or for a script that streams data from S3 to process it. +There are cases when data is so large, or its processing is organized in such a +way, that its preferable to avoid moving it from its original location. For +example data on a network attached storage (NAS), processing data on HDFS, +running [Dask](https://dask.org/) via SSH, or for a script that streams data +from S3 to process it. -External outputs and +External outputs and [external dependencies](/doc/user-guide/external-dependencies) provide ways to track data outside of the project. ## How external outputs work -DVC can track existing files or directories on an external location with -`dvc add`. It can also define external outputs for `dvc.yaml` stages to create. +External outputs are considered part of the (extended) DVC project: +DVC will track them for +[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when +they change (reported by `dvc status`, for example). -External outputs are considered part of the (extended) DVC project: DVC will -track them for [versioning](/doc/use-cases/versioning-data-and-model-files), -thus detecting when they change, and reporting their state in `dvc status` for -example. +DVC can track existing files or directories on an external location with +`dvc add`. It's also possible to use them as [stage](/doc/command-reference/run) +outputs. Their remote URLs or external paths can be defined in `dvc.yaml` with +the same format as the `url` of certain `dvc remote` types. -The remote URLs or external paths can be defined with the same format as the -`url` of certain `dvc remote` types. Currently, the following protocols are -supported: +Currently, the following protocols are supported: - Amazon S3 - Google Cloud Storage @@ -34,7 +34,7 @@ External outputs require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. -> Note that [remote storage](/doc/command-reference/remote) is a separate +> Note that [remote storage](/doc/command-reference/remote) is a different > feature, and that external outputs are not pushed or pulled from/to DVC > remotes. @@ -53,9 +53,9 @@ types: 2. Tracking existing data on the location using `dvc add` (`--external` option needed). This produces a `.dvc` file with an external URL or path in its `outs` field. -3. Creating a simple [stage](/doc/command-reference/run) with `dvc run` - (`--external` option needed) that moves a local file to the external - location. This produces a stage with an external output, in `dvc.yaml`. +3. Creating a simple stage with `dvc run` (`--external` option needed) that + moves a local file to the external location. This produces an external output + in `dvc.yaml`.
From de67413d7d5515fe0ad255323d7932e2cca1f4d9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 13:15:43 -0600 Subject: [PATCH 21/23] cases: simplify Example intro for ext data docs per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-535877766 --- content/docs/user-guide/external-dependencies.md | 8 ++------ content/docs/user-guide/managing-external-data.md | 4 ++-- 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 18c2f01c38..2f3dc3295a 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -18,8 +18,8 @@ executions on `dvc repro`, for example). DVC can track files or directories on an external location as [stage](/doc/command-reference/run) dependencies. Their remote URLs or external -paths are defined in `dvc.yaml` with the same format as the `url` of certain -`dvc remote` types. +paths are defined in `dvc.yaml` (`deps` field) with the same format as the `url` +of certain `dvc remote` types. Currently, the following protocols are supported: @@ -36,10 +36,6 @@ Currently, the following protocols are supported: ## Examples -To define an external dependency, add the external URL or path to -the `deps` field of `dvc.yaml`. For example, with the usual `-d` option in -`dvc run`, giving it the external URL/path to your desired file or directory. - Let's take a look at defining and running a `download_file` stage that simply downloads a file from an external location, on all the supported location types. diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index e1b91ba15e..817faa85a7 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -19,8 +19,8 @@ they change (reported by `dvc status`, for example). DVC can track existing files or directories on an external location with `dvc add`. It's also possible to use them as [stage](/doc/command-reference/run) -outputs. Their remote URLs or external paths can be defined in `dvc.yaml` with -the same format as the `url` of certain `dvc remote` types. +outputs. Their remote URLs or external paths can be defined in `dvc.yaml` +(`outs` field) with the same format as the `url` of certain `dvc remote` types. Currently, the following protocols are supported: From 6fc053868e52cf5788751e8a995b2fa02ea33cf8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 13:23:08 -0600 Subject: [PATCH 22/23] guide: more copy edits on exteral data docs --- content/docs/user-guide/external-dependencies.md | 3 --- content/docs/user-guide/managing-external-data.md | 6 +++--- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 2f3dc3295a..b34a2add80 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -39,9 +39,6 @@ Currently, the following protocols are supported: Let's take a look at defining and running a `download_file` stage that simply downloads a file from an external location, on all the supported location types. -> Note that some of the example commands below use the `/home/shared` directory, -> typical in Linux distributions. -
### Click for Amazon S3 diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 817faa85a7..c2b9254379 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -50,9 +50,9 @@ types: 1. Adding a `dvc remote` in the same location as the desired outputs, and configure it as external cache with `dvc config`. -2. Tracking existing data on the location using `dvc add` (`--external` option - needed). This produces a `.dvc` file with an external URL or path in its - `outs` field. +2. Tracking existing data on the external location using `dvc add` (`--external` + option needed). This produces a `.dvc` file with an external URL or path in + its `outs` field. 3. Creating a simple stage with `dvc run` (`--external` option needed) that moves a local file to the external location. This produces an external output in `dvc.yaml`. From cfe6e07a57bf2a9f72c4174d7832613b5a213a0e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 27 Nov 2020 14:11:39 -0600 Subject: [PATCH 23/23] guide: clarify how ext deps/outs work per https://github.com/iterative/dvc.org/pull/1945#pullrequestreview-540175279 --- content/docs/user-guide/external-dependencies.md | 10 ++++------ content/docs/user-guide/managing-external-data.md | 11 +++++------ 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index b34a2add80..79806d60e8 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -16,12 +16,10 @@ External dependencies are considered part of the (extended) DVC project: DVC will track them, detecting when they change (triggering stage executions on `dvc repro`, for example). -DVC can track files or directories on an external location as -[stage](/doc/command-reference/run) dependencies. Their remote URLs or external -paths are defined in `dvc.yaml` (`deps` field) with the same format as the `url` -of certain `dvc remote` types. - -Currently, the following protocols are supported: +To define files or directories in an external location as +[stage](/doc/command-reference/run) dependencies, put their remote URLs or +external paths in `dvc.yaml` (`deps` field). Use the same format as the `url` of +certain `dvc remote` types. Currently, the following protocols are supported: - Amazon S3 - Microsoft Azure Blob Storage diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index c2b9254379..d14322c1a3 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -17,12 +17,11 @@ DVC will track them for [versioning](/doc/use-cases/versioning-data-and-model-files), detecting when they change (reported by `dvc status`, for example). -DVC can track existing files or directories on an external location with -`dvc add`. It's also possible to use them as [stage](/doc/command-reference/run) -outputs. Their remote URLs or external paths can be defined in `dvc.yaml` -(`outs` field) with the same format as the `url` of certain `dvc remote` types. - -Currently, the following protocols are supported: +To use existing files or directories in an external location as +[stage](/doc/command-reference/run) outputs, give their remote URLs or external +paths to `dvc add`, or put them in `dvc.yaml` (`deps` field). Use the same +format as the `url` of certain `dvc remote` types. Currently, the following +protocols are supported: - Amazon S3 - Google Cloud Storage