Skip to content

Commit

Permalink
docs: clarifications around external outputs info. (#2154)
Browse files Browse the repository at this point in the history
* guide: disclaim x data (impro #2104)

* guide: revert Exp Outs guide rename
per #2154 (review)
  • Loading branch information
jorgeorpinel authored Mar 14, 2021
1 parent 84093cd commit 2bce7d5
Show file tree
Hide file tree
Showing 5 changed files with 45 additions and 45 deletions.
6 changes: 4 additions & 2 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,10 @@ not.
- `--external` - allow `targets` that are outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).

> Note that external outputs typically require an external cache setup. See
> link above for more details.
> ⚠️ Note that this is an advanced feature for very specific situations and
> not recommended except if there's absolutely no other alternative.
> Additionally, this typically requires an external cache setup (see link
> above).
- `-o <path>`, `--out <path>` - destination `path` to make a local target copy,
or to [transfer](#example-transfer-to-cache) an external target into the cache
Expand Down
6 changes: 3 additions & 3 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,10 @@ settings, and configuring a remote is the way that can be done.
- `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as
external cache.

> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file
> ⚠️ Avoid using the same [remote storage](/doc/command-reference/remote) used
> for `dvc push` and `dvc pull` as external cache, because it may cause file
> hash overlaps: the hash of an external <abbr>output</abbr> could collide with
> a hash generated locally for another file with different content.
> that of a local file with different content.
### state

Expand Down
14 changes: 7 additions & 7 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# External Dependencies

There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
way, that its preferable to avoid moving it from its current external location.
For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

Expand All @@ -12,14 +12,14 @@ and version data outside of the <abbr>project</abbr>.

## How external dependencies work

External <abbr>dependencies</abbr> are considered part of the (extended) DVC
project: DVC will track them, detecting when they change (triggering stage
executions on `dvc repro`, for example).
External <abbr>dependencies</abbr> will be tracked by DVC, detecting when they
change (triggering stage executions on `dvc repro`, for example).

To define files or directories in an external location as
[stage](/doc/command-reference/run) dependencies, put their remote URLs or
[stage](/doc/command-reference/run) dependencies, specify their remote URLs or
external paths in `dvc.yaml` (`deps` field). Use the same format as the `url` of
certain `dvc remote` types. Currently, the following protocols are supported:
certain `dvc remote` types. Currently, the following supported `dvc remote`
types/protocols:

- Amazon S3
- Microsoft Azure Blob Storage
Expand Down
53 changes: 26 additions & 27 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,51 @@
# Managing External Data
# External Outputs

> ⚠️ This is an advanced feature that we don't recommend using unless you really
> know what you are doing. Artifacts added with --external are not affected by
> `dvc push/pull/status -c`. You are likely looking for straight
> ⚠️ This is an advanced feature for very specific situations and not
> recommended except if there's absolutely no other alternative. In most cases
> alternatives like the
> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or
> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage)
> transfers, or `dvc import-url`).
> strategies of `dvc add` and `dvc import-url` are more convenient. **Note**
> that external outputs are not pushed or pulled from/to
> [remote storage](/doc/command-reference/remote).
There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.
way, that its impossible to handle it in the local machine disk. For example
versioning existing data on a network attached storage (NAS), processing data on
HDFS, running [Dask](https://dask.org/) via SSH, or any code that generates
massive files directly to the cloud.

External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
External outputs (and
[external dependencies](/doc/user-guide/external-dependencies)) provide ways to
track and version data outside of the <abbr>project</abbr>.

## How external outputs work

External <abbr>outputs</abbr> are considered part of the (extended) DVC project:
DVC will track them for
External <abbr>outputs</abbr> are considered part of the (extended)
<abbr>workspace</abbr>: DVC will track them for
[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when
they change (reported by `dvc status`, for example).

To use existing files or directories in an external location as
[stage](/doc/command-reference/run) outputs, give their remote URLs or external
paths to `dvc add`, or put them in `dvc.yaml` (`deps` field). Use the same
format as the `url` of certain `dvc remote` types. Currently, the following
protocols are supported:
To use existing files or directories in an external location as outputs, give
their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml`
(`deps` field). Use the same format as the `url` of the following supported
`dvc remote` types/protocols:

- Amazon S3
- SSH
- HDFS
- Local files and directories outside the <abbr>workspace</abbr>
- Local files and directories outside the workspace

External outputs require an
⚠️ External outputs require an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file.

> Note that [remote storage](/doc/command-reference/remote) is a different
> feature, and that external outputs are not pushed or pulled from/to DVC
> remotes.
> Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. as
> external cache, because it may cause data collisions: the hash of an external
> output could collide with that of a local file with different content.
> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for
> external outputs, because it may cause data collisions: the hash of an
> external output could collide with that of a local file with different
> content.
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.
## Examples

Expand Down
11 changes: 5 additions & 6 deletions content/docs/user-guide/project-structure/dvc-files.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
# `.dvc` Files

You can use `dvc add` to track data files or directories located in your current
<abbr>workspace</abbr>, or in supported
[external locations](/doc/user-guide/managing-external-data). Additionally,
`dvc import` and `dvc import-url` let you bring data from external locations to
your project, and start tracking it locally.
<abbr>workspace</abbr>\*. Additionally, `dvc import` and `dvc import-url` let
you bring data from external locations to your project, and start tracking it
locally. See [Data Versioning](/doc/start/data-versioning) for more info.

> See [Data Versioning](/doc/start/data-versioning) and
> [Data Access](/doc/start/data-access) for more info.
> \* Certain [external locations](/doc/user-guide/managing-external-data) are
> also supported.
Files ending with the `.dvc` extension ("dot DVC file") are created by these
commands as data placeholders that can be versioned with Git. They contain the
Expand Down

0 comments on commit 2bce7d5

Please sign in to comment.