Skip to content

Commit

Permalink
guide: rewrite ext deps/outs explanations
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Nov 27, 2020
1 parent 8a6892d commit dd1501b
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 27 deletions.
20 changes: 12 additions & 8 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,22 @@ example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

External <abbr>dependencies</abbr> and
External dependencies and
[external outputs](/doc/user-guide/managing-external-data) provide ways to track
data outside of the <abbr>project</abbr>.

## How external dependencies work

You can specify external files or directories as dependencies for your pipeline
[stages](/doc/command-reference/run). DVC will track changes in them and reflect
this in the output of `dvc status`.
External <abbr>dependencies</abbr> are considered part of the (extended) DVC
project: DVC will track them, detecting when they change (triggering stage
executions on `dvc repro`, for example).

The remote URLs or external paths can be defined with the same format as the
`url` of certain `dvc remote` types. Currently, the following protocols are
supported:
DVC can track files or directories on an external location as
[stage](/doc/command-reference/run) dependencies. Their remote URLs or external
paths are defined in `dvc.yaml` with the same format as the `url` of certain
`dvc remote` types.

Currently, the following protocols are supported:

- Amazon S3
- Microsoft Azure Blob Storage
Expand All @@ -28,7 +31,8 @@ supported:
- HTTP
- Local files and directories outside the <abbr>workspace</abbr>

> Note [remote storage](/doc/command-reference/remote) is a separate feature.
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.
## Examples

Expand Down
38 changes: 19 additions & 19 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
# Managing External Data

There are cases when data is so large, or its processing is organized in a way
such that its preferable to avoid moving it from its original location, even if
it's external or remote to the project. For example: data on a network attached
storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via
SSH, or for a script that streams data from S3 to process it.
There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

External <abbr>outputs</abbr> and
External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
track data outside of the <abbr>project</abbr>.

## How external outputs work

DVC can track existing files or directories on an external location with
`dvc add`. It can also define external outputs for `dvc.yaml` stages to create.
External <abbr>outputs</abbr> are considered part of the (extended) DVC project:
DVC will track them for
[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when
they change (reported by `dvc status`, for example).

External outputs are considered part of the (extended) DVC project: DVC will
track them for [versioning](/doc/use-cases/versioning-data-and-model-files),
thus detecting when they change, and reporting their state in `dvc status` for
example.
DVC can track existing files or directories on an external location with
`dvc add`. It's also possible to use them as [stage](/doc/command-reference/run)
outputs. Their remote URLs or external paths can be defined in `dvc.yaml` with
the same format as the `url` of certain `dvc remote` types.

The remote URLs or external paths can be defined with the same format as the
`url` of certain `dvc remote` types. Currently, the following protocols are
supported:
Currently, the following protocols are supported:

- Amazon S3
- Google Cloud Storage
Expand All @@ -34,7 +34,7 @@ External outputs require an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file.

> Note that [remote storage](/doc/command-reference/remote) is a separate
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature, and that external outputs are not pushed or pulled from/to DVC
> remotes.
Expand All @@ -53,9 +53,9 @@ types:
2. Tracking existing data on the location using `dvc add` (`--external` option
needed). This produces a `.dvc` file with an external URL or path in its
`outs` field.
3. Creating a simple [stage](/doc/command-reference/run) with `dvc run`
(`--external` option needed) that moves a local file to the external
location. This produces a stage with an external output, in `dvc.yaml`.
3. Creating a simple stage with `dvc run` (`--external` option needed) that
moves a local file to the external location. This produces an external output
in `dvc.yaml`.

<details>

Expand Down

0 comments on commit dd1501b

Please sign in to comment.