Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. updates #1945

Merged
merged 26 commits into from
Nov 27, 2020
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
2494b95
cmd: link add --glob to the glob py mod
jorgeorpinel Nov 14, 2020
8843e94
cmd: copy edit symlink targets section of add
jorgeorpinel Nov 14, 2020
fd47b43
cmd: copy edit import(-url) --no-exec text
jorgeorpinel Nov 14, 2020
06d24fd
cmd: copy edits to plots and metrics
jorgeorpinel Nov 14, 2020
6dcba8a
cmd: move add --glob option up 1 place
jorgeorpinel Nov 16, 2020
5f88b96
Merge branch 'master' into jorge
jorgeorpinel Nov 16, 2020
21e3725
cmd: correct note about add --external
jorgeorpinel Nov 16, 2020
89dd826
cmd: clarify --jobs for status and gc
jorgeorpinel Nov 16, 2020
2ab35cc
cmd: fix path to .dvc/tmp/state file in --no-commit desc
jorgeorpinel Nov 17, 2020
7bad093
cmd: move notes about --external outputs to x data guide
jorgeorpinel Nov 17, 2020
be86844
guide: improve nots around external types supported
jorgeorpinel Nov 17, 2020
595c9ed
gude: remove mention of --no-commit for ext data
jorgeorpinel Nov 17, 2020
70d353d
guide: simplify notes around external deps/outs docs
jorgeorpinel Nov 17, 2020
0543b5e
guide: review base explanations and remove remote storage notes (for …
jorgeorpinel Nov 17, 2020
ada0b1c
guide: improve example description in ext data docs
jorgeorpinel Nov 17, 2020
d6ad067
guide: mention remote storage properly in ext data docs
jorgeorpinel Nov 17, 2020
9a8dc19
guide: forgot to update note about remotes in ext deps
jorgeorpinel Nov 18, 2020
8d03411
Merge branch 'master' into jorge
jorgeorpinel Nov 19, 2020
bc7b97c
guide: address external data feedback
jorgeorpinel Nov 19, 2020
defd99d
guide: reinstate note about data hash overlaps btw local and ext data
jorgeorpinel Nov 21, 2020
4fe893c
Merge branch 'master' into jorge
jorgeorpinel Nov 27, 2020
8a6892d
guide: simplify ext deps intro
jorgeorpinel Nov 27, 2020
dd1501b
guide: rewrite ext deps/outs explanations
jorgeorpinel Nov 27, 2020
de67413
cases: simplify Example intro for ext data docs
jorgeorpinel Nov 27, 2020
6fc0538
guide: more copy edits on exteral data docs
jorgeorpinel Nov 27, 2020
cfe6e07
guide: clarify how ext deps/outs work
jorgeorpinel Nov 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 18 additions & 17 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,13 +98,13 @@ undesirable for data directories with a large number of files.
To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) to `.dvcignore`.

### Adding symlinked targets {#add-symlink}
### Adding symlink targets {#add-symlink}

DVC only supports symlinked files as valid targets for `dvc add`. If the target
path is a directory symlink, or if the target path contains any intermediate
directory symlinks, `dvc add` will fail.
`dvc add` supports symlinked files as `targets`. But if a target path is a
directory symlink, or if it contains any intermediate directory symlinks, it
cannot be added to DVC.

So given the following project structure:
For example, given the following project structure:

```
.
Expand All @@ -117,10 +117,9 @@ So given the following project structure:
└── link_to_file -> dir/file
```

`dir`, `dir/file`, `link_to_external_file` and `link_to_file` are all valid
targets for `dvc add`. `link_to_dir`, `link_to_external_dir` and
`link_to_dir/file` are invalid targets, since the target path would contain
directory symlinks.
`link_to_file` and `link_to_external_file` are both valid symlink targets to
`dvc add`. But `link_to_dir`, `link_to_external_dir`, and `link_to_dir/file` are
not.

## Options

Expand All @@ -129,24 +128,26 @@ directory symlinks.
among the `targets`, this option is ignored. For each file found, a new `.dvc`
file is created using the process described in this command's description.

- `--no-commit` - do not save outputs to cache. A `.dvc` file is created and an
entry is added to `.dvc/state`, while nothing is added to the cache.
(`dvc status` will report that the file is `not in cache`.) Use `dvc commit`
when ready to commit outputs with DVC. This is analogous to using `git add`
before `git commit`.
- `--no-commit` - do not save outputs to cache. A `.dvc` file is created, while
nothing is added to the cache. (`dvc status` will report that the file is
`not in cache`.) Use `dvc commit` when ready to commit outputs with DVC. This
is analogous to using `git add` before `git commit`.

- `--file <filename>` - specify name of the `.dvc` file it generates. This
option works only if there is a single target. By default the name of the
generated `.dvc` file is `<target>.dvc`, where `<target>` is the file name of
the given target. This option allows to set the name and the path of the
generated `.dvc` file.

- `--glob` - allows adding files and directories that match the
[pattern](https://docs.python.org/3/library/glob.html) specified in `targets`.
Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`

- `--external` - allow `targets` that are outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).

- `--glob` - allows adding files and directories that match the specified
pattern as specified by `target`. Shell-style wildcards are supported: `*`,
`?`, `[seq]`, `[!seq]`, and `**`.
> Note that external outputs typically require an external cache setup. See
> link above for more details.

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,8 @@ The default remote is cleaned (see `dvc config core.remote`) unless the
from remote storage. This only applies when the `--cloud` option is used, or a
`--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes,
the default is `4`. Note that the default value can be set using the `jobs`
config option with `dvc remote modify`. Using more jobs may improve the
overall connection speed.
config option with `dvc remote modify`. Using more jobs may speed up the
operation.
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

> For now only some phases of garbage collection are parallel.

Expand Down
7 changes: 4 additions & 3 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,10 @@ source.

- `--no-exec` - create `.dvc` file without actually downloading `url`. E.g. if
the file or directory already exists, this can be used to skip the download.
The data hash is not calculated by this, only the metadata is saved into the
`.dvc` file. You can use `dvc commit <out>.dvc` if you need the hashes in the
new `.dvc` file and save existing data to the cache.
The data hash is not calculated when this option is used, only the import
metadata is saved to the `.dvc` file. `dvc commit <out>.dvc` can be used if
the data hashes are needed in the `.dvc` file, and to save existing data to
the cache.

- `--desc <text>` - user description of the data (optional). This doesn't
affect any DVC operations.
Expand Down
11 changes: 6 additions & 5 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,11 +103,12 @@ repo at `url`) are not supported.
> [Importing and updating fixed revisions](#example-importing-and-updating-fixed-revisions)
> example below).

- `--no-exec` - create `.dvc` file without actually downloading the file or
directory. E.g. if the file or directory already exists, this can be used to
skip the download. The data hash is not calculated by this, only the metadata
is saved into the `.dvc` file. You can use `dvc commit <out>.dvc` if you need
the hashes in the new `.dvc` file and save existing data to the cache.
- `--no-exec` - create the import `.dvc` file without actually downloading the
file or directory. E.g. if the file or directory already exists, this can be
used to skip the download. The data hash is not calculated when this option is
used, only the import metadata is saved to the `.dvc` file.
`dvc commit <out>.dvc` can be used if the data hashes are needed in the `.dvc`
file, and to save existing data to the cache.

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/metrics/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ positional arguments:

This command provides a quick way to compare metrics among experiments in the
repository history. All metrics defined in `dvc.yaml` are used by default. The
comparison shown by this command includes the new value, and the numeric
difference (delta) with the previous value (rounded to 5 digits precision).
differences shown by this command include the new value, and numeric difference
(delta) from the previous value of metrics (rounded to 5 digits precision).

`a_rev` and `b_rev` are Git commit hashes, tag, or branch names. If none are
specified, `dvc metrics diff` compares metrics currently present in the
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/plots/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ versions of the <abbr>repository</abbr>, by overlaying them in a single plot.
(uncommitted changes) with their latest commit (required). A single specified
revision results in comparing the workspace and that version.

💡 Note that any number of `revisions` can be provided, and the resulting plot
shows all of them in a single image.
💡 Note that any number of `revisions` can be provided (the resulting plot shows
all of them in a single image).

All plots defined in `dvc.yaml` are used by default, but specific plots files
can be specified with the `--targets` option (note that targets don't
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/plots/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ header (first row) are equivalent to field names.

### DVC template anchors

- `<DVC_METRIC_DATA>` (**required**) - the plot data from any kind of metrics
- `<DVC_METRIC_DATA>` (**required**) - the plot data from any type of metrics
files is converted to a single JSON array internally, and injected instead of
this anchor. Two additional fields will be added: `index` and `rev` (explained
above).
Expand Down
10 changes: 5 additions & 5 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,11 @@ up-to-date and only execute the final stage.
target directory and its subdirectories for stages (in `dvc.yaml`) to inspect.
If there are no directories among the targets, this option is ignored.

- `--no-commit` - do not save outputs to cache. A DVC-file is created and an
entry is added to `.dvc/state`, while nothing is added to the cache.
(`dvc status` will report that the file is `not in cache`.) Use `dvc commit`
when ready to commit outputs with DVC. Useful to avoid caching unnecessary
data repeatedly when running multiple experiments.
- `--no-commit` - do not save outputs to cache. A DVC-file is created, while
nothing is added to the cache. (`dvc status` will report that the file is
`not in cache`.) Use `dvc commit` when ready to commit outputs with DVC.
Useful to avoid caching unnecessary data repeatedly when running multiple
experiments.

- `-m`, `--metrics` - show metrics after reproduction. The target pipelines must
have at least one metrics file defined either with the `dvc metrics` command,
Expand Down
12 changes: 6 additions & 6 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,11 +244,11 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
command's code is non-deterministic
([not recommended](#avoiding-unexpected-behavior)).

- `--no-commit` - do not save outputs to cache. A stage created and an entry is
added to `.dvc/state`, while nothing is added to the cache. In the stage file,
the file hash values will be empty; They will be populated the next time this
stage is actually executed, or `dvc commit` can be used to force committing
existing output file versions to cache.
- `--no-commit` - do not save outputs to cache. A stage created, while nothing
is added to the cache. In the stage file, the file hash values will be empty;
They will be populated the next time this stage is actually executed, or
`dvc commit` can be used to force committing existing output file versions to
cache.

This is useful to avoid caching unnecessary data repeatedly when running
multiple experiments.
Expand All @@ -260,7 +260,7 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
> Note that DVC-files without dependencies are automatically considered
> "always changed", so this option has no effect in those cases.

- `--external` - allow outputs that are outside of the DVC repository. See
- `--external` - allow writing outputs outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).

- `--desc <text>` - user description of the stage (optional). This doesn't
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,8 @@ that.
information from remote storage. This only applies when the `--cloud` option
is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For
SSH remotes, the default is `4`. Note that the default value can be set using
the `jobs` config option with `dvc remote modify`. Using more jobs may improve
the overall connection speed.
the `jobs` config option with `dvc remote modify`. Using more jobs may speed
up the operation.
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
41 changes: 19 additions & 22 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,27 @@
# External Dependencies

There are cases when data is so large, or its processing is organized in a way
such that you would like to avoid moving it out of its external/remote location.
For example from a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

External <abbr>dependencies</abbr> and
External dependencies and
[external outputs](/doc/user-guide/managing-external-data) provide ways to track
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

track and version?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we version dependencies? Only when they're outputs of a previous stage, I think.

I guess we do keep track of their versions in any case, but don't really control those versions if they are external.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is about External dependencies and [external outputs]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Updating in 60e8055.

data outside of the <abbr>project</abbr>.

## How it works
## How external dependencies work

You can specify external files or directories as dependencies for your pipeline
stages. DVC will track changes in them and reflect this in the output of
`dvc status`.
External <abbr>dependencies</abbr> are considered part of the (extended) DVC
project: DVC will track them, detecting when they change (triggering stage
executions on `dvc repro`, for example).

Currently, the following types (protocols) of external dependencies are
supported:
DVC can track files or directories on an external location as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like it repeats a lot of the previous paragraph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True 🤦. Updated in cfe6e07.

[stage](/doc/command-reference/run) dependencies. Their remote URLs or external
paths are defined in `dvc.yaml` (`deps` field) with the same format as the `url`
of certain `dvc remote` types.

Currently, the following protocols are supported:

- Amazon S3
- Microsoft Azure Blob Storage
Expand All @@ -27,20 +31,13 @@ supported:
- HTTP
- Local files and directories outside the <abbr>workspace</abbr>

> Note that these are a subset of the remote storage types supported by
> `dvc remote`.

In order to specify an external <abbr>dependency</abbr> for your stage, use the
usual `-d` option in `dvc run` with the external path or URL to your desired
file or directory.
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.

## Examples

Let's take a look at a `download_file` [stage](/doc/command-reference/run) that
simply downloads a file from an external location.

> Note that some of these commands use the `/home/shared` directory, typical in
> Linux distributions.
Let's take a look at defining and running a `download_file` stage that simply
downloads a file from an external location, on all the supported location types.

<details>

Expand Down
61 changes: 33 additions & 28 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,60 @@
# Managing External Data

There are cases when data is so large, or its processing is organized in a way
such that its preferable to avoid moving it from its external/remote location.
For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

External <abbr>outputs</abbr> and
External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
track data outside of the <abbr>project</abbr>.

## How external outputs work

DVC can track existing files or directories on an external location with
`dvc add` (`out` field). It can also create external files or directories as
outputs for `dvc.yaml` files (only `outs` field, not metrics or plots).

External outputs are considered part of the (extended) DVC project: DVC will
track changes in them, and reflect this in `dvc status` reports, for example.
External <abbr>outputs</abbr> are considered part of the (extended) DVC project:
DVC will track them for
[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when
they change (reported by `dvc status`, for example).

For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.
DVC can track existing files or directories on an external location with
`dvc add`. It's also possible to use them as [stage](/doc/command-reference/run)
outputs. Their remote URLs or external paths can be defined in `dvc.yaml`
(`outs` field) with the same format as the `url` of certain `dvc remote` types.

Currently, the following types (protocols) of external outputs (and
<abbr>cache</abbr>) are supported:
Currently, the following protocols are supported:

- Amazon S3
- Google Cloud Storage
- SSH
- HDFS
- Local files and directories outside the <abbr>workspace</abbr>

> Note that these are a subset of the remote storage types supported by
> `dvc remote`.
External outputs require an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file.

> Note that [remote storage](/doc/command-reference/remote) is a different
> feature, and that external outputs are not pushed or pulled from/to DVC
> remotes.

> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file
> hash overlaps: the hash of an external output could collide with a hash
> generated locally for another file with different content.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for
> external outputs, because it may cause data collisions: the hash of an
> external output could collide with that of a local file with different
> content.

## Examples

Let's take a look at:
Let's take a look at the following operations on all the supported location
types:

1. Adding a `dvc remote` to use as cache for data in the external location, and
1. Adding a `dvc remote` in the same location as the desired outputs, and
configure it as external <abbr>cache</abbr> with `dvc config`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an external cache?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No "an" needed. Maybe "the" but it's OK like this too (generic to any hypothetical project), I think.

2. Tracking existing data on an external location with `dvc add` (this doesn't
download it). This produces a `.dvc` file with an external output.
3. Creating a simple [stage](/doc/command-reference/run) that moves a local file
to the external location. This produces a stage with another external output
2. Tracking existing data on the external location using `dvc add` (`--external`
option needed). This produces a `.dvc` file with an external URL or path in
its `outs` field.
3. Creating a simple stage with `dvc run` (`--external` option needed) that
moves a local file to the external location. This produces an external output
in `dvc.yaml`.

<details>
Expand Down