Skip to content

Commit

Permalink
ref: clarifications around add and import-url for external targets (
Browse files Browse the repository at this point in the history
#3210)

* ref: clarify external-data-related options of add
i.e. --external, --out, and --to-remote

* ref: improve add --out example

* ref: clarify add/import-url --to-remote examples

* ref: add/import-url --add option copy edits

* ref: clarify add --out base case
per #3210 (review)

* ref: re-explain add --out example (again)

* Update content/docs/command-reference/add.md

Co-authored-by: Dave Berenbaum <[email protected]>

* ref: rename add --out (external cache) example and
and more explanation clarifications heh

* ref: rewrite add/import-url --to-remote example intros to
to match previous commits (improvements to add --out example text)

* rephrase --to-remote examples
per #3210 (review)
and #3210 (review)

* ref: corrections around add --to-remote and --out
per #3210 (comment)

* ref: std import --out description

* ref: forgot to remove -o from add --to-remote example and
and complete console sample with ls

* ref: add ls to import-url --to-remote example

* ref: consistent --to-remote examples among add and import-url

* ref: more consistency changes around --to-remote

* ref: remove some redundancy in import-url --to-remote example

* ref: add --out/to-remote doesn't move, it copies
per #3210 (review)
and #3210 (review)

* ref: include motivation again in add --to-remote example and
and everything else while at it...
Plus std. changes to corresponding import-url example

Co-authored-by: Dave Berenbaum <[email protected]>
  • Loading branch information
2 people authored and iesahin committed Apr 11, 2022
1 parent 8bc3b2c commit bb2e569
Show file tree
Hide file tree
Showing 5 changed files with 81 additions and 85 deletions.
87 changes: 43 additions & 44 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,8 @@ not.
[pattern](https://docs.python.org/3/library/glob.html) specified in `targets`.
Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`

- `--external` - allow `targets` that are outside of the DVC repository. See
- `--external` - allow tracking `targets` outside of the DVC repository
in-place. See
[Managing External Data](/doc/user-guide/managing-external-data).

> ⚠️ Note that this is an advanced feature for very specific situations and
Expand All @@ -155,16 +156,15 @@ not.
> above).
- `-o <path>`, `--out <path>` - specify a `path` to the desired location in the
workspace to place the `targets` (instead of using the current working
directory). Directories specified in the path will be created by this command.
workspace to place the `targets` (copying them from their current location).
This enables targeting data outside the project (see an
[example](#example-transfer-to-an-external-cache)).

> Note that this can be combined with `--to-remote` to avoid storing the data
> locally, in which case the give `path` is only used in `dvc.yaml`.
- `--to-remote` - add an external target, but don't move it into the workspace,
nor cache it. [Transfer it](#example-transfer-to-remote-storage) it directly
to remote storage (the default one, unless `-r` is specified) instead. Use
`dvc pull` to get the data locally.
- `--to-remote` - add a target that's outside the project, but neither cache it
nor place it in the workspace nor cache it yet.
[Transfer it](#example-transfer-to-remote-storage) directly to remote storage
instead (the default one unless one is specified with `-r`). Implies
`--out .`. Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) to transfer external target to
Expand Down Expand Up @@ -343,22 +343,20 @@ $ tree .dvc/cache
Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.

## Example: Transfer to the cache
## Example: Transfer to an external cache

When you have a large dataset in an external location, you may want to add it to
the <abbr>project</abbr> without having to copy it into the
<abbr>workspace</abbr>. Maybe your local disk doesn't have enough space, but you
have set up an
[external cache](/doc/user-guide/managing-external-data#setting-up-an-external-cache)
that could handle it.
When you want to add a large dataset that is outside of your
<abbr>project</abbr> (e.g. online), you would normally need to download or copy
it into the <abbr>workspace</abbr> first. But you may not have enough local
storage space.

The `--out` option lets you add external paths in a way that they are
<abbr>cached</abbr> first, and then
[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
to a given path inside the workspace.
You can however set up an [external cache] that can handle the data. To avoid
ever making a local copy, target the outside data with `dvc add` while
specifying an `--out` (`-o`) path inside of your project. This way the data will
be transferred to the <abbr>cache</abbr> directly, and then [linked] into your
workspace.

Let's add a `data.xml` file via HTTP for example, putting it a local path in our
project:
Let's add a `data.xml` file via HTTP, putting it in `./data.xml`:

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
Expand All @@ -379,41 +377,42 @@ outs:
path: data.xml
```

[linked]:
/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache
[external cache]:
/doc/user-guide/managing-external-data#setting-up-an-external-cache

## Example: Transfer to remote storage

When you have a large dataset in an external location, you may want to track it
as if it was in your project, but without downloading it locally (for now). The
`--to-remote` option lets you do so, while storing a copy
[remotely](/doc/command-reference/remote) so it can be
Sometimes there's not enough space in the local environment to import a large
dataset, but you still want to track it in the <abbr>project</abbr> so it can be
[pulled](/doc/command-reference/plots) later.

Let's add the `data.xml` to our remote storage from the given remote location:

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
--to-remote
```
As long as you have setup [remote storage] that can handle the data, this can be
achieved with the `--to-remote` flag. It creates a `.dvc` file without
downloading anything, transferring a target directly to a DVC remote instead:

The only difference that dataset is transferred straight to remote, so DVC won't
control the remote location you gave but rather continue managing your remote
storage where the data is now on. The operation will still be resulted with an
`.dvc` file:
Let's add a `data.xml` file via HTTP straight to remote:

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml --to-remote
...
$ ls
data.xml.dvc
```

Whenever anyone wants to actually download the added data (for example from a
system that can handle it), they can use `dvc pull` as usual:
Since a `.dvc` file is created in the <abbr>workspace</abbr>, whenever anyone
wants to actually download the data they can use `dvc pull`:

```dvc
$ dvc pull data.xml.dvc
A data.xml
1 file added and 1 file fetched
1 file added
```

> For a similar operation that actually keeps a connection to the data source,
> please see an
> [`import-url` example](/doc/command-reference/import-url#example-transfer-to-remote-storage).
> Note that you can also do this [with `dvc import-url`][iutr]. This has the
> added benefit of keeping a connection to the data source so it can be updated
> later (with `dvc update`).

[remote storage]: /doc/command-reference/remote
[iutr]: /doc/command-reference/import-url#example-transfer-to-remote-storage
52 changes: 25 additions & 27 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,10 +140,10 @@ produces a regular stage in `dvc.yaml`.
finish the operation(s)); or if the target data already exist locally and you
want to "DVCfy" this state of the project (see also `dvc commit`).

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer](#example-transfer-to-remote-storage) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.
- `--to-remote` - import a target, but neither move it into the workspace, nor
cache it. [Transfer it](#example-transfer-to-remote-storage) directly to
remote storage (the default one unless one is specified with `-r`) instead.
Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) (can only be used with
Expand Down Expand Up @@ -318,11 +318,11 @@ $ tree
.
├── README.md
├── data
   ├── data.xml
   ├── data.xml.dvc
   └── prepared
   ├── test.tsv
   └── train.tsv
├── data.xml
├── data.xml.dvc
└── prepared
├── test.tsv
└── train.tsv
├── dvc.lock
├── dvc.yaml
├── params.yaml
Expand Down Expand Up @@ -363,36 +363,34 @@ Running stage 'prepare' with command:

## Example: Transfer to remote storage

When you have a large dataset in an external location, you may want to import it
to your project without downloading it to the local file system (for using it
later/elsewhere). The `--to-remote` option let you skip the download, while
storing the imported data [remotely](/doc/command-reference/remote).
Sometimes there's not enough space in the local environment to import a large
dataset, but you still want to track it in the <abbr>project</abbr> so it can be
[pulled](/doc/command-reference/plots) later.

Let's create an import `.dvc` file without downloading the target data,
transferring it directly to remote storage instead:
As long as you have setup [remote storage] that can handle the data, this can be
achieved with the `--to-remote` flag. It creates an import `.dvc` file without
downloading anything, transferring a target directly to a DVC remote instead.

Let's import a `data.xml` file via HTTP straight to remote:

```dvc
$ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \
--to-remote
```

The only change in our local <abbr>workspace</abbr> is a newly created import
`.dvc` file:

```dvc
...
$ ls
data.xml.dvc
```

Whenever anyone wants to actually download the imported data (for example from a
system that can handle it), they can use `dvc pull` as usual:
Since a `.dvc` file is created in the <abbr>workspace</abbr>, whenever anyone
wants to actually download the data they can use `dvc pull`:

```dvc
$ dvc pull data.xml.dvc
A data.xml
1 file added and 1 file fetched
1 file added
```

Note that you can also use `dvc update --to-remote` to bring the import up to
date in remote storage, without downloading anything.
Use `dvc update --to-remote` to bring the import up to date in remote storage,
without downloading anything.

[remote storage]: /doc/command-reference/remote
3 changes: 1 addition & 2 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,7 @@ To actually [version the data](/doc/start/data-and-model-versioning), `git add`

- `-o <path>`, `--out <path>` - specify a `path` to the desired location in the
workspace to place the downloaded file or directory (instead of using the
current working directory). Directories specified in the path must already
exist, otherwise this command will fail.
current working directory).

- `--file <filename>` - specify a path and/or file name for the `.dvc` file
created by this command (e.g. `--file stages/stage.dvc`). This overrides the
Expand Down
20 changes: 10 additions & 10 deletions content/docs/command-reference/update.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# update

Update files or directories imported from external <abbr>DVC
repositories</abbr>, and the corresponding import stage `.dvc` files.
repositories</abbr>, and the corresponding import `.dvc` files.

## Synopsis

Expand All @@ -10,7 +10,7 @@ usage: dvc update [-h] [-q | -v] [--rev <commit>] [-R] [--to-remote]
[-r <name>] [-j <number>] targets [targets ...]
positional arguments:
targets Import stage .dvc files to update. Using -R, directories
targets Import .dvc files to update. Using -R, directories
to search for .dvc files can also be given.
```

Expand Down Expand Up @@ -46,14 +46,14 @@ $ dvc update --rev master
> revision.
- `-R`, `--recursive` - determines the files to update by searching each target
directory and its subdirectories for import stage `.dvc` files to inspect. If
there are no directories among the targets, this option has no effect.

- `--to-remote` - update a `.dvc` file created with `dvc import-url` and
[transfer](/doc/command-reference/import-url#example-import-straight-to-the-remote)
the new data directly to remote storage (the default one unless `-r` is used).
No changes are done in the <abbr>workspace</abbr>. Use `dvc pull` to get the
data locally. This option can't be used with DVC or Git repository imports.
directory and its subdirectories for import `.dvc` files to inspect. If there
are no directories among the targets, this option has no effect.

- `--to-remote` - update a `.dvc` file created with `dvc import-url` without
downloading the latest data.
[Transfer it](/doc/command-reference/import-url#example-transfer-to-remote-storage)
directly to remote storage instead (the default one unless one is specified
with `-r`). Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) (can only be used with
Expand Down
4 changes: 2 additions & 2 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Managing External Data

> ⚠️ This is an advanced feature for very specific situations and not
> recommended except if there's absolutely no other alternative. In most cases
> recommended except if there's absolutely no other alternative. In most cases,
> alternatives like the [to-cache] or [to-remote] strategies of `dvc add` and
> `dvc import-url` are more convenient. **Note** that external outputs are not
> pushed or pulled from/to [remote storage].
[to-cache]: /doc/command-reference/add#example-transfer-to-the-cache
[to-cache]: /doc/command-reference/add#example-transfer-to-an-external-cache
[to-remote]: /doc/command-reference/add#example-transfer-to-remote-storage
[remote storage]: /doc/command-reference/remote

Expand Down

0 comments on commit bb2e569

Please sign in to comment.