Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: add to-cache docs #2246

Merged
merged 9 commits into from
Mar 10, 2021
52 changes: 50 additions & 2 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,8 +160,10 @@ not.
[remote storage](/doc/command-reference/remote) to transfer external target to
(can only be used with `--to-remote`).

- `-o <path>`, `--out <path>` - destination `path` for the transferred data (can
only be used with `--to-remote`).
- `-o <path>`, `--out <path>` - destination `path` for the transferred data. If
used with `--to-remote`, the data will be transferred to the remote storage.
Else, it will be transferred [to the cache](#example-transfer-to-cache) and
will be linked to the workspace.
isidentical marked this conversation as resolved.
Show resolved Hide resolved

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.
Expand Down Expand Up @@ -378,3 +380,49 @@ system that can handle it), they can use `dvc pull` as usual:
A data.xml
1 file added and 1 file fetched
```

## Example: Transfer to the cache

When you have a large dataset in an external location, you may want to add it to
your cache without actually copying it into the workspace first. This might be
due to cache and workspace are in separate disks, which only the cache can
handle that size of data. After the data is saved to your cache, we link it to
your workspace with the
[preffered links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).

Let's initalize a DVC project;
isidentical marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
```

Afterwards, let's setup a shared cache by following
[this](https://dvc.org/doc/use-cases/shared-development-server#preparation)
tutorial. When it is ready to go, we can add `data.xml` to our cache directly;
isidentical marked this conversation as resolved.
Show resolved Hide resolved

```
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
```

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure about the https://data.dvc.org/get-started/data.xml example... We could keep it hypothetical with some abstract path like /mnt/nfs/raw/data.xml (or some better idea) or ssh://[email protected]/raw/data.xml

Idk, I guess it's not very important but data.dvc.org is kind of an internal company thing (not a secret but still).


Depending on the cache type configured on our workspace (can be set using
`cache.type` config value), the data is either linked up or just copied over.
For this use case, a reflink or a symlink is suggested.

```
$ ls
data.xml data.xml.dvc
```
isidentical marked this conversation as resolved.
Show resolved Hide resolved

As it can be seen, this option doesn't track the source unlike
[import-url](/doc/command-reference/import-url).

```
$ cat data.xml.dvc
outs:
- md5: a304afb96060aad90176268345e10355
nfiles: 1
path: data.xml
```
isidentical marked this conversation as resolved.
Show resolved Hide resolved