Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drops external outputs and updates external data guides #4574

Merged
merged 4 commits into from
Jun 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 10 additions & 54 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ file.
## Synopsis

```usage
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external]
usage: dvc add [-h] [-q | -v] [-R] [--no-commit]
[--glob] [--file <filename>] [-o <path>]
[--to-remote] [-r <name>] [-j <number>] [-f]
[--desc <text>] [--meta key=value] [--label <str>]
Expand All @@ -16,6 +16,14 @@ positional arguments:
targets Files or directories to add
```

<details>

### Options deprecated in 3.0

- `--external`

</details>

## Description

The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
Expand Down Expand Up @@ -149,21 +157,9 @@ not.
specified in `targets`. Shell style wildcards supported: `*`, `?`, `[seq]`,
`[!seq]`, and `**`

- `--external` - allow tracking `targets` outside of the DVC repository
in-place. See [Managing External Data].

<admon type="warn">

Note that this is an advanced feature for very specific situations and not
recommended except if there's absolutely no other alternative. Additionally,
this typically requires an external cache setup (see link above).

</admon>

- `-o <path>`, `--out <path>` - specify a `path` to the desired location in the
workspace to place the `targets` (copying them from their current location).
This enables targeting data outside the project (see an
[example](#example-transfer-to-an-external-cache)).
This enables targeting data outside the project.

- `--to-remote` - add a target that's outside the project, neither move it into
the workspace, nor cache it.
Expand Down Expand Up @@ -199,7 +195,6 @@ not.
- `-v`, `--verbose` - displays detailed tracing information.

[pattern]: https://docs.python.org/3/library/glob.html
[managing external data]: /doc/user-guide/data-management/managing-external-data

## Example: Single file

Expand Down Expand Up @@ -360,45 +355,6 @@ $ tree .dvc/cache
Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.

## Example: Transfer to an external cache

When you want to add a large dataset that is outside of your
<abbr>project</abbr> (e.g. online), you would normally need to download or copy
it into the <abbr>workspace</abbr> first. But you may not have enough local
storage space.

You can however set up an [external cache] that can handle the data. To avoid
ever making a local copy, target the outside data with `dvc add` while
specifying an `--out` (`-o`) path inside of your project. This way the data will
be transferred to the <abbr>cache</abbr> directly, and then [linked] into your
workspace.

Let's add a `data.xml` file via HTTP, putting it in `./data.xml`:

```cli
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
...
$ ls
data.xml data.xml.dvc
```

The resulting `.dvc` file will save the provided local `path` as if the data was
already in the workspace, while the `md5` hash points to the copy of the data
that has now been transferred to the <abbr>cache</abbr>. Let's check the
contents of `data.xml.dvc` in this case:

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
nfiles: 1
path: data.xml
```

[linked]:
/doc/user-guide/data-management/large-dataset-optimization#file-link-types-for-the-dvc-cache
[external cache]:
/doc/user-guide/data-management/managing-external-data#setting-up-an-external-cache

## Example: Transfer to remote storage

Sometimes there's not enough space in the local environment to import a large
Expand Down
64 changes: 3 additions & 61 deletions content/docs/command-reference/destroy.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,12 @@ usage: dvc destroy [-h] [-q | -v] [-f]
`dvc destroy` removes `dvc.yaml`, `.dvc` files, and the internal `.dvc/`
directory from the <abbr>project</abbr>.

Note that the <abbr>cache directory</abbr> will be removed as well, unless it's
set to an
[external location](/doc/user-guide/data-management/managing-external-data#setting-up-an-external-cache)
(by default a local cache is located in `.dvc/cache`). If you have setup
[symlinks](/doc/user-guide/data-management/large-dataset-optimization) (from
cache to workspace) in your project, DVC will replace them with the latest
Note that the <abbr>cache directory</abbr> will be removed as well. If you have
setup [symlinks](/doc/user-guide/data-management/large-dataset-optimization)
(from cache to workspace) in your project, DVC will replace them with the latest
versions of the actual files and directories first, so that your data is intact
after destruction.

[external cache]:
/doc/user-guide/data-management/managing-external-data#setting-up-an-external-cache

> Refer to [Project Structure](/doc/user-guide/project-structure) for more
> details on the directories and files deleted by this command.

Expand Down Expand Up @@ -60,55 +54,3 @@ $ ls -a

.git code.py foo
```

## Example: Preserve an external cache directory

By default, the <abbr>cache</abbr> location is `.dvc/cache`. Let's change its
location to `/mnt/cache` using `dvc cache dir`, add some data, and then try
`dvc destroy`:

```cli
$ dvc cache dir /mnt/cache
$ echo foo > foo
$ dvc add foo
```

Contents of the <abbr>workspace</abbr>:

```cli
$ ls -a
.dvc .git code.py foo foo.dvc
```

Contents of the (external) cache (`b1/946a...` contains `foo`):

```cli
$ tree /mnt/cache
/mnt/cache/
└── b1
└── 946ac92492d2347c6235b4d2611184
```

OK, let's destroy the <abbr>DVC project</abbr>:

```cli
$ dvc destroy

This will destroy all information about your pipelines, all data files...
Are you sure you want to continue? [y/n]
yes

$ ls -a
.git code.py foo
```

`foo.dvc` and the internal `.dvc/` directory were removed (this would include
any cached data prior to changing the cache location). But the cache files in
`/mnt/cache` persist:

```cli
$ tree /mnt/cache
/mnt/cache/
└── b1
└── 946ac92492d2347c6235b4d2611184
```
21 changes: 13 additions & 8 deletions content/docs/command-reference/stage/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Helper command to create or update <abbr>stages</abbr> in `dvc.yaml`.
usage: dvc stage add [-h] [-q | -v] -n <name> [-f]
[-d <path>] [-p [<filename>:]<params_list>]
[-o <filename>] [-O <filename>] [-c <filename>]
[--external] [--outs-persist <filename>]
[--outs-persist <filename>]
[--outs-persist-no-cache <filename>]
[-m <path>] [-M <path>]
[--plots <path>] [--plots-no-cache <path>]
Expand All @@ -20,6 +20,14 @@ positional arguments:
command Command to execute
```

<details>

### Options deprecated in 3.0

- `--external`

</details>

## Description

Writes stage definitions to `dvc.yaml` (in the current working directory). To
Expand Down Expand Up @@ -84,8 +92,8 @@ is reproduced (see also `dvc gc`). Relevant notes:
which generates a single `.dir` entry in the cache (refer to [Structure of
cache directory] for more info.)

- [external dependencies] and [external outputs] (outside of the
<abbr>workspace</abbr>) are also supported (except metrics and plots).
- [external dependencies and outputs] (outside of the <abbr>workspace</abbr>)
are also supported (except metrics and plots).

- Since <abbr>outputs</abbr> are deleted from the workspace before executing
stage commands, the underlying code should create any directory structures
Expand All @@ -102,8 +110,8 @@ is reproduced (see also `dvc gc`). Relevant notes:
/docs/user-guide/how-to/add-deps-or-outs-to-a-stage
[structure of cache directory]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
[external dependencies]: /doc/user-guide/external-dependencies
[external outputs]: /doc/user-guide/managing-external-data
[external dependencies and outputs]:
/doc/user-guide/pipelines/external-dependencies-and-outputs
[manual process]: /doc/command-reference/move#renaming-stage-outputs

### For displaying and comparing data science experiments
Expand Down Expand Up @@ -209,9 +217,6 @@ data science experiments.
`always_changed` field in `dvc.yaml`). As a result DVC will always execute it
when reproducing the pipeline.

- `--external` - allow writing outputs outside of the DVC repository. See
[Managing External Data](/doc/user-guide/data-management/managing-external-data).

- `--desc <text>` - user description of the stage (optional). This doesn't
affect any DVC operations.

Expand Down
2 changes: 0 additions & 2 deletions content/docs/command-reference/version.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,13 @@ usage: dvc version [-h] [-q | -v]
| `Supports` | Types of [remote storage] supported by the current DVC setup (their required dependencies are installed) |
| `Cache types` | [Types of links] supported (between <abbr>workspace</abbr> and <abbr>cache</abbr>) |
| `Cache directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the <abbr>cache</abbr> directory is mounted |
| `Caches` | Cache [location types] configured in the repo (e.g. local, SSH, S3, etc.) |
| `Remotes` | Remote [location types][remote storage] configured in the repo (e.g. SSH, S3, Google Drive, etc.) |
| `Workspace directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the <abbr>workspace</abbr> is mounted |
| `Repo` | Shows whether we are in a DVC repo and/or Git repo |

[remote storage]: /doc/user-guide/data-management/remote-storage
[types of links]:
/doc/user-guide/data-management/large-dataset-optimization#file-link-types-for-the-dvc-cache
[location types]: /doc/user-guide/data-management/managing-external-data

> No info about `Cache` or `Workspace directory` is printed if `dvc version` is
> used outside a DVC project.
Expand Down
10 changes: 7 additions & 3 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -166,15 +166,19 @@
},
"cloud-versioning",
"discovering-and-accessing-data",
"importing-external-data",
"managing-external-data",
{ "label": "External Data", "slug": "importing-external-data" },
"large-dataset-optimization"
]
},
{
"slug": "pipelines",
"source": "pipelines/index.md",
"children": ["defining-pipelines", "running-pipelines", "run-cache"]
"children": [
"defining-pipelines",
"running-pipelines",
"run-cache",
"external-dependencies-and-outputs"
]
},
{
"label": "Experiment Management",
Expand Down
Loading