Skip to content

Commit

Permalink
Merge pull request #1370 from iterative/refactor/get-started-1.0
Browse files Browse the repository at this point in the history
user guide: DVC Files and Directories - 1.0 updates
  • Loading branch information
jorgeorpinel authored Jun 15, 2020
2 parents 5a58811 + cd7bca0 commit 4e82857
Show file tree
Hide file tree
Showing 59 changed files with 479 additions and 437 deletions.
8 changes: 4 additions & 4 deletions content/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,10 @@ specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.

The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the
[`dvc.yaml`](/doc/user-guide/dvc-file-format) or
[`.dvc` file](/doc/user-guide/dvc-file-format) where the given `path` is found
(`outs` field). The URL schema returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) where the
given `path` is found (`outs` field). The schema of the URL returned depends on
the [type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).

If the target is a directory, the returned URL will end in `.dir`. Refer to
Expand Down
45 changes: 25 additions & 20 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# add

Track data files or directories with DVC, by creating a corresponding
[`.dvc` file](/doc/user-guide/dvc-file-format).
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files).

## Synopsis

Expand All @@ -17,7 +17,8 @@ positional arguments:

The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
the target data, in order to start versioning it. It creates a
[`.dvc` file](/doc/user-guide/dvc-file-format) to track the added data.
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) to track the
added data.

This command can be used to
[version control](/doc/use-cases/versioning-data-and-model-files) large files,
Expand All @@ -31,8 +32,9 @@ The `targets` are the files or directories to add, which are turned into
> See also `dvc run` for more advanced ways to version intermediate and final
> results (like ML models).
Under the hood, a few actions are taken for each file (or directory) in
`targets`:
After checking that each `target` file (or directory) hasn't been added before
(or tracked with other DVC commands), a few actions are taken under the hood for
each one:

1. Calculate the file hash.
2. Move the file contents to the cache (by default in `.dvc/cache`), using the
Expand All @@ -41,25 +43,27 @@ Under the hood, a few actions are taken for each file (or directory) in
for more details.)
3. Attempt to replace the file with a link to the cached data (more details on
file linking further down).
4. Create a corresponding [`.dvc` file](/doc/user-guide/dvc-file-format) to
track the file, using its path and hash to identify the cached data. The
`.dvc` file lists the DVC-tracked file as an <abbr>output</abbr> (`outs`
field). Unless the `-f` option is used, the `.dvc` file name generated by
default is `<file>.dvc`, where `<file>` is the file name of the first target.
4. Create a corresponding
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) to track
the file, using its path and hash to identify the cached data. The `.dvc`
file lists the DVC-tracked file as an <abbr>output</abbr> (`outs` field).
Unless the `-f` option is used, the `.dvc` file name generated by default is
`<file>.dvc`, where `<file>` is the file name of the first target.
5. Add the `targets` to `.gitignore` in order to prevent them from being
committed to the Git repository (unless `dvc init --no-scm` was used when
initializing the DVC project).
6. Instructions are printed showing `git` commands for adding the files, if
appropriate.

Summarizing, the result is that the target data is replaced by small `.dvc`
files that can easily be tracked with Git. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.
Summarizing, the result is that the target data is replaced by small
[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) that can be
easily tracked with Git.

> Note that `.dvc` files can be considered _orphan stages_, because they have no
> <abbr>dependencies</abbr>, only outputs. These are treated as _always changed_
> by `dvc status` and `dvc repro`, which always executes them. See
> [`dvc.yaml`](/doc/user-guide/dvc-file-format) to learn more about stages.
> [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) to learn
> more about stages.
To avoid adding files inside a directory accidentally, you can add the
corresponding [patterns](/doc/user-guide/dvcignore) in a `.dvcignore` file.
Expand All @@ -74,8 +78,8 @@ large files. DVC also supports other link types for use on file systems without
### Tracking directories

A `dvc add` target can be an individual file or a directory. In the latter case,
a [`.dvc` file](/doc/user-guide/dvc-file-format) is created for the top of the
directory (with default name `<dir_name>.dvc`).
a [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) is created
for the top of the directory (with default name `<dir_name>.dvc`).

Every file in the hierarchy is added to the cache (unless the `--no-commit`
option is used), but DVC does not produce individual `.dvc` files for each file
Expand Down Expand Up @@ -135,7 +139,8 @@ To track the changes with git, run:
git add .gitignore data.xml.dvc
```

As indicated above, a [`.dvc` file](/doc/user-guide/dvc-file-format) has been
As indicated above, a
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) has been
created for `data.xml`. Let's explore the result:

```dvc
Expand Down Expand Up @@ -187,10 +192,10 @@ Tracking a directory with DVC as simple as with a single file:
$ dvc add pics
```

There are no [`.dvc` files](/doc/user-guide/dvc-file-format) generated within
this directory structure to match each images, but the image files are all
<abbr>cached</abbr>. A single `pics.dvc` file is generated for the top-level
directory, and it contains:
There are no [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files)
generated within this directory structure to match each image, but the image
files are all <abbr>cached</abbr>. A single `pics.dvc` file is generated for the
top-level directory, and it contains:

```yaml
outs:
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ positional arguments:

At DVC initialization, a new `.dvc/` directory is created for internal
configuration and <abbr>cache</abbr>
[files and directories](/doc/user-guide/dvc-files-and-directories), that are
hidden from the user.
[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files),
that are hidden from the user.

The cache is where your data files, models, etc. (anything you want to version
with DVC) are actually stored. The corresponding files you see in the
Expand Down
7 changes: 4 additions & 3 deletions content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@ positional arguments:

## Description

[DVC-files](/doc/user-guide/dvc-file-format) act as pointers to specific version
of data files or directories tracked by DVC. This command synchronizes the
workspace data with the versions specified in the current DVC-files.
[DVC-files](/doc/user-guide/dvc-files-and-directories) act as pointers to
specific version of data files or directories tracked by DVC. This command
synchronizes the workspace data with the versions specified in the current
DVC-files.

`dvc checkout` is useful, for example, when using Git in the
<abbr>project</abbr>, after `git clone`, `git checkout`, or any other operation
Expand Down
10 changes: 5 additions & 5 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# commit

Record changes to DVC-tracked files in the <abbr>project</abbr>, by updating
[DVC-files](/doc/user-guide/dvc-file-format) and saving <abbr>outputs</abbr> to
the <abbr>cache</abbr>.
[DVC-files](/doc/user-guide/dvc-files-and-directories) and saving
<abbr>outputs</abbr> to the <abbr>cache</abbr>.

## Synopsis

Expand Down Expand Up @@ -67,8 +67,8 @@ cache. This is where the `dvc commit` command comes into play. It performs that
last step (saving the data in cache).

Note that it's best to avoid the last two scenarios. They essentially
force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to
cache. They are still useful, but keep in mind that DVC can't guarantee
force-update the [DVC-files](/doc/user-guide/dvc-files-and-directories) and save
data to cache. They are still useful, but keep in mind that DVC can't guarantee
reproducibility in those cases.

## Options
Expand Down Expand Up @@ -227,7 +227,7 @@ the new instance of `model.pkl` is there.
It is also possible to execute the commands that are executed by `dvc repro` by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in a
[DVC-file](/doc/user-guide/dvc-file-format). For example:
[DVC-file](/doc/user-guide/dvc-files-and-directories). For example:

```dvc
$ python src/featurization.py data/prepared data/features
Expand Down
5 changes: 3 additions & 2 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,9 @@ for more details.) This section contains the following options:

### state

See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to
learn more about the state file (database) that is used for optimization.
See
[Internal directories and files](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files)
to learn more about the state file (database) that is used for optimization.

- `state.row_limit` - maximum number of entries in the state database, which
affects the physical size of the state file itself, as well as the performance
Expand Down
23 changes: 14 additions & 9 deletions content/docs/command-reference/destroy.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,19 @@ usage: dvc destroy [-h] [-q | -v] [-f]

## Description

`dvc destroy` removes DVC-files, and the entire `.dvc/` meta directory from the
<abbr>workspace</abbr>. Note that the <abbr>cache directory</abbr> will normally
be removed as well, unless it's set to an external location with
`dvc cache dir`. (By default a local cache is located in the `.dvc/cache`
directory.) If you were using
`dvc destroy` removes `dvc.yaml`, `.dvc` files, and the internal `.dvc/`
directory from the <abbr>workspace</abbr>.

Note that the <abbr>cache directory</abbr> will be removed as well, unless it's
[set to an external location](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
(by default a local cache is located in `.dvc/cache`). If you were using
[symlinks for linking](/doc/user-guide/large-dataset-optimization) data from the
cache, DVC will replace them with copies, so that your data is intact after the
project's destruction.
cache, DVC will replace them with the latest versions of the actual files and
directories first, so that your data is intact after the project's destruction.

> Refer to
> [DVC files and directories](/doc/user-guide/dvc-files-and-directories) for
> more details on the directories and files deleted by this command.
## Options

Expand Down Expand Up @@ -94,8 +99,8 @@ $ ls -a
.git code.py foo
```

`dvc destroy` command removed DVC-files, and the entire `.dvc/` meta directory
from the <abbr>workspace</abbr>. But the cache files that are present in the
`dvc destroy` command removed DVC-files, and the internal `.dvc/` directory from
the <abbr>workspace</abbr>. But the cache files that are present in the
`/mnt/cache` directory still persist:

```dvc
Expand Down
33 changes: 17 additions & 16 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ of the project, but without placing them in the <abbr>workspace</abbr>. This
makes the data files available for linking (or copying) into the workspace.
(Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along
with `dvc checkout`, it's performed automatically by `dvc pull` when the target
[`dvc.yaml`](/doc/user-guide/dvc-file-format) or
[`.dvc`](/doc/user-guide/dvc-file-format) files are not already in the cache:
[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or
[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files are not
already in the cache:

```
Controlled files Commands
Expand Down Expand Up @@ -52,8 +53,7 @@ on DVC remotes.) These necessary data or model files are listed as
required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the
corresponding [pipeline](/doc/command-reference/pipeline).

`dvc fetch` ensures that the files needed for a
[stage](/doc/command-reference/run) or `.dvc` file to be
`dvc fetch` ensures that the files needed for a stage or `.dvc` file to be
[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in
cache. If no `targets` are specified, the set of data files to fetch is
determined by analyzing all `dvc.yaml` and `.dvc` files in the current branch,
Expand Down Expand Up @@ -196,11 +196,11 @@ Note that the `.dvc/cache` directory was created and populated.
> for more info.
Used without arguments (as above), `dvc fetch` downloads all assets needed by
all [`dvc.yaml`](/doc/user-guide/dvc-file-format) and
[`.dvc`](/doc/user-guide/dvc-file-format) files in the current branch, including
for directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and
`42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and
`data/features/` directory, respectively.
all [`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) and
[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) files in the
current branch, including for directories. The hash values
`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
correspond to the `model.pkl` file and `data/features/` directory, respectively.

Let's now link files from the cache to the workspace with:

Expand All @@ -214,7 +214,8 @@ $ dvc checkout
> follow this example if you tried the previous one (**Default behavior**).
`dvc fetch` only downloads the data files of a specific stage when the
corresponding `.dvc` file (command target) is specified:
corresponding [`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files)
(command target) is specified:

```dvc
$ dvc fetch prepare.dvc
Expand Down Expand Up @@ -280,12 +281,12 @@ $ tree .dvc/cache
```

Fetching using `--with-deps` starts with the target
[`.dvc` file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches
backwards through its pipeline for data to download into the project's cache.
All the data for the second and third stages ("featurize" and "train") has now
been downloaded to the cache. We could now use `dvc checkout` to get the data
files needed to reproduce this pipeline up to the third stage into the workspace
(with `dvc repro train.dvc`).
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) (`train.dvc`)
and searches backwards through its pipeline for data to download into the
project's cache. All the data for the second and third stages ("featurize" and
"train") has now been downloaded to the cache. We could now use `dvc checkout`
to get the data files needed to reproduce this pipeline up to the third stage
into the workspace (with `dvc repro train.dvc`).

> Note that in this example project, the last stage file `evaluate.dvc` doesn't
> add any more data files than those form previous stages, so at this point all
Expand Down
10 changes: 6 additions & 4 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ The `path` argument is used to specify the location of the target to be
downloaded within the source repository at `url`. `path` can specify any file or
directory in the source repo, including those tracked by DVC, or by Git. Note
that DVC-tracked targets should be found in a
[`dvc.yaml`](/doc/user-guide/dvc-file-format) or
[`.dvc`](/doc/user-guide/dvc-file-format) file of the project.
[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file) or
[`.dvc`](/doc/user-guide/dvc-files-and-directories#dvc-files) file of the
project.

⚠️ The project should have a default
[DVC remote](/doc/command-reference/remote), containing the actual data for this
Expand Down Expand Up @@ -183,8 +184,9 @@ get the most recent one, we use a similar command, but with
`-o model.bigrams.pkl` and `--rev bigrams-experiment` (or even without `--rev`
since that tag has the latest model version anyway). In fact, in this case using
`dvc pull` with the corresponding
[`.dvc` files](/doc/user-guide/dvc-file-format) should suffice, downloading the
file as just `model.pkl`. We can then rename it to make its variant explicit:
[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) should
suffice, downloading the file as just `model.pkl`. We can then rename it to make
its variant explicit:

```dvc
$ dvc pull train.dvc
Expand Down
19 changes: 10 additions & 9 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Download a file or directory from a supported URL (for example `s3://`,
`ssh://`, and other protocols) into the <abbr>workspace</abbr>, and track
changes in the remote data source. Creates a
[`.dvc` file](/doc/user-guide/dvc-file-format).
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files).

> See `dvc import` to download and tack data/model files or directories from
> other <abbr>DVC repositories</abbr> (e.g. hosted on Github).
Expand Down Expand Up @@ -42,8 +42,8 @@ while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, the file or
directory will be placed inside.

[`.dvc` files](/doc/user-guide/dvc-file-format) support references to data in an
external location, see
[`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) support
references to data in an external location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such a `.dvc`
file, the `deps` field stores the remote URL, and the `outs` field contains the
corresponding local path in the <abbr>workspace</abbr>. It records enough
Expand Down Expand Up @@ -104,10 +104,11 @@ $ dvc run -d https://example.com/path/to/data.csv \
```

`dvc import-url` generates an import stage
[`.dvc` file](/doc/user-guide/dvc-file-format) and `dvc run` a regular stage (in
[`dvc.yaml`](/doc/user-guide/dvc-file-format)). Both have an external
dependency, but the one created by `dvc import-url` preserves the connection to
the data source. We call this an _import stage_.
[`.dvc` file](/doc/user-guide/dvc-files-and-directories#dvc-files) and `dvc run`
a regular stage (in
[`dvc.yaml`](/doc/user-guide/dvc-files-and-directories#dvcyaml-file)). Both have
an external dependency, but the one created by `dvc import-url` preserves the
connection to the data source. We call this an _import stage_.

Note that import stages are considered always
[frozen](/doc/command-reference/freeze), meaning that if you run `dvc repro`,
Expand Down Expand Up @@ -192,8 +193,8 @@ The `etag` field in the `.dvc` file contains the
If the remote file changes, its ETag will be different. This metadata allows DVC
to determine whether its necessary to download it again.

> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the
> text format above.
> See [`.dvc` files](/doc/user-guide/dvc-files-and-directories#dvc-files) for
> more details on the format above.

You may want to get out of and remove the `example-get-started/` directory after
trying this example (especially if trying out the following one).
Expand Down
Loading

0 comments on commit 4e82857

Please sign in to comment.