Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

term: standard way to refer to DVC-files (2) #433

Merged
merged 7 commits into from
Jun 12, 2019
43 changes: 25 additions & 18 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,29 +21,32 @@ file is committed to the DVC cache. Using the `--no-commit` option, the file
will not be added to the cache and instead the `dvc commit` command is used when
(or if) the file is to be committed to the DVC cache.

Under the hood a few actions are taken for each file in the target(s):
Under the hood, a few actions are taken for each file in the target(s):

1. Calculate the file checksum.
2. Move the file content to the DVC cache (default location is `.dvc/cache`).
3. Replace the file by a link to the file in the cache (see details below).
4. Create a corresponding DVC-file (`.dvc` extension) and store the checksum to
identify the cache entry.
4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store
the checksum to identify the cache entry.
5. Add the _target_ filename to `.gitignore` (if Git is used in this workspace)
to prevent it from being committed to the Git repository.
6. Instructions are printed showing `git` commands for adding the files to a Git
repository. If a different SCM system is being used, use the equivalent
command for that system or nothing is printed if `--no-scm` was specified for
the repository.

The result is data file is added to the DVC cache, and DVC metafiles (`.dvc`)
can be tracked via Git or other version control system. The DVC-file (metafile)
lists the added file as an `out` (output) of the DVC-file, and references the
DVC cache entry using the checksum. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for the detailed description
of the DVC _metafile_ format.
Unless the `-f` options is used, by default the DVC-file name generated is
`<file>.dvc`, where `<file>` is file name of the first output (from `targets`).
If neither `-f`, nor outputs are specified, the stage name defaults to
`Dvcfile`.

The result is data file is added to the DVC cache, and DVC-files can be tracked
via Git or other version control system. The DVC-file lists the added file as an
output (`out`), and references the DVC cache entry using the checksum. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.

> Note that DVC-files created by this command are _orphans_: they have no
> dependencies. _Orphaned_ "stage files" are always considered _changed_ by
> dependencies. _Orphan_ "stage files" are always considered _changed_ by
> `dvc repro`, which always executes them.

By default DVC tries to use reflinks (see
Expand All @@ -58,8 +61,8 @@ to work with directory hierarchies with `dvc add`.

1. With `dvc add --recursive`, the hierarchy is traversed and every file is
added individually as described above. This means every file has its own
`.dvc` file, and a corresponding DVC cache entry is made (unless
`--no-commit` flag is added).
DVC-file, and a corresponding DVC cache entry is made (unless `--no-commit`
flag is added).
2. When not using `--recursive` a DVC-file is created for the top of the
directory (`dirname.dvc`), and every file in the hierarchy is added to the
DVC cache (unless `--no-commit` flag is added), but these files do not have
Expand Down Expand Up @@ -92,9 +95,12 @@ This way you bring data provenance and make your project reproducible.

- `-v`, `--verbose` - displays detailed tracing information.

- `-f`, `--file` - specify name of the DVC-file it generates. It should be
either `Dvcfile` or have a `.dvc` file extension (e.g. `data.dvc`) in order
for `dvc` to be able to find it later.
- `-f`, `--file` - specify name of the DVC-file it generates. By default the
DVC-file name generated is `<file>.dvc`, where `<file>` is file name of the
first output (from `targets`). The stage file is placed in the same directory
where `dvc run` is run by default, but `-f` can be used to change this
location, by including a path in the provided value (e.g.
`-f stages/stage.dvc`).

## Examples: Single file

Expand Down Expand Up @@ -131,16 +137,17 @@ outs:
md5: d8acabbfd4ee51c95da5d7628c7ef74b
metric: false
path: data.xml.jpg
meta: #key to contain arbitary user data
meta: # Special key to contain arbitary user data
name: John
email: [email protected]
```

This is a standard DVC-file with only an `outs` entry. The checksum should
correspond to an entry in the cache.

If user overwrites the `.dvc` file, comments and meta values are not preserved
between multiple executions of `dvc add` command.
> Note that the `meta` values above were entered manually for this example. Meta
> values and `#` comments are not preserved when a DVC-file is overwritten with
> the `dvc add` command.

```dvc
$ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b
Expand Down
8 changes: 4 additions & 4 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ positional arguments:

## Description

DVC-files (`.dvc`) in the workspace specify which instance of each data file or
directory is to be used, using the checksum saved in the `outs` fields. The
`dvc checkout` command updates the workspace data to match with the cache files
corresponding to those checksums.
[DVC-files](/doc/user-guide/dvc-file-format) in the workspace specify which
instance of each data file or directory is to be used, using the checksum saved
in the `outs` fields. The `dvc checkout` command updates the workspace data to
match with the cache files corresponding to those checksums.

Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM workspace, the DVC-files will contain checksums for the
Expand Down
9 changes: 5 additions & 4 deletions static/docs/commands-reference/destroy.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# destroy

Remove DVC-files from your repository.
Remove [DVC-files](/doc/user-guide/dvc-file-format) from your repository.

It removes `.dvc` and `Dvcfile` files, `.dvc/` directory. It means cache will be
removed as well by default, if it's not set to an external location (by default
local cache is located in the `.dvc/cache` directory).
It removes DVC-files, and the entire `.dvc/` meta directory from the workspace.
Note that the DVC cache will normally be removed as well, unless it's set to an
external location with `dvc cache dir`. (By default a local cache is located in
the `.dvc/cache` directory.)

```usage
usage: dvc destroy [-h] [-q] [-v] [-f]
Expand Down
4 changes: 2 additions & 2 deletions static/docs/commands-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ dependencies and outputs.)
`dvc fetch` ensures that the files needed for a DVC-file to be
[reproduced](/doc/get-started/reproduce) exist in the local cache. If no
`targets` are specified, the set of data files to fetch is determined by
analyzing all `.dvc` files in the current branch, unless `--all-branches` or
analyzing all DVC-files in the current branch, unless `--all-branches` or
`--all-tags` is specified.

The default remote is used unless `--remote` is specified. See `dvc remote add`
Expand Down Expand Up @@ -216,7 +216,7 @@ Checking out '{'scheme': 'local', 'path': '.../example-get-started/data/...

## Examples: Specific stages

> Please delete the `.dvc/cache/` directory first (with `rm -Rf .dvc/cache`) to
> Please delete the `.dvc/cache` directory first (with `rm -Rf .dvc/cache`) to
> follow this example if you tried the previous one (**Default behavior**).

`dvc fetch` only downloads the data files of a specific stage when the
Expand Down
15 changes: 9 additions & 6 deletions static/docs/commands-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ project might produce occasional data files that are used in other projects, for
example. ETL pipeline running regularly updates some data file. A shared dataset
on a remote storage that is managed and updated outside DVC.

DVC supports `.dvc` files which refer to an external data location, see
DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to an
external data location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such a DVC
file, the `deps` section lists a remote URL specification, and the `outs`
section lists the corresponding local path name in the workspace. It records
Expand Down Expand Up @@ -89,9 +90,11 @@ to test its current status.
- `--resume` - resume previously started download. This is useful if the
connection to the remote resource is unstable.

- `-f`, `--file` - specify name of the DVC-file it generates. It should be
either `Dvcfile` or have a `.dvc` file extension (e.g. `data.dvc`) in order
for `dvc` to be able to find it later.
- `-f`, `--file` - specify name of the DVC-file it generates. By default the
DVC-file name generated is `<file>.dvc`, where `<file>` is file name of the
output (`out`). The stage file is placed in the same directory where `dvc run`
is run by default, but `-f` can be used to change this location, by including
a path in the provided value (e.g. `-f stages/stage.dvc`).

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down Expand Up @@ -199,8 +202,8 @@ The `etag` field in the DVC-file contains the ETag recorded from the HTTP
request. If the remote file changes, the ETag changes, letting DVC know when the
file has changed.

While executing `dvc import` command, if user overwrites the `.dvc` file,
comments and meta values are not preserved between multiple executions.
> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the
> text format above.

## Example: Detecting remote file changes

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/lock.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ positional arguments:

Lock is useful to avoid syncing data from the top of a pipeline and keep
iterating on the last stages only. In this sense `lock` causes any DVC-file to
behave as a `.dvc` file that would be created by `dvc add` ran on outputs.
behave as an _orphan_ stage file as if created with `dvc add`.

## Options

Expand Down
11 changes: 5 additions & 6 deletions static/docs/commands-reference/metrics_add.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,9 @@ positional arguments:

## Description

Sets a special flag (see [DVC-File Format](/doc/user-guide/dvc-file-format)) in
the relevant `.dvc` file to identify a specified output as a metric file.
Alternatively, an output file could be made a metric via `-M` or `-m` parameter
of the `dvc run` command.
Sets a special flag in the relevant [DVC-file](/doc/user-guide/dvc-file-format)
to identify a specified output as a metric file. Alternatively, an output file
can be marked as a metric via `-M` or `-m` parameter of the `dvc run` command.

While any text file could be used as a metric file to track, it's recommended to
use `TSV`, `CSV`, or `JSON` formats. DVC provides a way (see
Expand All @@ -30,7 +29,7 @@ contains multiple metrics.
- `-t`, `--type` - specify a type of the metric file. Accepted values are:
`raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. Type will be used to determine
how `dvc metrics show` handles displaying it. This type will be saved into the
corresponding `.dvc` file and will be used automatically in the
corresponding DVC-file and will be used automatically in the
`dvc metrics show`. `htsv`/`hcsv` are the same `tsv`/`csv` but the values in
the first row of the file will be used as the field names and should be used
to address columns in the `--xpath` option. `raw` means that no additional
Expand All @@ -40,7 +39,7 @@ contains multiple metrics.
- `-x`, `--xpath` - specify a path within a metric file to get a specific metric
value. Should be used if metric file contains multiple numbers and you need to
get a only one of them. Only single path is allowed. This path will be saved
into the corresponding `.dvc` file and will be used automatically in
into the corresponding DVC-file and will be used automatically in
`dvc metrics show`. Accepted value depends on the metric file type (`-t`
option):

Expand Down
18 changes: 8 additions & 10 deletions static/docs/commands-reference/metrics_modify.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,20 @@ etc).
## Synopsis

```usage
usage: dvc metrics modify [-h] [-q] [-v]
[-t TYPE] [-x XPATH]
path
usage: dvc metrics modify [-h] [-q | -v] [-t TYPE] [-x XPATH] path
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

positional arguments:
path Path to a metric file.
```

## Description

This command finds a corresponding DVC-file for the metric file `path` provided
(i.e. a DVC-file that specifies one of its outputs is the file path in question
– see `dvc metrics add` or `dvc run` with `-m` and `-M` options) and updates the
meta-information that is used to manage and show the metric.
This command finds a corresponding [DVC-file](/doc/user-guide/dvc-file-format)
for the metric file `path` provided (the one that specifies the file path in
question among its outputs – see `dvc metrics add` or `dvc run` with `-m` and
`-M` options), and updates the information that represents the metric.

It the path provided is not part of the pipeline, the following error will be
If the path provided is not part of the pipeline, the following error will be
raised:

```text
Expand All @@ -34,7 +32,7 @@ Error: failed to modify metrics - unable
- `-t`, `--type` - specify a type of the metric file. Accepted values are:
`raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine how
`dvc metrics show` handles displaying it. This type will be saved into the
corresponding `.dvc` file and will be used automatically in the
corresponding DVC-file and will be used automatically in the
`dvc metrics show`. `htsv` and `hcsv` are `tsv` and `csv` but the values in
the first row of the file will be used as the field names and can be used to
address columns in the `--xpath` option. `raw` means that no additional
Expand All @@ -44,7 +42,7 @@ Error: failed to modify metrics - unable
- `-x`, `--xpath` - specify a path within a metric file to get a specific metric
value. Should be used if metric file contains multiple numbers and you need to
get a only one of them. Only single path is allowed. This path will be saved
into the corresponding `.dvc` file and will be used automatically in
into the corresponding DVC-file and will be used automatically in
`dvc metrics show`. Accepted value depends on the metric file type (`-t`
option):

Expand Down
3 changes: 2 additions & 1 deletion static/docs/commands-reference/pipeline_show.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# show

Show [stages](/doc/commands-reference/run) in a pipeline that lead to the
specified stage. By default it lists DVC-files (`.dvc` file extension).
specified stage. By default it lists
[DVC-files](/doc/user-guide/dvc-file-format).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

The `-c` and `-o` options allow to list or visualize a pipeline commands or data
files flow instead.
Expand Down
4 changes: 2 additions & 2 deletions static/docs/commands-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ positional arguments:

## Description

DVC-file (`target`) can have any name followed by the `.dvc` file extension. If
file name is omitted, `Dvcfile` will be used by default.
If the [DVC-file](/doc/user-guide/dvc-file-format) (`target`) is omitted,
`Dvcfile` will be assumed.

`dvc repro` provides an interface to rerun the commands in the computational
graph (a.k.a. pipeline) defined by the connected stages (DVC-files) in the
Expand Down
25 changes: 12 additions & 13 deletions static/docs/commands-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,10 @@ data and caches data artifacts along the way. Check this
[example](/doc/get-started/example-pipeline) to learn more and try to build a
pipeline.

By default, unless `-f` options is specified, the DVC-file name generated is
`<file>.dvc` where `<file>` is the file name of the first output (`-o`, `-O`, or
`-M` option). If neither `-f`, nor outputs (with `-o`, `-O`, `-M` options) are
specified, the stage name defaults to `Dvcfile`.
Unless the `-f` options is used, by default the DVC-file name generated is
`<file>.dvc`, where `<file>` is file name of the first output (`-o`, `-O`, or
`-M` option). If neither `-f`, nor outputs are specified, the stage name
defaults to `Dvcfile`.

Since `dvc run` provides a way to build a graph of computations, using
dependencies and outputs to connect different stages it checks computational
Expand Down Expand Up @@ -78,9 +78,9 @@ be no cycles, etc.

- `-m`, `--metrics` - another kind of output files. It is usually a small human
readable file (JSON, CSV, text, whatnot) with some numbers or other
meta-information that describes a model or other outputs. Check `dvc metrics`
to learn more about tracking metrics and comparing them across different model
or experiment versions.
information that describes a model or other outputs. Check `dvc metrics` to
learn more about tracking metrics and comparing them across different model or
experiment versions.

- `-M`, `--metrics-no-cache` - the same as `-m` except files are not put
automatically under DVC control. It means that they are not cached, and it's
Expand All @@ -91,10 +91,9 @@ be no cycles, etc.

- `-f`, `--file` - specify stage file name. By default the DVC-file name
generated is `<file>.dvc`, where `<file>` is file name of the first output
(`-o`, `-O`, or `-M` option). If neither `-f`, nor outputs (with `-o`, `-O`,
`-M`) are specified, the stage name defaults to `Dvcfile`. By default stage
file is placed in the same directory `dvc run` is executed. `-f` can be used
to change this place, by including path into provided value (e.g.
(`-o`, `-O`, or `-M` option). The stage file is placed in the same directory
where `dvc run` is run by default, but `-f` can be used to change this
location, by including a path in the provided value (e.g.
`-f stages/stage.dvc`).

- `-c`, `--cwd` - deprecated, use `-f` and `-w` to change location and working
Expand Down Expand Up @@ -166,8 +165,8 @@ be no cycles, etc.
git add .gitignore metric.dvc
```

While executing `dvc run`, if the user overwrites the `.dvc` file, comments
and meta values are not preserved between multiple executions.
> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the
> text format above.

- Execute a Python script as a DVC pipeline stage. The stage file name is not
specified, so a `model.p.dvc` DVC-file is created:
Expand Down
8 changes: 4 additions & 4 deletions static/docs/get-started/add-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ any **file** or a **directory**:
$ dvc add data/data.xml
```

DVC stores information about your data file in a special `.dvc` file, that has a
human-readable [description](/doc/user-guide/dvc-file-format) and can be
committed to Git to track versions of your file:
DVC stores information about your data file in a special DVC-file, that has a
human-readable [format](/doc/user-guide/dvc-file-format) and can be committed to
Git to track versions of your file:

```dvc
$ git add data/.gitignore data/data.xml.dvc
Expand All @@ -55,7 +55,7 @@ $ ls -R .dvc/cache
```

where `a304afb96060aad90176268345e10355` is an MD5 hash of the `data.xml` file.
And if you check the `data/data.xml.dvc` meta-file you will see that it has this
And if you check the `data/data.xml.dvc` DVC-file you will see that it has this
hash inside.

</details>
Expand Down
11 changes: 6 additions & 5 deletions static/docs/get-started/example-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,9 @@ in general and a user does not interact with these files directly. Check
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn
more.

When we run `dvc add Posts.xml.zip` the following happens. DVC creates an
_orphaned_ version of the [DVC-file](/doc/user-guide/dvc-file-format):
When we run `dvc add` `Posts.xml.zip`, DVC creates a
[DVC-file](/doc/user-guide/dvc-file-format) with no dependencies, a.k.a. and
"_orphan_ stage file":

```yaml
md5: 4dbe7a4e5a0d41b652f3d6286c4ae788
Expand All @@ -111,7 +112,7 @@ It's enough to run `dvc checkout` or `dvc pull` to restore data files.

</details>

- Commit the data file meta-information to Git repository:
- Commit the changes to Git repository:

```dvc
$ git add data/Posts.xml.zip.dvc data/.gitignore
Expand Down Expand Up @@ -229,8 +230,8 @@ $ dvc run -d code/evaluate.py -d data/model.pkl -d data/matrix-test.pkl \

### Expand to learn more about DVC internals

By analyzing dependencies and outputs DVC-files describe we can restore the full
chain (DAG) of commands we need to apply. This is important when you run
By analyzing dependencies and outputs in DVC-files, we can restore the full
chain of commands (DAG) we need to apply. This is important when you run
`dvc repro` to reproduce the final or intermediate result.

`dvc pipeline show` helps to visualize the pipeline (run it with `-c` option to
Expand Down
Loading