Skip to content

Commit

Permalink
Merge pull request #1174 from iterative/2020-04-21
Browse files Browse the repository at this point in the history
Regular updates (Apr 21)
  • Loading branch information
jorgeorpinel authored Apr 27, 2020
2 parents aaf04f0 + 26db8de commit d10763c
Show file tree
Hide file tree
Showing 22 changed files with 201 additions and 232 deletions.
4 changes: 2 additions & 2 deletions content/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.

The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is an
<abbr>output</abbr>. The URL schema returned depends on the
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is found
(`outs` field). The URL schema returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).

Expand Down
4 changes: 2 additions & 2 deletions content/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,8 @@ improvements) we have done in the last few months:

- 🙂 A lot of **UI improvements** . Starting from the finally fixed nasty issue
with Windows command prompt printing a lot of garbage symbols, to using
progress bars for checkouts, better metrics output, and lots of smaller
things: ![|528x200](/img/0.35-metrics.gif)
progress bars for checkouts, better CLI output for `dvc metrics`, and lots of
smaller things: ![|528x200](/img/0.35-metrics.gif)

- ⚡️ **Performance optimizations.** The most notable one is the migration from
using a plain JSON file to an (embedded) SQLLite instance, to cache file and
Expand Down
79 changes: 39 additions & 40 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,45 +16,45 @@ positional arguments:
## Description

The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
the target data, as a first step to version it. Data added with DVC is also
committed to the <abbr>cache</abbr> (use the `--no-commit` option to avoid this,
and `dvc commit` to finish the process when needed).
the target data, as a first step to version it. It creates a
[DVC-file](/doc/user-guide/dvc-file-format) to track the added data.

The `targets` are files or directories to be track with DVC. These are turned
into <abbr>outputs<abbr> (`outs` field) in a resulting
[DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.)
Note that target data outside the current <abbr>workspace</abbr> is supported,
that becomes [external outputs](/doc/user-guide/managing-external-data).
The `targets` are files or directories to add with this command, that are turned
into <abbr>data artifacts</abbr> of the <abbr>project</abbr>. By default, these
are committed to the <abbr>cache</abbr> (use the `--no-commit` option to avoid
this, and `dvc commit` to finish the process when needed).

Note that [external data](/doc/user-guide/managing-external-data) (targets
outside the <abbr>workspace</abbr>) is supported.

Under the hood, a few actions are taken for each file (or directory) in
`targets`:

1. Calculate the file hashes.
1. Calculate the file hash.
2. Move the file contents to the cache directory (by default in `.dvc/cache`),
using the file hash to form the cached file names. (See
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
for more details.)
3. Attempt to replace the file by a link to the file in cache (more details
below).
4. Create a corresponding DVC-file and store the file hash to identify the
cached file. Unless the `-f` option is used, the DVC-file name generated by
default is `<file>.dvc`, where `<file>` is the file name of the first target.
3. Attempt to replace the file with a link to the cached data (more details
further down).
4. Create a corresponding DVC-file to store the file (as an
<abbr>output</abbr>), using its path and hash to identify the cached data.
Unless the `-f` option is used, the DVC-file name generated by default is
`<file>.dvc`, where `<file>` is the file name of the first target.
5. Unless `dvc init --no-scm` was used when initializing the project, add the
`targets` to `.gitignore` in order to prevent them from being committed to
the Git repository.
6. Unless `dvc init --no-scm` was used when initializing the project,
instructions are printed showing `git` commands for adding the files to a Git
repository.
6. Instructions are printed showing `git` commands for adding the files, if
appropriate.

The result is that the target data gets cached by DVC, and instead small
DVC-files can be tracked with Git. The DVC-file lists the added file as an
output (`outs` field), and references the cached file using its hash. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.
Summarizing, the result is that the target data is replaced small DVC-files can
be tracked with Git. See [DVC-File Format](/doc/user-guide/dvc-file-format) for
more details.

> Note that DVC-files created by this command are considered _orphans_ because
> they have no dependencies, only outputs. These _orphan_ "stage files" are
> always treated as _changed_ by `dvc repro`, which always executes them. See
> `dvc run` to learn about regular stage files.
> Note that DVC-files created by this command are considered _orphan stage
> files_ because they have no _dependencies_, only outputs. These are always
> treated as _changed_ by `dvc repro`, which always executes them. See `dvc run`
> to learn more about stage files.
By default DVC tries to use reflinks (see
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
Expand All @@ -78,13 +78,12 @@ to work with directory hierarchies with `dvc add`:
(with `.dir` extension), that in turn points to the files added from the
hierarchy.

In a <abbr>DVC project</abbr>, `dvc add` can be used to version control any
<abbr>data artifact</abbr> (input, intermediate, or output files and
directories, and model files). It is useful by itself to go back and forth
between different versions of datasets or models. We recommend using `dvc run`
and `dvc repro` mechanism to version control intermediate and final results
(like models) though. This way you bring data provenance and make your project
reproducible.
`dvc add` is typically used to version control raw data or initial datasets from
which data processing [pipelines](/doc/command-reference/pipeline) are built,
but it can be used to track any large file or directory. We recommend using
`dvc run` to version control intermediate and final results (like ML models).
This way you bring data provenance and make your project
[reproducible](/doc/command-reference/repro).

## Options

Expand Down Expand Up @@ -126,8 +125,8 @@ To track the changes with git run:
git add .gitignore data.xml.dvc
```

As the output says, a [DVC-file](/doc/user-guide/dvc-file-format) has been
created for `data.xml`. Let's explore the result:
As shown above, a [DVC-file](/doc/user-guide/dvc-file-format) has been created
for `data.xml`. Let's explore the result:

```dvc
$ tree
Expand Down Expand Up @@ -197,9 +196,9 @@ Saving information to 'pics.dvc'.

There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this
directory structure, but the images are all added to the <abbr>cache</abbr>. DVC
prints a message about this, mentioning that MD5 hash values are computed for
each directory. A single `pics.dvc` DVC-file is generated for the top-level
directory, and it contains:
prints a message mentioning that MD5 hash values are computed for each file. A
single `pics.dvc` DVC-file is generated for the top-level directory, and it
contains:

```yaml
md5: df06d8d51e6483ed5a74d3979f8fe42e
Expand All @@ -215,9 +214,9 @@ wdir: .
> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
> for more info.

This allows us to treat the entire directory structure as one unit (a dependency
or an <abbr>output</abbr>) with DVC commands. For example, it lets you pass the
whole directory tree as a dependency to a `dvc run` stage definition:
This allows us to treat the entire directory structure as a single <abbr>data
artifact</abbr>. This lets you pass the whole directory tree as a
<abbr>dependency</abbr> to a `dvc run` stage definition:

```dvc
$ dvc run -f train.dvc \
Expand Down
9 changes: 5 additions & 4 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ project's cache ++ | dvc pull |
Fetching could be useful when first checking out a <abbr>DVC project</abbr>,
since files tracked by DVC should already exist in remote storage, but won't be
in the project's <abbr>cache</abbr>. (Refer to `dvc remote` for more information
on DVC remotes.) These necessary data or model files are listed as dependencies
or outputs in a DVC-file (target [stage](/doc/command-reference/run)) so they
are required to [reproduce](/doc/tutorials/get-started/reproduce) the
corresponding [pipeline](/doc/command-reference/pipeline). (See
on DVC remotes.) These necessary data or model files are listed as
<abbr>dependencies</abbr> or <abbr>outputs</abbr> in a DVC-file (target
[stage](/doc/command-reference/run)) so they are required to
[reproduce](/doc/tutorials/get-started/reproduce) the corresponding
[pipeline](/doc/command-reference/pipeline). (See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more information on
dependencies and outputs.)

Expand Down
14 changes: 6 additions & 8 deletions content/docs/command-reference/get-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,23 +13,23 @@ usage: dvc get-url [-h] [-q | -v] url [out]
positional arguments:
url (See supported URLs in the description.)
out Destination path to put data to.
out Destination path to put files in.
```

## Description

In some cases it's convenient to get a <abbr>data artifact</abbr> from a remote
location into the local file system. The `dvc get-url` command helps the user do
just that.
In some cases it's convenient to get a file or directory from a remote location
into the local file system. The `dvc get-url` command helps the user do just
that.

> Note that unlike `dvc import-url`, this command does not track the downloaded
> data files (does not create a DVC-file). For that reason, this command doesn't
> require an existing <abbr>DVC project</abbr> to run in.
The `url` argument should provide the location of the data to be downloaded,
while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, then the output will
be placed inside of it.
the downloaded data. If an existing directory is specified, then the file or
directory will be placed inside.

DVC supports several types of (local or) remote locations (protocols):

Expand All @@ -48,8 +48,6 @@ DVC supports several types of (local or) remote locations (protocols):
> include them all. The command should look like this: `pip install "dvc[s3]"`.
> (This example installs `boto3` library along with DVC to support S3 storage.)
<!-- Separate MD quote: -->

\* HDFS and HTTP **do not** support downloading entire directories, only single
files.

Expand Down
34 changes: 17 additions & 17 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,15 @@ the data source. Both HTTP and SSH protocols are supported for online repos
to an "offline" repo (if it's a DVC repo without a default remote, instead of
downloading, DVC will try to copy the target data from its <abbr>cache</abbr>).

The `path` argument of this command is used to specify the location of the
target to be downloaded within the source repository at `url`. `path` can
specify any file or directory in the source repo, including <abbr>outputs</abbr>
tracked by DVC, as well as files tracked by Git. Note that for DVC repos, the
target should be found in one of the
[DVC-files](/doc/user-guide/dvc-file-format) of the project. The project should
also have a default [DVC remote](/doc/command-reference/remote), containing the
actual data.
The `path` argument is used to specify the location of the target to be
downloaded within the source repository at `url`. `path` can specify any file or
directory in the source repo, including those tracked by DVC, or by Git. Note
that DVC-tracked targets should be found in a
[DVC-file](/doc/user-guide/dvc-file-format) of the project.

⚠️ The project should have a default
[DVC remote](/doc/command-reference/remote), containing the actual data for this
command to work.

> See `dvc get-url` to download data from other supported locations such as S3,
> SSH, HTTP, etc.
Expand All @@ -57,8 +58,8 @@ name.
- `-o <path>`, `--out <path>` - specify a path (directory and/or file name) to
the desired location to place the download file in. The default value (when
this option isn't used) is the current working directory (`.`) and original
file name. If an existing directory is specified, then the output will be
placed inside of it.
file name. If an existing directory is specified, then the target data will be
placed inside.

- `--rev <commit>` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
Expand All @@ -76,7 +77,7 @@ name.

- `-v`, `--verbose` - displays detailed tracing information.

## Example: Get a DVC-tracked model file
## Example: Get a DVC-tracked model

> Note that `dvc get` can be used from anywhere in the file system, as long as
> DVC is [installed](/doc/install).
Expand All @@ -95,7 +96,7 @@ Note that the `model.pkl` file doesn't actually exist in the
[root directory](https://github.com/iterative/example-get-started/tree/master/)
of the external Git repo. Instead, the corresponding DVC-file
[train.dvc](https://github.com/iterative/example-get-started/blob/master/train.dvc)
is found, that specifies `model.pkl` in its outputs (`outs`). DVC then
is found, that contains `model.pkl` (in the `outs` field). DVC then
[pulls](/doc/command-reference/pull) the file from the default
[remote](/doc/command-reference/remote) of the external DVC project (found in
its
Expand All @@ -109,8 +110,7 @@ its
> [CI/CD](https://en.wikipedia.org/wiki/CI/CD) tools.
The same example applies to raw or intermediate <abbr>data artifacts</abbr> as
well, of course, for cases where we want to download those files or directories
and perform some analysis on them.
well, of course.

## Examples: Get a misc. Git-tracked file

Expand Down Expand Up @@ -145,9 +145,9 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951
`dvc get` provides the `--rev` option to specify which
[commit](https://git-scm.com/docs/revisions) of the repository to download a
<abbr>data artifact</abbr> from. It also has the `--out` option to specify the
location to place the artifact within the workspace. Combining these two options
allows us to do something we can't achieve with the regular `git checkout` +
`dvc checkout` process – see for example the
location to place the target data within the workspace. Combining these two
options allows us to do something we can't achieve with the regular
`git checkout` + `dvc checkout` process – see for example the
[Get Older Data Version](/doc/tutorials/get-started/older-versions) chapter of
our _Get Started_.

Expand Down
37 changes: 20 additions & 17 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ usage: dvc import-url [-h] [-q | -v] [-f <filename>] url [out]
positional arguments:
url (See supported URLs in the description.)
out Destination path to put files to.
out Destination path to put files in.
```

## Description
Expand All @@ -32,18 +32,22 @@ external data source changes. Example scenarios:
> (just download the file or directory).
The `dvc import-url` command helps the user create such an external data
dependency. The `url` argument specifies the external location of the data to be
imported, while `out` can be used to specify the directory and/or file name
desired for the downloaded data. If an existing directory is specified, the
<abbr>output</abbr> will be created inside of it.
dependency without having to manually copying files from the supported remote
locations (listed below), which may require installing a different tool for each
type.

DVC supports [DVC-files](/doc/user-guide/dvc-file-format) that refer to data in
external locations, see
The `url` argument specifies the external location of the data to be imported,
while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, the file or
directory will be placed inside.

[DVC-files](/doc/user-guide/dvc-file-format) support references to data in an
external location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such a
DVC-file, the `deps` field stores the remote URL, and the `outs` field contains
the corresponding local path in the workspace. It records metadata from the
external file or directory, allowing DVC to efficiently check it later and
determine whether the local copy is out of date.
the corresponding local path in the <abbr>workspace</abbr>. It records enough
metadata about the imported data to enable DVC efficiently determining whether
the local copy is out of date.

DVC supports several types of (local or) remote locations (protocols):

Expand Down Expand Up @@ -97,10 +101,9 @@ $ dvc run -d https://example.com/path/to/data.csv \
wget https://example.com/path/to/data.csv -O data.csv
```

Both methods generate an equivalent [stage file](/doc/command-reference/run)
(DVC-file) with an external dependency. The `dvc import-url` command saves the
user from having to manually copy files from each of the remote storage schemes,
and from having to install CLI tools for each service.
Both methods generate a [DVC-files](/doc/user-guide/dvc-file-format) with an
external dependency, but the one created by `dvc import-url` preserves the
connection to the data source. We call this an _import stage_.

Note that import stages are considered always locked, meaning that if you run
`dvc repro`, they won't be updated. Use `dvc update` on them to bring the import
Expand All @@ -110,8 +113,8 @@ up to date from the external data source.

- `-f <filename>`, `--file <filename>` - specify a path and/or file name for the
DVC-file created by this command (e.g. `-f stages/stage.dvc`). This overrides
the default file name: `<file>.dvc`, where `<file>` is the file name of the
output (`out`).
the default file name: `<file>.dvc`, where `<file>` is the desired file name
of the imported data (`out`).

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down Expand Up @@ -192,7 +195,7 @@ trying this example (especially if trying out the following one).
## Example: Detecting remote file changes

What if that remote file is updated regularly? The project goals might include
regenerating a <abbr>data artifact</abbr> based on the updated data source.
regenerating some results based on the updated data source.
[Pipeline](/doc/command-reference/pipeline) reproduction can be triggered based
on a changed external dependency.

Expand Down
Loading

0 comments on commit d10763c

Please sign in to comment.