Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dvc install and dvc import documentation #260

Merged
merged 6 commits into from
Apr 25, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ The output of `dvc checkout` does not list which data files were restored. It
does report removed files and files that DVC was unable to restore due to it
missing from the cache.

This command will fail to checkout files that are missing from the cache. In
such a case, `dvc checkout` prints a warning message. Any files that can be
checked out without error will be restored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any files that are found in cache instead of an error. It's not an error usually - it's a warning like you mentioned above.


There are two methods to restore a file missing from the cache, depending on the
situation. In some cases the pipeline must be rerun using the `dvc repro`
command. In other cases the cache can be pulled from a remote cache using the
Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Now, we can install requirements for the project:
Then download the precomputed data using:

```dvc
$ dvc pull
$ dvc pull --all-branches --all-tags
robogeek marked this conversation as resolved.
Show resolved Hide resolved
```

This data will be retrieved from a preconfigured remote cache.
Expand Down
293 changes: 272 additions & 21 deletions static/docs/commands-reference/import.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,6 @@
# import

Import file from URL to local directory and track changes in remote file.

Supported schemes:

* `local` - Local path
* `s3` - URL to a file on Amazon S3
* `gs` - URL to a file on Google Storage
* `ssh` - URL to a file on another machine with SSH access
* `hdfs` - URL to a file on HDFS
* `http` - URL to a file with a _strong ETag_ served with HTTP or HTTPS
Import file from any supported URL or local directory to local workspace and track changes in remote file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not 80 symbols here


## Synopsis

Expand All @@ -21,28 +12,288 @@ Supported schemes:
out Output
```

## Description

In some cases it is convenient to add a data file to a workspace such that it
will be automatically updated when the data source is updated. One project might
produce occasional data files that are used in other projects, for example. Or
a government agency might produce occasionally updated data of use in a project.

DVC supports `.dvc` files which refer to an external data file, see
robogeek marked this conversation as resolved.
Show resolved Hide resolved
[External Dependencies](/doc/user-guide/external-dependencies). In such a DVC
file, the `deps` section lists a remote file specification, and the `outs`
section lists the corresponding local file name in the workspace. It records
enough data from the remote file to enable DVC to efficiently check the remote
file to determine if the local file is out of date. DVC uses this data to then
download the file to the workspace, and to re-download it upon changes.

The `dvc import` command helps the user create such an external data dependency.
robogeek marked this conversation as resolved.
Show resolved Hide resolved

DVC supports several types of remote files:

Type | Discussion | URL format
-----|------------|------------
`local` | Local path | `/path/to/local/file`
`s3` | Amazon S3 | `s3://mybucket/data.csv`
`gs` | Google Storage | `gs://mybucket/data.csv`
`ssh` | SSH server | `ssh://[email protected]:/path/to/data.csv`
`hdfs` | HDFS | `hdfs://[email protected]/path/to/data.csv`
`http` | HTTP to file with _strong ETag_ | `https://example.com/path/to/data.csv`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is one more actually - remote://. See this ticket - #108. It would be great to add it here and propagate the explanation to the external dependencies section, and dvc run if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote URL needs to be documented in the dvc remote command documentation. It should then be enough to reference that documentation from here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still explain it briefly here - just give an example of the transformation.

robogeek marked this conversation as resolved.
Show resolved Hide resolved
`remote` | Remote path | `remote://myremote/path/to/file`

Another way to understand the `dvc import` command is as a short-cut for more
verbose `dvc run` commands. This is discussed in the
[External Dependencies](/doc/user-guide/external-dependencies) documentation,
where an alternative is demonstrated for each of these schemes.

Instead of `dvc import`:

```dvc
$ dvc import https://example.com/path/to/data.csv data.csv
```

It is possible to instead use `dvc run`:

```dvc
$ dvc run -d https://example.com/path/to/data.csv \
-o data.csv \
wget https://example.com/path/to/data.csv -O data.csv
```

Both methods generate a DVC file with an external dependency, and they perform
a roughly equivalent result. The `dvc import` command saves the user from using
the command to copy files from each of the remote storage schemes, and from
having to install CLI tools for each service.

When DVC inspects a DVC file, one step is inspecting the dependencies to see if
any have changed. A changed dependency will appear in the `dvc status` report,
indicating the need to re-run the corresponding part of the pipeline. When DVC
inspects an external dependency, it uses a method appropriate to that dependency
to test its current status.

## Options

* `--resume` - resume previously started download. This is useful if the
connection to the remote resource is unstable.

* `-f`, `--file` - specify name of the DVC file it generates. It should be
either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order for `dvc`
to be able to find it later.

* `-h`, `--help` - prints the usage/help message, and exit.

* `-q`, `--quiet` - does not write anything to standard output. Exit with 0 if
no problems arise, otherwise 1.

* `-v`, `--verbose` - displays detailed tracing information.

* `--resume` - resume previously started download.
## Example: Tracking a remote file

The [DVC getting started tutorial](/doc/get-started) demonstrates a simple DVC
pipeline. In the [Add Files step](/doc/get-started/add-files) we are told to
robogeek marked this conversation as resolved.
Show resolved Hide resolved
download a file, then use `dvc add` to integrate it with the workspace.

An advanced alternate way to initialize the _Getting Started_ workspace, using
`dvc import`, is:

```dvc
$ mkdir get-started
$ cd get-started
$ git init
$ dvc init
$ mkdir data
$ dvc import https://dvc.org/s3/get-started/data.xml data/data.xml
Importing 'https://dvc.org/s3/get-started/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml'
robogeek marked this conversation as resolved.
Show resolved Hide resolved
[##############################] 100% data.xml
Adding 'data/data.xml' to 'data/.gitignore'.
robogeek marked this conversation as resolved.
Show resolved Hide resolved
Saving 'data/data.xml' to cache '.dvc/cache'.
Saving information to 'data.xml.dvc'.

To track the changes with git run:

git add data/.gitignore data.xml.dvc
```

If you wish, it's possible to set up the other stages from the _Getting Started_
robogeek marked this conversation as resolved.
Show resolved Hide resolved
example. Since we do not need those stages for this example, we'll skip that.
Instead we can look at the resulting DVC file `data.xml.dvc`:

```yaml
deps:
- etag: '"f432e270cd634c51296ecd2bc2f5e752-5"'
path: https://dvc.org/s3/get-started/data.xml
md5: 61e80c38c1ce04ed2e11e331258e6d0d
outs:
- cache: true
md5: a304afb96060aad90176268345e10355
metric: false
path: data/data.xml
persist: false
wdir: .
```

The `etag` field in the DVC file contains the ETAG recorded from the HTTP
request. If the remote file changes, the ETAG changes, letting DVC know when
the file has changed.

* `-f`, `--file` - specify name of the DVC file it generates. It should be
either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order
for `dvc` to be able to find it later.
## Example: Detecting remote file changes

## Examples
What if that remote file is one which will be updated regularly? The project
robogeek marked this conversation as resolved.
Show resolved Hide resolved
goal might include regenerating some artifact based on the updated data. With a
DVC external dependency, the pipeline can be triggered to re-execute based on a
changed external dependency.

Let us again use the [Getting Started](/doc/get-started) example, in a way which
will mimic an updated external data source.

To make it easy to experiment with this, let us use a local directory as our
remote data store. In real life the data file will probably be on a remote
server, of course. Run these commands:

```dvc
$ dvc import /path/to/data.csv local_data.csv
$ dvc import s3://mybucket/data.csv s3_data.csv
$ dvc import gs://mybucket/data.csv gs_data.csv
$ dvc import ssh://[email protected]:/path/to/data.csv ssh_data.csv
$ dvc import hdfs://[email protected]/path/to/data.csv hdfs_data.csv
$ dvc import https://example.com/path/to/data.csv http_data.csv
$ mkdir /path/to/data-store
$ cd /path/to/data-store
$ wget https://dvc.org/s3/get-started/data.xml
```

In a production system you might have a process to update data files you need.
That's not what we have here, so in this case we'll set up a data store where we
can edit the data file.

On your laptop initialize the workspace again:

```dvc
$ mkdir get-started
$ cd get-started
$ git init
$ dvc init
$ mkdir data
$ dvc import /path/to/data-store/data.xml data/data.xml
Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml'
[##############################] 100% data.xml
Adding 'data/data.xml' to 'data/.gitignore'.
Saving 'data/data.xml' to cache '.dvc/cache'.
Saving information to 'data.xml.dvc'.

To track the changes with git run:

git add data/.gitignore data.xml.dvc
```

At this point we have the workspace set up in a similar fashion. The difference
is that DVC file references now references the editable data file in the data
store directory we just set up. We did this to make it easy to edit the data file.

```yaml
deps:
- md5: a86ca87250ed8e54a9e2e8d6d34c252e
path: /path/to/data-store/data.xml
md5: 361728a3b037c9a4bcb897cdf856edfc
outs:
- cache: true
md5: a304afb96060aad90176268345e10355
metric: false
path: data/data.xml
persist: false
wdir: .
```

The DVC file is nearly the same as before. The `path` has the URL for the data
store, and instead of an `etag` we have an `md5` checksum.

Let's also set up one of the processing stages from the Getting Started example.

```dvc
$ wget https://dvc.org/s3/get-started/code.zip
$ unzip code.zip
$ rm -f code.zip
$ pip install -U -r requirements.txt
$ git add .
$ git commit -m 'add code'
$ dvc run -f prepare.dvc \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
python src/prepare.py data/data.xml
```

Having this stage means that later when we run `dvc repro` a pipeline will be
executed.

The workspace says it is fine:

```dvc
$ tree
.
├── data
│   ├── data.xml
│   └── prepared
│   ├── test.tsv
│   └── train.tsv
├── data.xml.dvc
├── prepare.dvc
├── requirements.txt
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
└── train.py

3 directories, 10 files

$ dvc status
Pipeline is up to date. Nothing to reproduce.
```

Then in the data store directory, edit `data.xml`. It doesn't matter what you
change, other than it still being a valid XML file, just that a change is made
because any change will change the checksum. Once we do so, we'll see this:

```dvc
$ dvc status
data.xml.dvc:
changed deps:
modified: /path/to/data-store/data.xml
```

DVC has noticed the external dependency has changed. It is telling us that it
is necessary to now run `dvc repro`.

```dvc
$ dvc repro prepare.dvc

WARNING: Dependency '/path/to/data-store/data.xml' of 'data.xml.dvc' changed because it is 'modified'.
WARNING: Stage 'data.xml.dvc' changed.
Reproducing 'data.xml.dvc'
Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml'
[##############################] 100% data.xml
Saving 'data/data.xml' to cache '.dvc/cache'.
Saving information to 'data.xml.dvc'.

WARNING: Dependency 'data/data.xml' of 'prepare.dvc' changed because it is 'modified'.
WARNING: Stage 'prepare.dvc' changed.
Reproducing 'prepare.dvc'
Running command:
python src/prepare.py data/data.xml
Saving 'data/prepared' to cache '.dvc/cache'.
Linking directory 'data/prepared'.
Saving information to 'prepare.dvc'.

To track the changes with git run:

git add data.xml.dvc prepare.dvc

$ git add .
$ git commit -a -m 'updated data'

[master a8d4ce8] updated data
2 files changed, 6 insertions(+), 6 deletions(-)

$ dvc status
Pipeline is up to date. Nothing to reproduce.
```

Because the external source for the data file changed, the change was noticed
by the `dvc status` command. Running `dvc repro` then ran both stages of
the pipeline, and if we had set up the other stages they also would have been
run. It first downloaded the updated data file. And then noticing that
`data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute.
Loading