-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dvc install and dvc import documentation #260
Changes from all commits
508fe08
588132a
e834ee2
4f275a1
e515ccb
b45a86b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,6 @@ | ||
# import | ||
|
||
Import file from URL to local directory and track changes in remote file. | ||
|
||
Supported schemes: | ||
|
||
* `local` - Local path | ||
* `s3` - URL to a file on Amazon S3 | ||
* `gs` - URL to a file on Google Storage | ||
* `ssh` - URL to a file on another machine with SSH access | ||
* `hdfs` - URL to a file on HDFS | ||
* `http` - URL to a file with a _strong ETag_ served with HTTP or HTTPS | ||
Import file from any supported URL or local directory to local workspace and track changes in remote file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably not 80 symbols here |
||
|
||
## Synopsis | ||
|
||
|
@@ -21,28 +12,288 @@ Supported schemes: | |
out Output | ||
``` | ||
|
||
## Description | ||
|
||
In some cases it is convenient to add a data file to a workspace such that it | ||
will be automatically updated when the data source is updated. One project might | ||
produce occasional data files that are used in other projects, for example. Or | ||
a government agency might produce occasionally updated data of use in a project. | ||
|
||
DVC supports `.dvc` files which refer to an external data file, see | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[External Dependencies](/doc/user-guide/external-dependencies). In such a DVC | ||
file, the `deps` section lists a remote file specification, and the `outs` | ||
section lists the corresponding local file name in the workspace. It records | ||
enough data from the remote file to enable DVC to efficiently check the remote | ||
file to determine if the local file is out of date. DVC uses this data to then | ||
download the file to the workspace, and to re-download it upon changes. | ||
|
||
The `dvc import` command helps the user create such an external data dependency. | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
DVC supports several types of remote files: | ||
|
||
Type | Discussion | URL format | ||
-----|------------|------------ | ||
`local` | Local path | `/path/to/local/file` | ||
`s3` | Amazon S3 | `s3://mybucket/data.csv` | ||
`gs` | Google Storage | `gs://mybucket/data.csv` | ||
`ssh` | SSH server | `ssh://[email protected]:/path/to/data.csv` | ||
`hdfs` | HDFS | `hdfs://[email protected]/path/to/data.csv` | ||
`http` | HTTP to file with _strong ETag_ | `https://example.com/path/to/data.csv` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there is one more actually - There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The remote URL needs to be documented in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would still explain it briefly here - just give an example of the transformation.
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`remote` | Remote path | `remote://myremote/path/to/file` | ||
|
||
Another way to understand the `dvc import` command is as a short-cut for more | ||
verbose `dvc run` commands. This is discussed in the | ||
[External Dependencies](/doc/user-guide/external-dependencies) documentation, | ||
where an alternative is demonstrated for each of these schemes. | ||
|
||
Instead of `dvc import`: | ||
|
||
```dvc | ||
$ dvc import https://example.com/path/to/data.csv data.csv | ||
``` | ||
|
||
It is possible to instead use `dvc run`: | ||
|
||
```dvc | ||
$ dvc run -d https://example.com/path/to/data.csv \ | ||
-o data.csv \ | ||
wget https://example.com/path/to/data.csv -O data.csv | ||
``` | ||
|
||
Both methods generate a DVC file with an external dependency, and they perform | ||
a roughly equivalent result. The `dvc import` command saves the user from using | ||
the command to copy files from each of the remote storage schemes, and from | ||
having to install CLI tools for each service. | ||
|
||
When DVC inspects a DVC file, one step is inspecting the dependencies to see if | ||
any have changed. A changed dependency will appear in the `dvc status` report, | ||
indicating the need to re-run the corresponding part of the pipeline. When DVC | ||
inspects an external dependency, it uses a method appropriate to that dependency | ||
to test its current status. | ||
|
||
## Options | ||
|
||
* `--resume` - resume previously started download. This is useful if the | ||
connection to the remote resource is unstable. | ||
|
||
* `-f`, `--file` - specify name of the DVC file it generates. It should be | ||
either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order for `dvc` | ||
to be able to find it later. | ||
|
||
* `-h`, `--help` - prints the usage/help message, and exit. | ||
|
||
* `-q`, `--quiet` - does not write anything to standard output. Exit with 0 if | ||
no problems arise, otherwise 1. | ||
|
||
* `-v`, `--verbose` - displays detailed tracing information. | ||
|
||
* `--resume` - resume previously started download. | ||
## Example: Tracking a remote file | ||
|
||
The [DVC getting started tutorial](/doc/get-started) demonstrates a simple DVC | ||
pipeline. In the [Add Files step](/doc/get-started/add-files) we are told to | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
download a file, then use `dvc add` to integrate it with the workspace. | ||
|
||
An advanced alternate way to initialize the _Getting Started_ workspace, using | ||
`dvc import`, is: | ||
|
||
```dvc | ||
$ mkdir get-started | ||
$ cd get-started | ||
$ git init | ||
$ dvc init | ||
$ mkdir data | ||
$ dvc import https://dvc.org/s3/get-started/data.xml data/data.xml | ||
Importing 'https://dvc.org/s3/get-started/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[##############################] 100% data.xml | ||
Adding 'data/data.xml' to 'data/.gitignore'. | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Saving 'data/data.xml' to cache '.dvc/cache'. | ||
Saving information to 'data.xml.dvc'. | ||
|
||
To track the changes with git run: | ||
|
||
git add data/.gitignore data.xml.dvc | ||
``` | ||
|
||
If you wish, it's possible to set up the other stages from the _Getting Started_ | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
example. Since we do not need those stages for this example, we'll skip that. | ||
Instead we can look at the resulting DVC file `data.xml.dvc`: | ||
|
||
```yaml | ||
deps: | ||
- etag: '"f432e270cd634c51296ecd2bc2f5e752-5"' | ||
path: https://dvc.org/s3/get-started/data.xml | ||
md5: 61e80c38c1ce04ed2e11e331258e6d0d | ||
outs: | ||
- cache: true | ||
md5: a304afb96060aad90176268345e10355 | ||
metric: false | ||
path: data/data.xml | ||
persist: false | ||
wdir: . | ||
``` | ||
|
||
The `etag` field in the DVC file contains the ETAG recorded from the HTTP | ||
request. If the remote file changes, the ETAG changes, letting DVC know when | ||
the file has changed. | ||
|
||
* `-f`, `--file` - specify name of the DVC file it generates. It should be | ||
either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order | ||
for `dvc` to be able to find it later. | ||
## Example: Detecting remote file changes | ||
|
||
## Examples | ||
What if that remote file is one which will be updated regularly? The project | ||
robogeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
goal might include regenerating some artifact based on the updated data. With a | ||
DVC external dependency, the pipeline can be triggered to re-execute based on a | ||
changed external dependency. | ||
|
||
Let us again use the [Getting Started](/doc/get-started) example, in a way which | ||
will mimic an updated external data source. | ||
|
||
To make it easy to experiment with this, let us use a local directory as our | ||
remote data store. In real life the data file will probably be on a remote | ||
server, of course. Run these commands: | ||
|
||
```dvc | ||
$ dvc import /path/to/data.csv local_data.csv | ||
$ dvc import s3://mybucket/data.csv s3_data.csv | ||
$ dvc import gs://mybucket/data.csv gs_data.csv | ||
$ dvc import ssh://[email protected]:/path/to/data.csv ssh_data.csv | ||
$ dvc import hdfs://[email protected]/path/to/data.csv hdfs_data.csv | ||
$ dvc import https://example.com/path/to/data.csv http_data.csv | ||
$ mkdir /path/to/data-store | ||
$ cd /path/to/data-store | ||
$ wget https://dvc.org/s3/get-started/data.xml | ||
``` | ||
|
||
In a production system you might have a process to update data files you need. | ||
That's not what we have here, so in this case we'll set up a data store where we | ||
can edit the data file. | ||
|
||
On your laptop initialize the workspace again: | ||
|
||
```dvc | ||
$ mkdir get-started | ||
$ cd get-started | ||
$ git init | ||
$ dvc init | ||
$ mkdir data | ||
$ dvc import /path/to/data-store/data.xml data/data.xml | ||
Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' | ||
[##############################] 100% data.xml | ||
Adding 'data/data.xml' to 'data/.gitignore'. | ||
Saving 'data/data.xml' to cache '.dvc/cache'. | ||
Saving information to 'data.xml.dvc'. | ||
|
||
To track the changes with git run: | ||
|
||
git add data/.gitignore data.xml.dvc | ||
``` | ||
|
||
At this point we have the workspace set up in a similar fashion. The difference | ||
is that DVC file references now references the editable data file in the data | ||
store directory we just set up. We did this to make it easy to edit the data file. | ||
|
||
```yaml | ||
deps: | ||
- md5: a86ca87250ed8e54a9e2e8d6d34c252e | ||
path: /path/to/data-store/data.xml | ||
md5: 361728a3b037c9a4bcb897cdf856edfc | ||
outs: | ||
- cache: true | ||
md5: a304afb96060aad90176268345e10355 | ||
metric: false | ||
path: data/data.xml | ||
persist: false | ||
wdir: . | ||
``` | ||
|
||
The DVC file is nearly the same as before. The `path` has the URL for the data | ||
store, and instead of an `etag` we have an `md5` checksum. | ||
|
||
Let's also set up one of the processing stages from the Getting Started example. | ||
|
||
```dvc | ||
$ wget https://dvc.org/s3/get-started/code.zip | ||
$ unzip code.zip | ||
$ rm -f code.zip | ||
$ pip install -U -r requirements.txt | ||
$ git add . | ||
$ git commit -m 'add code' | ||
$ dvc run -f prepare.dvc \ | ||
-d src/prepare.py -d data/data.xml \ | ||
-o data/prepared \ | ||
python src/prepare.py data/data.xml | ||
``` | ||
|
||
Having this stage means that later when we run `dvc repro` a pipeline will be | ||
executed. | ||
|
||
The workspace says it is fine: | ||
|
||
```dvc | ||
$ tree | ||
. | ||
├── data | ||
│ ├── data.xml | ||
│ └── prepared | ||
│ ├── test.tsv | ||
│ └── train.tsv | ||
├── data.xml.dvc | ||
├── prepare.dvc | ||
├── requirements.txt | ||
└── src | ||
├── evaluate.py | ||
├── featurization.py | ||
├── prepare.py | ||
└── train.py | ||
|
||
3 directories, 10 files | ||
|
||
$ dvc status | ||
Pipeline is up to date. Nothing to reproduce. | ||
``` | ||
|
||
Then in the data store directory, edit `data.xml`. It doesn't matter what you | ||
change, other than it still being a valid XML file, just that a change is made | ||
because any change will change the checksum. Once we do so, we'll see this: | ||
|
||
```dvc | ||
$ dvc status | ||
data.xml.dvc: | ||
changed deps: | ||
modified: /path/to/data-store/data.xml | ||
``` | ||
|
||
DVC has noticed the external dependency has changed. It is telling us that it | ||
is necessary to now run `dvc repro`. | ||
|
||
```dvc | ||
$ dvc repro prepare.dvc | ||
|
||
WARNING: Dependency '/path/to/data-store/data.xml' of 'data.xml.dvc' changed because it is 'modified'. | ||
WARNING: Stage 'data.xml.dvc' changed. | ||
Reproducing 'data.xml.dvc' | ||
Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' | ||
[##############################] 100% data.xml | ||
Saving 'data/data.xml' to cache '.dvc/cache'. | ||
Saving information to 'data.xml.dvc'. | ||
|
||
WARNING: Dependency 'data/data.xml' of 'prepare.dvc' changed because it is 'modified'. | ||
WARNING: Stage 'prepare.dvc' changed. | ||
Reproducing 'prepare.dvc' | ||
Running command: | ||
python src/prepare.py data/data.xml | ||
Saving 'data/prepared' to cache '.dvc/cache'. | ||
Linking directory 'data/prepared'. | ||
Saving information to 'prepare.dvc'. | ||
|
||
To track the changes with git run: | ||
|
||
git add data.xml.dvc prepare.dvc | ||
|
||
$ git add . | ||
$ git commit -a -m 'updated data' | ||
|
||
[master a8d4ce8] updated data | ||
2 files changed, 6 insertions(+), 6 deletions(-) | ||
|
||
$ dvc status | ||
Pipeline is up to date. Nothing to reproduce. | ||
``` | ||
|
||
Because the external source for the data file changed, the change was noticed | ||
by the `dvc status` command. Running `dvc repro` then ran both stages of | ||
the pipeline, and if we had set up the other stages they also would have been | ||
run. It first downloaded the updated data file. And then noticing that | ||
`data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any files that are found in cache
instead of an error. It's not an error usually - it's a warning like you mentioned above.