From 508fe08ca8a8faf2e4cf7fc98f7af4fe0d11bd5e Mon Sep 17 00:00:00 2001 From: David Herron Date: Fri, 19 Apr 2019 12:57:52 -0700 Subject: [PATCH 1/6] Update dvc install documentation --- static/docs/commands-reference/commit.md | 2 +- static/docs/commands-reference/install.md | 236 +++++++++++++++++++++- 2 files changed, 228 insertions(+), 10 deletions(-) diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index 02cc781d9b..caa090bf9e 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -121,7 +121,7 @@ Now, we can install requirements for the project: Then download the precomputed data using: ```dvc - $ dvc pull + $ dvc pull --all-branches --all-tags ``` This data will be retrieved from a preconfigured remote cache. diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 642b3b1b28..89ed9a279d 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -1,6 +1,6 @@ # install -Install dvc hooks into the repository +Install DVC hooks into the Git repository to automate certain common actions ## Synopsis @@ -8,9 +8,36 @@ Install dvc hooks into the repository usage: dvc install [-h] [-q] [-v] ``` -## Installed hooks -- pre-commit : dvc status -- post-checkout : dvc checkout +## Description + +As designed DVC combines an intelligent data repository with using a regular SCM +like Git to store code and configuration files. With `dvc install` the two are +more tightly integrated, causing certain actions to happen automatically. This +saves the DVC user from having to remember to perform both DVC and SCM actions. + +There are two modes of tracking files in a DVC workspace: + +* Data files are managed by DVC in the DVC cache +* Code, configuration and DVC files are managed by an SCM like Git + +**Checkout** For any given SCM branch or tag, the SCM checks-out the DVC files +corresponding to that branch or tag. The DVC files in turn refer to data files +in the DVC cache by checksum. Hence, switching from one SCM branch or tag to +another, the SCM retrieves the corresponding DVC files. By default that leaves +the workspace in a state where the DVC files refer to data files other than what +is currently in the workspace. The user at this point should run `dvc checkout` +so that the data files will match the current DVC files. + +**Commit** When committing a change to the SCM repository, that change possibly +requires rerunning the pipeline to reproduce the workspace results. It is +helpful to know automatically this as a reminder to run `dvc repro`. + +## Installed SCM hooks + +* Git `pre-commit` hook executes `dvc status` to inform the user about the + workspace status. +* Git `post-checkout` hook executes `dvc checkout` to automatically synchronize + the data files with the new workspace state. ## Options @@ -23,13 +50,204 @@ Install dvc hooks into the repository ## Examples +To explore `dvc install` let's consider a simple workspace with several stages, +the example workspace used in the [Getting Started](/doc/get-started) tutorial. + +
+ +### Click and expand to setup the project + +This step is optional, and you can run it only if you want to run this examples +in your environment. First, you need to download the project: + +```dvc + $ git clone https://github.com/iterative/example-get-started +``` + +Second, let's install the requirements. But before we do that, we **strongly** +recommend creating a virtual environment with `virtualenv` or a similar tool: + +```dvc + $ cd example-get-started + $ virtualenv -p python3 .env + $ source .env/bin/activate +``` + +Now, we can install requirements for the project: + +```dvc + $ pip install -r requirements.txt +``` + +Then download the precomputed data using: + +```dvc + $ dvc pull --all-branches --all-tags +``` + +This data will be retrieved from a preconfigured remote cache. + +
+ +## Example: Checkout both DVC and SCM + +Let's start our exploration with the impact of `dvc install` on the +`dvc checkout` command. Remember that switching from one SCM tag or branch to +another changes the set of DVC files in the workspace, which then also changes +the data files that should be in the workspace. + +With the Getting Started example workspace described above, let's first list +the available tags: + +```dvc + $ git tag + + 0-empty + 1-initialize + 2-remote + 3-add-file + 4-sources + 5-preparation + 6-featurization + 7-train + 8-evaluation + 9-bigrams + baseline-experiment + bigrams-experiment +``` + +These tags are used to mark points in the development of this workspace, and to +document specific experiments conducted in the workspace. To take a look at one +we check-out the workspace using the SCM (in this case Git): + +```dvc + $ git checkout 6-featurization + Note: checking out '6-featurization'. + + You are in 'detached HEAD' state. ... + + $ dvc status + + featurize.dvc: + changed outs: + modified: data/features + + $ dvc checkout + + [##############################] 100% Checkout finished! + + $ dvc status + + Pipeline is up to date. Nothing to reproduce. +``` + +After running `git checkout` we are also shown a message saying _You are in +'detached HEAD' state_, and the Git documentation explains what that means. +Bottom line is returning the workspace to a normal state requires the +command `git checkout master`. + +We also see that `dvc status` tells us about differences between the workspace +and the data files currently in the workspace. Git changed the DVC files in +the workspace, which changed references to data files. What `dvc status` did is +inform us the data files in the workspace no longer matched the checksums in +the DVC files. Running `dvc checkout` then checks out the corresponding data +files, and now `dvc status` tells us the data files match the DVC files. + +Now let's see how this changes after running `dvc install` + +```dvc + $ git checkout master + Previous HEAD position was d13ba9a add featurization stage + Switched to branch 'master' + Your branch is up to date with 'origin/master'. + + $ dvc checkout + [##############################] 100% Checkout finished! +``` + +Start by resetting the workspace to he at the head commit. + ```dvc $ dvc install + $ cat .git/hooks/pre-commit - #!/bin/sh - exec dvc status + #!/bin/sh + exec dvc status + $ cat .git/hooks/post-checkout - #!/bin/sh - exec dvc checkout - $ git checkout mybranch # will call `dvc checkout` automatically + #!/bin/sh + exec dvc checkout ``` + +The two Git hooks have been installed, and the one of interest for this exercise +is the `post-checkout` script which runs after `git checkout`. + +```dvc + $ git checkout 6-featurization + Note: checking out '6-featurization'. + + You are in 'detached HEAD' state. ... + + HEAD is now at d13ba9a add featurization stage + [##############################] 100% Checkout finished! + + $ dvc status + + Pipeline is up to date. Nothing to reproduce. +``` + +Look carefully at this output and it is clear that the `dvc checkout` command +has indeed been run. As a result the workspace is up-to-date with the data files +matching what is referenced by the DVC files. + +## Example: Showing DVC status on Git commit + +The other hook installed by `dvc install` runs before `git commit` operation. +To see see what that does, start with the same workspace, making sure it is +not in the detached HEAD state from the previous example. + +If we simply edit one of the code files: + +```dvc + $ vi src/featurization.py + + $ git commit -a -m 'modified featurization' + + featurize.dvc: + changed deps: + modified: src/featurization.py + [master 1116ddc] modified featurization + 1 file changed, 1 insertion(+), 1 deletion(-) +``` + +We see that `dvc status` output has appeared in the `git commit` interaction. +This new behavior corresponds to the Git hook which was installed, and it +helpfully informs us the workspace is out of sync. We should therefore run +the `dvc repro` command. + +```dvc + $ dvc repro evaluate.dvc + + ... much output + To track the changes with git run: + + git add featurize.dvc train.dvc evaluate.dvc + + $ git status -s + M auc.metric + M evaluate.dvc + M featurize.dvc + M src/featurization.py + M train.dvc + + $ git commit -a -m 'updated data after modified featurization' + Pipeline is up to date. Nothing to reproduce. + [master 78d0c44] modified featurization + 5 files changed, 12 insertions(+), 12 deletions(-) + +``` + +After rerunning the DVC pipeline, of course the data files are in sync with +the other files but we must now commit some files to the Git repository. +Looking closely we see that `dvc status` is again run, informing us that the +data files are synchronized. From 588132a61d91cd76c28cee1be90b35c9395bcfb5 Mon Sep 17 00:00:00 2001 From: David Herron Date: Sat, 20 Apr 2019 14:34:46 -0700 Subject: [PATCH 2/6] Update dvc import documentation --- static/docs/commands-reference/import.md | 281 +++++++++++++++++++++-- 1 file changed, 261 insertions(+), 20 deletions(-) diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index c5d40a55ef..a4ae4aa54b 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -2,15 +2,6 @@ Import file from URL to local directory and track changes in remote file. -Supported schemes: - -* `local` - Local path -* `s3` - URL to a file on Amazon S3 -* `gs` - URL to a file on Google Storage -* `ssh` - URL to a file on another machine with SSH access -* `hdfs` - URL to a file on HDFS -* `http` - URL to a file with a _strong ETag_ served with HTTP or HTTPS - ## Synopsis ```usage @@ -21,8 +12,65 @@ Supported schemes: out Output ``` +## Description + +DVC supports `.dvc` files which refer to an external data file, see +[External Dependencies](/doc/user-guide/external-dependencies). In such a DVC +file, the `deps` section will list the remote file specification and the `outs` +section will list the local file name in the workspace. It records enough data +from the remote file to enable DVC to efficiently check the remote file to +determine if the local file is out of date. + +The `dvc import` command helps the user create such an external data dependency. + +It generates a DVC file listing in the `deps` section the resource named in +the `url` parameter, and in the `outs` section the file named in the `out` +parameter. + +DVC supports several types of remote files: + +Type | Discussion | URL format +-----|------------|------------ +`local` | Local path | `/path/to/local/file` +`s3` | Amazon S3 | `s3://mybucket/data.csv` +`gs` | Google Storage | `gs://mybucket/data.csv` +`ssh` | SSH server | `ssh://user@example.com:/path/to/data.csv` +`hdfs` | HDFS | `hdfs://user@example.com/path/to/data.csv` +`http` | HTTP to file with _strong ETag_ | `https://example.com/path/to/data.csv` + +In the _External Dependencies_ documentation, an alternative is demonstrated for +each of these schemes. Instead of: + +```dvc + $ dvc import https://example.com/path/to/data.csv data.csv +``` + +It is possible to instead run + +```dvc + $ dvc run -d https://example.com/path/to/data.csv \ + -o data.csv \ + wget https://example.com/path/to/data.csv -O data.csv +``` + +Both methods generate a DVC file with an external dependency. The `dvc import` +command saves the user from using the command to copy files from each of the +remote storage schemes, and from having to install CLI tools for each service. + +When DVC inspects a DVC file, one step is inspecting the dependencies to see if +any have changed. A changed dependency will appear in the `dvc status` report, +indicating the need to re-run the corresponding part of the pipeline. When DVC +inspects an external dependency, it uses a method appropriate to that dependency +to test its current status. + ## Options +* `--resume` - resume previously started download. + +* `-f`, `--file` - specify name of the DVC file it generates. It should be + either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order + for `dvc` to be able to find it later. + * `-h`, `--help` - prints the usage/help message, and exit. * `-q`, `--quiet` - does not write anything to standard output. Exit with 0 if @@ -30,19 +78,212 @@ Supported schemes: * `-v`, `--verbose` - displays detailed tracing information. -* `--resume` - resume previously started download. +## Example: Initializing a workspace using a remote file -* `-f`, `--file` - specify name of the DVC file it generates. It should be - either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order - for `dvc` to be able to find it later. +The [DVC getting started tutorial](/doc/get-started) demonstrates a simple DVC +pipeline. In the [Add Files step](/doc/get-started/add-files) we are told to +download a file, then use `dvc add` to integrate it with the workspace. + +An alternate way to initialize the _Getting Started_ workspace, using +`dvc import`, is + +```dvc + $ mkdir get-started + $ cd get-started + $ git init + $ dvc init + $ mkdir data + $ dvc import https://dvc.org/s3/get-started/data.xml data/data.xml + Importing 'https://dvc.org/s3/get-started/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' + [##############################] 100% data.xml + Adding 'data/data.xml' to 'data/.gitignore'. + Saving 'data/data.xml' to cache '.dvc/cache'. + Saving information to 'data.xml.dvc'. + + To track the changes with git run: + + git add data/.gitignore data.xml.dvc +``` + +If you wish, it's possible to set up the other stages from the _Getting Started_ +example. Since we do not need those stages for this example, we'll skip that. +Instead we can look at the resulting DVC file `data.xml.dvc`: + +```yaml + deps: + - etag: '"f432e270cd634c51296ecd2bc2f5e752-5"' + path: https://dvc.org/s3/get-started/data.xml + md5: 61e80c38c1ce04ed2e11e331258e6d0d + outs: + - cache: true + md5: a304afb96060aad90176268345e10355 + metric: false + path: data/data.xml + persist: false + wdir: . +``` + +The `etag` field in the DVC file contains the ETAG recorded from the HTTP +request. If the remote file changes, the ETAG changes, letting DVC know when +the file has changed. + +## Example: Remote file that is updated -## Examples +What if that remote file is one which will be updated regularly? The project +goal might include regenerating some artifact based on the updated data. With a +DVC external dependency, the pipeline can be triggered to re-execute based on a +changed external dependency. + +Let us again use the _Getting Started_ example, in a way which will mimic an +updated external data source. + +The first step is to set up an SSH remote for the data file. On a server you can +access using SSH, run these commands: + +```dvc + $ mkdir /path/to/data-store + $ cd /path/to/data-store + $ wget https://dvc.org/s3/get-started/data.xml +``` + +In a production system you might have a process to update data files you need. +That's not what we have here, so in this case we'll set up a data store where we +can edit the data file. + +On your laptop initialize the workspace again: + +```dvc + $ mkdir get-started + $ cd get-started + $ git init + $ dvc init + $ mkdir data + $ dvc import ssh://USER-NAME@HOST-NAME:/path/to/data-store/data.xml data/data.xml + Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' + [##############################] 100% data.xml + Adding 'data/data.xml' to 'data/.gitignore'. + Saving 'data/data.xml' to cache '.dvc/cache'. + Saving information to 'data.xml.dvc'. + + To track the changes with git run: + + git add data/.gitignore data.xml.dvc +``` + +At this point we have the workspace set up in a similar fashion. The difference +is that DVC file references now references the editable data file on the SSH +data store we just set up. We did this to make it easy to edit the data file. + +```yaml + deps: + - md5: a86ca87250ed8e54a9e2e8d6d34c252e + path: ssh://USER-NAME@HOST-NAME:/tmp/data-store/data.xml + md5: 361728a3b037c9a4bcb897cdf856edfc + outs: + - cache: true + md5: a304afb96060aad90176268345e10355 + metric: false + path: data/data.xml + persist: false + wdir: . +``` + +The DVC file is nearly the same as before. The `path` has the URL for the SSH +data store, and instead of an `etag` we have an `md5` checksum. + +Let's also set up one of the processing stages from the Getting Started example. ```dvc - $ dvc import /path/to/data.csv local_data.csv - $ dvc import s3://mybucket/data.csv s3_data.csv - $ dvc import gs://mybucket/data.csv gs_data.csv - $ dvc import ssh://user@example.com:/path/to/data.csv ssh_data.csv - $ dvc import hdfs://user@example.com/path/to/data.csv hdfs_data.csv - $ dvc import https://example.com/path/to/data.csv http_data.csv + $ wget https://dvc.org/s3/get-started/code.zip + $ unzip code.zip + $ rm -f code.zip + $ pip install -U -r requirements.txt + $ git add . + $ git commit -m 'add code' + $ dvc run -f prepare.dvc \ + -d src/prepare.py -d data/data.xml \ + -o data/prepared \ + python src/prepare.py data/data.xml ``` + +Having this stage means that later when we run `dvc repro` a pipeline will be +executed. + +The workspace says it is fine: + +```dvc + $ tree + . + ├── data + │   ├── data.xml + │   └── prepared + │   ├── test.tsv + │   └── train.tsv + ├── data.xml.dvc + ├── prepare.dvc + ├── requirements.txt + └── src + ├── evaluate.py + ├── featurization.py + ├── prepare.py + └── train.py + + 3 directories, 10 files + + $ dvc status + Pipeline is up to date. Nothing to reproduce. +``` + +Then over on the SSH server, edit `data.xml`. It doesn't matter what you +change, other than it still being a valid XML file, just that a change is made +because any change will change the checksum. Once we do so, we'll see this: + +```dvc + $ dvc status + data.xml.dvc: + changed deps: + modified: ssh://USER-NAME@HOST-NAME:/path/to/data-store/data.xml +``` + +DVC has noticed the external dependency has changed. It is telling us that it +is necessary to now run `dvc repro`. + +```dvc + $ dvc repro prepare.dvc + + WARNING: Dependency 'ssh://USER-NAME@HOST-NAME:/path/to/data-store/data.xml' of 'data.xml.dvc' changed because it is 'modified'. + WARNING: Stage 'data.xml.dvc' changed. + Reproducing 'data.xml.dvc' + Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' + [##############################] 100% data.xml + Saving 'data/data.xml' to cache '.dvc/cache'. + Saving information to 'data.xml.dvc'. + + WARNING: Dependency 'data/data.xml' of 'prepare.dvc' changed because it is 'modified'. + WARNING: Stage 'prepare.dvc' changed. + Reproducing 'prepare.dvc' + Running command: + python src/prepare.py data/data.xml + Saving 'data/prepared' to cache '.dvc/cache'. + Linking directory 'data/prepared'. + Saving information to 'prepare.dvc'. + + To track the changes with git run: + + git add data.xml.dvc prepare.dvc + + $ git add . + $ git commit -a -m 'updated data' + [master a8d4ce8] updated data + 2 files changed, 6 insertions(+), 6 deletions(-) + + $ dvc status + Pipeline is up to date. Nothing to reproduce. + +``` + +Because the external source for the data file changed, the change was noticed +by the `dvc status` command. Running `dvc repro` then ran both stages of +the pipeline, and if we had set up the other stages they also would have been +run. It first downloaded the updated data file. And then noticing that +`data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute. From e834ee2fa968be405a6909399d5688cb1d976abe Mon Sep 17 00:00:00 2001 From: David Herron Date: Sat, 20 Apr 2019 17:00:27 -0700 Subject: [PATCH 3/6] Trailing blanks --- static/docs/commands-reference/install.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 89ed9a279d..5d2b487f8d 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -1,6 +1,6 @@ # install -Install DVC hooks into the Git repository to automate certain common actions +Install DVC hooks into the Git repository to automate certain common actions. ## Synopsis @@ -17,8 +17,8 @@ saves the DVC user from having to remember to perform both DVC and SCM actions. There are two modes of tracking files in a DVC workspace: -* Data files are managed by DVC in the DVC cache -* Code, configuration and DVC files are managed by an SCM like Git +* Data files are managed by DVC in the DVC cache. +* Code, configuration and DVC files are managed by an SCM like Git. **Checkout** For any given SCM branch or tag, the SCM checks-out the DVC files corresponding to that branch or tag. The DVC files in turn refer to data files @@ -209,7 +209,7 @@ not in the detached HEAD state from the previous example. If we simply edit one of the code files: ```dvc - $ vi src/featurization.py + $ vi src/featurization.py $ git commit -a -m 'modified featurization' @@ -226,7 +226,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the `dvc repro` command. ```dvc - $ dvc repro evaluate.dvc + $ dvc repro evaluate.dvc ... much output To track the changes with git run: From 4f275a174fc8217b85f190b9c99f0843a9f550bb Mon Sep 17 00:00:00 2001 From: David Herron Date: Sun, 21 Apr 2019 19:09:01 -0700 Subject: [PATCH 4/6] Update dvc install documentation per feedback --- static/docs/commands-reference/install.md | 34 ++++++++++++----------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 5d2b487f8d..c85a904b33 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -12,13 +12,10 @@ Install DVC hooks into the Git repository to automate certain common actions. As designed DVC combines an intelligent data repository with using a regular SCM like Git to store code and configuration files. With `dvc install` the two are -more tightly integrated, causing certain actions to happen automatically. This -saves the DVC user from having to remember to perform both DVC and SCM actions. +more tightly integrated, to conveniently cause certain useful actions to happen +automatically. -There are two modes of tracking files in a DVC workspace: - -* Data files are managed by DVC in the DVC cache. -* Code, configuration and DVC files are managed by an SCM like Git. +Namely: **Checkout** For any given SCM branch or tag, the SCM checks-out the DVC files corresponding to that branch or tag. The DVC files in turn refer to data files @@ -29,15 +26,16 @@ is currently in the workspace. The user at this point should run `dvc checkout` so that the data files will match the current DVC files. **Commit** When committing a change to the SCM repository, that change possibly -requires rerunning the pipeline to reproduce the workspace results. It is -helpful to know automatically this as a reminder to run `dvc repro`. +requires rerunning the pipeline to reproduce the workspace results, which is a +reminder to run `dvc repro`. Or there might be files not yet in the cache, which +is a reminder to run `dvc commit`. ## Installed SCM hooks -* Git `pre-commit` hook executes `dvc status` to inform the user about the - workspace status. -* Git `post-checkout` hook executes `dvc checkout` to automatically synchronize - the data files with the new workspace state. +* Git `pre-commit` hook executes `dvc status` before `git commit` to inform the + user about the workspace status. +* Git `post-checkout` hook executes `dvc checkout` after `git checkout` to + automatically synchronize the data files with the new workspace state. ## Options @@ -153,8 +151,6 @@ inform us the data files in the workspace no longer matched the checksums in the DVC files. Running `dvc checkout` then checks out the corresponding data files, and now `dvc status` tells us the data files match the DVC files. -Now let's see how this changes after running `dvc install` - ```dvc $ git checkout master Previous HEAD position was d13ba9a add featurization stage @@ -165,7 +161,9 @@ Now let's see how this changes after running `dvc install` [##############################] 100% Checkout finished! ``` -Start by resetting the workspace to he at the head commit. +We've seen the default behavior with there being no Git hooks installed. We want +to see how the behavior changes after installing the Git hooks. We must first +reset the workspace to he at the head commit before installing the hooks. ```dvc $ dvc install @@ -180,7 +178,9 @@ Start by resetting the workspace to he at the head commit. ``` The two Git hooks have been installed, and the one of interest for this exercise -is the `post-checkout` script which runs after `git checkout`. +is the `post-checkout` script which runs after `git checkout`. + +We can now repeat the command run earlier, to see the difference. ```dvc $ git checkout 6-featurization @@ -241,7 +241,9 @@ the `dvc repro` command. M train.dvc $ git commit -a -m 'updated data after modified featurization' + Pipeline is up to date. Nothing to reproduce. + [master 78d0c44] modified featurization 5 files changed, 12 insertions(+), 12 deletions(-) From e515ccb190f708af62cc8c50fdcb06764d0bb0c2 Mon Sep 17 00:00:00 2001 From: David Herron Date: Mon, 22 Apr 2019 12:28:53 -0700 Subject: [PATCH 5/6] Clarify what happens when dvc checkout fails --- static/docs/commands-reference/checkout.md | 7 ++++++- static/docs/commands-reference/install.md | 9 +++++++-- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index ab350e572a..901fe14096 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -51,10 +51,15 @@ The output of `dvc checkout` does not list which data files were restored. It does report removed files and files that DVC was unable to restore due to it missing from the cache. +This command will fail to checkout files that are missing from the cache. In +such a case, `dvc checkout` prints a warning message. Any files that can be +checked out without error will be restored. + There are two methods to restore a file missing from the cache, depending on the situation. In some cases the pipeline must be rerun using the `dvc repro` command. In other cases the cache can be pulled from a remote cache using the -`dvc pull` command. +`dvc pull` command. It may be necessary to use `--all-tags` or `--all-branches` +with the `dvc pull` command. ## Options diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index c85a904b33..e61c0793da 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -19,17 +19,22 @@ Namely: **Checkout** For any given SCM branch or tag, the SCM checks-out the DVC files corresponding to that branch or tag. The DVC files in turn refer to data files -in the DVC cache by checksum. Hence, switching from one SCM branch or tag to -another, the SCM retrieves the corresponding DVC files. By default that leaves +in the DVC cache by checksum. When switching from one SCM branch or tag to +another the SCM retrieves the corresponding DVC files. By default that leaves the workspace in a state where the DVC files refer to data files other than what is currently in the workspace. The user at this point should run `dvc checkout` so that the data files will match the current DVC files. +The installed Git hook automates running `dvc checkout`. + **Commit** When committing a change to the SCM repository, that change possibly requires rerunning the pipeline to reproduce the workspace results, which is a reminder to run `dvc repro`. Or there might be files not yet in the cache, which is a reminder to run `dvc commit`. +The installed Git hook automates reminding the user to run either `dvc repro` +or `dvc commit`. + ## Installed SCM hooks * Git `pre-commit` hook executes `dvc status` before `git commit` to inform the From b45a86badd74dede43a44d10748ab093cf79d261 Mon Sep 17 00:00:00 2001 From: David Herron Date: Wed, 24 Apr 2019 12:15:17 -0700 Subject: [PATCH 6/6] Update to dvc import and dvc install commands --- static/docs/commands-reference/checkout.md | 3 +- static/docs/commands-reference/import.md | 84 ++++++++++++---------- static/docs/commands-reference/install.md | 4 +- 3 files changed, 50 insertions(+), 41 deletions(-) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 901fe14096..f6dfad668e 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -58,8 +58,7 @@ checked out without error will be restored. There are two methods to restore a file missing from the cache, depending on the situation. In some cases the pipeline must be rerun using the `dvc repro` command. In other cases the cache can be pulled from a remote cache using the -`dvc pull` command. It may be necessary to use `--all-tags` or `--all-branches` -with the `dvc pull` command. +`dvc pull` command. ## Options diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index a4ae4aa54b..0de0b2ce63 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -1,6 +1,6 @@ # import -Import file from URL to local directory and track changes in remote file. +Import file from any supported URL or local directory to local workspace and track changes in remote file. ## Synopsis @@ -14,19 +14,21 @@ Import file from URL to local directory and track changes in remote file. ## Description +In some cases it is convenient to add a data file to a workspace such that it +will be automatically updated when the data source is updated. One project might +produce occasional data files that are used in other projects, for example. Or +a government agency might produce occasionally updated data of use in a project. + DVC supports `.dvc` files which refer to an external data file, see [External Dependencies](/doc/user-guide/external-dependencies). In such a DVC -file, the `deps` section will list the remote file specification and the `outs` -section will list the local file name in the workspace. It records enough data -from the remote file to enable DVC to efficiently check the remote file to -determine if the local file is out of date. +file, the `deps` section lists a remote file specification, and the `outs` +section lists the corresponding local file name in the workspace. It records +enough data from the remote file to enable DVC to efficiently check the remote +file to determine if the local file is out of date. DVC uses this data to then +download the file to the workspace, and to re-download it upon changes. The `dvc import` command helps the user create such an external data dependency. -It generates a DVC file listing in the `deps` section the resource named in -the `url` parameter, and in the `outs` section the file named in the `out` -parameter. - DVC supports several types of remote files: Type | Discussion | URL format @@ -37,15 +39,20 @@ Type | Discussion | URL format `ssh` | SSH server | `ssh://user@example.com:/path/to/data.csv` `hdfs` | HDFS | `hdfs://user@example.com/path/to/data.csv` `http` | HTTP to file with _strong ETag_ | `https://example.com/path/to/data.csv` +`remote` | Remote path | `remote://myremote/path/to/file` + +Another way to understand the `dvc import` command is as a short-cut for more +verbose `dvc run` commands. This is discussed in the +[External Dependencies](/doc/user-guide/external-dependencies) documentation, +where an alternative is demonstrated for each of these schemes. -In the _External Dependencies_ documentation, an alternative is demonstrated for -each of these schemes. Instead of: +Instead of `dvc import`: ```dvc $ dvc import https://example.com/path/to/data.csv data.csv ``` -It is possible to instead run +It is possible to instead use `dvc run`: ```dvc $ dvc run -d https://example.com/path/to/data.csv \ @@ -53,9 +60,10 @@ It is possible to instead run wget https://example.com/path/to/data.csv -O data.csv ``` -Both methods generate a DVC file with an external dependency. The `dvc import` -command saves the user from using the command to copy files from each of the -remote storage schemes, and from having to install CLI tools for each service. +Both methods generate a DVC file with an external dependency, and they perform +a roughly equivalent result. The `dvc import` command saves the user from using +the command to copy files from each of the remote storage schemes, and from +having to install CLI tools for each service. When DVC inspects a DVC file, one step is inspecting the dependencies to see if any have changed. A changed dependency will appear in the `dvc status` report, @@ -65,11 +73,12 @@ to test its current status. ## Options -* `--resume` - resume previously started download. +* `--resume` - resume previously started download. This is useful if the + connection to the remote resource is unstable. -* `-f`, `--file` - specify name of the DVC file it generates. It should be - either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order - for `dvc` to be able to find it later. +* `-f`, `--file` - specify name of the DVC file it generates. It should be + either `Dvcfile` or have a `.dvc` suffix (e.g. `data.dvc`) in order for `dvc` + to be able to find it later. * `-h`, `--help` - prints the usage/help message, and exit. @@ -78,14 +87,14 @@ to test its current status. * `-v`, `--verbose` - displays detailed tracing information. -## Example: Initializing a workspace using a remote file +## Example: Tracking a remote file The [DVC getting started tutorial](/doc/get-started) demonstrates a simple DVC pipeline. In the [Add Files step](/doc/get-started/add-files) we are told to download a file, then use `dvc add` to integrate it with the workspace. -An alternate way to initialize the _Getting Started_ workspace, using -`dvc import`, is +An advanced alternate way to initialize the _Getting Started_ workspace, using +`dvc import`, is: ```dvc $ mkdir get-started @@ -127,18 +136,19 @@ The `etag` field in the DVC file contains the ETAG recorded from the HTTP request. If the remote file changes, the ETAG changes, letting DVC know when the file has changed. -## Example: Remote file that is updated +## Example: Detecting remote file changes What if that remote file is one which will be updated regularly? The project goal might include regenerating some artifact based on the updated data. With a DVC external dependency, the pipeline can be triggered to re-execute based on a changed external dependency. -Let us again use the _Getting Started_ example, in a way which will mimic an -updated external data source. +Let us again use the [Getting Started](/doc/get-started) example, in a way which +will mimic an updated external data source. -The first step is to set up an SSH remote for the data file. On a server you can -access using SSH, run these commands: +To make it easy to experiment with this, let us use a local directory as our +remote data store. In real life the data file will probably be on a remote +server, of course. Run these commands: ```dvc $ mkdir /path/to/data-store @@ -158,7 +168,7 @@ On your laptop initialize the workspace again: $ git init $ dvc init $ mkdir data - $ dvc import ssh://USER-NAME@HOST-NAME:/path/to/data-store/data.xml data/data.xml + $ dvc import /path/to/data-store/data.xml data/data.xml Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' [##############################] 100% data.xml Adding 'data/data.xml' to 'data/.gitignore'. @@ -171,13 +181,13 @@ On your laptop initialize the workspace again: ``` At this point we have the workspace set up in a similar fashion. The difference -is that DVC file references now references the editable data file on the SSH -data store we just set up. We did this to make it easy to edit the data file. +is that DVC file references now references the editable data file in the data +store directory we just set up. We did this to make it easy to edit the data file. ```yaml deps: - md5: a86ca87250ed8e54a9e2e8d6d34c252e - path: ssh://USER-NAME@HOST-NAME:/tmp/data-store/data.xml + path: /path/to/data-store/data.xml md5: 361728a3b037c9a4bcb897cdf856edfc outs: - cache: true @@ -188,8 +198,8 @@ data store we just set up. We did this to make it easy to edit the data file. wdir: . ``` -The DVC file is nearly the same as before. The `path` has the URL for the SSH -data store, and instead of an `etag` we have an `md5` checksum. +The DVC file is nearly the same as before. The `path` has the URL for the data +store, and instead of an `etag` we have an `md5` checksum. Let's also set up one of the processing stages from the Getting Started example. @@ -234,7 +244,7 @@ The workspace says it is fine: Pipeline is up to date. Nothing to reproduce. ``` -Then over on the SSH server, edit `data.xml`. It doesn't matter what you +Then in the data store directory, edit `data.xml`. It doesn't matter what you change, other than it still being a valid XML file, just that a change is made because any change will change the checksum. Once we do so, we'll see this: @@ -242,7 +252,7 @@ because any change will change the checksum. Once we do so, we'll see this: $ dvc status data.xml.dvc: changed deps: - modified: ssh://USER-NAME@HOST-NAME:/path/to/data-store/data.xml + modified: /path/to/data-store/data.xml ``` DVC has noticed the external dependency has changed. It is telling us that it @@ -251,7 +261,7 @@ is necessary to now run `dvc repro`. ```dvc $ dvc repro prepare.dvc - WARNING: Dependency 'ssh://USER-NAME@HOST-NAME:/path/to/data-store/data.xml' of 'data.xml.dvc' changed because it is 'modified'. + WARNING: Dependency '/path/to/data-store/data.xml' of 'data.xml.dvc' changed because it is 'modified'. WARNING: Stage 'data.xml.dvc' changed. Reproducing 'data.xml.dvc' Importing '/path/to/data-store/data.xml' -> '/Volumes/Extra/dvc/get-started/data/data.xml' @@ -274,12 +284,12 @@ is necessary to now run `dvc repro`. $ git add . $ git commit -a -m 'updated data' + [master a8d4ce8] updated data 2 files changed, 6 insertions(+), 6 deletions(-) $ dvc status Pipeline is up to date. Nothing to reproduce. - ``` Because the external source for the data file changed, the change was noticed diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index e61c0793da..8a0d16cf67 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -251,10 +251,10 @@ the `dvc repro` command. [master 78d0c44] modified featurization 5 files changed, 12 insertions(+), 12 deletions(-) - ``` After rerunning the DVC pipeline, of course the data files are in sync with the other files but we must now commit some files to the Git repository. Looking closely we see that `dvc status` is again run, informing us that the -data files are synchronized. +data files are synchronized with the statement: _Pipeline is up to date. Nothing +to reproduce_.