Skip to content

Commit

Permalink
Merge pull request #800 from iterative/jorgeorpinel
Browse files Browse the repository at this point in the history
Improve re-importing example in `dvc import`
  • Loading branch information
jorgeorpinel authored Nov 19, 2019
2 parents def161a + a8f4520 commit 0360771
Show file tree
Hide file tree
Showing 7 changed files with 82 additions and 65 deletions.
11 changes: 5 additions & 6 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@
- Have you followed the guidelines in
[Contributing Documentation](https://dvc.org/doc/user-guide/contributing/docs)?

- The PR title should start with "Fix #bugnum: " (if applicable), followed by a
clear one-line present-tense summary of the changes introduced in the PR. For
example: "Fix #bugnum: Introduce the first version of the collection editor.".
- Please use the title to provide a clear one-line present-tense summary of the
changes introduced in the PR. For example: "Introduce the first version of the
collection editor.".

- Please make sure to mention "Fix #bugnum" (if applicable) somewhere in the
description of the PR. This enables GitHub to link the PR to the corresponding
bug.
- Please make sure to mention "Fix #bugnum" (if applicable) in the description
of the PR. This enables GitHub to link the PR to the corresponding bug.
8 changes: 1 addition & 7 deletions static/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,10 +159,7 @@ _Get Started_ section is to use `dvc import-url`:
$ dvc import-url https://data.dvc.org/get-started/data.xml \
data/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data/data.xml'
Saving information to 'data.xml.dvc'.
...
To track the changes with git, run:
git add data.xml.dvc data/.gitignore
Expand Down Expand Up @@ -224,7 +221,6 @@ edit the data file.
```dvc
$ dvc import-url /tmp/dvc-import-url-example/data.xml data/data.xml
Importing '../../../tmp/dvc-import-url-example/data.xml' -> 'data/data.xml'
...
```

Check `data.xml.dvc`:
Expand Down Expand Up @@ -318,8 +314,6 @@ do so, we can run `dvc update` to make sure the import stage is up to date:
$ dvc update data.xml.dvc
...
Importing '.../tmp/dvc-import-url-example/data.xml' -> 'data/data.xml'
...
Saving information to 'data.xml.dvc'.
```

DVC has noticed the "external" data source has changed, and updated the import
Expand Down
54 changes: 35 additions & 19 deletions static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ DVC provides an easy way to reuse datasets, intermediate results, ML models, or
other files and directories tracked in another <abbr>DVC repository</abbr> into
the workspace. The `dvc import` command downloads such a <abbr>data
artifact</abbr> in a way that it is tracked with DVC, so it can be updated when
the data source changes.
the data source changes. (See `dvc update`.)

The `url` argument specifies the address of the Git repository containing the
source <abbr>project</abbr>. Both HTTP and SSH protocols are supported for
Expand Down Expand Up @@ -92,10 +92,10 @@ repository</abbr>, such as our
[get started example repo](https://github.com/iterative/example-get-started).

```dvc
$ dvc import [email protected]:iterative/example-get-started data/data.xml
Importing 'data/data.xml ([email protected]:iterative/example-get-started)' -> 'data.xml'
...
Saving information to 'data.xml.dvc'.
$ dvc import [email protected]:iterative/example-get-started \
data/data.xml
Importing 'data/data.xml ([email protected]:iterative/example-get-started)'
-> 'data.xml'
```

In contrast with `dvc get`, this command doesn't just download the data file,
Expand All @@ -121,14 +121,27 @@ outs:
```
Several of the values above are pulled from the original stage file
`model.pkl.dvc` in the external DVC repository. `url` and `rev_lock` fields are
used to specify the origin and version of the dependency.
`model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock`
subfields under `repo` are used to save the origin and version of the
dependency.

## Example: fixed revisions & re-importing

When the `--rev` option is used, the import stage
([DVC-file](/doc/user-guide/dvc-file-format)) will include a `rev` field under
`repo` like this:
To import a specific revision of a <abbr>data artifact</abbr>, we may use the
`--rev` option:

```dvc
$ dvc import --rev cats-dogs-v1 \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
Importing
'use-cases/cats-dogs ([email protected]:iterative/dataset-registry.git)'
-> 'cats-dogs'
```

When using this option, the import stage
([DVC-file](/doc/user-guide/dvc-file-format)) will also have a `rev` subfield
under `repo`:

```yaml
deps:
Expand All @@ -139,22 +152,25 @@ deps:
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
```

If the Git revision moves, such as a branch, this doesn't have much of an effect
on the import/update workflow. However, for static refs such as tags (unless
manually updated), or for SHA commits, `dvc update` will not have any effect on
the import. In this cases, in order to actually "update" an import, it's
necessary to **re-import the data** instead, by using `dvc import` again without
or with a different `--rev`. For example:
If the
[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
moves (e.g. a branch), you may use `dvc update` to bring the data up to date.
However, for typically static references (e.g. tags), or for SHA commits, in
order to actually "update" an import, it's necessary to **re-import the data**
instead, by using `dvc import` again without or with a different `--rev`. This
will overwrite the import stage (DVC-file), either removing or replacing the
`rev` field, respectively. This can produce an import stage that is able to be
updated normally with `dvc update` going forward. For example:

```dvc
$ dvc import --rev master \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

This will overwrite the import stage (DVC-file) either removing or replacing the
`rev` field. This can produce an import stage that is able to be updated
normally with `dvc update` going forward.
> In the above example, the value for `rev` in the new import stage will be
> `master`, which happens to be the default branch in this Git repository, so
> the command is equivalent to not using `--rev` at all.

## Example: Data registry

Expand Down
21 changes: 11 additions & 10 deletions static/docs/command-reference/update.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,17 @@ DVC-file `targets` as command arguments.

Note that import stages are considered always "locked", meaning that if you run
`dvc repro`, they won't be updated. `dvc update` is the only command that can
update them. Also, for `dvc import` import stages, the `rev_lock` field is
updated by `dvc update`.
update them.

Another detail to note is that when the `--rev` (revision) option of
`dvc import` has been used to create an import stage, DVC is not aware of what
kind of
[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this
is, for example a branch or a tag. For static refs such as tags (unless manually
updated), or for SHA commits, `dvc update` will not have any effect on the
import.
is, for example a branch or a tag. For typically static references (e.g. tags),
or for SHA commits, `dvc update` will not have any effect on the import. Refer
to the
[re-importing example](/doc/command-reference/import#example-fixed-revisions-re-importing)
to learn how to "update" fixed-revision imports.

## Options

Expand All @@ -52,10 +53,8 @@ Let's first import a data artifact from our

```dvc
$ dvc import [email protected]:iterative/example-get-started model.pkl
Importing 'model.pkl ([email protected]:iterative/example-get-started)' -> 'model.pkl'
...
Saving information to 'model.pkl.dvc'.
...
Importing 'model.pkl ([email protected]:iterative/example-get-started)'
-> 'model.pkl'
```

As DVC mentions, the import stage (DVC-file) `model.pkl.dvc` is created. This
Expand All @@ -73,4 +72,6 @@ Saving information to 'model.pkl.dvc'.
This time nothing has changed, since the source <abbr>project</abbr> is rather
stable.

> Refer to this [re-importing example]() for
> Note that `dvc update` updates the `rev_lock` field of the corresponding
> [DVC-file](/doc/user-guide/dvc-file-format) (when there are changes to bring
> in).
22 changes: 13 additions & 9 deletions static/docs/get-started/import-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
We've seen how to [push](/doc/get-started/store-data) and
[pull](/doc/get-started/retrieve-data) data from/to a <abbr>DVC project</abbr>'s
[remote](/doc/command-reference/remote). But what if we wanted to integrate a
dataset or ML model produced in one project into another project?
dataset or ML model produced in one project into another one?

One way is to download the data (with `wget` or `dvc get`, for example) and use
`dvc add` to track it, but the connection between projects would be lost. We
wouldn't be able to tell where the data came from or whether there are new
versions available. A better alternative is the `dvc import` command:
One way is to manually download the data (with `wget` or `dvc get`, for example)
and use `dvc add` to track it, but the connection between the projects would be
lost. We wouldn't be able to tell where the data came from or whether there are
new versions available. A better alternative is the `dvc import` command:

<!--
In the [Add Files](/doc/get-started/add-files) chapter, for example, we download
Expand All @@ -31,7 +31,7 @@ This downloads `data.xml` from our
[dataset-registry](https://github.com/iterative/dataset-registry) project into
the current working directory, adds it to `.gitignore`, and creates the
`data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to track changes in
the source data. With _imports_, we can use `dvc update` to check for changes in
the source data. With _imports_, we can use `dvc update` to bring in changes in
the external data source before [reproducing](/doc/get-started/reproduce) any
<abbr>pipeline</abbr> that depends on this data.

Expand Down Expand Up @@ -67,9 +67,11 @@ outs:
persist: false
```
The `url` subfield points to the source project, while `rev_lock` lets DVC know
which Git repository version did the data come from. Note that `dvc update`
updates the `rev_lock` value.
The `url` and `rev_lock` subfields under `repo` are used to save the origin and
version of the dependency.

> Note that `dvc update` updates the `rev_lock` field of the corresponding
> DVC-file (when there are changes to bring in).

</details>

Expand All @@ -80,3 +82,5 @@ to normal with:
$ git reset --hard
$ rm -f data.*
```

> See also `dvc import-url`.
9 changes: 7 additions & 2 deletions static/docs/user-guide/dvc-file-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,16 @@ A dependency entry consists of a pair of fields:
- `etag`: Strong ETag response header (only HTTP <abbr>external
dependencies</abbr> created with `dvc import-url`)
- `repo`: This entry is only for external dependencies created with
`dvc import`, and in itself contains the following fields:
`dvc import`, and can contains the following fields:

- `url`: URL of Git repository with source DVC project
- `rev`: Only present when the `--rev` option of `dvc import` is used.
Specific
[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
used to import the dependency from.
- `rev_lock`: Revision or version (Git commit hash) of the external <abbr>DVC
repository</abbr> at the time of importing the dependency
repository</abbr> at the time of importing or updating (with `dvc update`)
the dependency.

> See the examples in
> [External Dependencies](/doc/user-guide/external-dependencies) for more
Expand Down
22 changes: 10 additions & 12 deletions static/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ $ dvc run -d remote://example/data.txt \
```

Please refer to `dvc remote add` for more details like setting up access
credentials for certain remotes.
credentials for the different remotes.

## Example: import-url command

Expand All @@ -119,7 +119,6 @@ external path or URL types.
```dvc
$ dvc import-url https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
...
```

The command above creates the <abbr>import stage</abbr> (DVC-file)
Expand All @@ -144,22 +143,21 @@ outs:
DVC checks the headers returned by the server, looking for a strong
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag) or a
[Content-MD5](https://tools.ietf.org/html/rfc1864) header, and uses it to know
if the file has changed and we need to download it again.
[Content-MD5](https://tools.ietf.org/html/rfc1864) header, and uses it to
determine whether the source has changed and we need to download the file again.
</details>
## Example: Using import
`dvc import` can download a <abbr>data artifact</abbr> from an external
<abbr>DVC repository</abbr>repository. It also creates an external dependency in
its <abbr>import stage</abbr> (DVC-file).
`dvc import` can download a <abbr>data artifact</abbr> from any <abbr>DVC
repository</abbr>. It also creates an external dependency in its <abbr>import
stage</abbr> (DVC-file).

```dvc
$ dvc import [email protected]:iterative/example-get-started model.pkl
Importing 'model.pkl ([email protected]:iterative/example-get-started)' -> 'model.pkl'
Preparing to download data from 'https://remote.dvc.org/get-started'
...
Importing 'model.pkl ([email protected]:iterative/example-get-started)'
-> 'model.pkl'
```

The command above creates `model.pkl.dvc`, where the external dependency is
Expand All @@ -184,7 +182,7 @@ outs:
persist: false
```

For external sources that are <abbr>DVC repositories</abbr>, `url` and
`rev_lock` fields are used to specify the origin and version of the dependency.
The `url` and `rev_lock` subfields under `repo` are used to save the origin and
version of the dependency.

</details>

0 comments on commit 0360771

Please sign in to comment.