From 131af1ea370592c9237f8e55a2e0202c7fe4246d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 29 Oct 2019 21:41:21 -0600 Subject: [PATCH] import,update: explain rev field and update vs re-importing for #735, but also for use-case: add expandable sections to new data registry case per https://github.com/iterative/dvc.org/pull/679#issuecomment-544789692 and other misc. copy edits. Also standardizes term "external" (repo) vs. "source" data/project in this context and introduces the term "revision fixing". --- static/docs/command-reference/import.md | 70 ++++++++++++++++------ static/docs/command-reference/update.md | 27 ++++++--- static/docs/use-cases/data-registry.md | 79 ++++++++++++++++++------- 3 files changed, 129 insertions(+), 47 deletions(-) diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index 4dca2bd27f..13f67c2748 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -3,7 +3,7 @@ Download or copy file or directory from any DVC project in a Git repository (e.g. hosted on GitHub) into the workspace, and track changes in this [external dependency](/doc/user-guide/external-dependencies). -Creates a DVC-file. +Creates a special DVC-file a.k.a _import stage_. > See also `dvc get`, that corresponds to the first step this command performs > (just download the data). @@ -23,11 +23,11 @@ positional arguments: DVC provides an easy way to reuse datasets, intermediate results, ML models, or other files and directories tracked in another DVC repository into the workspace. The `dvc import` command downloads such a data artifact -in a way that it is tracked with DVC, so it can be updated when the external -data source changes. +in a way that it is tracked with DVC, so it can be updated when the data source +changes. The `url` argument specifies the address of the Git repository containing the -external project. Both HTTP and SSH protocols are supported for +source project. Both HTTP and SSH protocols are supported for online repositories (e.g. `[user@]server:project.git`). `url` can also be a local file system path to an "offline" repository. @@ -35,31 +35,31 @@ The `path` argument of this command is used to specify the location of the data to be downloaded within the source project. It should point to a data file or directory tracked by that project – specified in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the repository at `url`. (You -will not find these files directly in the source Git repository.) The source +will not find these files directly in the external Git repository.) The source project should have a default [DVC remote](/doc/command-reference/remote) configured, containing them.) > See `dvc import-url` to download and tack data from other supported URLs. After running this command successfully, the imported data is placed in the -current working directory with its original file name e.g. `data.txt`. An import -stage (DVC-file) is then created extending the full file or directory name of -the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` to -generate the same output. +current working directory with its original file name e.g. `data.txt`. An +_import stage_ (DVC-file) is then created, extending the full file or directory +name of the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` +to generate the same output. DVC supports DVC-files that refer to data in an external DVC repository (hosted -on a Git server). In such a DVC-file, the `deps` section specifies the `repo` -URL and data `path`, and the `outs` section contains the corresponding local -path in the workspace. It records enough data from the external file or -directory to enable DVC to efficiently check it to determine whether the local -copy is out of date. +on a Git server) a.k.a _import stages_. In such a DVC-file, the `deps` section +specifies the `repo` URL and data `path`, and the `outs` section contains the +corresponding local path in the workspace. It records enough data from the +external file or directory to enable DVC to efficiently check it to determine +whether the local copy is out of date. To actually [track the data](https://dvc.org/doc/get-started/add-files), -`git add` (and `git commit`) the import stage (DVC-file). +`git add` (and `git commit`) the import stage. Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to update the -downloaded data artifact from the external DVC repository. +downloaded data artifact from the source DVC repository. ## Options @@ -72,8 +72,10 @@ downloaded data artifact from the external DVC repository. - `--rev` - specific [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (such as a branch name, a tag, or a commit hash) of the DVC repository to - import the data from. The tip of the default branch is used by default when - this option is not specified. + import the data from. The tip of the repository's default branch is used by + default when this option is not specified. Note that this adds a `rev` field + in the import stage that fixes it to this revision. This can impact the + behavior of `dvc update`. - `-h`, `--help` - prints the usage/help message, and exit. @@ -120,3 +122,35 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used to specify the origin and version of the dependency. + +## Example: fixed revisions & re-importing + +When the `--rev` option is used, the import stage +([DVC-file](/doc/user-guide/dvc-file-format)) will include a `rev` field under +`repo` like this: + +```yaml +deps: + - path: data/data.xml + repo: + url: git@github.com:iterative/dataset-registry.git + rev: cats-dogs-v1 + rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c +``` + +If the Git revision moves, such as a branch, this doesn't have much of an effect +on the import/update workflow. However, for static refs such as tags (unless +manually updated), or for SHA commits, `dvc update` will not have any effect on +the import. In this cases, in order to actually "update" an import, it's +necessary to **re-import the data** instead, by using `dvc import` again without +or with a different `--rev`. For example: + +```dvc +$ dvc import --rev master \ + git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` + +This will overwrite the import stage (DVC-file) either removing or replacing the +`rev` field. This can produce an import stage that is able to be updated +normally with `dvc update` going forward. diff --git a/static/docs/command-reference/update.md b/static/docs/command-reference/update.md index 9c99f92125..c7c8df6f44 100644 --- a/static/docs/command-reference/update.md +++ b/static/docs/command-reference/update.md @@ -1,6 +1,6 @@ # update -Update data artifacts imported from other DVC repositories. +Update data artifacts imported from external DVC repositories. ## Synopsis @@ -15,16 +15,24 @@ positional arguments: After creating import stages ([DVC-files](/doc/user-guide/dvc-file-format)) with `dvc import` or -`dvc import-url`, the external data source can change. Use `dvc update` to bring -these imported file, directory, or data artifact up to date. +`dvc import-url`, the data source can change. Use `dvc update` to bring these +imported file, directory, or data artifact up to date. + +To indicate which import stages to update, we must specify the corresponding +DVC-file `targets` as command arguments. Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. `dvc update` is the only command that can -update them. Also, for `dvc import` DVC-files, the `rev_lock` field is updated -by `dvc update`. +update them. Also, for `dvc import` import stages, the `rev_lock` field is +updated by `dvc update`. -To indicate which import stages to update, we must specify the corresponding -DVC-file `targets` as command arguments. +Another detail to note is that when the `--rev` (revision) option of +`dvc import` has been used to create an import stage, DVC is not aware of what +kind of +[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this +is, for example a branch or a tag. For static refs such as tags (unless manually +updated), or for SHA commits, `dvc update` will not have any effect on the +import. ## Options @@ -60,4 +68,7 @@ Output 'model.pkl' didn't change. Skipping saving. Saving information to 'model.pkl.dvc'. ``` -This time nothing has changed, since the source repository is rather stable. +This time nothing has changed, since the source project is rather +stable. + +> Refer to this [re-importing example]() for diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 842adb6c3a..e2b7bb79ec 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -47,22 +47,24 @@ containing 2800 images of cats and dogs. We partitioned the dataset in two for our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a storage server, downloading them with `wget` in our examples. This setup was then revised to download the dataset with `dvc get` instead, so we created the -[dataset-registry](https://github.com/iterative/dataset-registry)) project, a +[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a DVC project hosted on GitHub, to version the dataset (see its [`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) directory). -However, there are a few problems with the way this dataset is structured (in 2 -parts). Most importantly, this single dataset is tracked by 2 different +However, there are a few problems with the way this dataset is structured. Most +importantly, this single dataset is tracked by 2 different [DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same one, which would better reflect the intentions of this dataset... Fortunately, we have also prepared an improved alternative in the [`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) directory of the same repository. -As step one, we extracted the first part of the dataset into the -`use-cases/cats-dogs` directory (illustrated below), and ran dvc add -use-cases/cats-dogs to +To create a +[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) +of our dataset, we extracted the first part into the `use-cases/cats-dogs` +directory (illustrated below), and ran dvc add use-cases/cats-dogs +to [track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). ```dvc @@ -77,14 +79,11 @@ use-cases/cats-dogs └── dogs [400 image files] ``` -This first version uses the -[`cats-dogs-v1`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) -Git tag. In a local DVC project, we can obtain this dataset with the following -command (note the usage of `--rev`): +In a local DVC project, we could have obtained this dataset at this point with +the following command: ```dvc -$ dvc import --rev cats-dogs-v1 \ - git@github.com:iterative/dataset-registry.git \ +$ dvc import git@github.com:iterative/dataset-registry.git \ use-cases/cats-dogs ``` @@ -92,18 +91,37 @@ $ dvc import --rev cats-dogs-v1 \ > always needs to run from an [initialized](/doc/command-reference/init) DVC > project. +
+ +### Expand for actionable command (optional) + +The command above is meant for informational purposes only. If you actually run +it in a DVC project, although it should work, it will import the latest version +of `use-cases/cats-dogs` from `dataset-registry`. The following command would +actually bring in the version in question: + +```dvc +$ dvc import --rev cats-dogs-v1 \ + git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` + +See the `dvc import` command reference for more details on the `--rev` +(revision) option. + +
+ Importing keeps the connection between the local project and data registry where we are downloading the dataset from. This is achieved by creating a special -DVC-file (a.k.a. an _import stage_) – which can be used for versioning the -import with Git in the local project. This connection will come in handy when -the source data changes, and we want to obtain these updates... +DVC-file (a.k.a. _import stage_) – that can be used for versioning the import +with Git. This connection will come in handy when the source data changes, and +we want to obtain these updates... -Back in our **dataset-registry** repository, the second (and last) version of -our dataset exists under the -[`cats-dogs-v2`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) -tag. It was created by extracting the second part of the dataset, with 1000 -additional images (500 cats, 500 dogs) in the same directory structure, and -simply running dvc add use-cases/cats-dogs again. +Back in our **dataset-registry** repository, a +[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) +of our dataset was created by extracting the second part, with 1000 additional +images (500 cats, 500 dogs), into the same directory structure. Then, we simply +ran dvc add use-cases/cats-dogs again. In our local project, all we have to do in order to obtain this latest version of the dataset is to run: @@ -112,6 +130,25 @@ of the dataset is to run: $ dvc update cats-dogs.dvc ``` +
+ +### Expand for actionable command (optional) + +As with the previous hidden note, actually trying the commands above should +produced the expected results, but not for obvious reasons. Specifically, the +initial `dvc import` command would have already obtained the latest version of +the dataset (as noted before), so this `dvc update` is unnecessary and won't +have an effect. + +If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import +stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to +update it, do not use `dvc update`. Instead, re-import the data by using the +original import command (without `--rev`). Refer to +[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing) +for more information. + +
+ This downloads new and changed files in `cats-dogs/` from the source project, and updates the metadata in the import stage DVC-file.