diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md index 120b3c98a3..f1cbc6c6e2 100644 --- a/static/docs/command-reference/get.md +++ b/static/docs/command-reference/get.md @@ -163,7 +163,7 @@ different names, and not currently tracked by Git: $ git status ... Untracked files: - (use "git add ..." to include in what will be committed) + (use "git add ..." to include in what will be committed) model.bigrams.pkl model.monograms.pkl diff --git a/static/docs/command-reference/install.md b/static/docs/command-reference/install.md index cda7101d8b..ff2c9710a2 100644 --- a/static/docs/command-reference/install.md +++ b/static/docs/command-reference/install.md @@ -155,7 +155,7 @@ checkout the `6-featurization` tag: $ git checkout 6-featurization Note: checking out '6-featurization'. -You are in 'detached HEAD' state. ... +You are in 'detached HEAD' state... $ dvc status @@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference. $ git checkout 6-featurization Note: checking out '6-featurization'. -You are in 'detached HEAD' state. ... +You are in 'detached HEAD' state... HEAD is now at d13ba9a add featurization stage @@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the ```dvc $ dvc repro evaluate.dvc - -... much output +... To track the changes with git run: git add featurize.dvc train.dvc evaluate.dvc diff --git a/static/docs/tutorials/deep/reproducibility.md b/static/docs/tutorials/deep/reproducibility.md index 1e3ad9fcb3..25d1e7024f 100644 --- a/static/docs/tutorials/deep/reproducibility.md +++ b/static/docs/tutorials/deep/reproducibility.md @@ -34,7 +34,7 @@ $ dvc repro model.p.dvc $ dvc repro ``` -Tries to reproduce the same pipeline... But there is still nothing to reproduce. +Tries to reproduce the same pipeline, but there is still nothing to reproduce. ## Adding bigrams diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index a5eead5b21..b03433b9dc 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -7,30 +7,24 @@ tracking of datasets and any other data artifacts. With the aim to enable reusability of these versioned artifacts between different projects (similar to package management systems, but for data), DVC -also includes the `dvc get`, `dvc import`, and `dvc update` commands. For -example, project A may use a data file to begin its data -[pipeline](/doc/command-reference/pipeline), but project B also requires this -same file; Instead of -[adding it](/doc/command-reference/add#example-single-file) it to both projects, -B can simply import it from A. Furthermore, the version of the data file -imported to B can be an older iteration than what's currently used in A. +also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means +that a project can depend on data from an external DVC project. Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any kind of large files). This way we would -have a repository that has all the metadata and change history for the project's -data. We can see who updated what, and when; use pull requests to update data -the same way you do with code; and we don't need ad-hoc conventions to store -different data versions. Other projects can share the data in the registry by -downloading (`dvc get`) or importing (`dvc import`) them for use in different -data processes. +have a repository with all the metadata and history of changes in the project's +data. We could see who updated what, and when, use pull requests to update data +(the same way we do with code), and avoid ad-hoc conventions to store different +data versions. This is what we call a data registry. Other projects can share +datasets in a registry by downloading (`dvc get`) or importing (`dvc import`) +them for use in different data processes. -The advantages of using a DVC **data registry** project are: +Advantages of using a DVC **data registry** project: - Data as code: Improve _lifecycle management_ with versioning of simple directory structures (like Git for your cloud storage), without ad-hoc - conventions. Leverage Git and Git hosting features such as change history, - branching, pull requests, reviews, and even continuous deployment of ML - models. + conventions. Leverage Git and Git hosting features such as commits, branching, + pull requests, reviews, and even continuous deployment of ML models. - Reusability: Reproduce and organize _feature stores_ with a simple CLI (`dvc get` and `dvc import` commands, similar to software package management systems like `pip`). @@ -49,29 +43,30 @@ The advantages of using a DVC **data registry** project are: ## Example -A dataset we use for several of our examples and tutorials is one containing -2800 images of cats and dogs. We partitioned the dataset in two for our -[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a -storage server, downloading them with `wget` in our examples. This setup was -then revised to download the dataset with `dvc get` instead, so we created the -[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a -DVC project hosted on GitHub, to version the dataset (see its +A dataset we commonly use for several of our examples and tutorials contains +2800 images of cats and dogs, which was split it in two for our +[Versioning Tutorial](/doc/tutorials/versioning). Originally, the parts were +backed up on a storage server, and downloaded with +[`wget`](https://www.gnu.org/software/wget/). This was then revised in order to +download the parts with `dvc get` instead, so we created the +[dataset-registry](https://github.com/iterative/dataset-registry) +project to version the dataset (in the [`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) directory). -However, there are a few problems with the way this dataset is structured. Most -importantly, this single dataset is tracked by 2 different -[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same -one, which would better reflect the intentions of this dataset... Fortunately, -we have also prepared an improved alternative in the +However, there's a few problems with the way that dataset is versioned. Most +importantly, this split dataset is tracked by 2 different +[DVC-files](/doc/user-guide/dvc-file-format) (one for each part), instead of 2 +versions of a single DVC-file. An initial version could have the first part +only, while an update would have the entire, unified dataset. Fortunately, we +have also prepared this improved alternative in the [`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) directory of the same DVC repository. -To create a -[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) +To create the +[initial version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) of our dataset, we extracted the first part into the `use-cases/cats-dogs` -directory (illustrated below), and ran `dvc add use-cases/cats-dogs` to -[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). +directory, illustrated below: ```dvc $ tree use-cases/cats-dogs --filelimit 3 @@ -85,7 +80,10 @@ use-cases/cats-dogs └── dogs [400 image files] ``` -In a local DVC project, we could have obtained this dataset at this point with +Then we ran `dvc add use-cases/cats-dogs` to +[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). + +At this point, we could have obtained this dataset in another DVC project with the following command: ```dvc @@ -95,15 +93,16 @@ $ dvc import git@github.com:iterative/dataset-registry.git \ > Note that unlike `dvc get`, which can be used from any directory, `dvc import` > always needs to run from an [initialized](/doc/command-reference/init) DVC -> project. +> project. Remember also that with both commands, the data comes from the source +> project's remote storage, not from the Git repository itself.
### Expand for actionable command (optional) The command above is meant for informational purposes only. If you actually run -it in a DVC project, although it should work, it will import the latest version -of `use-cases/cats-dogs` from `dataset-registry`. The following command would +it, although it will work, it will import the latest version of +`use-cases/cats-dogs` from `dataset-registry`. The following command would actually bring in the version in question: ```dvc @@ -117,54 +116,52 @@ See the `dvc import` command reference for more details on the `--rev`
-Importing keeps the connection between the local project and the source data -registry where we are downloading the dataset from. This is achieved by creating -a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the -`repo` field (a.k.a. _import stage_). (This file can be used for versioning the -import with Git.) +Importing keeps the connection between the local project and the +data source (registry repository). This is achieved by creating a +particular kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import +stage_) that includes a `repo` field. (This file can be used staged and +committed with Git.) > For a sample DVC-file resulting from `dvc import`, refer to > [this example](/doc/command-reference/import#example-data-registry). -Back in our **dataset-registry** project, a +Back in our **dataset-registry** project, the [second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) of our dataset was created by extracting the second part, with 1000 additional -images (500 cats, 500 dogs), into the same directory structure. Then, we simply -ran `dvc add use-cases/cats-dogs` again. +images (500 cats, 500 dogs) on top of the existing directory structure. Then, we +simply ran `dvc add use-cases/cats-dogs` again. -In our local project, all we have to do in order to obtain this latest version -of the dataset is to run: +All we would have to do in order to obtain this latest version in another +project where the first version was previously imported, is to run: ```dvc $ dvc update cats-dogs.dvc ``` -This is possible because of the connection that the import stage saved among -local and source projects, as explained earlier. -
### Expand for actionable command (optional) -As with the previous hidden note, actually trying the commands above should -produced the expected results, but not for obvious reasons. Specifically, the -initial `dvc import` command would have already obtained the latest version of -the dataset (as noted before), so this `dvc update` is unnecessary and won't -have an effect. +As with the previous hidden note, actually trying the command above will produce +the desired results, but not for obvious reasons. The initial `dvc import` +command would have already obtained the latest version of the dataset (as noted +before), so this `dvc update` is unnecessary and won't have any effect. -If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import -stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to -update it, do not use `dvc update`. Instead, re-import the data by using the -original import command (without `--rev`). Refer to -[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing) -for more information. +And if you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its +import stage (DVC-file) would be +[fixed to that revision](/doc/command-reference/import#example-fixed-revisions-re-importing) +(`cats-dogs-v1` tag), so `dvc update` would also be ineffective. In order to +actually "update" it, re-import the data instead, by now running the initial +import command (the one without `--rev`): -
+```dvc +$ dvc import git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` -This downloads new and changed files in `cats-dogs/` from the source project, -and updates the metadata in the import stage DVC-file. + -As an extra detail, notice that so far our local project is working only with a -local cache. It has no need to setup a -[remotes](/doc/command-reference/remote) to [pull](/doc/command-reference/pull) -or [push](/doc/command-reference/push) this dataset. +This is possible because of the connection that the import stage saved among +local and source projects, as explained earlier. The update downloads new and +changed files in `cats-dogs/` based on the source project, and updates the +metadata in the import stage DVC-file.