From eb088b56fd0185183c95e21528b5c66f756a76f7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 6 Oct 2019 15:25:34 -0400 Subject: [PATCH 01/40] use-cases: fix H3->H2 levels in shared-development-server --- static/docs/use-cases/shared-development-server.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/static/docs/use-cases/shared-development-server.md b/static/docs/use-cases/shared-development-server.md index 44cee2f9da..65e6987e63 100644 --- a/static/docs/use-cases/shared-development-server.md +++ b/static/docs/use-cases/shared-development-server.md @@ -11,7 +11,7 @@ your team to store and share data for your projects effectively, and to have instantaneous workspace restoration/switching speed – similar to `git checkout` for your code. -### Preparation +## Preparation Create a shared directory to be used as cache location for everyone's projects, so that all your colleagues can use the same @@ -27,7 +27,7 @@ written by others. The most straightforward way to do this is to make sure that everyone's users are members of the same group, and that your shared cache directory is owned by this group, with the aforementioned permissions. -### Transfer existing cache (Optional) +## Transfer existing cache (Optional) This step is optional. You can skip it if you are setting up a new DVC project whose cache directory is not stored in the default location, `.dvc/cache`. If @@ -39,7 +39,7 @@ to simply move it from an old cache location to the new one: $ mv .dvc/cache/* /path/to/dvc-cache ``` -### Configure shared cache +## Configure shared cache Tell DVC to use the directory we've set up above as an shared cache location by running: @@ -55,7 +55,7 @@ $ git add .dvc/config $ git commit -m "dvc: shared external cache dir" ``` -### Examples +## Examples You and your colleagues can work in your own separate workspaces as usual, and DVC will handle all your data in the most effective way possible. From 98271d7f30f8b0c505b55f9cd746554e41d0a131 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 7 Oct 2019 10:51:57 -0400 Subject: [PATCH 02/40] cmd ref: remove comment from import --- static/docs/command-reference/import.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index e55588e68b..adbaae5a9f 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -115,5 +115,3 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used to specify the origin and version of the dependency. - - From 77da9ba6945a2f0aa57f28886bb1d3e3bbe31205 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 7 Oct 2019 12:41:36 -0400 Subject: [PATCH 03/40] tutorials: explain use of `dvc remove` in versionin; other improvements... in preparation for new data registry use case (#674) --- static/docs/tutorials/versioning.md | 32 +++++++++++++++++------------ 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/static/docs/tutorials/versioning.md b/static/docs/tutorials/versioning.md index 2fb0189b56..2e0a4c3e2f 100644 --- a/static/docs/tutorials/versioning.md +++ b/static/docs/tutorials/versioning.md @@ -160,12 +160,12 @@ $ git tag -a "v1.0" -m "model v1.0, 1000 images" ### Expand to learn more about DVC internals As we mentioned briefly, DVC does not commit the `data/` directory and -`model.h5` file into git, `dvc add` pushed them into the DVC cache and added to -the `.gitignore`. Instead, we commit DVC-files that serve as pointers to the -cache (usually in the `.dvc/cache` directory inside the repository) where actual -data resides. +`model.h5` file to the Git repository, `dvc add` placed them into the DVC cache +and added them to `.gitignore`. Instead, we commit DVC-files that serve as +pointers to the cache (usually in the `.dvc/cache` directory inside the +repository) where actual data resides. -In this case we created `data.dvc` and `model.h5.dvc` files. Refer to the +In this case we created `data.dvc` and `model.h5.dvc`. Refer to [DVC-File Format](/doc/user-guide/dvc-file-format) to learn more about how these files work. @@ -194,8 +194,9 @@ $ unzip new-labels.zip $ rm -f new-labels.zip ``` -For simplicity we keep the validation dataset the same. Now our dataset has 2000 -images for training and 800 images for validation, with a total size of 67 MB: +For simplicity, we keep the validation dataset the same. Now our dataset has +2000 images for training and 800 images for validation, with a total size of 67 +MB: ```sh data @@ -219,7 +220,7 @@ data └── cat.1400.jpg ``` -Of course, we want to leverage these new labels and retrain the model. +Of course, we want to leverage these new labels and retrain the model: ```dvc $ dvc add data @@ -228,6 +229,10 @@ $ python train.py $ dvc add model.h5 ``` +> `dvc remove` is necessary here because `model.h5` was already added with +> `dvc add` earlier, but we want to do so again. Later we'll see how `dvc run` +> eliminates this extra step. + Let's commit the second version: ```dvc @@ -324,11 +329,12 @@ $ dvc run -f Dvcfile \ python train.py ``` -Similar to `dvc add`, `dvc run` creates a DVC-file (forced to have file name -`Dvcfile` with the `-f` option). It puts all outputs (`-o`) under DVC control -the same way as `dvc add` does. Unlike, `dvc add`, `dvc run` also tracks -dependencies (`-d`) and the command (`python train.py`) that was run to produce -the result. We also such a DVC-file a "stage file". +Similar to `dvc add`, `dvc run` creates a +[DVC-file](/doc/user-guide/dvc-file-format) named `Dvcfile` (specified using the +`-f` option). It puts all outputs (`-o`) under DVC control the same way as +`dvc add` does. Unlike, `dvc add`, `dvc run` also tracks dependencies (`-d`) and +the command (`python train.py`) that was run to produce the result. We also such +a DVC-file a "stage file". > BTW, at this point you could `git add .` and `git commit` to save the > `Dvcfile` stage file and its changed output files to the repository. From b0f871a61edf878dc61051d144b4be276530f8c4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 7 Oct 2019 12:49:59 -0400 Subject: [PATCH 04/40] use-cases: incomplete draft of new data registry case study for #674 --- src/Documentation/sidebar.json | 3 +- static/docs/use-cases/shared-data-registry.md | 85 +++++++++++++++++++ 2 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 static/docs/use-cases/shared-data-registry.md diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index 50b73c5049..f00c064fa3 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -75,7 +75,8 @@ "label": "Share Data & Model Files", "slug": "share-data-and-model-files" }, - "shared-development-server" + "shared-development-server", + "shared-data-registry" ] }, { diff --git a/static/docs/use-cases/shared-data-registry.md b/static/docs/use-cases/shared-data-registry.md new file mode 100644 index 0000000000..c6cc7bdc1c --- /dev/null +++ b/static/docs/use-cases/shared-data-registry.md @@ -0,0 +1,85 @@ +# Shared Data Registry + +In the [Versioning Tutorial](/doc/tutorials/versioning) we use two ZIP files +containing parts of a dataset with labeled images of cat and dogs. For +simplicity, these archives are downloaded with `dvc get` from our +[dataset registry](https://github.com/iterative/dataset-registry), a DVC +project hosted on GitHub. + +In this document we'll explain the idea behind shared data registries, how to +easily create one with DVC, and the best ways to version datasets based on the +same example above (without ZIP files!) + +## Concept + +We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim +to enable reusability of any outputs (datasets, intermediate +results, models, etc) between different projects. For example, project A may use +a raw dataset to begin a data [pipeline](/doc/command-reference/pipeline), but +project B also requires this same dataset; Instead of +[adding](/doc/command-reference/add) it to both projects, B can simply import it +from A. (This can bring many benefits that we'll explain a little later.) + +Taking this idea to a useful extreme, we could setup a project that +is dedicated to +[tracking and versioning](/doc/use-cases/data-and-model-files-versioning) +datasets (or any kind of large files) – by mainly using `dvc add` to build it. +Such a project would not have [stages](/doc/command-reference/run), but its data +files may be updated manually as they evolve. Other projects can then share +these files by downloading (`dvc get`) or importing (`dvc import`) them for use +in different data processes – and these don't even have to be _DVC projects_, as +`dvc get` can work anywhere in your system. + +The advantages of using a data registry are: + +- Tracked files can be safely stored in a **centralized** remote location, with + the ability to create any amount of distributed copies on other remote + storage. +- Several projects can **share** the same files, trusting that everyone is using + the same data versions. +- Projects that import data from the registry don't need to push these large + files to their own remotes, **saving space** on storage – they may not even + need to a remote at all, using only the local cache. +- Easier to manage **access control** per remote storage configured in a single + data registry project. + +A possible weakness of data registries is that, if the source project or its +remote storage are lost for any reason, several other projects depending on it +may stop being reproducible. So this strategy is best when the registry is owned +and controlled internally by the same team as the projects that employ it. +Trusting 3rd party data registries should be considered a risk. + +## Example: A sub-optimal approach + +For illustration purposes, our own + +[dataset registry](https://github.com/iterative/dataset-registry) contains a +poorly handled dataset in the `tutorial/ver` directory. It contains 2 +[DVC-files](/doc/user-guide/dvc-file-format) that track a couple of ZIP files +(problem #1). Each archive, when extracted, contains the same directory +structure, but with complementary files, that together form a single dataset +(problem #2) of 2000 images of cats and dogs. + +> As mentioned in the introduction, these ZIP files are used as-is for +> simplicity in our [Versioning Tutorial](/doc/tutorials/versioning). + +Let's download and extract the first half of this dataset to better understand +its structure: + +```dvc +$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip +... +$ unzip -q ... +$ rm data.zip +$ tree ... -L 3 +... +``` + +```dvc +$ $ dvc get https://github.com/iterative/dataset-registry \ + tutorial/ver/new-labels.zip +``` + +## A properly versioned registry + +... From cef3e1473e71d45cd1f16de719aa763901fba4b0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 7 Oct 2019 16:18:19 -0400 Subject: [PATCH 05/40] use-cases: shared-data-registry -> data-registry per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-298352500 --- src/Documentation/sidebar.json | 2 +- .../use-cases/{shared-data-registry.md => data-registry.md} | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) rename static/docs/use-cases/{shared-data-registry.md => data-registry.md} (97%) diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index f00c064fa3..59a27ec4e8 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -76,7 +76,7 @@ "slug": "share-data-and-model-files" }, "shared-development-server", - "shared-data-registry" + "data-registry" ] }, { diff --git a/static/docs/use-cases/shared-data-registry.md b/static/docs/use-cases/data-registry.md similarity index 97% rename from static/docs/use-cases/shared-data-registry.md rename to static/docs/use-cases/data-registry.md index c6cc7bdc1c..fee27187d7 100644 --- a/static/docs/use-cases/shared-data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,4 +1,4 @@ -# Shared Data Registry +# Data Registry In the [Versioning Tutorial](/doc/tutorials/versioning) we use two ZIP files containing parts of a dataset with labeled images of cat and dogs. For @@ -67,7 +67,8 @@ Let's download and extract the first half of this dataset to better understand its structure: ```dvc -$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip +$ dvc get https://github.com/iterative/dataset-registry \ + tutorial/ver/data.zip ... $ unzip -q ... $ rm data.zip From 4ccd9329382ac6fbeb2b1849464a928ad30d9d56 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 9 Oct 2019 00:08:26 -0500 Subject: [PATCH 06/40] engine: clarify what [Edit on GitHub] button is for. --- src/Documentation/RightPanel/RightPanel.js | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/Documentation/RightPanel/RightPanel.js b/src/Documentation/RightPanel/RightPanel.js index 5ee797d4bc..f6327590c4 100644 --- a/src/Documentation/RightPanel/RightPanel.js +++ b/src/Documentation/RightPanel/RightPanel.js @@ -136,7 +136,9 @@ export default class RightPanel extends React.PureComponent { )} - Found an issue? Let us know or fix it: + + Found an issue in this doc? Let us know or fix it: + From 76d200df611d2fc81f9d220962ec40ac099bcd6d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 9 Oct 2019 19:38:59 -0500 Subject: [PATCH 07/40] use-cases: first draft of new data registry case study --- static/docs/use-cases/data-registry.md | 136 ++++++++++++++++++++----- 1 file changed, 112 insertions(+), 24 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index fee27187d7..5be012928b 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -6,29 +6,30 @@ simplicity, these archives are downloaded with `dvc get` from our [dataset registry](https://github.com/iterative/dataset-registry), a DVC project hosted on GitHub. -In this document we'll explain the idea behind shared data registries, how to -easily create one with DVC, and the best ways to version datasets based on the -same example above (without ZIP files!) +In this document we'll explain the idea behind **shared data registries**, how +to easily create one with DVC, and the best ways to version datasets based on +the same example above (without ZIP files!) ## Concept We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim -to enable reusability of any outputs (datasets, intermediate +to enable reusability of any outputs (raw data, intermediate results, models, etc) between different projects. For example, project A may use -a raw dataset to begin a data [pipeline](/doc/command-reference/pipeline), but -project B also requires this same dataset; Instead of -[adding](/doc/command-reference/add) it to both projects, B can simply import it -from A. (This can bring many benefits that we'll explain a little later.) - -Taking this idea to a useful extreme, we could setup a project that -is dedicated to +a data file to begin its data [pipeline](/doc/command-reference/pipeline), but +project B also requires this same file; Instead of +[adding it](/doc/command-reference/add#example-single-file) it to both projects, +B can simply import it from A. (This can bring many benefits that we'll explain +a little later.) + +Taking this idea to a useful extreme, we could create a project +that is exclusively dedicated to [tracking and versioning](/doc/use-cases/data-and-model-files-versioning) datasets (or any kind of large files) – by mainly using `dvc add` to build it. Such a project would not have [stages](/doc/command-reference/run), but its data files may be updated manually as they evolve. Other projects can then share these files by downloading (`dvc get`) or importing (`dvc import`) them for use in different data processes – and these don't even have to be _DVC projects_, as -`dvc get` can work anywhere in your system. +`dvc get` works anywhere in your system. The advantages of using a data registry are: @@ -38,8 +39,9 @@ The advantages of using a data registry are: - Several projects can **share** the same files, trusting that everyone is using the same data versions. - Projects that import data from the registry don't need to push these large - files to their own remotes, **saving space** on storage – they may not even - need to a remote at all, using only the local cache. + files to their own [remotes](/doc/command-reference/remote), **saving space** + on storage – they may not even need to a remote at all, using only the local + cache. - Easier to manage **access control** per remote storage configured in a single data registry project. @@ -47,14 +49,13 @@ A possible weakness of data registries is that, if the source project or its remote storage are lost for any reason, several other projects depending on it may stop being reproducible. So this strategy is best when the registry is owned and controlled internally by the same team as the projects that employ it. -Trusting 3rd party data registries should be considered a risk. +Trusting 3rd party registries should be considered a risk. ## Example: A sub-optimal approach For illustration purposes, our own - [dataset registry](https://github.com/iterative/dataset-registry) contains a -poorly handled dataset in the `tutorial/ver` directory. It contains 2 +poorly handled dataset, in the `tutorial/ver` directory. It contains 2 [DVC-files](/doc/user-guide/dvc-file-format) that track a couple of ZIP files (problem #1). Each archive, when extracted, contains the same directory structure, but with complementary files, that together form a single dataset @@ -63,24 +64,111 @@ structure, but with complementary files, that together form a single dataset > As mentioned in the introduction, these ZIP files are used as-is for > simplicity in our [Versioning Tutorial](/doc/tutorials/versioning). -Let's download and extract the first half of this dataset to better understand -its structure: +Let's download and extract the first half of this dataset (in an empty +directory) to better understand its structure: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/data.zip +$ unzip -q data.zip +$ tree --filelimit 3 +. +├── data +│   ├── train +│   │   ├── cats [500 entries ...] +│   │   └── dogs [500 entries ...] +│   └── validation +│   ├── cats [400 entries ...] +│   └── dogs [400 entries ...] +└── data.zip + +7 directories, 1 file +$ rm -f data.zip +``` + +### Problem #1: Compressed data files + +`dvc add`, the command to place existing data under DVC control, supports +tracking both files +[and directories](/doc/command-reference/add#example-directory). For this +reason, adding compressed directories to a project is not +recommended, especially when the files contained are already compressed (like +typical image file formats). + +While compression can save space for some files, such as tabular data stored in +text files (CSV, TSV, JSON, etc.), versioning compressed files risks storing +repeated files in the cache or +[remote storage](/doc/command-reference/remote), when the dataset is not +partitioned correctly. You will also need an extra step after downloading the +files to uncompress the data. + +Let's add the entire `data/` dir to DVC instead to a new, Git-backed DVC +project: + +```dvc +$ git init +$ dvc init ... -$ unzip -q ... -$ rm data.zip -$ tree ... -L 3 +$ git commit -m "Initialize DVC project" +$ dvc add data +Adding 'data' to '.gitignore'... +Saving information to 'data.dvc'. ... +$ git add data.dvc .gitignore +$ git commit -m "Add 1800 cats and dogs images dataset" ``` +> Refer to +> [Adding a directory example](/doc/command-reference/add#example-directory) and +> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +> for more details on what happens under the hood. + +### Problem #2: Dataset partitioning + +Under some contexts, such as distributed storage or distribution (p2p), data +partitioning (i.e. the automatic kind) can be a great tool. Manually separating +data directories however, besides prone to human error, is unnecessary with DVC. + +Let's extract the remaining cats and dogs images from our example, to see what +changes in our data directory: + ```dvc -$ $ dvc get https://github.com/iterative/dataset-registry \ +$ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/new-labels.zip +$ unzip -q new-labels.zip +$ tree --filelimit 3 +. +├── data +│   ├── train +│   │   ├── cats [1000 entries ...] +│   │   └── dogs [1000 entries ...] +│   └── validation +│   ├── cats [400 entries ...] +│   └── dogs [400 entries ...] +├── data.dvc +└── new-labels.zip + +7 directories, 2 files ``` -## A properly versioned registry +It seems an additional 500 images of cats and another 500 of dogs have been +added to the `data/train` directory. To update the DVC cache with +this merged dataset, we simply need to run `dvc add data` again: +```dvc +$ dvc add data +Computing md5 for a large number of files. This is only done once. +WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' +... +Saving information to 'data.dvc'. ... +``` + +Finally, let's commit the new dataset version to Git, and list the 2 versions: + +```dvc +$ git commit -m "Add 1000 more cats and dogs images to dataset" +$ git log --format="%h %s" +cbcf466 Add 1800 cats and dogs images dataset +5b058a3 Initialize DVC project +``` From 32bbcd0e94fdccdb81a2fae8fba30b345792a1e2 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 10 Oct 2019 02:01:51 -0500 Subject: [PATCH 08/40] use-cases: fixes and improvements on first draft mainly involving compression and partitioning topics --- static/docs/use-cases/data-registry.md | 49 ++++++++++++++------------ 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 5be012928b..902e92a22b 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -2,7 +2,7 @@ In the [Versioning Tutorial](/doc/tutorials/versioning) we use two ZIP files containing parts of a dataset with labeled images of cat and dogs. For -simplicity, these archives are downloaded with `dvc get` from our +simplicity, these compressed archives are downloaded with `dvc get` from our [dataset registry](https://github.com/iterative/dataset-registry), a DVC project hosted on GitHub. @@ -56,10 +56,10 @@ Trusting 3rd party registries should be considered a risk. For illustration purposes, our own [dataset registry](https://github.com/iterative/dataset-registry) contains a poorly handled dataset, in the `tutorial/ver` directory. It contains 2 -[DVC-files](/doc/user-guide/dvc-file-format) that track a couple of ZIP files -(problem #1). Each archive, when extracted, contains the same directory -structure, but with complementary files, that together form a single dataset -(problem #2) of 2000 images of cats and dogs. +[DVC-files](/doc/user-guide/dvc-file-format) that track a couple of compressed +ZIP files (problem #1). Each archive, when extracted, contains the same +directory structure, but with complementary files, that together form a single +dataset (problem #2) of 2000 images of cats and dogs. > As mentioned in the introduction, these ZIP files are used as-is for > simplicity in our [Versioning Tutorial](/doc/tutorials/versioning). @@ -90,19 +90,14 @@ $ rm -f data.zip `dvc add`, the command to place existing data under DVC control, supports tracking both files + [and directories](/doc/command-reference/add#example-directory). For this -reason, adding compressed directories to a project is not +reason, adding compressed directory contents to a project is not recommended, especially when the files contained are already compressed (like -typical image file formats). - -While compression can save space for some files, such as tabular data stored in -text files (CSV, TSV, JSON, etc.), versioning compressed files risks storing -repeated files in the cache or -[remote storage](/doc/command-reference/remote), when the dataset is not -partitioned correctly. You will also need an extra step after downloading the -files to uncompress the data. +typical image file formats). Also, uncompressing files after downloading them is +an extra step we may prefer to avoid. -Let's add the entire `data/` dir to DVC instead to a new, Git-backed DVC +Let's add the entire `data/` dir to DVC instead, in a new Git-backed DVC project: ```dvc @@ -121,13 +116,16 @@ $ git commit -m "Add 1800 cats and dogs images dataset" > Refer to > [Adding a directory example](/doc/command-reference/add#example-directory) and > [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) -> for more details on what happens under the hood. +> for more details on what happens under the hood when adding directories. ### Problem #2: Dataset partitioning -Under some contexts, such as distributed storage or distribution (p2p), data -partitioning (i.e. the automatic kind) can be a great tool. Manually separating -data directories however, besides prone to human error, is unnecessary with DVC. +Consistent data partitioning can be very valuable for some applications, such as +distributed storage and distribution (p2p). Versioning parts of a dataset +separately with DVC, however, is an unnecessary complication. This would also +deprive us from neat performance features such as automatic "deduplication" +(avoidance of file repetition) among dataset versions on DVC caches +or [remotes](/doc/command-reference/remote). Let's extract the remaining cats and dogs images from our example, to see what changes in our data directory: @@ -149,11 +147,12 @@ $ tree --filelimit 3 └── new-labels.zip 7 directories, 2 files +$ rm -f new-labels.zip ``` -It seems an additional 500 images of cats and another 500 of dogs have been -added to the `data/train` directory. To update the DVC cache with -this merged dataset, we simply need to run `dvc add data` again: +An additional 500 images of cats and another 500 of dogs have been added to the +corresponding `data/train` directories. To update the DVC cache +with this reconstructed dataset, we simply need to run `dvc add data` again: ```dvc $ dvc add data @@ -164,11 +163,15 @@ Saving information to 'data.dvc'. ... ``` +Note that when adding an updated data directory, DVC only needs to move the new +and changed files to the cache. + Finally, let's commit the new dataset version to Git, and list the 2 versions: ```dvc -$ git commit -m "Add 1000 more cats and dogs images to dataset" +$ git commit -am "Add 1000 more cats and dogs images to dataset" $ git log --format="%h %s" +162b2e7 Add 1000 more cats and dogs images to dataset cbcf466 Add 1800 cats and dogs images dataset 5b058a3 Initialize DVC project ``` From 7a98c302b6c4ad8b3d5c6a99cf0016baa915e097 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 10 Oct 2019 19:40:02 -0500 Subject: [PATCH 09/40] use-cases: remove Concept header and move its intro to the top, compress example per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-300352295 also https://github.com/iterative/dvc.org/pull/679#pullrequestreview-300349894 --- static/docs/tutorials/versioning.md | 5 ++- static/docs/use-cases/data-registry.md | 58 +++++++++----------------- 2 files changed, 23 insertions(+), 40 deletions(-) diff --git a/static/docs/tutorials/versioning.md b/static/docs/tutorials/versioning.md index 2e0a4c3e2f..6c4df810db 100644 --- a/static/docs/tutorials/versioning.md +++ b/static/docs/tutorials/versioning.md @@ -87,8 +87,9 @@ $ rm -f data.zip > `dvc get` can download data artifacts from any DVC > project hosted on a Git repository into the current working directory > (similar to `wget` but for DVC repositories). In this case we use our own -> [iterative/dataset-registry](https://github.com/iterative/dataset-registry)) -> project as the external data source. +> [dataset registry](https://github.com/iterative/dataset-registry)) project as +> the external data source. (Refer to +> [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) This command downloads and extracts our raw dataset, consisting of 1000 labeled images for training and 800 labeled images for validation. In summary, it's a 43 diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 902e92a22b..ab3abb3fd0 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,17 +1,5 @@ # Data Registry -In the [Versioning Tutorial](/doc/tutorials/versioning) we use two ZIP files -containing parts of a dataset with labeled images of cat and dogs. For -simplicity, these compressed archives are downloaded with `dvc get` from our -[dataset registry](https://github.com/iterative/dataset-registry), a DVC -project hosted on GitHub. - -In this document we'll explain the idea behind **shared data registries**, how -to easily create one with DVC, and the best ways to version datasets based on -the same example above (without ZIP files!) - -## Concept - We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim to enable reusability of any outputs (raw data, intermediate results, models, etc) between different projects. For example, project A may use @@ -19,7 +7,7 @@ a data file to begin its data [pipeline](/doc/command-reference/pipeline), but project B also requires this same file; Instead of [adding it](/doc/command-reference/add#example-single-file) it to both projects, B can simply import it from A. (This can bring many benefits that we'll explain -a little later.) +soon.) Taking this idea to a useful extreme, we could create a project that is exclusively dedicated to @@ -45,27 +33,25 @@ The advantages of using a data registry are: - Easier to manage **access control** per remote storage configured in a single data registry project. -A possible weakness of data registries is that, if the source project or its -remote storage are lost for any reason, several other projects depending on it -may stop being reproducible. So this strategy is best when the registry is owned -and controlled internally by the same team as the projects that employ it. +A possible weakness of shared data registries is that, if the source project or +its remote storage are lost for any reason, several other projects depending on +it may stop being reproducible. So this strategy is best when the registry is +owned and controlled internally by the same team as the projects that employ it. Trusting 3rd party registries should be considered a risk. ## Example: A sub-optimal approach -For illustration purposes, our own -[dataset registry](https://github.com/iterative/dataset-registry) contains a -poorly handled dataset, in the `tutorial/ver` directory. It contains 2 -[DVC-files](/doc/user-guide/dvc-file-format) that track a couple of compressed -ZIP files (problem #1). Each archive, when extracted, contains the same +In the [Versioning Tutorial](/doc/tutorials/versioning) we use two ZIP files +containing parts of a dataset with labeled images of cat and dogs. For +simplicity, these compressed archives are downloaded with `dvc get` from our +[dataset registry](https://github.com/iterative/dataset-registry), a DVC +project hosted on GitHub. Each archive, when extracted, contains the same directory structure, but with complementary files, that together form a single -dataset (problem #2) of 2000 images of cats and dogs. - -> As mentioned in the introduction, these ZIP files are used as-is for -> simplicity in our [Versioning Tutorial](/doc/tutorials/versioning). +dataset of 2800 images of cats and dogs. -Let's download and extract the first half of this dataset (in an empty -directory) to better understand its structure: +Let's see a better approach to versioning this same dataset with DVC. First, we +download and extract the first half of this dataset (in an empty directory) to +better understand its structure: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ @@ -90,12 +76,11 @@ $ rm -f data.zip `dvc add`, the command to place existing data under DVC control, supports tracking both files - [and directories](/doc/command-reference/add#example-directory). For this -reason, adding compressed directory contents to a project is not -recommended, especially when the files contained are already compressed (like -typical image file formats). Also, uncompressing files after downloading them is -an extra step we may prefer to avoid. +reason, adding compressed archives to a project is not recommended, +especially when the files contained are already compressed (like typical image +file formats). Also, uncompressing files after downloading them is an extra step +we may prefer to avoid. Let's add the entire `data/` dir to DVC instead, in a new Git-backed DVC project: @@ -122,10 +107,7 @@ $ git commit -m "Add 1800 cats and dogs images dataset" Consistent data partitioning can be very valuable for some applications, such as distributed storage and distribution (p2p). Versioning parts of a dataset -separately with DVC, however, is an unnecessary complication. This would also -deprive us from neat performance features such as automatic "deduplication" -(avoidance of file repetition) among dataset versions on DVC caches -or [remotes](/doc/command-reference/remote). +separately with DVC, however, is an unnecessary complication. Let's extract the remaining cats and dogs images from our example, to see what changes in our data directory: @@ -166,7 +148,7 @@ Saving information to 'data.dvc'. Note that when adding an updated data directory, DVC only needs to move the new and changed files to the cache. -Finally, let's commit the new dataset version to Git, and list the 2 versions: +Finally, let's commit the new dataset version to Git, and list all the commits: ```dvc $ git commit -am "Add 1000 more cats and dogs images to dataset" From de6103dd19e50d0e6e8e6a3c99a8670683f92f49 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 11 Oct 2019 17:12:28 -0500 Subject: [PATCH 10/40] use-cases: Rewrite second half of new case study draft --- src/Documentation/glossary.js | 5 +- static/docs/get-started/add-files.md | 7 +- static/docs/get-started/import-data.md | 12 ++-- static/docs/tutorials/pipelines.md | 7 +- static/docs/tutorials/versioning.md | 8 +-- static/docs/use-cases/data-registry.md | 96 ++++++++++++-------------- 6 files changed, 64 insertions(+), 71 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index 654e50d25a..58065740ed 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -59,8 +59,9 @@ commands. They represent files or directories from external sources. name: 'Output', match: ['output', 'outputs'], desc: ` -A file or a directory that is under DVC control. See \`dvc add\` \`dvc run\`, -\`dvc import\`, \`dvc import-url\` commands. +A file or a directory that is under DVC control, recorded in the \`outs\` +section of a DVC-file. See \`dvc add\` \`dvc run\`, \`dvc import\`, +\`dvc import-url\` commands. ` }, { diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 02b6757470..1ccdda1be2 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -11,10 +11,11 @@ $ dvc get https://github.com/iterative/dataset-registry \ ``` > `dvc get` can download data artifacts from any DVC -> project hosted on a Git repository into the current working directory -> (similar to `wget` but for DVC repositories). In this case we use our own +> project hosted on a Git repository (similar to `wget` but for DVC +> repositories). In this case we use our own > [iterative/dataset-registry](https://github.com/iterative/dataset-registry)) -> project as the external data source. +> project as the external data source. (Refer to +> [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) To take a file (or a directory) under DVC control just run `dvc add` on it. For example: diff --git a/static/docs/get-started/import-data.md b/static/docs/get-started/import-data.md index 7010ab7155..79d28b5a55 100644 --- a/static/docs/get-started/import-data.md +++ b/static/docs/get-started/import-data.md @@ -27,12 +27,12 @@ $ dvc import https://github.com/iterative/dataset-registry \ get-started/data.xml ``` -This downloads `data.xml` from our -[dataset-registry](https://github.com/iterative/dataset-registry) DVC project -into the current working directory, adds it to `.gitignore`, and creates the -`data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to track changes in -the source data. With _imports_, we can use `dvc update` to check for changes in -the external data source before [reproducing](/doc/get-started/reproduce) any +This downloads `data.xml` from our own +[iterative/dataset-registry](https://github.com/iterative/dataset-registry) +project into the current working directory, adds it to `.gitignore`, and creates +the `data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to track changes +in the source data. With _imports_, we can use `dvc update` to check for changes +in the external data source before [reproducing](/doc/get-started/reproduce) any pipeline that depends on this data.
diff --git a/static/docs/tutorials/pipelines.md b/static/docs/tutorials/pipelines.md index f362a34bb8..272a85effd 100644 --- a/static/docs/tutorials/pipelines.md +++ b/static/docs/tutorials/pipelines.md @@ -38,11 +38,12 @@ $ git add code/ $ git commit -m "Download and add code to new Git repo" ``` -> `dvc get` can download data artifacts from any DVC project hosted on a Git -> repository into the current working directory (similar to `wget` but for DVC +> `dvc get` can download data artifacts from any DVC +> project hosted on a Git repository (similar to `wget` but for DVC > repositories). In this case we use our own > [iterative/dataset-registry](https://github.com/iterative/dataset-registry)) -> project as the external data source. +> project as the external data source. (Refer to +> [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) Now let's install the requirements. But before we do that, we **strongly** recommend creating a virtual environment with a tool such as diff --git a/static/docs/tutorials/versioning.md b/static/docs/tutorials/versioning.md index 6c4df810db..47d24c32e4 100644 --- a/static/docs/tutorials/versioning.md +++ b/static/docs/tutorials/versioning.md @@ -85,10 +85,10 @@ $ rm -f data.zip ``` > `dvc get` can download data artifacts from any DVC -> project hosted on a Git repository into the current working directory -> (similar to `wget` but for DVC repositories). In this case we use our own -> [dataset registry](https://github.com/iterative/dataset-registry)) project as -> the external data source. (Refer to +> project hosted on a Git repository (similar to `wget` but for DVC +> repositories). In this case we use our own +> [iterative/dataset-registry](https://github.com/iterative/dataset-registry)) +> project as the external data source. (Refer to > [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) This command downloads and extracts our raw dataset, consisting of 1000 labeled diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index ab3abb3fd0..2c34e6fd5c 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,13 +1,12 @@ # Data Registry We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim -to enable reusability of any outputs (raw data, intermediate +to enable reusability of any data artifacts (raw data, intermediate results, models, etc) between different projects. For example, project A may use a data file to begin its data [pipeline](/doc/command-reference/pipeline), but project B also requires this same file; Instead of [adding it](/doc/command-reference/add#example-single-file) it to both projects, -B can simply import it from A. (This can bring many benefits that we'll explain -soon.) +B can simply import it from A. Taking this idea to a useful extreme, we could create a project that is exclusively dedicated to @@ -15,9 +14,9 @@ that is exclusively dedicated to datasets (or any kind of large files) – by mainly using `dvc add` to build it. Such a project would not have [stages](/doc/command-reference/run), but its data files may be updated manually as they evolve. Other projects can then share -these files by downloading (`dvc get`) or importing (`dvc import`) them for use -in different data processes – and these don't even have to be _DVC projects_, as -`dvc get` works anywhere in your system. +these artifacts by downloading (`dvc get`) or importing (`dvc import`) them for +use in different data processes – and these don't even have to be _DVC +projects_, as `dvc get` works anywhere in your system. The advantages of using a data registry are: @@ -33,25 +32,33 @@ The advantages of using a data registry are: - Easier to manage **access control** per remote storage configured in a single data registry project. -A possible weakness of shared data registries is that, if the source project or -its remote storage are lost for any reason, several other projects depending on -it may stop being reproducible. So this strategy is best when the registry is -owned and controlled internally by the same team as the projects that employ it. -Trusting 3rd party registries should be considered a risk. +A possible risk of shared data registries is that, if the source project or its +remote storage are lost for any reason, several other projects depending on it +may stop being reproducible. So this strategy is best when the registry is owned +and controlled internally by the same team as the projects that employ it. ## Example: A sub-optimal approach -In the [Versioning Tutorial](/doc/tutorials/versioning) we use two ZIP files -containing parts of a dataset with labeled images of cat and dogs. For -simplicity, these compressed archives are downloaded with `dvc get` from our -[dataset registry](https://github.com/iterative/dataset-registry), a DVC -project hosted on GitHub. Each archive, when extracted, contains the same -directory structure, but with complementary files, that together form a single -dataset of 2800 images of cats and dogs. +In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file +containing a dataset with images of cats and dogs, and later on we get a second +archive to update the dataset with more images. For simplicity, these compressed +archives are downloaded with `dvc get` from our own +[iterative/dataset-registry](https://github.com/iterative/dataset-registry)), a +DVC project hosted on GitHub. These data files can be found as +outputs of 2 separate +[DVC-files](/doc/user-guide/dvc-files-and-directories) in the `tutorial/ver` +directory. -Let's see a better approach to versioning this same dataset with DVC. First, we -download and extract the first half of this dataset (in an empty directory) to -better understand its structure: +There are a few possible problems with the way this dataset is stored (as 2 +parts, in compressed archives). One is that dataset partitioning is complicated. +It can cause data duplication and other hurdles, so it's best to avoid. Another +issue is that storing file archives requires the extra steps of bundling and +extracting them. The data compression also raises questions in this approach. + +## Example: Better dataset versioning + +Let's download and extract the first archive we discussed above (in an empty +directory) to visualize the structure of the dataset: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ @@ -72,24 +79,16 @@ $ tree --filelimit 3 $ rm -f data.zip ``` -### Problem #1: Compressed data files - -`dvc add`, the command to place existing data under DVC control, supports -tracking both files -[and directories](/doc/command-reference/add#example-directory). For this -reason, adding compressed archives to a project is not recommended, -especially when the files contained are already compressed (like typical image -file formats). Also, uncompressing files after downloading them is an extra step -we may prefer to avoid. - -Let's add the entire `data/` dir to DVC instead, in a new Git-backed DVC -project: +Instead of creating an archive containing the `data/` directory, we can simply +put it under DVC control as-is! This is done with the `dvc add` command, and +this first version can be saved with Git: ```dvc -$ git init -$ dvc init +$ git init # Initialize Git repository +$ dvc init # Initialize DVC project ... $ git commit -m "Initialize DVC project" +... $ dvc add data Adding 'data' to '.gitignore'... Saving information to 'data.dvc'. @@ -103,13 +102,7 @@ $ git commit -m "Add 1800 cats and dogs images dataset" > [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) > for more details on what happens under the hood when adding directories. -### Problem #2: Dataset partitioning - -Consistent data partitioning can be very valuable for some applications, such as -distributed storage and distribution (p2p). Versioning parts of a dataset -separately with DVC, however, is an unnecessary complication. - -Let's extract the remaining cats and dogs images from our example, to see what +Let's add the remaining cats and dogs images from the archive, to see what changes in our data directory: ```dvc @@ -132,9 +125,10 @@ $ tree --filelimit 3 $ rm -f new-labels.zip ``` -An additional 500 images of cats and another 500 of dogs have been added to the -corresponding `data/train` directories. To update the DVC cache -with this reconstructed dataset, we simply need to run `dvc add data` again: +Now that an additional 500 images of each kind have been added to their +corresponding subdirectories, we'll want to save the updates with DVC. To do +this, we simply need to use `dvc add` again, commit the new dataset version with +Git: ```dvc $ dvc add data @@ -143,17 +137,13 @@ WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' ... Saving information to 'data.dvc'. ... -``` - -Note that when adding an updated data directory, DVC only needs to move the new -and changed files to the cache. - -Finally, let's commit the new dataset version to Git, and list all the commits: - -```dvc $ git commit -am "Add 1000 more cats and dogs images to dataset" $ git log --format="%h %s" 162b2e7 Add 1000 more cats and dogs images to dataset cbcf466 Add 1800 cats and dogs images dataset 5b058a3 Initialize DVC project ``` + +Note that when adding an updated data directory, DVC only needs to move the new +and changed files to the cache, as all the previous ones were +already there after the initial use of `dvc add`. From 79f974b3b580f0eca82a31eeabe8b79c648a572a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 11 Oct 2019 22:02:34 -0500 Subject: [PATCH 11/40] use-cases: full first draft of new data registry case --- static/docs/use-cases/data-registry.md | 89 +++++++++++++++++++------- 1 file changed, 66 insertions(+), 23 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 2c34e6fd5c..89efc2dd5a 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -42,7 +42,7 @@ and controlled internally by the same team as the projects that employ it. In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file containing a dataset with images of cats and dogs, and later on we get a second archive to update the dataset with more images. For simplicity, these compressed -archives are downloaded with `dvc get` from our own +archives are downloaded from our own [iterative/dataset-registry](https://github.com/iterative/dataset-registry)), a DVC project hosted on GitHub. These data files can be found as outputs of 2 separate @@ -50,10 +50,13 @@ archives are downloaded with `dvc get` from our own directory. There are a few possible problems with the way this dataset is stored (as 2 -parts, in compressed archives). One is that dataset partitioning is complicated. -It can cause data duplication and other hurdles, so it's best to avoid. Another -issue is that storing file archives requires the extra steps of bundling and -extracting them. The data compression also raises questions in this approach. +parts, in compressed archives). One is that dataset partitioning is an +unnecessary complication in this case; It can cause data duplication and other +hurdles. Another issue is that storing file archives requires the extra steps of +bundling and extracting them. The data compression also raises questions in this +approach. Most importantly, both files are parts of the same dataset, however +they are tracked by 2 different DVC-files, instead of as 2 versions of the same +one, as we'll explain next. ## Example: Better dataset versioning @@ -80,8 +83,8 @@ $ rm -f data.zip ``` Instead of creating an archive containing the `data/` directory, we can simply -put it under DVC control as-is! This is done with the `dvc add` command, and -this first version can be saved with Git: +put it under DVC control as-is! This is done with the `dvc add` command, which +accepts entire directories. We can then save this first version using Git: ```dvc $ git init # Initialize Git repository @@ -102,8 +105,8 @@ $ git commit -m "Add 1800 cats and dogs images dataset" > [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) > for more details on what happens under the hood when adding directories. -Let's add the remaining cats and dogs images from the archive, to see what -changes in our data directory: +Let's add the remaining images from the second archive, and see what changes in +our data directory: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ @@ -115,20 +118,14 @@ $ tree --filelimit 3 │   ├── train │   │   ├── cats [1000 entries ...] │   │   └── dogs [1000 entries ...] -│   └── validation -│   ├── cats [400 entries ...] -│   └── dogs [400 entries ...] -├── data.dvc -└── new-labels.zip - -7 directories, 2 files +... $ rm -f new-labels.zip ``` -Now that an additional 500 images of each kind have been added to their -corresponding subdirectories, we'll want to save the updates with DVC. To do -this, we simply need to use `dvc add` again, commit the new dataset version with -Git: +An additional 500 images of each kind have been added to the corresponding +subdirectories, and we probably want to save this update in our +project. To do this, we simply run `dvc add` again on the same +directory, and commit the new version with Git: ```dvc $ dvc add data @@ -144,6 +141,52 @@ cbcf466 Add 1800 cats and dogs images dataset 5b058a3 Initialize DVC project ``` -Note that when adding an updated data directory, DVC only needs to move the new -and changed files to the cache, as all the previous ones were -already there after the initial use of `dvc add`. +As shown by the output of `git log` above, now we have 2 different versions of +the project which we can switch between at any time. (Refer to `dvc checkout`.) +Both have a single `data.dvc` +[DVC-file](/doc/user-guide/dvc-files-and-directories), but the first version +tacks a `data/` directory with the first 1800 images, while the latest one has +the entire 2800 set. + +Note that when adding an updated data directory, DVC only needs to move the +_delta_ (new and changed files) to the cache, as the unchanged +files were already placed there by previous use(s) of `dvc add`. This optimizes +file system performance, and avoids file duplication in caches or +[remotes](/doc/command-reference/remote). + +This strategy enables exact reproducibility of old experiments (originally +performed with the initial dataset). It also provides the ability to track the +history of changes to datasets, via Git. + +## Taking full advantage of data registries + +If we want to keep the connection between the current project and +an external DVC repository (e.g. A data registry)) we would use `dvc import` +instead of `dvc get`. (Please refer to the command reference.) Let's try this +with our example +[dataset registry](https://github.com/iterative/dataset-registry), where we +already registered the dataset, in the `use-cases` directory (with +[2 versions](https://github.com/iterative/dataset-registry/commits/master/use-cases), +as described in the previous section): + +```dvc +dvc import --rev 0547f58 \ + git@github.com:iterative/dataset-registry.git \ + use-cases/data +``` + +This downloads the `data/` directory from Git version +[0547f58](https://github.com/iterative/dataset-registry/tree/0547f58), which +corresponds to the 1800 image dataset, and creates a local `data.dvc` _import +stage_. Unlike typical DVC-files, this one records the source (project) of the +imported data. This connection between projects allows us to check for updates +in the data, using `dvc update`: + +```dvc +$ dvc update data.dvc +``` + +This brings the `data/` directory up to its +[latest version](https://github.com/iterative/dataset-registry/commit/99d1cdb) +with 2800 images. Note that DVC only downloads the _delta_ when updating +imports. From a16ac703981d6e1169e595000a68965992f482b8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 16 Oct 2019 20:20:28 -0500 Subject: [PATCH 12/40] use-cases: add note about read-only remotes for data registries per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-301023910 --- static/docs/use-cases/data-registry.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 89efc2dd5a..67ed50982b 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -27,10 +27,11 @@ The advantages of using a data registry are: the same data versions. - Projects that import data from the registry don't need to push these large files to their own [remotes](/doc/command-reference/remote), **saving space** - on storage – they may not even need to a remote at all, using only the local + on storage – they may not even need a remote at all, using only their local cache. -- Easier to manage **access control** per remote storage configured in a single - data registry project. +- It may be easier to manage **access control** for remote storage configured in + a single data registry project. A possible setup would use a read-only remote, + so other projects can't affect each other accidentally. A possible risk of shared data registries is that, if the source project or its remote storage are lost for any reason, several other projects depending on it From 0f7b1a16a08bafd663372dc59bea15d38a86fe24 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 16 Oct 2019 20:22:39 -0500 Subject: [PATCH 13/40] use-cases: merge and compress example H2s with new title per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-301024121 --- static/docs/use-cases/data-registry.md | 27 ++++++++++++-------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 67ed50982b..8ee62df47d 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -38,28 +38,25 @@ remote storage are lost for any reason, several other projects depending on it may stop being reproducible. So this strategy is best when the registry is owned and controlled internally by the same team as the projects that employ it. -## Example: A sub-optimal approach +## Implementing proper data versioning In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file -containing a dataset with images of cats and dogs, and later on we get a second -archive to update the dataset with more images. For simplicity, these compressed -archives are downloaded from our own +containing images of cats and dogs, and then we update the dataset with more +images from a second ZIP file. These compressed archives are downloaded from our +own [iterative/dataset-registry](https://github.com/iterative/dataset-registry)), a -DVC project hosted on GitHub. These data files can be found as +DVC project hosted on GitHub. They can be found as outputs of 2 separate [DVC-files](/doc/user-guide/dvc-files-and-directories) in the `tutorial/ver` directory. -There are a few possible problems with the way this dataset is stored (as 2 -parts, in compressed archives). One is that dataset partitioning is an -unnecessary complication in this case; It can cause data duplication and other -hurdles. Another issue is that storing file archives requires the extra steps of -bundling and extracting them. The data compression also raises questions in this -approach. Most importantly, both files are parts of the same dataset, however -they are tracked by 2 different DVC-files, instead of as 2 versions of the same -one, as we'll explain next. - -## Example: Better dataset versioning +There are a few possible problems with the way this dataset is stored (in 2 +parts, as compressed archives). One is that dataset partitioning is an +unnecessary complication; It can cause data duplication and other hurdles. +Another issue is that extra steps are needed to bundle and extract file +archives. The data compression also raises questions. But most importantly, this +single dataset is tracked by 2 different DVC-files, instead of 2 versions of the +same one, as we'll explain next. Let's download and extract the first archive we discussed above (in an empty directory) to visualize the structure of the dataset: From b9d8475fac86c2989eaa51b7744b13c970be766c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 17 Oct 2019 02:19:03 -0500 Subject: [PATCH 14/40] use-cases: shorten "proper data versioning" section of data-registry, ... add note about file deduplication in cache, and misc related updates --- static/docs/command-reference/cache/index.md | 6 +- static/docs/get-started/configure.md | 4 +- static/docs/use-cases/data-registry.md | 72 +++++++------------ .../user-guide/dvc-files-and-directories.md | 19 ++--- 4 files changed, 41 insertions(+), 60 deletions(-) diff --git a/static/docs/command-reference/cache/index.md b/static/docs/command-reference/cache/index.md index 90b9cd712c..2ac04dc3ab 100644 --- a/static/docs/command-reference/cache/index.md +++ b/static/docs/command-reference/cache/index.md @@ -21,9 +21,9 @@ including the default cache directory. The cache is where your data files, models, etc (anything you want to version with DVC) are actually stored. The corresponding files you see in the -workspace simply link to the ones in cache. (See -`dvc config cache`, `type` config option, for more information on file links on -different platforms.) +workspace can simply link to the ones in cache. (Refer to +[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +for more information on file links on different platforms.) > For more cache-related configuration options refer to `dvc config cache`. diff --git a/static/docs/get-started/configure.md b/static/docs/get-started/configure.md index e64e354389..8248de19ea 100644 --- a/static/docs/get-started/configure.md +++ b/static/docs/get-started/configure.md @@ -31,8 +31,8 @@ $ git commit .dvc/config -m "Configure local remote" > use DVC. For most [use cases](/doc/use-cases), other "more remote" types of > remotes will be required. -Adding a remote should be specified by both its type prefix (protocol) and its -path. DVC currently supports seven types of remotes: +Adding a remote should be specified by both its type (protocol) and its path. +DVC currently supports seven types of remotes: - `local`: Local Directory - `s3`: Amazon Simple Storage Service diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 8ee62df47d..6ace4d4371 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -65,19 +65,17 @@ directory) to visualize the structure of the dataset: $ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/data.zip $ unzip -q data.zip +$ rm -f data.zip $ tree --filelimit 3 . -├── data -│   ├── train -│   │   ├── cats [500 entries ...] -│   │   └── dogs [500 entries ...] -│   └── validation -│   ├── cats [400 entries ...] -│   └── dogs [400 entries ...] -└── data.zip - -7 directories, 1 file -$ rm -f data.zip +└── data +    ├── train +    │   ├── cats [500 entries ...] +    │   └── dogs [500 entries ...] +    └── validation +    ├── cats [400 entries ...] +    └── dogs [400 entries ...] +... ``` Instead of creating an archive containing the `data/` directory, we can simply @@ -103,13 +101,16 @@ $ git commit -m "Add 1800 cats and dogs images dataset" > [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) > for more details on what happens under the hood when adding directories. -Let's add the remaining images from the second archive, and see what changes in -our data directory: +Let's add the remaining images from the second archive (500 of each kind), and +see what changes in our data directory. To save the updates in our +project, we simply run `dvc add` again on the same directory, and +commit the new version with Git: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/new-labels.zip $ unzip -q new-labels.zip +$ rm -f new-labels.zip $ tree --filelimit 3 . ├── data @@ -117,15 +118,6 @@ $ tree --filelimit 3 │   │   ├── cats [1000 entries ...] │   │   └── dogs [1000 entries ...] ... -$ rm -f new-labels.zip -``` - -An additional 500 images of each kind have been added to the corresponding -subdirectories, and we probably want to save this update in our -project. To do this, we simply run `dvc add` again on the same -directory, and commit the new version with Git: - -```dvc $ dvc add data Computing md5 for a large number of files. This is only done once. WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' @@ -133,37 +125,25 @@ WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' Saving information to 'data.dvc'. ... $ git commit -am "Add 1000 more cats and dogs images to dataset" -$ git log --format="%h %s" -162b2e7 Add 1000 more cats and dogs images to dataset -cbcf466 Add 1800 cats and dogs images dataset -5b058a3 Initialize DVC project ``` -As shown by the output of `git log` above, now we have 2 different versions of -the project which we can switch between at any time. (Refer to `dvc checkout`.) -Both have a single `data.dvc` -[DVC-file](/doc/user-guide/dvc-files-and-directories), but the first version -tacks a `data/` directory with the first 1800 images, while the latest one has -the entire 2800 set. - -Note that when adding an updated data directory, DVC only needs to move the -_delta_ (new and changed files) to the cache, as the unchanged -files were already placed there by previous use(s) of `dvc add`. This optimizes -file system performance, and avoids file duplication in caches or -[remotes](/doc/command-reference/remote). +> Note that when adding updates to a tracked data directory, DVC only moves new +> and changed files to the cache. This optimizes file system performance, and +> avoids file duplication in cache and [remotes](/doc/command-reference/remote). -This strategy enables exact reproducibility of old experiments (originally -performed with the initial dataset). It also provides the ability to track the -history of changes to datasets, via Git. +Done! We now have a straightforward dataset stored as a full directory in our +data registry. Its first version tracks 1800 images, while the updated second +version tracks 1000 more. This strategy enables easy reproducibility of any +experiments (matching their dataset versions, see `dvc checkout`). It also +provides the ability to track the history of changes to the dataset with Git. ## Taking full advantage of data registries If we want to keep the connection between the current project and an external DVC repository (e.g. A data registry)) we would use `dvc import` -instead of `dvc get`. (Please refer to the command reference.) Let's try this -with our example +instead of `dvc get`. Let's try this with our example [dataset registry](https://github.com/iterative/dataset-registry), where we -already registered the dataset, in the `use-cases` directory (with +already registered the dataset in the `use-cases` directory (with [2 versions](https://github.com/iterative/dataset-registry/commits/master/use-cases), as described in the previous section): @@ -186,5 +166,5 @@ $ dvc update data.dvc This brings the `data/` directory up to its [latest version](https://github.com/iterative/dataset-registry/commit/99d1cdb) -with 2800 images. Note that DVC only downloads the _delta_ when updating -imports. +with 2800 images. Note that DVC only downloads new and changed files when +updating imports. diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index be43dfb1f1..0b67384da4 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -43,19 +43,20 @@ operation: ## Structure of cache directory -There are two ways in which the data is stored in cache. It depends -on whether the data in question is a single file (eg. `data.csv`) or a directory -of files. +There are two ways in which the data is stored in cache: As a +single file (eg. `data.csv`), or a directory of files. For the first case, we calculate the file's checksum, a 32 characters long string (usually MD5). The first two characters are used to name the directory -inside `.dvc/cache` and the rest become the file name of the cached file. For +inside `.dvc/cache`, and the rest become the file name of the cached file. For example, if a data file `Posts.xml.zip` has checksum `ec1d2935f811b77cc49b031b999cbf17`, its cache entry will be -`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17` locally. If pushed to -[remote storage](/doc/command-reference/remote), its location will be -`/ec/1d2935f811b77cc49b031b999cbf17`, where prefix is the name of the -DVC remote. +`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17` locally. + +> **Note!** File checksums are calculated from file contents only. 2 or more +> files with different names but the same contents can exist in the workspace +> and be tracked by DVC, but only one copy can be stored in the cache! This +> helps avoid data duplication in cache and remotes. For the second case, let us consider a directory with 2 images. @@ -95,7 +96,7 @@ $ tree .dvc/cache     └── 0b40427ee0998e9802335d98f08cd98f ``` -The cache file with `.dir` extension is a special text file that stores the +The cache file with `.dir` extension is a special text file that records the mapping of files in the `data/` directory (as a JSON array), along with their checksums. The other two cache files are the files inside `data/`. A typical `.dir` cache file looks like this: From 5112f127d4deae3c85fe1a93df608330dd9e83ce Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 17 Oct 2019 13:31:17 -0500 Subject: [PATCH 15/40] use-cases: address feedback in #679... Per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-303045500 --- static/docs/use-cases/data-registry.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 6ace4d4371..28abc825ee 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -20,17 +20,16 @@ projects_, as `dvc get` works anywhere in your system. The advantages of using a data registry are: -- Tracked files can be safely stored in a **centralized** remote location, with - the ability to create any amount of distributed copies on other remote - storage. -- Several projects can **share** the same files, trusting that everyone is using - the same data versions. +- Tracked data is stored in a **centralized** remote location, with the ability + to create distributed copies on other remotes. +- Several projects can **share** the same files, guaranteeing that everyone has + access to the same data versions. - Projects that import data from the registry don't need to push these large files to their own [remotes](/doc/command-reference/remote), **saving space** on storage – they may not even need a remote at all, using only their local cache. -- It may be easier to manage **access control** for remote storage configured in - a single data registry project. A possible setup would use a read-only remote, +- Its easier to manage **access control** for remote storage configured in a + single data registry project. A possible setup would use a read-only remote, so other projects can't affect each other accidentally. A possible risk of shared data registries is that, if the source project or its @@ -89,10 +88,10 @@ $ dvc init # Initialize DVC project $ git commit -m "Initialize DVC project" ... $ dvc add data -Adding 'data' to '.gitignore'... +... Saving information to 'data.dvc'. ... -$ git add data.dvc .gitignore +$ git add data.dvc .gitignore $ git commit -m "Add 1800 cats and dogs images dataset" ``` @@ -119,7 +118,7 @@ $ tree --filelimit 3 │   │   └── dogs [1000 entries ...] ... $ dvc add data -Computing md5 for a large number of files. This is only done once. +... WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' ... Saving information to 'data.dvc'. From bdab21bd0344b23994ff34229c011d2f506fc1e9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 21 Oct 2019 20:25:17 -0500 Subject: [PATCH 16/40] use-cases: add general benefits to the list in data-registry per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-303043210 --- static/docs/use-cases/data-registry.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 28abc825ee..cf1f68b121 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -23,14 +23,19 @@ The advantages of using a data registry are: - Tracked data is stored in a **centralized** remote location, with the ability to create distributed copies on other remotes. - Several projects can **share** the same files, guaranteeing that everyone has - access to the same data versions. + access to the same data versions. See + [Share Data and Model Files](/doc/use-cases/share-data-and-model-files) for + more information. - Projects that import data from the registry don't need to push these large files to their own [remotes](/doc/command-reference/remote), **saving space** on storage – they may not even need a remote at all, using only their local cache. -- Its easier to manage **access control** for remote storage configured in a - single data registry project. A possible setup would use a read-only remote, - so other projects can't affect each other accidentally. +- DVC data registries can handle multiple versions of data and ML modes with a + familiar CLI. See + [Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning) + for more information. +- DVC data registries are versioned with Git, so you can always track the + history of the project the same as you manage your source code repository. A possible risk of shared data registries is that, if the source project or its remote storage are lost for any reason, several other projects depending on it From f5a66d0909c551f114d9a6a00c4d76fce3a78d83 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 21 Oct 2019 22:06:32 -0500 Subject: [PATCH 17/40] use-cases: simplify 2nd half of data-registry per https://github.com/iterative/dvc.org/pull/679#discussion_r334254796 and https://github.com/iterative/dvc.org/pull/679#pullrequestreview-303045500 --- static/docs/use-cases/data-registry.md | 157 ++++++++----------------- 1 file changed, 52 insertions(+), 105 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index cf1f68b121..231f98edd2 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -42,133 +42,80 @@ remote storage are lost for any reason, several other projects depending on it may stop being reproducible. So this strategy is best when the registry is owned and controlled internally by the same team as the projects that employ it. -## Implementing proper data versioning +## Using properly versioned registries In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file containing images of cats and dogs, and then we update the dataset with more images from a second ZIP file. These compressed archives are downloaded from our own [iterative/dataset-registry](https://github.com/iterative/dataset-registry)), a -DVC project hosted on GitHub. They can be found as +DVC project hosted on GitHub using `dvc get`. They can be found as outputs of 2 separate [DVC-files](/doc/user-guide/dvc-files-and-directories) in the `tutorial/ver` directory. -There are a few possible problems with the way this dataset is stored (in 2 -parts, as compressed archives). One is that dataset partitioning is an -unnecessary complication; It can cause data duplication and other hurdles. -Another issue is that extra steps are needed to bundle and extract file -archives. The data compression also raises questions. But most importantly, this -single dataset is tracked by 2 different DVC-files, instead of 2 versions of the -same one, as we'll explain next. +There are a few problems with the way this dataset is structured (in 2 parts, as +compressed archives). One is that dataset partitioning is an unnecessary +complication; It can cause data duplication, among other hurdles. Another issue +is that extra steps are needed to bundle or extract file archives. The data +compression also raises questions. But most importantly, this single dataset is +tracked by 2 different DVC-files, instead of 2 versions of the same one, +preventing us from leveraging Git's central features to track changes. + +Fortunately, we have also prepared a better alternative in the `use-cases` +directory of the same +[repository](https://github.com/iterative/dataset-registry)). First, we used +dvc add cats-dogs to +[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory) +(without bundling or compression) in first version of this dataset, which looks +like this: -Let's download and extract the first archive we discussed above (in an empty -directory) to visualize the structure of the dataset: - -```dvc -$ dvc get https://github.com/iterative/dataset-registry \ - tutorial/ver/data.zip -$ unzip -q data.zip -$ rm -f data.zip -$ tree --filelimit 3 -. -└── data -    ├── train -    │   ├── cats [500 entries ...] -    │   └── dogs [500 entries ...] -    └── validation -    ├── cats [400 entries ...] -    └── dogs [400 entries ...] -... ``` - -Instead of creating an archive containing the `data/` directory, we can simply -put it under DVC control as-is! This is done with the `dvc add` command, which -accepts entire directories. We can then save this first version using Git: - -```dvc -$ git init # Initialize Git repository -$ dvc init # Initialize DVC project -... -$ git commit -m "Initialize DVC project" -... -$ dvc add data -... -Saving information to 'data.dvc'. -... -$ git add data.dvc .gitignore -$ git commit -m "Add 1800 cats and dogs images dataset" + cats-dogs + ├── train + │   ├── cats [500 image files] + │   └── dogs [500 image files] + └── validation + ├── cats [400 image files] + └── dogs [400 image files] ``` -> Refer to -> [Adding a directory example](/doc/command-reference/add#example-directory) and -> [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) -> for more details on what happens under the hood when adding directories. - -Let's add the remaining images from the second archive (500 of each kind), and -see what changes in our data directory. To save the updates in our -project, we simply run `dvc add` again on the same directory, and -commit the new version with Git: +This first version uses the +[`cats-dogs-v1`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) +Git tag. In a local DVC project, we can obtain this dataset with the following +command: ```dvc -$ dvc get https://github.com/iterative/dataset-registry \ - tutorial/ver/new-labels.zip -$ unzip -q new-labels.zip -$ rm -f new-labels.zip -$ tree --filelimit 3 -. -├── data -│   ├── train -│   │   ├── cats [1000 entries ...] -│   │   └── dogs [1000 entries ...] -... -$ dvc add data -... -WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' -... -Saving information to 'data.dvc'. -... -$ git commit -am "Add 1000 more cats and dogs images to dataset" +$ dvc import --rev cats-dogs-v1 \ + git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs ``` -> Note that when adding updates to a tracked data directory, DVC only moves new -> and changed files to the cache. This optimizes file system performance, and -> avoids file duplication in cache and [remotes](/doc/command-reference/remote). - -Done! We now have a straightforward dataset stored as a full directory in our -data registry. Its first version tracks 1800 images, while the updated second -version tracks 1000 more. This strategy enables easy reproducibility of any -experiments (matching their dataset versions, see `dvc checkout`). It also -provides the ability to track the history of changes to the dataset with Git. - -## Taking full advantage of data registries +> Unlike downloading with `dvc get`, which can be used from any directory, +> `dvc import` has to be run from an [initialized](/doc/command-reference/init) +> DVC project. For illustrative purposes, the optional `--rev` option is used in +> the example above to specify an exact version of the dataset. -If we want to keep the connection between the current project and -an external DVC repository (e.g. A data registry)) we would use `dvc import` -instead of `dvc get`. Let's try this with our example -[dataset registry](https://github.com/iterative/dataset-registry), where we -already registered the dataset in the `use-cases` directory (with -[2 versions](https://github.com/iterative/dataset-registry/commits/master/use-cases), -as described in the previous section): +Importing keeps the connection between the local project and an external DVC +repository (e.g. a data registry)) where we are downloading data from. This is +achieved by creating a DVC-file (a.k.a. an _import stage_), in this case +`cats-dogs.dvc` – which can be used for versioning the import with Git in the +local project. This connection will come in handy when the source data changes, +and we want to obtain these updates... -```dvc -dvc import --rev 0547f58 \ - git@github.com:iterative/dataset-registry.git \ - use-cases/data -``` +Back in our **dataset-registry** repository, the second (and last) version of +our dataset exists under the +[`cats-dogs-v2`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) +tag. It was created by placing the additional 500 cat images in +`cats-dogs/training/cats` and 500 dog images in `cats-dogs/training/dogs`, and +simply running dvc add cats-dogs again. -This downloads the `data/` directory from Git version -[0547f58](https://github.com/iterative/dataset-registry/tree/0547f58), which -corresponds to the 1800 image dataset, and creates a local `data.dvc` _import -stage_. Unlike typical DVC-files, this one records the source (project) of the -imported data. This connection between projects allows us to check for updates -in the data, using `dvc update`: +In our local project, all we have to do in order to obtain this latest version +of the dataset is to run: ```dvc -$ dvc update data.dvc +$ dvc update cats-dogs.dvc ``` -This brings the `data/` directory up to its -[latest version](https://github.com/iterative/dataset-registry/commit/99d1cdb) -with 2800 images. Note that DVC only downloads new and changed files when -updating imports. +This downloads new and changed files in `cats-dogs/` from the source project, +and updates the metadata in `cats-dogs.dvc`. From bdeeb2430bdd268b50df16efb2191866b94f60f2 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 00:52:40 -0500 Subject: [PATCH 18/40] use-cases: fix 2 typos in data-registry --- static/docs/use-cases/data-registry.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 231f98edd2..edea10e294 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -62,7 +62,7 @@ compression also raises questions. But most importantly, this single dataset is tracked by 2 different DVC-files, instead of 2 versions of the same one, preventing us from leveraging Git's central features to track changes. -Fortunately, we have also prepared a better alternative in the `use-cases` +Fortunately, we have also prepared a better alternative in the `use-cases/` directory of the same [repository](https://github.com/iterative/dataset-registry)). First, we used dvc add cats-dogs to @@ -97,8 +97,8 @@ $ dvc import --rev cats-dogs-v1 \ > the example above to specify an exact version of the dataset. Importing keeps the connection between the local project and an external DVC -repository (e.g. a data registry)) where we are downloading data from. This is -achieved by creating a DVC-file (a.k.a. an _import stage_), in this case +repository (e.g. a data registry) where we are downloading data from. This is +achieved by creating a special DVC-file a.k.a. an _import stage_, in this case `cats-dogs.dvc` – which can be used for versioning the import with Git in the local project. This connection will come in handy when the source data changes, and we want to obtain these updates... From 42335a1f8e1fa75028d4d7c9033f13f30e4b7710 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 12:20:36 -0500 Subject: [PATCH 19/40] use-cases: removed paragraph about risk in data-registry per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-304974770 --- static/docs/use-cases/data-registry.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index edea10e294..667b481339 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -37,11 +37,6 @@ The advantages of using a data registry are: - DVC data registries are versioned with Git, so you can always track the history of the project the same as you manage your source code repository. -A possible risk of shared data registries is that, if the source project or its -remote storage are lost for any reason, several other projects depending on it -may stop being reproducible. So this strategy is best when the registry is owned -and controlled internally by the same team as the projects that employ it. - ## Using properly versioned registries In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file From ed6025b58179c52660d7b781d4a56c9d0d912d36 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 12:30:42 -0500 Subject: [PATCH 20/40] use-cases: rename H2 to "Example" per https://github.com/iterative/dvc.org/pull/679#discussion_r337340174 --- static/docs/use-cases/data-registry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 667b481339..5958105ae5 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -37,7 +37,7 @@ The advantages of using a data registry are: - DVC data registries are versioned with Git, so you can always track the history of the project the same as you manage your source code repository. -## Using properly versioned registries +## Example In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file containing images of cats and dogs, and then we update the dataset with more From 7a7e2e6bdceb2b7ae93a7405246c62d32f64d94d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 13:45:07 -0500 Subject: [PATCH 21/40] use-cases: rewrite parts of the example in data-registry and and related updates in other docs per https://github.com/iterative/dvc.org/pull/679#discussion_r337341000 and https://github.com/iterative/dvc.org/pull/679#pullrequestreview-304977884 --- static/docs/command-reference/add.md | 10 +-- static/docs/get-started/add-files.md | 6 +- static/docs/get-started/import-data.md | 15 ++--- static/docs/tutorials/pipelines.md | 6 +- static/docs/tutorials/versioning.md | 3 +- static/docs/use-cases/data-registry.md | 93 +++++++++++++------------- 6 files changed, 64 insertions(+), 69 deletions(-) diff --git a/static/docs/command-reference/add.md b/static/docs/command-reference/add.md index 5782764a84..94440595e4 100644 --- a/static/docs/command-reference/add.md +++ b/static/docs/command-reference/add.md @@ -171,14 +171,14 @@ Your goal might be to build an algorithm to identify dogs and cats in pictures, and this is your training dataset: ```dvc -$ tree pics +$ tree pics --filelimit 3 pics ├── train -│   ├── cats <-- a lot of images of cats -│   └── dogs <-- a lot of images of dogs +│   ├── cats [many image files] +│   └── dogs [many image files] └── validation - ├── cats <-- images of cats - └── dogs <-- images of dogs + ├── cats [more image files] + └── dogs [more image files] ``` Taking a directory under DVC control as simple as taking a single file: diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 1ccdda1be2..41cb7a7b41 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -12,9 +12,9 @@ $ dvc get https://github.com/iterative/dataset-registry \ > `dvc get` can download data artifacts from any DVC > project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use our own -> [iterative/dataset-registry](https://github.com/iterative/dataset-registry)) -> project as the external data source. (Refer to +> repositories). In this case we use our +> [dataset-registry](https://github.com/iterative/dataset-registry)) project as +> the external data source. (Refer to > [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) To take a file (or a directory) under DVC control just run `dvc add` on it. For diff --git a/static/docs/get-started/import-data.md b/static/docs/get-started/import-data.md index 79d28b5a55..87137bf64f 100644 --- a/static/docs/get-started/import-data.md +++ b/static/docs/get-started/import-data.md @@ -27,20 +27,19 @@ $ dvc import https://github.com/iterative/dataset-registry \ get-started/data.xml ``` -This downloads `data.xml` from our own -[iterative/dataset-registry](https://github.com/iterative/dataset-registry) -project into the current working directory, adds it to `.gitignore`, and creates -the `data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to track changes -in the source data. With _imports_, we can use `dvc update` to check for changes -in the external data source before [reproducing](/doc/get-started/reproduce) any +This downloads `data.xml` from our +[dataset-registry](https://github.com/iterative/dataset-registry) project into +the current working directory, adds it to `.gitignore`, and creates the +`data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) to track changes in +the source data. With _imports_, we can use `dvc update` to check for changes in +the external data source before [reproducing](/doc/get-started/reproduce) any pipeline that depends on this data.
### Expand to learn more about imports -Note that the -[iterative/dataset-registry](https://github.com/iterative/dataset-registry) +Note that the [dataset-registry](https://github.com/iterative/dataset-registry) repository doesn't actually contain a `get-started/data.xml` file. Instead, DVC inspects [get-started/data.xml.dvc](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc) diff --git a/static/docs/tutorials/pipelines.md b/static/docs/tutorials/pipelines.md index 14818e2fee..2ed4280a79 100644 --- a/static/docs/tutorials/pipelines.md +++ b/static/docs/tutorials/pipelines.md @@ -40,9 +40,9 @@ $ git commit -m "Download and add code to new Git repo" > `dvc get` can download data artifacts from any DVC > project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use our own -> [iterative/dataset-registry](https://github.com/iterative/dataset-registry)) -> project as the external data source. (Refer to +> repositories). In this case we use our +> [dataset-registry](https://github.com/iterative/dataset-registry)) project as +> the external data source. (Refer to > [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) Now let's install the requirements. But before we do that, we **strongly** diff --git a/static/docs/tutorials/versioning.md b/static/docs/tutorials/versioning.md index c0772e9af5..2cbf3f3555 100644 --- a/static/docs/tutorials/versioning.md +++ b/static/docs/tutorials/versioning.md @@ -83,8 +83,7 @@ $ rm -f data.zip > `dvc get` can download data artifacts from any DVC project hosted > on a Git repository (similar to `wget` but for DVC repositories). In this case -> we use our own -> [iterative/dataset-registry](https://github.com/iterative/dataset-registry) +> we use our [dataset-registry](https://github.com/iterative/dataset-registry) > project as the external data source. (Refer to > [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 5958105ae5..b1269949c0 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -39,46 +39,45 @@ The advantages of using a data registry are: ## Example -In the [Versioning Tutorial](/doc/tutorials/versioning) we use a ZIP file -containing images of cats and dogs, and then we update the dataset with more -images from a second ZIP file. These compressed archives are downloaded from our -own -[iterative/dataset-registry](https://github.com/iterative/dataset-registry)), a -DVC project hosted on GitHub using `dvc get`. They can be found as -outputs of 2 separate -[DVC-files](/doc/user-guide/dvc-files-and-directories) in the `tutorial/ver` -directory. - -There are a few problems with the way this dataset is structured (in 2 parts, as -compressed archives). One is that dataset partitioning is an unnecessary -complication; It can cause data duplication, among other hurdles. Another issue -is that extra steps are needed to bundle or extract file archives. The data -compression also raises questions. But most importantly, this single dataset is -tracked by 2 different DVC-files, instead of 2 versions of the same one, -preventing us from leveraging Git's central features to track changes. - -Fortunately, we have also prepared a better alternative in the `use-cases/` -directory of the same -[repository](https://github.com/iterative/dataset-registry)). First, we used -dvc add cats-dogs to -[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory) -(without bundling or compression) in first version of this dataset, which looks -like this: +A dataset we use for several of our examples and tutorials in these docs is one +containing 2800 images of cats and dogs. We partitioned the dataset in two for +our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on +a storage server, downloading them with `wget` in our examples. This setup was +then revised to download the dataset with `dvc get` instead, so we created the +[dataset-registry](https://github.com/iterative/dataset-registry)) project, a +DVC project hosted on GitHub, to version the dataset (see its +[`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) +directory). + +However, there are a few problems with the way this dataset is structured (in 2 +parts). Most importantly, this single dataset is tracked by 2 different +[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same +one, which would better reflect the intentions of this dataset... Fortunately, +we have also prepared an improved alternative in the +[`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) +directory of the same repository. + +As step one, we extracted the first part of the dataset into the +`use-cases/cats-dogs` directory (illustrated below), and ran dvc add +use-cases/cats-dogs to +[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). -``` - cats-dogs - ├── train - │   ├── cats [500 image files] - │   └── dogs [500 image files] - └── validation - ├── cats [400 image files] - └── dogs [400 image files] +```dvc +$ tree use-cases/cats-dogs --filelimit 3 +use-cases/cats-dogs +└── data + ├── train + │   ├── cats [500 image files] + │   └── dogs [500 image files] + └── validation + ├── cats [400 image files] + └── dogs [400 image files] ``` This first version uses the [`cats-dogs-v1`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) Git tag. In a local DVC project, we can obtain this dataset with the following -command: +command (note the usage of `--rev`): ```dvc $ dvc import --rev cats-dogs-v1 \ @@ -86,24 +85,22 @@ $ dvc import --rev cats-dogs-v1 \ use-cases/cats-dogs ``` -> Unlike downloading with `dvc get`, which can be used from any directory, -> `dvc import` has to be run from an [initialized](/doc/command-reference/init) -> DVC project. For illustrative purposes, the optional `--rev` option is used in -> the example above to specify an exact version of the dataset. +> Note that unlike `dvc get`, which can be used from any directory, `dvc import` +> always needs to run from an [initialized](/doc/command-reference/init) DVC +> project. -Importing keeps the connection between the local project and an external DVC -repository (e.g. a data registry) where we are downloading data from. This is -achieved by creating a special DVC-file a.k.a. an _import stage_, in this case -`cats-dogs.dvc` – which can be used for versioning the import with Git in the -local project. This connection will come in handy when the source data changes, -and we want to obtain these updates... +Importing keeps the connection between the local project and data registry where +we are downloading the dataset from. This is achieved by creating a special +DVC-file (a.k.a. an _import stage_) – which can be used for versioning the +import with Git in the local project. This connection will come in handy when +the source data changes, and we want to obtain these updates... Back in our **dataset-registry** repository, the second (and last) version of our dataset exists under the [`cats-dogs-v2`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) -tag. It was created by placing the additional 500 cat images in -`cats-dogs/training/cats` and 500 dog images in `cats-dogs/training/dogs`, and -simply running dvc add cats-dogs again. +tag. It was created by extracting the second part of the dataset, with 1000 +additional images (500 cats, 500 dogs) in the same directory structure, and +simply running dvc add use-cases/cats-dogs again. In our local project, all we have to do in order to obtain this latest version of the dataset is to run: @@ -113,4 +110,4 @@ $ dvc update cats-dogs.dvc ``` This downloads new and changed files in `cats-dogs/` from the source project, -and updates the metadata in `cats-dogs.dvc`. +and updates the metadata in the import stage DVC-file. From 008e35818f135aa82b53e613ae4d36bbaec1e75b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 21:12:26 -0500 Subject: [PATCH 22/40] use-cases: rewrite data-registry list of benefits per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-304974615 --- static/docs/use-cases/data-registry.md | 40 +++++++++++++++----------- 1 file changed, 24 insertions(+), 16 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index b1269949c0..842adb6c3a 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -20,22 +20,25 @@ projects_, as `dvc get` works anywhere in your system. The advantages of using a data registry are: -- Tracked data is stored in a **centralized** remote location, with the ability - to create distributed copies on other remotes. -- Several projects can **share** the same files, guaranteeing that everyone has - access to the same data versions. See - [Share Data and Model Files](/doc/use-cases/share-data-and-model-files) for - more information. -- Projects that import data from the registry don't need to push these large - files to their own [remotes](/doc/command-reference/remote), **saving space** - on storage – they may not even need a remote at all, using only their local - cache. -- DVC data registries can handle multiple versions of data and ML modes with a - familiar CLI. See - [Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning) - for more information. -- DVC data registries are versioned with Git, so you can always track the - history of the project the same as you manage your source code repository. +- Centralization: Data [shared](/doc/use-cases/share-data-and-model-files) by + multiple projects can be stored in a single location (with the ability to + create distributed copies on other remotes). This simplifies data management + and helps use storage space efficiently. +- [Versioning](/doc/use-cases/data-and-model-files-versioning): Any version of + the stored data or ML modes can be used in other projects at any + time. +- Persistence: The registry controlled + [remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves + data security. There are less chances someone can delete or rewrite a model, + for example. +- Lifecycle management: Manage your data like you do with code, leveraging Git + and GitHub features such as version history, pull requests, reviews, or even + continuous deployment of ML models. +- Security: Registries can be setup to have read-only remote storage (e.g. an + HTTP location). Git versioning of DVC-files allows us to track and audit data + changes. +- Reusability: Reproduce and organizing _feature stores_ with `dvc get` and + `dvc import`. ## Example @@ -111,3 +114,8 @@ $ dvc update cats-dogs.dvc This downloads new and changed files in `cats-dogs/` from the source project, and updates the metadata in the import stage DVC-file. + +As an extra detail, notice that so far our local project is working only with a +local cache. It has no need to setup a +[remotes](/doc/command-reference/remote) to [pull](/doc/command-reference/pull) +or [push](/doc/command-reference/push) this dataset. From 131af1ea370592c9237f8e55a2e0202c7fe4246d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 29 Oct 2019 21:41:21 -0600 Subject: [PATCH 23/40] import,update: explain rev field and update vs re-importing for #735, but also for use-case: add expandable sections to new data registry case per https://github.com/iterative/dvc.org/pull/679#issuecomment-544789692 and other misc. copy edits. Also standardizes term "external" (repo) vs. "source" data/project in this context and introduces the term "revision fixing". --- static/docs/command-reference/import.md | 70 ++++++++++++++++------ static/docs/command-reference/update.md | 27 ++++++--- static/docs/use-cases/data-registry.md | 79 ++++++++++++++++++------- 3 files changed, 129 insertions(+), 47 deletions(-) diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index 4dca2bd27f..13f67c2748 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -3,7 +3,7 @@ Download or copy file or directory from any DVC project in a Git repository (e.g. hosted on GitHub) into the workspace, and track changes in this [external dependency](/doc/user-guide/external-dependencies). -Creates a DVC-file. +Creates a special DVC-file a.k.a _import stage_. > See also `dvc get`, that corresponds to the first step this command performs > (just download the data). @@ -23,11 +23,11 @@ positional arguments: DVC provides an easy way to reuse datasets, intermediate results, ML models, or other files and directories tracked in another DVC repository into the workspace. The `dvc import` command downloads such a data artifact -in a way that it is tracked with DVC, so it can be updated when the external -data source changes. +in a way that it is tracked with DVC, so it can be updated when the data source +changes. The `url` argument specifies the address of the Git repository containing the -external project. Both HTTP and SSH protocols are supported for +source project. Both HTTP and SSH protocols are supported for online repositories (e.g. `[user@]server:project.git`). `url` can also be a local file system path to an "offline" repository. @@ -35,31 +35,31 @@ The `path` argument of this command is used to specify the location of the data to be downloaded within the source project. It should point to a data file or directory tracked by that project – specified in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the repository at `url`. (You -will not find these files directly in the source Git repository.) The source +will not find these files directly in the external Git repository.) The source project should have a default [DVC remote](/doc/command-reference/remote) configured, containing them.) > See `dvc import-url` to download and tack data from other supported URLs. After running this command successfully, the imported data is placed in the -current working directory with its original file name e.g. `data.txt`. An import -stage (DVC-file) is then created extending the full file or directory name of -the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` to -generate the same output. +current working directory with its original file name e.g. `data.txt`. An +_import stage_ (DVC-file) is then created, extending the full file or directory +name of the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` +to generate the same output. DVC supports DVC-files that refer to data in an external DVC repository (hosted -on a Git server). In such a DVC-file, the `deps` section specifies the `repo` -URL and data `path`, and the `outs` section contains the corresponding local -path in the workspace. It records enough data from the external file or -directory to enable DVC to efficiently check it to determine whether the local -copy is out of date. +on a Git server) a.k.a _import stages_. In such a DVC-file, the `deps` section +specifies the `repo` URL and data `path`, and the `outs` section contains the +corresponding local path in the workspace. It records enough data from the +external file or directory to enable DVC to efficiently check it to determine +whether the local copy is out of date. To actually [track the data](https://dvc.org/doc/get-started/add-files), -`git add` (and `git commit`) the import stage (DVC-file). +`git add` (and `git commit`) the import stage. Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to update the -downloaded data artifact from the external DVC repository. +downloaded data artifact from the source DVC repository. ## Options @@ -72,8 +72,10 @@ downloaded data artifact from the external DVC repository. - `--rev` - specific [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (such as a branch name, a tag, or a commit hash) of the DVC repository to - import the data from. The tip of the default branch is used by default when - this option is not specified. + import the data from. The tip of the repository's default branch is used by + default when this option is not specified. Note that this adds a `rev` field + in the import stage that fixes it to this revision. This can impact the + behavior of `dvc update`. - `-h`, `--help` - prints the usage/help message, and exit. @@ -120,3 +122,35 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used to specify the origin and version of the dependency. + +## Example: fixed revisions & re-importing + +When the `--rev` option is used, the import stage +([DVC-file](/doc/user-guide/dvc-file-format)) will include a `rev` field under +`repo` like this: + +```yaml +deps: + - path: data/data.xml + repo: + url: git@github.com:iterative/dataset-registry.git + rev: cats-dogs-v1 + rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c +``` + +If the Git revision moves, such as a branch, this doesn't have much of an effect +on the import/update workflow. However, for static refs such as tags (unless +manually updated), or for SHA commits, `dvc update` will not have any effect on +the import. In this cases, in order to actually "update" an import, it's +necessary to **re-import the data** instead, by using `dvc import` again without +or with a different `--rev`. For example: + +```dvc +$ dvc import --rev master \ + git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` + +This will overwrite the import stage (DVC-file) either removing or replacing the +`rev` field. This can produce an import stage that is able to be updated +normally with `dvc update` going forward. diff --git a/static/docs/command-reference/update.md b/static/docs/command-reference/update.md index 9c99f92125..c7c8df6f44 100644 --- a/static/docs/command-reference/update.md +++ b/static/docs/command-reference/update.md @@ -1,6 +1,6 @@ # update -Update data artifacts imported from other DVC repositories. +Update data artifacts imported from external DVC repositories. ## Synopsis @@ -15,16 +15,24 @@ positional arguments: After creating import stages ([DVC-files](/doc/user-guide/dvc-file-format)) with `dvc import` or -`dvc import-url`, the external data source can change. Use `dvc update` to bring -these imported file, directory, or data artifact up to date. +`dvc import-url`, the data source can change. Use `dvc update` to bring these +imported file, directory, or data artifact up to date. + +To indicate which import stages to update, we must specify the corresponding +DVC-file `targets` as command arguments. Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. `dvc update` is the only command that can -update them. Also, for `dvc import` DVC-files, the `rev_lock` field is updated -by `dvc update`. +update them. Also, for `dvc import` import stages, the `rev_lock` field is +updated by `dvc update`. -To indicate which import stages to update, we must specify the corresponding -DVC-file `targets` as command arguments. +Another detail to note is that when the `--rev` (revision) option of +`dvc import` has been used to create an import stage, DVC is not aware of what +kind of +[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this +is, for example a branch or a tag. For static refs such as tags (unless manually +updated), or for SHA commits, `dvc update` will not have any effect on the +import. ## Options @@ -60,4 +68,7 @@ Output 'model.pkl' didn't change. Skipping saving. Saving information to 'model.pkl.dvc'. ``` -This time nothing has changed, since the source repository is rather stable. +This time nothing has changed, since the source project is rather +stable. + +> Refer to this [re-importing example]() for diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 842adb6c3a..e2b7bb79ec 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -47,22 +47,24 @@ containing 2800 images of cats and dogs. We partitioned the dataset in two for our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a storage server, downloading them with `wget` in our examples. This setup was then revised to download the dataset with `dvc get` instead, so we created the -[dataset-registry](https://github.com/iterative/dataset-registry)) project, a +[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a DVC project hosted on GitHub, to version the dataset (see its [`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) directory). -However, there are a few problems with the way this dataset is structured (in 2 -parts). Most importantly, this single dataset is tracked by 2 different +However, there are a few problems with the way this dataset is structured. Most +importantly, this single dataset is tracked by 2 different [DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same one, which would better reflect the intentions of this dataset... Fortunately, we have also prepared an improved alternative in the [`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) directory of the same repository. -As step one, we extracted the first part of the dataset into the -`use-cases/cats-dogs` directory (illustrated below), and ran dvc add -use-cases/cats-dogs to +To create a +[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) +of our dataset, we extracted the first part into the `use-cases/cats-dogs` +directory (illustrated below), and ran dvc add use-cases/cats-dogs +to [track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). ```dvc @@ -77,14 +79,11 @@ use-cases/cats-dogs └── dogs [400 image files] ``` -This first version uses the -[`cats-dogs-v1`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) -Git tag. In a local DVC project, we can obtain this dataset with the following -command (note the usage of `--rev`): +In a local DVC project, we could have obtained this dataset at this point with +the following command: ```dvc -$ dvc import --rev cats-dogs-v1 \ - git@github.com:iterative/dataset-registry.git \ +$ dvc import git@github.com:iterative/dataset-registry.git \ use-cases/cats-dogs ``` @@ -92,18 +91,37 @@ $ dvc import --rev cats-dogs-v1 \ > always needs to run from an [initialized](/doc/command-reference/init) DVC > project. +
+ +### Expand for actionable command (optional) + +The command above is meant for informational purposes only. If you actually run +it in a DVC project, although it should work, it will import the latest version +of `use-cases/cats-dogs` from `dataset-registry`. The following command would +actually bring in the version in question: + +```dvc +$ dvc import --rev cats-dogs-v1 \ + git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` + +See the `dvc import` command reference for more details on the `--rev` +(revision) option. + +
+ Importing keeps the connection between the local project and data registry where we are downloading the dataset from. This is achieved by creating a special -DVC-file (a.k.a. an _import stage_) – which can be used for versioning the -import with Git in the local project. This connection will come in handy when -the source data changes, and we want to obtain these updates... +DVC-file (a.k.a. _import stage_) – that can be used for versioning the import +with Git. This connection will come in handy when the source data changes, and +we want to obtain these updates... -Back in our **dataset-registry** repository, the second (and last) version of -our dataset exists under the -[`cats-dogs-v2`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) -tag. It was created by extracting the second part of the dataset, with 1000 -additional images (500 cats, 500 dogs) in the same directory structure, and -simply running dvc add use-cases/cats-dogs again. +Back in our **dataset-registry** repository, a +[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) +of our dataset was created by extracting the second part, with 1000 additional +images (500 cats, 500 dogs), into the same directory structure. Then, we simply +ran dvc add use-cases/cats-dogs again. In our local project, all we have to do in order to obtain this latest version of the dataset is to run: @@ -112,6 +130,25 @@ of the dataset is to run: $ dvc update cats-dogs.dvc ``` +
+ +### Expand for actionable command (optional) + +As with the previous hidden note, actually trying the commands above should +produced the expected results, but not for obvious reasons. Specifically, the +initial `dvc import` command would have already obtained the latest version of +the dataset (as noted before), so this `dvc update` is unnecessary and won't +have an effect. + +If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import +stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to +update it, do not use `dvc update`. Instead, re-import the data by using the +original import command (without `--rev`). Refer to +[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing) +for more information. + +
+ This downloads new and changed files in `cats-dogs/` from the source project, and updates the metadata in the import stage DVC-file. From f01f86045c77bc7ea5c66de75e5c5f9b5ecec37b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 30 Oct 2019 11:36:42 -0600 Subject: [PATCH 24/40] cmd ref: add data registry example to import cmd for #487 --- static/docs/command-reference/import.md | 51 +++++++++++++++++++++++++ static/docs/use-cases/data-registry.md | 13 +++++-- 2 files changed, 60 insertions(+), 4 deletions(-) diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index 13f67c2748..b341647ef9 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -154,3 +154,54 @@ $ dvc import --rev master \ This will overwrite the import stage (DVC-file) either removing or replacing the `rev` field. This can produce an import stage that is able to be updated normally with `dvc update` going forward. + +## Example: Data registry + +If you take a look at our +[dataset-registry](https://github.com/iterative/dataset-registry) +project, you'll see that it's organized into different directories +such as `tutorial/ver` and `use-cases/`, and these contain +[DVC-files](/doc/user-guide/dvc-file-format) that track different datasets. +Given this simple structure, these files can be easily shared among several +other projects, using `dvc get` and `dvc import`. For example: + +```dvc +$ dvc get https://github.com/iterative/dataset-registry \ + tutorial/ver/data.zip +``` + +> Used in our [versioning tutorial](/doc/tutorials/versioning) + +Or + +```dvc +$ dvc import git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` + +`dvc import` provides a better way to incorporate data files tracked in external +projects because it saves the connection between the current project and the +source project. This means that enough information is recorded in an import +stage (DVC-file) in order to [reproduce](/doc/command-reference/repro) +downloading of this same data version in the future, where and when needed. This +is achieved with the `repo` field, for example (matching the import command +above): + +```yaml +md5: 96fd8e791b0ee4824fc1ceffd13b1b49 +locked: true +deps: + - path: use-cases/cats-dogs + repo: + url: git@github.com:iterative/dataset-registry.git + rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c +outs: + - md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir + path: cats-dogs + cache: true + metric: false + persist: false +``` + +See a full explanation in our [Data Registry](/doc/use-cases/data-registry) use +case. diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index e2b7bb79ec..33a1ce550b 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -113,11 +113,13 @@ See the `dvc import` command reference for more details on the `--rev` Importing keeps the connection between the local project and data registry where we are downloading the dataset from. This is achieved by creating a special -DVC-file (a.k.a. _import stage_) – that can be used for versioning the import -with Git. This connection will come in handy when the source data changes, and -we want to obtain these updates... +[DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_) that uses +the `repo` field. (This file can be used for versioning the import with Git.) -Back in our **dataset-registry** repository, a +> For a sample DVC-file resulting from `dvc import`, refer to +> [this example](/doc/command-reference/import#example-data-registry). + +Back in our **dataset-registry** project, a [second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) of our dataset was created by extracting the second part, with 1000 additional images (500 cats, 500 dogs), into the same directory structure. Then, we simply @@ -130,6 +132,9 @@ of the dataset is to run: $ dvc update cats-dogs.dvc ``` +This is possible because of the connection that the import stage saved among +local and source projects, as explained earlier. +
### Expand for actionable command (optional) From 45fb574d2eb0575eaddd4c197c450edebc0c380e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 31 Oct 2019 23:12:47 -0400 Subject: [PATCH 25/40] get: clarify that external repos are NOT data sources per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-305538837 --- static/docs/get-started/add-files.md | 11 ++++++----- static/docs/tutorials/interactive.md | 5 +++-- static/docs/tutorials/pipelines.md | 17 +++++++++-------- static/docs/tutorials/versioning.md | 14 ++++++++------ static/docs/use-cases/data-registry.md | 9 +++++---- 5 files changed, 31 insertions(+), 25 deletions(-) diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index c23eaef96d..11dab42057 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -10,12 +10,13 @@ $ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml ``` -> `dvc get` can download data artifacts from any DVC +> `dvc get` can download data artifacts from the +> [remote storage](/doc/command-reference/remote) of any DVC > project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use our -> [dataset-registry](https://github.com/iterative/dataset-registry)) project as -> the external data source. (Refer to -> [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) +> repositories). In this case we use +> [dataset-registry](https://github.com/iterative/dataset-registry)) as the +> source project. (Refer to [Data Registry](/doc/use-cases/data-registry) for +> more info about this setup.) To take a file (or a directory) under DVC control just run `dvc add` on it. For example: diff --git a/static/docs/tutorials/interactive.md b/static/docs/tutorials/interactive.md index c37ec7c9cb..7d4c5a4275 100644 --- a/static/docs/tutorials/interactive.md +++ b/static/docs/tutorials/interactive.md @@ -27,8 +27,9 @@ Learn basic concepts and features of DVC with interactive lessons: pipeline end-to-end. 6. [Importing Data](https://katacoda.com/dvc/courses/basics/importing)
- Download and track data from another DVC project that is hosted in a Git - repository. + Download and track data from the + [remote storage](/doc/command-reference/remote) of another DVC project that + is hosted on a Git repository. ## Simple ML Scenarios diff --git a/static/docs/tutorials/pipelines.md b/static/docs/tutorials/pipelines.md index bd6c00256a..de1c92912e 100644 --- a/static/docs/tutorials/pipelines.md +++ b/static/docs/tutorials/pipelines.md @@ -44,12 +44,13 @@ $ git add code/ $ git commit -m "Download and add code to new Git repo" ``` -> `dvc get` can download data artifacts from any DVC +> `dvc get` can download data artifacts from the +> [remote storage](/doc/command-reference/remote) of any DVC > project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use our -> [dataset-registry](https://github.com/iterative/dataset-registry)) project as -> the external data source. (Refer to -> [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) +> repositories). In this case we use +> [dataset-registry](https://github.com/iterative/dataset-registry)) as the +> source project. (Refer to [Data Registry](/doc/use-cases/data-registry) for +> more info about this setup.) Now let's install the requirements. But before we do that, we **strongly** recommend creating a @@ -66,9 +67,9 @@ Next, we will create a [pipeline](/doc/command-reference/pipeline) step-by-step, utilizing the same set of commands that are described in earlier [Get Started](/doc/get-started) chapters. -> Note that its possible to define more than one pipeline in each DVC project. -> This will be determined by the interdependencies between DVC-files, mentioned -> below. +> Note that its possible to define more than one pipeline in each DVC +> project. This will be determined by the interdependencies between +> DVC-files, mentioned below. Initialize DVC repository (run it inside your Git repository): diff --git a/static/docs/tutorials/versioning.md b/static/docs/tutorials/versioning.md index 3cbbe24263..023a378289 100644 --- a/static/docs/tutorials/versioning.md +++ b/static/docs/tutorials/versioning.md @@ -82,11 +82,13 @@ $ unzip data.zip $ rm -f data.zip ``` -> `dvc get` can download data artifacts from any DVC project hosted -> on a Git repository (similar to `wget` but for DVC repositories). In this case -> we use our [dataset-registry](https://github.com/iterative/dataset-registry) -> project as the external data source. (Refer to -> [Data Registry](/doc/use-cases/data-registry) for more info about this setup.) +> `dvc get` can download data artifacts from the +> [remote storage](/doc/command-reference/remote) of any DVC +> project hosted on a Git repository (similar to `wget` but for DVC +> repositories). In this case we use +> [dataset-registry](https://github.com/iterative/dataset-registry)) as the +> source project. (Refer to [Data Registry](/doc/use-cases/data-registry) for +> more info about this setup.) This command downloads and extracts our raw dataset, consisting of 1000 labeled images for training and 800 labeled images for validation. In total, it's a 43 @@ -300,7 +302,7 @@ place. ## Automating capturing `dvc add` makes sense when you need to keep track of different versions of -datasets or model files that come from external sources. The `data/` directory +datasets or model files that come from source projects. The `data/` directory above (with cats and dogs images) is a good example. On the other hand, there are files that are the result of running some code. In diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 33a1ce550b..e3ba0d7a9c 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -111,10 +111,11 @@ See the `dvc import` command reference for more details on the `--rev`
-Importing keeps the connection between the local project and data registry where -we are downloading the dataset from. This is achieved by creating a special -[DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_) that uses -the `repo` field. (This file can be used for versioning the import with Git.) +Importing keeps the connection between the local project and the source data +registry where we are downloading the dataset from. This is achieved by creating +a special [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_) +that uses the `repo` field. (This file can be used for versioning the import +with Git.) > For a sample DVC-file resulting from `dvc import`, refer to > [this example](/doc/command-reference/import#example-data-registry). From c7d695d6fddd091bc4bbc0ab9fd0304150a65f30 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 1 Nov 2019 14:55:40 -0400 Subject: [PATCH 26/40] use-cases: feedback round for new data registry case and related cmd refs: See https://github.com/iterative/dvc.org/pull/679#pullrequestreview-305542187, https://github.com/iterative/dvc.org/pull/679#pullrequestreview-305542425, https://github.com/iterative/dvc.org/pull/679#pullrequestreview-309436031, https://github.com/iterative/dvc.org/pull/679#pullrequestreview-309438214, https://github.com/iterative/dvc.org/pull/679#pullrequestreview-309439290, and https://github.com/iterative/dvc.org/pull/679#pullrequestreview-309442485. --- static/docs/command-reference/import.md | 5 +-- static/docs/command-reference/update.md | 3 +- static/docs/use-cases/data-registry.md | 42 ++++++++++++------------- 3 files changed, 26 insertions(+), 24 deletions(-) diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index b341647ef9..ef45be2824 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -3,7 +3,7 @@ Download or copy file or directory from any DVC project in a Git repository (e.g. hosted on GitHub) into the workspace, and track changes in this [external dependency](/doc/user-guide/external-dependencies). -Creates a special DVC-file a.k.a _import stage_. +Creates [DVC-files](/doc/user-guide/dvc-file-format). > See also `dvc get`, that corresponds to the first step this command performs > (just download the data). @@ -75,7 +75,8 @@ downloaded data artifact from the source DVC repository. import the data from. The tip of the repository's default branch is used by default when this option is not specified. Note that this adds a `rev` field in the import stage that fixes it to this revision. This can impact the - behavior of `dvc update`. + behavior of `dvc update`. (See + [re-importing](#example-fixed-revisions-re-importing) example below.) - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/command-reference/update.md b/static/docs/command-reference/update.md index c7c8df6f44..173e9ab84e 100644 --- a/static/docs/command-reference/update.md +++ b/static/docs/command-reference/update.md @@ -1,6 +1,7 @@ # update -Update data artifacts imported from external DVC repositories. +Update data artifacts imported from external DVC repositories, and +corresponding [DVC-files](/doc/user-guide/dvc-file-format). ## Synopsis diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index e3ba0d7a9c..acf1e4d201 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -12,40 +12,40 @@ Taking this idea to a useful extreme, we could create a project that is exclusively dedicated to [tracking and versioning](/doc/use-cases/data-and-model-files-versioning) datasets (or any kind of large files) – by mainly using `dvc add` to build it. -Such a project would not have [stages](/doc/command-reference/run), but its data -files may be updated manually as they evolve. Other projects can then share -these artifacts by downloading (`dvc get`) or importing (`dvc import`) them for -use in different data processes – and these don't even have to be _DVC -projects_, as `dvc get` works anywhere in your system. - -The advantages of using a data registry are: - -- Centralization: Data [shared](/doc/use-cases/share-data-and-model-files) by - multiple projects can be stored in a single location (with the ability to - create distributed copies on other remotes). This simplifies data management - and helps use storage space efficiently. +Other projects can then share these artifacts by downloading (`dvc get`) or +importing (`dvc import`) them for use in different data processes – and these +don't even have to be _DVC projects_, as `dvc get` works anywhere in your +system. + +The advantages of using a DVC **data registry** project are: + - [Versioning](/doc/use-cases/data-and-model-files-versioning): Any version of - the stored data or ML modes can be used in other projects at any - time. -- Persistence: The registry controlled + the data or ML modes tracked by a DVC registry can be used in other projects + at any time. +- Reusability: Reproduce and organizing _feature stores_ with `dvc get` and + `dvc import`. +- Persistence: The DVC registry controlled [remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, for example. +- Storage Optimization: Track data + [shared](/doc/use-cases/share-data-and-model-files) by multiple projects + centralized in a single location (with the ability to create distributed + copies on other remotes). This simplifies data management and helps use + storage space efficiently. - Lifecycle management: Manage your data like you do with code, leveraging Git and GitHub features such as version history, pull requests, reviews, or even continuous deployment of ML models. - Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes. -- Reusability: Reproduce and organizing _feature stores_ with `dvc get` and - `dvc import`. ## Example -A dataset we use for several of our examples and tutorials in these docs is one -containing 2800 images of cats and dogs. We partitioned the dataset in two for -our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on -a storage server, downloading them with `wget` in our examples. This setup was +A dataset we use for several of our examples and tutorials is one containing +2800 images of cats and dogs. We partitioned the dataset in two for our +[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a +storage server, downloading them with `wget` in our examples. This setup was then revised to download the dataset with `dvc get` instead, so we created the [dataset-registry](https://github.com/iterative/dataset-registry)) repository, a DVC project hosted on GitHub, to version the dataset (see its From a829f934bb1e673af314edfbb227f537a470e1e4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 2 Nov 2019 14:46:17 -0400 Subject: [PATCH 27/40] use-cases: use regular back ticks for DVC commands instead of even when atm it causes a bug per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-304978198 --- static/docs/use-cases/data-registry.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index acf1e4d201..2d548ce7d0 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -63,8 +63,7 @@ directory of the same repository. To create a [first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) of our dataset, we extracted the first part into the `use-cases/cats-dogs` -directory (illustrated below), and ran dvc add use-cases/cats-dogs -to +directory (illustrated below), and ran `dvc add use-cases/cats-dogs` to [track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). ```dvc @@ -124,7 +123,7 @@ Back in our **dataset-registry** project, a [second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) of our dataset was created by extracting the second part, with 1000 additional images (500 cats, 500 dogs), into the same directory structure. Then, we simply -ran dvc add use-cases/cats-dogs again. +ran `dvc add use-cases/cats-dogs` again. In our local project, all we have to do in order to obtain this latest version of the dataset is to run: From 216bccb75fc54436a10a5c5a21858c43f9323b5c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 2 Nov 2019 17:15:11 -0400 Subject: [PATCH 28/40] use-cases: revise advanatage list again per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-309442485 --- static/docs/use-cases/data-registry.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 2d548ce7d0..4f18832a2f 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -22,8 +22,11 @@ The advantages of using a DVC **data registry** project are: - [Versioning](/doc/use-cases/data-and-model-files-versioning): Any version of the data or ML modes tracked by a DVC registry can be used in other projects at any time. -- Reusability: Reproduce and organizing _feature stores_ with `dvc get` and - `dvc import`. +- Data as code: Version straightforward dataset directory structures without + special ad-hoc conventions. +- Reusability: Reproduce and organize _feature stores_ with a simple CLI + (`dvc get` and `dvc import` commands, similar to software package management + systems like `pip`). - Persistence: The DVC registry controlled [remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, @@ -34,7 +37,7 @@ The advantages of using a DVC **data registry** project are: copies on other remotes). This simplifies data management and helps use storage space efficiently. - Lifecycle management: Manage your data like you do with code, leveraging Git - and GitHub features such as version history, pull requests, reviews, or even + (and GitHub) features such as version history, pull requests, reviews, or even continuous deployment of ML models. - Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data From 9f0d7290033b0799135c9bebd016fec8f273b917 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 6 Nov 2019 12:41:09 -0500 Subject: [PATCH 29/40] glossary: add "DVC Repository" term; reorder glossary; and avoid "repo" everywhere --- src/Documentation/glossary.js | 34 +++++++++++++++-------- static/docs/changelog/0.18.md | 2 +- static/docs/changelog/0.35.md | 2 +- static/docs/command-reference/destroy.md | 4 +-- static/docs/command-reference/fetch.md | 3 +- static/docs/command-reference/get.md | 3 +- static/docs/command-reference/import.md | 16 +++++------ static/docs/command-reference/init.md | 2 +- static/docs/command-reference/install.md | 3 +- static/docs/tutorials/pipelines.md | 2 +- static/docs/use-cases/data-registry.md | 4 +-- static/docs/user-guide/dvc-file-format.md | 8 +++--- 12 files changed, 47 insertions(+), 36 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index 804f114923..4e24df9061 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -21,10 +21,20 @@ form part of your expanded workspace, technically. name: 'DVC Project', match: ['DVC project', 'project', 'projects'], desc: ` -Initialized by running \`dvc init\` in the **workspace**. It will contain the +Initialized by running \`dvc init\` in the **workspace** (typically in a Git +repository). It will contain the [\`.dvc/\` directory](/doc/user-guide/dvc-files-and-directories) and [DVC-files](/doc/user-guide/dvc-file-format) created with commands such as -\`dvc add\` or \`dvc run\`. It's typically also a Git repository. +\`dvc add\` or \`dvc run\`. It may also be a Git repository. + ` + }, + { + name: 'DVC Repository', + match: ['DVC repository'], + desc: ` +**DVC project** initialized using \`dvc init\` in a Git repository. It will +contain \`.git/\` and [\`.dvc/\`](/doc/user-guide/dvc-files-and-directories) +directories, as well as any DVC-files created by DVC. ` }, { @@ -37,6 +47,15 @@ For more details, please refer to this [document](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory). ` }, + { + name: 'Output', + match: ['output', 'outputs'], + desc: ` +A file or a directory that is under DVC control, recorded in the \`outs\` +section of a DVC-file. See \`dvc add\` \`dvc run\`, \`dvc import\`, +\`dvc import-url\` commands. A.k.a. **data artifact*. + ` + }, { name: 'Data Artifact', match: ['data artifact', 'data artifacts'], @@ -44,7 +63,7 @@ For more details, please refer to this Any data file or directory, as well as intermediate or final result (such as extracted features or a ML model file) that is under DVC control. Refer to [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files) -for more details. +for more details. A.k.a. **output*. ` }, { @@ -55,15 +74,6 @@ Stage (DVC-file) created with the \`dvc import\` or \`dvc import-url\` commands. They represent files or directories from external sources. ` }, - { - name: 'Output', - match: ['output', 'outputs'], - desc: ` -A file or a directory that is under DVC control, recorded in the \`outs\` -section of a DVC-file. See \`dvc add\` \`dvc run\`, \`dvc import\`, -\`dvc import-url\` commands. - ` - }, { name: 'External Dependency', match: ['external dependency', 'external dependencies'], diff --git a/static/docs/changelog/0.18.md b/static/docs/changelog/0.18.md index aa674de08c..6602052fb8 100644 --- a/static/docs/changelog/0.18.md +++ b/static/docs/changelog/0.18.md @@ -43,5 +43,5 @@ really excited to share the progress with you: Please use the discussion forum [discuss.dvc.org](discuss.dvc.org) and [issue tracker]() and don't hesitate to [⭐](https://github.com/iterative/dvc) -our [DVC repository](https://github.com/iterative/dvc) if you haven't yet. We +the [DVC repository](https://github.com/iterative/dvc) if you haven't yet. We are waiting for your feedback! diff --git a/static/docs/changelog/0.35.md b/static/docs/changelog/0.35.md index 29a5483c63..d5388abfb0 100644 --- a/static/docs/changelog/0.35.md +++ b/static/docs/changelog/0.35.md @@ -72,5 +72,5 @@ There are new [integrations and plugins](/doc/install/plugins) available: (PyCharm, IntelliJ, etc). Don't hesitate to -[like\star DVC repository](https://github.com/iterative/dvc/stargazers) if you +[star the DVC repository](https://github.com/iterative/dvc/stargazers) if you haven't yet. We are waiting for your feedback! diff --git a/static/docs/command-reference/destroy.md b/static/docs/command-reference/destroy.md index 1da1280bad..533ba4f3b5 100644 --- a/static/docs/command-reference/destroy.md +++ b/static/docs/command-reference/destroy.md @@ -17,9 +17,9 @@ usage: dvc destroy [-h] [-q | -v] [-f] be removed as well, unless it's set to an external location with `dvc cache dir`. (By default a local cache is located in the `.dvc/cache` directory.) If you were using -[symlinks for linking data](/doc/user-guide/large-dataset-optimization) from the +[symlinks for linking](/doc/user-guide/large-dataset-optimization) data from the cache, DVC will replace them with copies, so that your data is intact after the -DVC repository destruction. +project's destruction. ## Options diff --git a/static/docs/command-reference/fetch.md b/static/docs/command-reference/fetch.md index 99fdf2997a..64e40c551b 100644 --- a/static/docs/command-reference/fetch.md +++ b/static/docs/command-reference/fetch.md @@ -117,8 +117,7 @@ specified in DVC-files currently in the project are considered by `dvc fetch` ## Examples Let's employ a simple workspace with some data, code, ML models, -pipeline stages, as well as a few Git tags, such as the DVC project -created in our +pipeline stages, as well as a few Git tags, such as our [get started example repo](https://github.com/iterative/example-get-started). Then we can see what happens with `dvc fetch` as we switch from tag to tag. diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md index 755978e663..49308231b3 100644 --- a/static/docs/command-reference/get.md +++ b/static/docs/command-reference/get.md @@ -20,7 +20,8 @@ positional arguments: Provides an easy way to download datasets, intermediate results, ML models, or other files and directories (any data artifact) tracked in another -DVC repository, by downloading them into the current working directory. +DVC repository, by downloading them into the current working +directory. Note that this command doesn't require an existing DVC project to run in. It's a single-purpose command that can be used out of the box after installing DVC. diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index ef45be2824..14e65f0960 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -21,10 +21,10 @@ positional arguments: ## Description DVC provides an easy way to reuse datasets, intermediate results, ML models, or -other files and directories tracked in another DVC repository into the -workspace. The `dvc import` command downloads such a data artifact -in a way that it is tracked with DVC, so it can be updated when the data source -changes. +other files and directories tracked in another DVC repository into +the workspace. The `dvc import` command downloads such a data +artifact in a way that it is tracked with DVC, so it can be updated when +the data source changes. The `url` argument specifies the address of the Git repository containing the source project. Both HTTP and SSH protocols are supported for @@ -87,8 +87,8 @@ downloaded data artifact from the source DVC repository. ## Examples -A simple case for this command is to import a dataset from an external DVC repo, -such as our +A simple case for this command is to import a dataset from an external DVC +repository, such as our [get started example repo](https://github.com/iterative/example-get-started). ```dvc @@ -121,8 +121,8 @@ outs: ``` Several of the values above are pulled from the original stage file -`model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used -to specify the origin and version of the dependency. +`model.pkl.dvc` in the external DVC repository. `url` and +`rev_lock` fields are used to specify the origin and version of the dependency. ## Example: fixed revisions & re-importing diff --git a/static/docs/command-reference/init.md b/static/docs/command-reference/init.md index 553ac7f32c..f6a69fe0f6 100644 --- a/static/docs/command-reference/init.md +++ b/static/docs/command-reference/init.md @@ -44,7 +44,7 @@ is a local cache and you cannot `git push` it. ## Examples -Create a new DVC repository (requires Git): +Create a new DVC repository (requires Git): ```dvc $ mkdir example && cd example diff --git a/static/docs/command-reference/install.md b/static/docs/command-reference/install.md index f1011cf1af..cda7101d8b 100644 --- a/static/docs/command-reference/install.md +++ b/static/docs/command-reference/install.md @@ -1,6 +1,7 @@ # install -Install Git hooks into the DVC repository to automate certain common actions. +Install Git hooks into the DVC repository to automate certain +common actions. ## Synopsis diff --git a/static/docs/tutorials/pipelines.md b/static/docs/tutorials/pipelines.md index de1c92912e..7dd30fcc54 100644 --- a/static/docs/tutorials/pipelines.md +++ b/static/docs/tutorials/pipelines.md @@ -71,7 +71,7 @@ utilizing the same set of commands that are described in earlier > project. This will be determined by the interdependencies between > DVC-files, mentioned below. -Initialize DVC repository (run it inside your Git repository): +Initialize DVC repository (run it inside your Git repository): ```dvc $ dvc init diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 4f18832a2f..658a4a250c 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -34,8 +34,8 @@ The advantages of using a DVC **data registry** project are: - Storage Optimization: Track data [shared](/doc/use-cases/share-data-and-model-files) by multiple projects centralized in a single location (with the ability to create distributed - copies on other remotes). This simplifies data management and helps use - storage space efficiently. + copies on other remotes). This simplifies data management and optimizes space + requirements. - Lifecycle management: Manage your data like you do with code, leveraging Git (and GitHub) features such as version history, pull requests, reviews, or even continuous deployment of ML models. diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index c3359a7dd3..a75726d494 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -62,12 +62,12 @@ A dependency entry consists of a pair of fields: [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) -- `repo`: This entry is only for DVC repository external dependencies created - with `dvc import`, and in itself contains the following fields: +- `repo`: This entry is only for external dependencies created with + `dvc import`, and in itself contains the following fields: - `url`: URL of Git repository with source DVC project - - `rev_lock`: Revision or version (Git commit hash) of the DVC repo at the - time of importing the dependency + - `rev_lock`: Revision or version (Git commit hash) of the external DVC + repository at the time of importing the dependency > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more From 2c56d992228797f2852fdd871a078f2f4e157d0f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 6 Nov 2019 13:51:19 -0500 Subject: [PATCH 30/40] glossary: add plural form "DVC repositories" to reviously introduced term, review usage throughout --- src/Documentation/glossary.js | 2 +- static/docs/command-reference/get-url.md | 4 +-- static/docs/command-reference/import-url.md | 2 +- static/docs/command-reference/import.md | 34 +++++++++---------- static/docs/command-reference/update.md | 5 +-- static/docs/tutorials/deep/preparation.md | 11 +++--- static/docs/tutorials/deep/sharing-data.md | 6 ++-- static/docs/tutorials/pipelines.md | 6 ++-- static/docs/use-cases/data-registry.md | 2 +- .../user-guide/dvc-files-and-directories.md | 4 +-- .../docs/user-guide/external-dependencies.md | 10 +++--- 11 files changed, 44 insertions(+), 42 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index a5d073e32a..18d70cf9f5 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -30,7 +30,7 @@ repository). It will contain the }, { name: 'DVC Repository', - match: ['DVC repository'], + match: ['DVC repository', 'DVC repositories'], desc: ` **DVC project** initialized using \`dvc init\` in a Git repository. It will contain \`.git/\` and [\`.dvc/\`](/doc/user-guide/dvc-files-and-directories) diff --git a/static/docs/command-reference/get-url.md b/static/docs/command-reference/get-url.md index 2e7808e504..bc5256b3ef 100644 --- a/static/docs/command-reference/get-url.md +++ b/static/docs/command-reference/get-url.md @@ -31,8 +31,8 @@ be placed inside of it. Note that this command doesn't require an existing DVC project to run in. It's a single-purpose command that can be used out of the box after installing DVC. -> See `dvc get` to download data or model files or directories from other DVC -> repositories (e.g. GitHub URLs). +> See `dvc get` to download data or model files or directories from other +> DVC repository (e.g. GitHub URLs). DVC supports several types of (local or) remote locations (protocols): diff --git a/static/docs/command-reference/import-url.md b/static/docs/command-reference/import-url.md index 0c7fdb84a2..891361cd6e 100644 --- a/static/docs/command-reference/import-url.md +++ b/static/docs/command-reference/import-url.md @@ -36,7 +36,7 @@ desired for the downloaded data. If an existing directory is specified, then the output will be placed inside of it. > See `dvc import` to download and tack data or model files or directories from -> other DVC repositories (e.g. GitHub URLs). +> other DVC repositories (e.g. GitHub URLs). DVC supports [DVC-files](/doc/user-guide/dvc-file-format) that refer to data in external locations, see diff --git a/static/docs/command-reference/import.md b/static/docs/command-reference/import.md index 14e65f0960..842b031560 100644 --- a/static/docs/command-reference/import.md +++ b/static/docs/command-reference/import.md @@ -35,7 +35,7 @@ The `path` argument of this command is used to specify the location of the data to be downloaded within the source project. It should point to a data file or directory tracked by that project – specified in one of the [DVC-files](/doc/user-guide/dvc-file-format) of the repository at `url`. (You -will not find these files directly in the external Git repository.) The source +will not find data files directly in the external Git repository.) The source project should have a default [DVC remote](/doc/command-reference/remote) configured, containing them.) @@ -51,15 +51,16 @@ DVC supports DVC-files that refer to data in an external DVC repository (hosted on a Git server) a.k.a _import stages_. In such a DVC-file, the `deps` section specifies the `repo` URL and data `path`, and the `outs` section contains the corresponding local path in the workspace. It records enough data from the -external file or directory to enable DVC to efficiently check it to determine -whether the local copy is out of date. +imported data to enable DVC to efficiently check it to determine whether the +local copy is out of date. To actually [track the data](https://dvc.org/doc/get-started/add-files), `git add` (and `git commit`) the import stage. Note that import stages are considered always "locked", meaning that if you run -`dvc repro`, they won't be updated. Use `dvc update` on them to update the -downloaded data artifact from the source DVC repository. +`dvc repro`, they won't be updated. Use `dvc update` or +[re-import](#example-fixed-revisions-re-importing) them to update the downloaded +data artifact from the source project. ## Options @@ -75,8 +76,7 @@ downloaded data artifact from the source DVC repository. import the data from. The tip of the repository's default branch is used by default when this option is not specified. Note that this adds a `rev` field in the import stage that fixes it to this revision. This can impact the - behavior of `dvc update`. (See - [re-importing](#example-fixed-revisions-re-importing) example below.) + behavior of `dvc update`. (See **re-importing** example below.) - `-h`, `--help` - prints the usage/help message, and exit. @@ -87,8 +87,8 @@ downloaded data artifact from the source DVC repository. ## Examples -A simple case for this command is to import a dataset from an external DVC -repository, such as our +A simple case for this command is to import a dataset from an external DVC +repository, such as our [get started example repo](https://github.com/iterative/example-get-started). ```dvc @@ -121,8 +121,8 @@ outs: ``` Several of the values above are pulled from the original stage file -`model.pkl.dvc` in the external DVC repository. `url` and -`rev_lock` fields are used to specify the origin and version of the dependency. +`model.pkl.dvc` in the external DVC repository. `url` and `rev_lock` fields are +used to specify the origin and version of the dependency. ## Example: fixed revisions & re-importing @@ -181,12 +181,12 @@ $ dvc import git@github.com:iterative/dataset-registry.git \ ``` `dvc import` provides a better way to incorporate data files tracked in external -projects because it saves the connection between the current project and the -source project. This means that enough information is recorded in an import -stage (DVC-file) in order to [reproduce](/doc/command-reference/repro) -downloading of this same data version in the future, where and when needed. This -is achieved with the `repo` field, for example (matching the import command -above): +DVC repositories because it saves the connection between the +current project and the source project. This means that enough information is +recorded in an import stage (DVC-file) in order to +[reproduce](/doc/command-reference/repro) downloading of this same data version +in the future, where and when needed. This is achieved with the `repo` field, +for example (matching the import command above): ```yaml md5: 96fd8e791b0ee4824fc1ceffd13b1b49 diff --git a/static/docs/command-reference/update.md b/static/docs/command-reference/update.md index 173e9ab84e..f0007a4e49 100644 --- a/static/docs/command-reference/update.md +++ b/static/docs/command-reference/update.md @@ -1,7 +1,8 @@ # update -Update data artifacts imported from external DVC repositories, and -corresponding [DVC-files](/doc/user-guide/dvc-file-format). +Update data artifacts imported from external DVC +repositories, and corresponding +[DVC-files](/doc/user-guide/dvc-file-format). ## Synopsis diff --git a/static/docs/tutorials/deep/preparation.md b/static/docs/tutorials/deep/preparation.md index d6d2871ae8..3c2682e234 100644 --- a/static/docs/tutorials/deep/preparation.md +++ b/static/docs/tutorials/deep/preparation.md @@ -93,11 +93,12 @@ $ cat .dvc/.gitignore $ git commit -m "init DVC" ``` -The `.dvc/cache` directory is one of the most important parts of any DVC -repository. The directory contains all the content of data files and will be -described in the next chapter in more detail. Note that the cache directory is -contained in the `.dvc/.gitignore` file, which means that it's not under Git -control — this is your local directory and you cannot push it to any Git remote. +The `.dvc/cache` directory is one of the most important parts of any DVC +repositories. The directory contains all the content of data files and +will be described in the next chapter in more detail. Note that the cache +directory is contained in the `.dvc/.gitignore` file, which means that it's not +under Git control — this is your local directory and you cannot push it to any +Git remote. For more information refer to [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories). diff --git a/static/docs/tutorials/deep/sharing-data.md b/static/docs/tutorials/deep/sharing-data.md index 352e5b8ac1..ed6c2e2902 100644 --- a/static/docs/tutorials/deep/sharing-data.md +++ b/static/docs/tutorials/deep/sharing-data.md @@ -3,9 +3,9 @@ ## Pushing data to the cloud We've gone over how source code and [DVC-files](/doc/user-guide/dvc-file-format) -can be shared using a Git repository. These DVC repositories will contain all -the information needed for reproducibility, so it might be a good idea to share -them with your team using Git hosting services (such as +can be shared using a Git repository. These DVC repositories will +contain all the information needed for reproducibility, so it might be a good +idea to share them with your team using Git hosting services (such as [GitHub](https://github.com/)). DVC is able to push the cache to cloud storage. diff --git a/static/docs/tutorials/pipelines.md b/static/docs/tutorials/pipelines.md index 9827ab2b78..bc0e6df7bd 100644 --- a/static/docs/tutorials/pipelines.md +++ b/static/docs/tutorials/pipelines.md @@ -71,9 +71,9 @@ Next, we will create a [pipeline](/doc/command-reference/pipeline) step-by-step, utilizing the same set of commands that are described in earlier [Get Started](/doc/get-started) chapters. -> Note that its possible to define more than one pipeline in each DVC -> project. This will be determined by the interdependencies between -> DVC-files, mentioned below. +> Note that its possible to define more than one pipeline in each DVC project. +> This will be determined by the interdependencies between DVC-files, mentioned +> below. Initialize DVC repository (run it inside your Git repository): diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 658a4a250c..10a4824371 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -61,7 +61,7 @@ importantly, this single dataset is tracked by 2 different one, which would better reflect the intentions of this dataset... Fortunately, we have also prepared an improved alternative in the [`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) -directory of the same repository. +directory of the same DVC repository. To create a [first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 0b67384da4..5c6551ff18 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -14,8 +14,8 @@ operation: hand or with the command `dvc config --local`. - `.dvc/cache`: The [cache directory](#structure-of-cache-directory) will store - your data. The data files and directories in DVC repositories will only - contain links to the data files in the cache. (Refer to + your data. The data files and directories in DVC repositories + will only contain links to the data files in the cache. (Refer to [Large Dataset Optimization](/docs/user-guide/large-dataset-optimization). See `dvc config cache` for related configuration options. diff --git a/static/docs/user-guide/external-dependencies.md b/static/docs/user-guide/external-dependencies.md index c59b1ece1c..1b5d988962 100644 --- a/static/docs/user-guide/external-dependencies.md +++ b/static/docs/user-guide/external-dependencies.md @@ -151,9 +151,9 @@ if the file has changed and we need to download it again. ## Example: Using import -`dvc import` can download a data artifact from an external DVC -repository. It also creates an external dependency in its import -stage (DVC-file). +`dvc import` can download a data artifact from an external +DVC repositoryrepository. It also creates an external dependency in +its import stage (DVC-file). ```dvc $ dvc import git@github.com:iterative/example-get-started model.pkl @@ -184,7 +184,7 @@ outs: persist: false ``` -For external sources that are DVC repositories, `url` and `rev_lock` fields are -used to specify the origin and version of the dependency. +For external sources that are DVC repositories, `url` and +`rev_lock` fields are used to specify the origin and version of the dependency.
From 64996443fc86b9116b41cd6834cfd1ad7b4a65eb Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 6 Nov 2019 16:21:31 -0500 Subject: [PATCH 31/40] remove a couple unnecessary link anchors --- static/docs/command-reference/add.md | 4 +--- static/docs/command-reference/fetch.md | 5 ++--- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/static/docs/command-reference/add.md b/static/docs/command-reference/add.md index 70d0ac2fcc..8175664c3a 100644 --- a/static/docs/command-reference/add.md +++ b/static/docs/command-reference/add.md @@ -165,9 +165,7 @@ $ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b ``` Note that tracking compressed files (e.g. ZIP or TAR archives) is not -recommended, as `dvc add` supports tracking directories. (Details below.) For -more context, refer to -[Data Registry](/doc/use-cases/data-registry#problem-1-compressed-data-files) +recommended, as `dvc add` supports tracking directories. (Details below.) ## Example: Directory diff --git a/static/docs/command-reference/fetch.md b/static/docs/command-reference/fetch.md index 64e40c551b..9b50b6c910 100644 --- a/static/docs/command-reference/fetch.md +++ b/static/docs/command-reference/fetch.md @@ -1,8 +1,7 @@ # fetch Get files that are under DVC control from -[remote](/doc/command-reference/remote#description) storage into the -cache. +[remote storage](/doc/command-reference/remote) into the cache. ## Synopsis @@ -74,7 +73,7 @@ specified in DVC-files currently in the project are considered by `dvc fetch` ## Options - `-r REMOTE`, `--remote REMOTE` - name of the - [remote storage](/doc/command-reference/remote#description) to fetch from (see + [remote storage](/doc/command-reference/remote) to fetch from (see `dvc remote list`). If not specified, the default remote is used (see `dvc config core.remote`). The argument `REMOTE` is a remote name defined using the `dvc remote` command. From 9d34076d03a18608cadcd458b667583b11231299 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 7 Nov 2019 18:35:50 -0500 Subject: [PATCH 32/40] use-cases: address a couple typos per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-313619970 and https://github.com/iterative/dvc.org/pull/679#pullrequestreview-313620557 --- static/docs/user-guide/dvc-files-and-directories.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 5c6551ff18..54e3e0ed8d 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -55,8 +55,8 @@ example, if a data file `Posts.xml.zip` has checksum > **Note!** File checksums are calculated from file contents only. 2 or more > files with different names but the same contents can exist in the workspace -> and be tracked by DVC, but only one copy can be stored in the cache! This -> helps avoid data duplication in cache and remotes. +> and be tracked by DVC, but only one copy is stored in the cache. This helps +> avoid data duplication in cache and remotes. For the second case, let us consider a directory with 2 images. @@ -96,7 +96,7 @@ $ tree .dvc/cache     └── 0b40427ee0998e9802335d98f08cd98f ``` -The cache file with `.dir` extension is a special text file that records the +The cache file with `.dir` extension is a special text file that contains the mapping of files in the `data/` directory (as a JSON array), along with their checksums. The other two cache files are the files inside `data/`. A typical `.dir` cache file looks like this: From e37e4baca4b5ba87edf62257d9573f0a4fb40824 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 7 Nov 2019 18:49:24 -0500 Subject: [PATCH 33/40] use-cases: merge "Data as code" and "Lifecycle management" in data registry for https://github.com/iterative/dvc.org/pull/679#pullrequestreview-313628287 --- static/docs/use-cases/data-registry.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 10a4824371..621d0f9c30 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -22,8 +22,6 @@ The advantages of using a DVC **data registry** project are: - [Versioning](/doc/use-cases/data-and-model-files-versioning): Any version of the data or ML modes tracked by a DVC registry can be used in other projects at any time. -- Data as code: Version straightforward dataset directory structures without - special ad-hoc conventions. - Reusability: Reproduce and organize _feature stores_ with a simple CLI (`dvc get` and `dvc import` commands, similar to software package management systems like `pip`). @@ -36,9 +34,10 @@ The advantages of using a DVC **data registry** project are: centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements. -- Lifecycle management: Manage your data like you do with code, leveraging Git - (and GitHub) features such as version history, pull requests, reviews, or even - continuous deployment of ML models. +- Lifecycle management: Manage _data as code_, versioning simple directory + structures without ad-hoc conventions, and leverage Git (and GitHub) features + such as change history, pull requests, reviews, and even continuous deployment + of ML models. - Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes. From ba2e28e32c61e3b250c4611390702c2164707a4b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 7 Nov 2019 18:58:12 -0500 Subject: [PATCH 34/40] use-cases: merge "Versioning" and "Data as code" (and "Lifecycle management") per https://github.com/iterative/dvc.org/pull/679#discussion_r343868042 --- static/docs/use-cases/data-registry.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 621d0f9c30..66ef29b205 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -19,9 +19,13 @@ system. The advantages of using a DVC **data registry** project are: -- [Versioning](/doc/use-cases/data-and-model-files-versioning): Any version of - the data or ML modes tracked by a DVC registry can be used in other projects - at any time. +- Data as code: Improve lifecycle management with + [versioning](/doc/use-cases/data-and-model-files-versioning) of simple + directory structures (without ad-hoc conventions); Any version of the data or + results tracked by a DVC registry can be used in other projects at any time. + Leverage Git and Git hosting (e.g. GitHub) features such as change history, + branching, pull requests, reviews, and even continuous deployment of ML + models. - Reusability: Reproduce and organize _feature stores_ with a simple CLI (`dvc get` and `dvc import` commands, similar to software package management systems like `pip`). @@ -34,10 +38,6 @@ The advantages of using a DVC **data registry** project are: centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements. -- Lifecycle management: Manage _data as code_, versioning simple directory - structures without ad-hoc conventions, and leverage Git (and GitHub) features - such as change history, pull requests, reviews, and even continuous deployment - of ML models. - Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes. From 9ce52cd352bbb074a9749fe5c337575f6716227e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 8 Nov 2019 12:53:14 -0500 Subject: [PATCH 35/40] get: update description to clarify data is downloaded from remote storage per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-305538837 --- static/docs/command-reference/get.md | 7 ++++--- static/docs/get-started/add-files.md | 8 ++++---- static/docs/tutorials/interactive.md | 4 ++-- static/docs/tutorials/pipelines.md | 8 ++++---- static/docs/tutorials/versioning.md | 8 ++++---- static/docs/use-cases/data-registry.md | 8 +++++++- 6 files changed, 25 insertions(+), 18 deletions(-) diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md index 49308231b3..120b3c98a3 100644 --- a/static/docs/command-reference/get.md +++ b/static/docs/command-reference/get.md @@ -1,7 +1,8 @@ # get -Download or copy file or directory from any DVC project in a Git -repository (e.g. hosted on GitHub) into the current working directory. +Download or copy file or directory from the +[remote storage](/doc/command-reference/remote) of any DVC project +in a Git repository (e.g. hosted on GitHub) into the current working directory. > Unlike `dvc import`, this command does not track the downloaded data files > (does not create a DVC-file). @@ -21,7 +22,7 @@ positional arguments: Provides an easy way to download datasets, intermediate results, ML models, or other files and directories (any data artifact) tracked in another DVC repository, by downloading them into the current working -directory. +directory. (It works like `wget`, but for DVC repositories.) Note that this command doesn't require an existing DVC project to run in. It's a single-purpose command that can be used out of the box after installing DVC. diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 11dab42057..3092c0ae69 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -10,10 +10,10 @@ $ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml ``` -> `dvc get` can download data artifacts from the -> [remote storage](/doc/command-reference/remote) of any DVC -> project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use +> `dvc get` can use any DVC project hosted on a Git repository to +> find the appropriate [remote storage](/doc/command-reference/remote) and +> download data artifacts from it. (It works like `wget`, but for +> DVC repositories.) In this case we use > [dataset-registry](https://github.com/iterative/dataset-registry)) as the > source project. (Refer to [Data Registry](/doc/use-cases/data-registry) for > more info about this setup.) diff --git a/static/docs/tutorials/interactive.md b/static/docs/tutorials/interactive.md index 7d4c5a4275..09dcb5029e 100644 --- a/static/docs/tutorials/interactive.md +++ b/static/docs/tutorials/interactive.md @@ -28,8 +28,8 @@ Learn basic concepts and features of DVC with interactive lessons: 6. [Importing Data](https://katacoda.com/dvc/courses/basics/importing)
Download and track data from the - [remote storage](/doc/command-reference/remote) of another DVC project that - is hosted on a Git repository. + [remote storage](/doc/command-reference/remote) of any DVC project that is + hosted on a Git repository. ## Simple ML Scenarios diff --git a/static/docs/tutorials/pipelines.md b/static/docs/tutorials/pipelines.md index bc0e6df7bd..767576d9ae 100644 --- a/static/docs/tutorials/pipelines.md +++ b/static/docs/tutorials/pipelines.md @@ -48,10 +48,10 @@ $ git add code/ $ git commit -m "Download and add code to new Git repo" ``` -> `dvc get` can download data artifacts from the -> [remote storage](/doc/command-reference/remote) of any DVC -> project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use +> `dvc get` can use any DVC project hosted on a Git repository to +> find the appropriate [remote storage](/doc/command-reference/remote) and +> download data artifacts from it. (It works like `wget`, but for +> DVC repositories.) In this case we use > [dataset-registry](https://github.com/iterative/dataset-registry)) as the > source project. (Refer to [Data Registry](/doc/use-cases/data-registry) for > more info about this setup.) diff --git a/static/docs/tutorials/versioning.md b/static/docs/tutorials/versioning.md index 7c89291e82..8c759bf364 100644 --- a/static/docs/tutorials/versioning.md +++ b/static/docs/tutorials/versioning.md @@ -83,10 +83,10 @@ $ unzip -q data.zip $ rm -f data.zip ``` -> `dvc get` can download data artifacts from the -> [remote storage](/doc/command-reference/remote) of any DVC -> project hosted on a Git repository (similar to `wget` but for DVC -> repositories). In this case we use +> `dvc get` can use any DVC project hosted on a Git repository to +> find the appropriate [remote storage](/doc/command-reference/remote) and +> download data artifacts from it. (It works like `wget`, but for +> DVC repositories.) In this case we use > [dataset-registry](https://github.com/iterative/dataset-registry)) as the > source project. (Refer to [Data Registry](/doc/use-cases/data-registry) for > more info about this setup.) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 66ef29b205..084cc14d3c 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,5 +1,11 @@ # Data Registry +## Introduction + +Sell it, advantages + +## Workflow (How) + We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim to enable reusability of any data artifacts (raw data, intermediate results, models, etc) between different projects. For example, project A may use @@ -19,7 +25,7 @@ system. The advantages of using a DVC **data registry** project are: -- Data as code: Improve lifecycle management with +- Data as code: Improve _lifecycle management_ with [versioning](/doc/use-cases/data-and-model-files-versioning) of simple directory structures (without ad-hoc conventions); Any version of the data or results tracked by a DVC registry can be used in other projects at any time. From 3e4ef7db28376b808eb4f52d0cafcec0c472aa4e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 8 Nov 2019 18:04:45 -0500 Subject: [PATCH 36/40] use-cases: un-do WIP text from data registry --- static/docs/use-cases/data-registry.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 084cc14d3c..25716bf310 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,11 +1,5 @@ # Data Registry -## Introduction - -Sell it, advantages - -## Workflow (How) - We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim to enable reusability of any data artifacts (raw data, intermediate results, models, etc) between different projects. For example, project A may use From f7061249924118ab40b52b8c7975bdd3454a5324 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 8 Nov 2019 19:15:24 -0500 Subject: [PATCH 37/40] user-guide: small improvement in dvc-file-format intro --- static/docs/user-guide/dvc-file-format.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index a75726d494..0b61870593 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -3,9 +3,9 @@ When you add a file (with `dvc add`) or a command (with `dvc run`) to a [pipeline](/doc/command-reference/pipeline), DVC creates a special text metafile with the `.dvc` file extension (e.g. `process.dvc`), or with the default name -`Dvcfile`. DVC-files a.k.a. **stage files** contain all the needed information -to track your data and reproduce pipeline stages. The file itself contains a -simple YAML format that could be easily written or altered manually. +`Dvcfile`. These **DVC-files** (a.k.a. stage files) contain all the needed +information to track your data and reproduce pipeline stages. The file itself +contains a simple YAML format that could be easily written or altered manually. See the [Syntax Highlighting](/doc/install/plugins) to learn how to enable the highlighting for your editor. From 2e31691bc99bb66ae8199eb3935eeb46ba964a5c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 8 Nov 2019 19:18:31 -0500 Subject: [PATCH 38/40] use-cases: revise intros of existing cases in prep for https://github.com/iterative/dvc.org/pull/679#pullrequestreview-305542903 --- .../use-cases/shared-development-server.md | 18 ++++++++---------- .../use-cases/sharing-data-and-model-files.md | 14 +++++++------- .../versioning-data-and-model-files.md | 14 +++++++------- 3 files changed, 22 insertions(+), 24 deletions(-) diff --git a/static/docs/use-cases/shared-development-server.md b/static/docs/use-cases/shared-development-server.md index 2a83d890e3..6bbd866534 100644 --- a/static/docs/use-cases/shared-development-server.md +++ b/static/docs/use-cases/shared-development-server.md @@ -1,21 +1,19 @@ # Shared Development Server Some teams may prefer using one single shared machine to run their experiments. -This allows them to have better resource utilization such as the ability to use -multiple GPUs, centralize all data storage, etc. +This allows better resource utilization, such as the ability to use multiple +GPUs, centralized data storage, etc. With DVC, you can easily setup shared data +storage on a server accessed by several users, in a way that enables almost +instantaneous workspace restoration/switching speed for everyone – +similar to `git checkout` for your code. ![](/static/img/shared-server.png) -With DVC, you can easily setup shared data storage on the server. This allows -your team to store and share data for your projects effectively, and to have -almost instantaneous workspace restoration/switching speed – -similar to `git checkout` for your code. - ## Preparation -Create a shared directory to be used as cache location for -everyone's projects, so that all your colleagues can use the same -project cache: +Create a shared directory to be used as the cache location for +everyone's DVC projects, so that all your colleagues can use the +same project cache: ```dvc $ mkdir -p /path/to/dvc-cache diff --git a/static/docs/use-cases/sharing-data-and-model-files.md b/static/docs/use-cases/sharing-data-and-model-files.md index 6a3dce4194..fc5dc395ab 100644 --- a/static/docs/use-cases/sharing-data-and-model-files.md +++ b/static/docs/use-cases/sharing-data-and-model-files.md @@ -1,14 +1,14 @@ # Sharing Data and Model Files Like Git, DVC allows for a distributed environment and collaboration. We make it -easy to consistently get all your data files and directories, along with -matching source code to any machine. All you need to do is to setup +easy to consistently get all your data files and directories into any machine, +along with matching source code. All you need to do is to setup [remote storage](/doc/command-reference/remote) for your DVC -project to store data files online, where others can reach them. -Currently DVC supports Amazon S3, Google Cloud Storage, Microsoft Azure Blob -Storage, SSH, HDFS, and other remote locations, and the list is constantly -growing. (For a complete list of supported remote types and their configuration, -take a look at the examples in `dvc remote add`.) +project, and push the data there, so others can reach it. Currently DVC +supports Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, SSH, +HDFS, and other remote locations, and the list is constantly growing. (For a +complete list and configuration instructions, take a look at the examples in +`dvc remote add`.) ![](/static/img/model-sharing-digram.png) diff --git a/static/docs/use-cases/versioning-data-and-model-files.md b/static/docs/use-cases/versioning-data-and-model-files.md index 4e590a4f7e..134d2e4f29 100644 --- a/static/docs/use-cases/versioning-data-and-model-files.md +++ b/static/docs/use-cases/versioning-data-and-model-files.md @@ -5,13 +5,13 @@ > [Versioning](/doc/tutorials/versioning) tutorial. DVC allows versioning data files and directories, intermediate results, and ML -models using Git, but without storing the file contents in the repository. It's -useful in general when dealing with files that are too large for Git to handle -properly. DVC records information about your data in a special -[DVC-file](/doc/user-guide/dvc-file-format). This description of files or -directories can be used for versioning. DVC supports various types of -[remote storage](/doc/command-reference/remote) for data that allows easily -saving and sharing data alongside code. +models using Git, but without storing the file contents in the Git repository. +It's useful when dealing with files that are too large for Git to handle +properly in general. DVC saves information about your data in special +[DVC-files](/doc/user-guide/dvc-file-format), and these metafiles can be used +for versioning. To actually store the data, DVC supports various types of +[remote storage](/doc/command-reference/remote). This allows easily saving and +sharing data alongside code. ![](/static/img/model-versioning-diagram.png) From 6425a5dba1fbe306e83590766f97059729283e54 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 9 Nov 2019 16:42:18 -0500 Subject: [PATCH 39/40] use-cases: rewrite data registry intro (1) per https://github.com/iterative/dvc.org/pull/679#discussion_r341730114 --- static/docs/use-cases/data-registry.md | 43 +++++++++++++++----------- static/docs/use-cases/index.md | 14 +++++---- 2 files changed, 33 insertions(+), 24 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 25716bf310..933e304e2b 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,21 +1,28 @@ # Data Registry -We developed the `dvc get`, `dvc import`, and `dvc update` commands with the aim -to enable reusability of any data artifacts (raw data, intermediate -results, models, etc) between different projects. For example, project A may use -a data file to begin its data [pipeline](/doc/command-reference/pipeline), but -project B also requires this same file; Instead of +One of the main uses of DVC repositories is the +[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning). +This is provided by commands such as `dvc add` and `dvc run`, that allow +tracking of datasets and any other data artifacts. + +With the aim to enable reusability of these versioned artifacts between +different projects (similar to package management systems, but for data), DVC +also includes the `dvc get`, `dvc import`, and `dvc update` commands. For +example, project A may use a data file to begin its data +[pipeline](/doc/command-reference/pipeline), but project B also requires this +same file; Instead of [adding it](/doc/command-reference/add#example-single-file) it to both projects, -B can simply import it from A. - -Taking this idea to a useful extreme, we could create a project -that is exclusively dedicated to -[tracking and versioning](/doc/use-cases/data-and-model-files-versioning) -datasets (or any kind of large files) – by mainly using `dvc add` to build it. -Other projects can then share these artifacts by downloading (`dvc get`) or -importing (`dvc import`) them for use in different data processes – and these -don't even have to be _DVC projects_, as `dvc get` works anywhere in your -system. +B can simply import it from A. Furthermore, the version of the data file +imported to B can be an older iteration than what's currently used in A. + +Keeping this in mind, we could build a DVC project dedicated to +tracking and versioning datasets (or any kind of large files). This way we would +have a repository that has all the metadata and change history for the project's +data. We can see who updated what, and when; use pull requests to update data +the same way you do with code; and we don't need ad-hoc conventions to store +different data versions. Other projects can share the data in the registry by +downloading (`dvc get`) or importing (`dvc import`) them for use in different +data processes. The advantages of using a DVC **data registry** project are: @@ -114,9 +121,9 @@ See the `dvc import` command reference for more details on the `--rev` Importing keeps the connection between the local project and the source data registry where we are downloading the dataset from. This is achieved by creating -a special [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_) -that uses the `repo` field. (This file can be used for versioning the import -with Git.) +a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the +`repo` field (a.k.a. _import stage_). (This file can be used for versioning the +import with Git.) > For a sample DVC-file resulting from `dvc import`, refer to > [this example](/doc/command-reference/import#example-data-registry). diff --git a/static/docs/use-cases/index.md b/static/docs/use-cases/index.md index 1c6db65adf..e1d10566ff 100644 --- a/static/docs/use-cases/index.md +++ b/static/docs/use-cases/index.md @@ -9,13 +9,15 @@ range from basic to more advanced: - [Data Versioning](/doc/use-cases/versioning-data-and-model-files) describes our most primary use: tracking and versioning large files with Git + DVC. - [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) - goes over basic collaboration possibilities enabled by DVC. -- [Shared Development Server](/doc/use-cases/shared-development-server) - describes a single development machine setup for teams that prefer so. + goes over the basic collaboration possibilities enabled by DVC. +- [Shared Development Server](/doc/use-cases/shared-development-server) provides + instructions to setup a single development machine for teams that prefer so. +- [Data Registry](/doc/use-cases/data-registry) explains how to use a DVC + repository as a shared hub for reusing datasets among several projects. -This list of use cases is _not_ exhaustive. We keep reviewing our docs and will -include interesting scenarios that surface in our community. Please, -[contact us](/support) if you need help or have suggestions! +> This list of use cases is **not** exhaustive. We keep reviewing our docs and +> will include interesting scenarios that surface in the community. Please, +> [contact us](/support) if you need help or have suggestions! Use cases are not written to be run end-to-end. For more general, hands-on experience with DVC, we recommend following the [Get Started](/doc/get-started), From cb0726ff7f6217ed9ebb563b281ba62db59e0762 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 11 Nov 2019 13:43:41 -0500 Subject: [PATCH 40/40] use-cases: add "or models" to the short data registry description per https://github.com/iterative/dvc.org/pull/679#pullrequestreview-314590334 --- static/docs/use-cases/index.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/static/docs/use-cases/index.md b/static/docs/use-cases/index.md index e1d10566ff..90c9288132 100644 --- a/static/docs/use-cases/index.md +++ b/static/docs/use-cases/index.md @@ -13,7 +13,8 @@ range from basic to more advanced: - [Shared Development Server](/doc/use-cases/shared-development-server) provides instructions to setup a single development machine for teams that prefer so. - [Data Registry](/doc/use-cases/data-registry) explains how to use a DVC - repository as a shared hub for reusing datasets among several projects. + repository as a shared hub for reusing datasets or models among several + projects. > This list of use cases is **not** exhaustive. We keep reviewing our docs and > will include interesting scenarios that surface in the community. Please,