From 4296de0bfa8afc5ff74823611993ee814111a6a6 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Thu, 24 Sep 2020 22:00:17 +0300 Subject: [PATCH 01/15] Added machine learning keyword --- content/docs/user-guide/what-is-dvc.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 18f86f0acb..06279ef300 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,10 +1,11 @@ # What Is DVC? -**Data Version Control** is a new type of data versioning, workflow and -experiment management software, that builds upon [Git](https://git-scm.com/) -(although it can work stand-alone). DVC reduces the gap between established -engineering tool sets and data science needs, allowing users to take advantage -of new [features](#core-features) while reusing existing skills and intuition. +**Data Version Control** is a new type of data versioning, workflow and machine +learning experiment management software, that builds upon +[Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the +gap between established engineering tool sets and data science needs, allowing +users to take advantage of new [features](#core-features) while reusing existing +skills and intuition. ![](/img/reproducibility.png) _DVC codifies data and ML experiments_ From 1039affb8926cd7459cc77ff3d1a78c9db124f5b Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Thu, 24 Sep 2020 22:19:02 +0300 Subject: [PATCH 02/15] Added additional ML references for SEO --- content/docs/user-guide/what-is-dvc.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 06279ef300..b4a7e17203 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,17 +1,18 @@ # What Is DVC? **Data Version Control** is a new type of data versioning, workflow and machine -learning experiment management software, that builds upon +learning experiment management software that builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the gap between established engineering tool sets and data science needs, allowing users to take advantage of new [features](#core-features) while reusing existing skills and intuition. -![](/img/reproducibility.png) _DVC codifies data and ML experiments_ +![](/img/reproducibility.png) _DVC codifies data and machine learning +experiments_ -Data science experiment sharing and collaboration can be done through a regular -Git flow (commits, branching, pull requests, etc.), the same way it works for -software engineers. +Using DVC, data scientists can apply a regular Git flow to ML project sharing +and collaboration (commits, branching, pull requests, etc.), the same way it +works for software engineers. ## Core Features @@ -23,7 +24,7 @@ software engineers. [versioning](/doc/use-cases/versioning-data-and-model-files) capabilities. - **Data versioning** is enabled by replacing large files, dataset directories, - ML models, etc. with small + machine learning models, etc. with small [metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management. From f71ef34dfb3fe4e396a50257b39131c865f6eb07 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Thu, 24 Sep 2020 22:46:58 +0300 Subject: [PATCH 03/15] Added keywords to use cases index doc to expand search terms --- content/docs/use-cases/index.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 971ad2e99f..3f876a8e56 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -1,18 +1,17 @@ # Use Cases -We provide short articles on common ML workflow or data management scenarios -that DVC can help with or improve. Our use cases are not written to be run -end-to-end like tutorials. For more general, hands-on experience with DVC, -please see our [Get Started](/doc/tutorials/get-started) instead. +We provide short articles on common ML workflow and data science use cases that +DVC can help with or improve. Our use cases are not written to be run end-to-end +like tutorials. For more general, hands-on experience with DVC, please see our +[Get Started](/doc/tutorials/get-started) instead. ## Why DVC? Even with all the success we've seen today in machine learning (ML), especially -with deep learning and its applications in business, the data science community -still lacks good practices for organizing their projects and collaborating -effectively. This is a critical challenge: while ML algorithms and methods are -no longer tribal knowledge, they are still difficult to implement, reuse, and -manage. +with deep learning and its applications in business, data scientists still lack +good tools for organizing their projects and collaborating effectively. This is +a critical challenge: while ML algorithms and methods are no longer tribal +knowledge, they are still difficult to implement, reuse, and manage. ## Basic uses of DVC @@ -20,9 +19,10 @@ If you store and process data files or datasets to produce other data or machine learning models, and you want to - capture and save data artifacts the same way you capture code; -- track and switch between different versions of data or models easily; -- understand how data or models were built in the first place; -- be able to compare models and metrics to each other; +- track, control, and switch between different versions of data or models + easily; +- understand how machine learning models were built in the first place; +- compare ML models and metrics to each other; - bring software engineering best practices to your data science team DVC is for you! From 916ca5fa861e8847053cd6da86f0125a5082ed4e Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Thu, 24 Sep 2020 23:08:05 +0300 Subject: [PATCH 04/15] Added model and data versioning references to expand search terms --- .../versioning-data-and-model-files/index.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index be28edd905..068d1bad31 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -11,8 +11,8 @@ pull requests, etc.) To actually store the data, DVC uses a built-in cache, and supports synchronizing it with various types of -[remote storage](/doc/command-reference/remote). This allows storing and sharing -data easily, and alongside code. +[remote storage](/doc/command-reference/remote). This allows for easy data and +model versioning, storage, and sharing — right alongside code. ![](/img/model-versioning-diagram.png) _Code and data flows in DVC_ @@ -30,9 +30,9 @@ on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider ## DVC is not Git! DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track -data files and directories (among other purposes). They point to specific data -contents in the cache, providing the ability to store multiple data -versions out-of-the-box. +the version of data files and directories (among other purposes). They point to +specific data contents in the cache, providing the ability to store +multiple data versions out-of-the-box. Full-fledged [version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) @@ -46,7 +46,7 @@ several other novel features (see [Get Started](/doc/start/) for a primer.) Let's say you have an empty DVC repository and put a dataset of images in the `images/` directory. You can start tracking it with `dvc add`. -This generate a `.dvc` file, which can be committed to Git in order to save the +This generates a `.dvc` file, which can be committed to Git in order to save the project's version: ```dvc @@ -116,7 +116,8 @@ M model.pkl ``` However, we can checkout certain parts only, for example if we want to keep the -latest source code and model but rewind to the previous dataset only: +latest source code and model versions, but rewind to the previous version of the +dataset: ```dvc $ git checkout v1.0 images.dvc @@ -125,5 +126,5 @@ M images ``` DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by -avoiding copying files each time, so checking out data is quick even if you have -large data files. +avoiding copying files each time, so checking out data is quick even if you are +versioning large data files. From 5f1708e838e02c30a0f9d14d13e0a1fd484e71e1 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Thu, 24 Sep 2020 23:15:26 +0300 Subject: [PATCH 05/15] Added 'data' to title for SEO --- .../docs/use-cases/versioning-data-and-model-files/tutorial.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md index ad8c5a628e..a08c958923 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md +++ b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md @@ -1,4 +1,4 @@ -# Tutorial: Versioning +# Tutorial: Data Versioning The goal of this example is to give you some hands-on experience with a basic machine learning version control scenario: working with multiple versions of From 508da285cd9506ae4240942ea9e9e46757abfc69 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Fri, 25 Sep 2020 15:49:59 +0300 Subject: [PATCH 06/15] Added 'data' and 'ml model' versioning/version references for SEO --- .../tutorial.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md index a08c958923..206cc48352 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md +++ b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md @@ -1,8 +1,8 @@ -# Tutorial: Data Versioning +# Tutorial: Data & Model Versioning The goal of this example is to give you some hands-on experience with a basic -machine learning version control scenario: working with multiple versions of -datasets and ML models using DVC commands. We'll work with a +machine learning version control scenario: managing multiple dataset and ML +model versions using DVC commands. We'll work with a [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) that [François Chollet](https://twitter.com/fchollet) put together to show how to build a powerful image classifier using a pretty small dataset. @@ -237,9 +237,9 @@ $ git commit -m "Second model, trained with 2000 images" $ git tag -a "v2.0" -m "model v2.0, 2000 images" ``` -That's it! We have tracked a second dataset, model, and metrics versioned DVC, -and the DVC-files that point to them committed with Git. Let's now look at how -DVC can help us go back to the previous version if we need to. +That's it! We've tracked the second version of the dataset, model, and metrics +in DVC and committed the DVC-files that point to them with Git. Now let's look +at how DVC can help us go back to the previous version if we need to. ## Switching between workspace versions @@ -338,15 +338,15 @@ changed. For example, when we added new images to built the second version of our model, that was a dependency change. It also updates outputs and puts them into the cache. -To make things a little simpler: if `dvc add` and `dvc checkout` provide a basic -mechanism to version control large data files or models, `dvc run` and -`dvc repro` provide a build system for ML models, which is similar to +To make things a little simpler: `dvc add` and `dvc checkout` provide a basic +mechanism for model and large dataset versioning. `dvc run` and `dvc repro` +provide a build system for machine learning models, which is similar to [Make](https://www.gnu.org/software/make/) in software build automation. ## What's next? -In this example, our focus was on giving you hands-on experience with versioning -ML models and datasets. We specifically looked at the `dvc add` and +In this example, our focus was on giving you hands-on experience with dataset +and ML model versioning. We specifically looked at the `dvc add` and `dvc checkout` commands. We'd also like to outline some topics and ideas you might be interested to try next to learn more about DVC and how it makes managing ML projects simpler. From 9da95fa96be963d8450ec8bd71af4af4ec078503 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Fri, 25 Sep 2020 21:33:51 +0300 Subject: [PATCH 07/15] Removes extra "ML" reference that changes meaning --- content/docs/user-guide/what-is-dvc.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index b4a7e17203..0a8b1294e4 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -10,9 +10,9 @@ skills and intuition. ![](/img/reproducibility.png) _DVC codifies data and machine learning experiments_ -Using DVC, data scientists can apply a regular Git flow to ML project sharing -and collaboration (commits, branching, pull requests, etc.), the same way it -works for software engineers. +Using DVC, data scientists can apply a regular Git flow to project sharing and +collaboration (commits, branching, pull requests, etc.), the same way it works +for software engineers. ## Core Features From 10d3ff3f2cb94b3f32e9879756dbec8e5c642beb Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Wed, 30 Sep 2020 18:03:43 +0300 Subject: [PATCH 08/15] Reverts previous additions and expands second paragraph. --- content/docs/user-guide/what-is-dvc.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 0a8b1294e4..89451e22cd 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,18 +1,17 @@ # What Is DVC? -**Data Version Control** is a new type of data versioning, workflow and machine -learning experiment management software that builds upon -[Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the -gap between established engineering tool sets and data science needs, allowing -users to take advantage of new [features](#core-features) while reusing existing -skills and intuition. - -![](/img/reproducibility.png) _DVC codifies data and machine learning -experiments_ - -Using DVC, data scientists can apply a regular Git flow to project sharing and -collaboration (commits, branching, pull requests, etc.), the same way it works -for software engineers. +**Data Version Control** is a new type of data versioning, workflow and +experiment management software that builds upon [Git](https://git-scm.com/) +(although it can work stand-alone). DVC reduces the gap between established +engineering tool sets and data science needs, allowing users to take advantage +of new [features](#core-features) while reusing existing skills and intuition. + +![](/img/reproducibility.png) _DVC codifies data and ML experiments_ + +The same way that software engineers use Git, data scientists can use DVC to +apply a regular flow to project sharing and collaboration (commits, branching, +pull requests, etc.). Using Git and DVC, data science and machine learning teams +can version experiments, manage large datasets, and make projects reproducible. ## Core Features From 960db419c506d9298af113bea3778bef53073c3f Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Wed, 30 Sep 2020 18:09:34 +0300 Subject: [PATCH 09/15] Fixes missing comma in first paragraph. --- content/docs/user-guide/what-is-dvc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 89451e22cd..9979fc8cdd 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,7 +1,7 @@ # What Is DVC? **Data Version Control** is a new type of data versioning, workflow and -experiment management software that builds upon [Git](https://git-scm.com/) +experiment management software, that builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the gap between established engineering tool sets and data science needs, allowing users to take advantage of new [features](#core-features) while reusing existing skills and intuition. From 6edacb2d2e68ab4d4c2f4e24dc7957cf2db81236 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Wed, 30 Sep 2020 18:15:07 +0300 Subject: [PATCH 10/15] Fixes another missing comma in first paragraph. --- content/docs/user-guide/what-is-dvc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 9979fc8cdd..2dbe9294f2 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -1,6 +1,6 @@ # What Is DVC? -**Data Version Control** is a new type of data versioning, workflow and +**Data Version Control** is a new type of data versioning, workflow, and experiment management software, that builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC reduces the gap between established engineering tool sets and data science needs, allowing users to take advantage From d599f1ab24044df146ab51957e0735325f11a33a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 30 Sep 2020 18:51:53 -0500 Subject: [PATCH 11/15] Update content/docs/use-cases/index.md --- content/docs/use-cases/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 3f876a8e56..7e4eb6d84c 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -2,7 +2,7 @@ We provide short articles on common ML workflow and data science use cases that DVC can help with or improve. Our use cases are not written to be run end-to-end -like tutorials. For more general, hands-on experience with DVC, please see our +like tutorials. For more general, hands-on experience with DVC, please see [Get Started](/doc/tutorials/get-started) instead. ## Why DVC? From dd880e499a196e08efeec912d617c2c67e245b8a Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Sat, 3 Oct 2020 14:50:27 +0300 Subject: [PATCH 12/15] Reverts to the original 1st sentence, adds 2nd sentence to paragraph 2 --- content/docs/user-guide/what-is-dvc.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 2dbe9294f2..ab7e2c2753 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -8,9 +8,9 @@ of new [features](#core-features) while reusing existing skills and intuition. ![](/img/reproducibility.png) _DVC codifies data and ML experiments_ -The same way that software engineers use Git, data scientists can use DVC to -apply a regular flow to project sharing and collaboration (commits, branching, -pull requests, etc.). Using Git and DVC, data science and machine learning teams +Data science experiment sharing and collaboration can be done through a regular +Git flow (commits, branching, pull requests, etc.), the same way it works for +software engineers. Using Git and DVC, data science and machine learning teams can version experiments, manage large datasets, and make projects reproducible. ## Core Features From 7605dadf0068d4fa9d8cbed601496068f22c8f98 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Sat, 3 Oct 2020 15:13:15 +0300 Subject: [PATCH 13/15] Adds "best practices" to into paragraph, "tools" to final bullet --- content/docs/use-cases/index.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 7e4eb6d84c..bc316c479b 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -9,8 +9,8 @@ like tutorials. For more general, hands-on experience with DVC, please see Even with all the success we've seen today in machine learning (ML), especially with deep learning and its applications in business, data scientists still lack -good tools for organizing their projects and collaborating effectively. This is -a critical challenge: while ML algorithms and methods are no longer tribal +best practices for organizing their projects and collaborating effectively. This +is a critical challenge: while ML algorithms and methods are no longer tribal knowledge, they are still difficult to implement, reuse, and manage. ## Basic uses of DVC @@ -21,9 +21,9 @@ learning models, and you want to - capture and save data artifacts the same way you capture code; - track, control, and switch between different versions of data or models easily; -- understand how machine learning models were built in the first place; -- compare ML models and metrics to each other; -- bring software engineering best practices to your data science team +- understand how data or ML models were built in the first place; +- compare machine learning models and metrics to each other; +- bring software engineering best practices and tools to your data science team DVC is for you! From d9e82289ff67086729ffe839e213153696a81cee Mon Sep 17 00:00:00 2001 From: jeremydesroches <18587991+jeremydesroches@users.noreply.github.com> Date: Tue, 6 Oct 2020 10:42:48 +0300 Subject: [PATCH 14/15] Switch reference to 'for versioning' from 'the version of' for clarity Co-authored-by: Jorge Orpinel --- content/docs/use-cases/versioning-data-and-model-files/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 068d1bad31..448bdeab55 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -30,7 +30,7 @@ on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider ## DVC is not Git! DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track -the version of data files and directories (among other purposes). They point to +data files and directories for versioning (among other purposes). They point to specific data contents in the cache, providing the ability to store multiple data versions out-of-the-box. From 450c87ff6bb53bbb5a883fd617eb5a0e170296b2 Mon Sep 17 00:00:00 2001 From: Jeremy DesRoches <18587991+jeremydesroches@users.noreply.github.com> Date: Tue, 6 Oct 2020 11:00:47 +0300 Subject: [PATCH 15/15] Add plural 'datasets' to intro and style fixes in Second model version. --- .../use-cases/versioning-data-and-model-files/tutorial.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md index 206cc48352..20d6766ca3 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md +++ b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md @@ -1,7 +1,7 @@ # Tutorial: Data & Model Versioning The goal of this example is to give you some hands-on experience with a basic -machine learning version control scenario: managing multiple dataset and ML +machine learning version control scenario: managing multiple datasets and ML model versions using DVC commands. We'll work with a [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) that [François Chollet](https://twitter.com/fchollet) put together to show how @@ -237,9 +237,9 @@ $ git commit -m "Second model, trained with 2000 images" $ git tag -a "v2.0" -m "model v2.0, 2000 images" ``` -That's it! We've tracked the second version of the dataset, model, and metrics -in DVC and committed the DVC-files that point to them with Git. Now let's look -at how DVC can help us go back to the previous version if we need to. +That's it! We've tracked a second version of the dataset, model, and metrics in +DVC and committed the DVC-files that point to them with Git. Let's now look at +how DVC can help us go back to the previous version if we need to. ## Switching between workspace versions