Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add key terms to use case intros/tutorial and what is dvc? docs [SEO] #1806

Merged
merged 16 commits into from
Oct 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions content/docs/use-cases/index.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,29 @@
# Use Cases

We provide short articles on common ML workflow or data management scenarios
that DVC can help with or improve. Our use cases are not written to be run
end-to-end like tutorials. For more general, hands-on experience with DVC,
please see our [Get Started](/doc/tutorials/get-started) instead.
We provide short articles on common ML workflow and data science use cases that

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is an SEO motivation here: the search term is "data science use cases".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Going fwd if you can make some notes in the PR file changes on terms each change is for, or a list of terms in the PR description at least, that would be helpful for reviews 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Definitely. That makes a lot of sense and I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that probably users looking for "data science use cases" are not looking for DVC use cases? I don't want to assume what 1000s of people want, but it sounds like a basic data science question rather than anything to do with structuring DS projects (e.g. using DVC).

So maybe changes like this will bring more traffic but also up the bounce rate. We'll have to try and see, I guess!

Copy link
Contributor Author

@jeremydesroches jeremydesroches Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that probably users looking for "data science use cases" are not looking for DVC use cases?

It's true that the term is not a perfect match, but it is related to the primary subject area (data science). Most non-brand terms are going to be partially related but inexact, as searches for discovery are imprecise (because they don't know what DVC is yet).

The search engine is trying to fill in the gaps, so we want to expand on terms that are showing interest within the correct subject area in order to meet them halfway. This article already has some impressions for "use cases", including ML and data science so that's the motivation for this change.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, cool! Keeping unresolved for future reference.

DVC can help with or improve. Our use cases are not written to be run end-to-end
like tutorials. For more general, hands-on experience with DVC, please see
[Get Started](/doc/tutorials/get-started) instead.

## Why DVC?

Even with all the success we've seen today in machine learning (ML), especially
with deep learning and its applications in business, the data science community
still lacks good practices for organizing their projects and collaborating
effectively. This is a critical challenge: while ML algorithms and methods are
no longer tribal knowledge, they are still difficult to implement, reuse, and
manage.
with deep learning and its applications in business, data scientists still lack
best practices for organizing their projects and collaborating effectively. This
is a critical challenge: while ML algorithms and methods are no longer tribal
knowledge, they are still difficult to implement, reuse, and manage.

## Basic uses of DVC

If you store and process data files or datasets to produce other data or machine
learning models, and you want to

- capture and save <abbr>data artifacts</abbr> the same way you capture code;
- track and switch between different versions of data or models easily;
- understand how data or models were built in the first place;
- be able to compare models and metrics to each other;
- bring software engineering best practices to your data science team
- track, control, and switch between different versions of data or models
easily;
- understand how data or ML models were built in the first place;
- compare machine learning models and metrics to each other;
- bring software engineering best practices and tools to your data science team

DVC is for you!

Expand Down
19 changes: 10 additions & 9 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ pull requests, etc.)

To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
synchronizing it with various types of
[remote storage](/doc/command-reference/remote). This allows storing and sharing
data easily, and alongside code.
[remote storage](/doc/command-reference/remote). This allows for easy data and
model versioning, storage, and sharing — right alongside code.

![](/img/model-versioning-diagram.png) _Code and data flows in DVC_

Expand All @@ -30,9 +30,9 @@ on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
## DVC is not Git!

DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track
data files and directories (among other purposes). They point to specific data
contents in the <abbr>cache</abbr>, providing the ability to store multiple data
versions out-of-the-box.
data files and directories for versioning (among other purposes). They point to
specific data contents in the <abbr>cache</abbr>, providing the ability to store
multiple data versions out-of-the-box.

Full-fledged
[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
Expand All @@ -46,7 +46,7 @@ several other novel features (see [Get Started](/doc/start/) for a primer.)

Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
images in the `images/` directory. You can start tracking it with `dvc add`.
This generate a `.dvc` file, which can be committed to Git in order to save the
This generates a `.dvc` file, which can be committed to Git in order to save the
project's version:

```dvc
Expand Down Expand Up @@ -116,7 +116,8 @@ M model.pkl
```

However, we can checkout certain parts only, for example if we want to keep the
latest source code and model but rewind to the previous dataset only:
latest source code and model versions, but rewind to the previous version of the
dataset:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ git checkout v1.0 images.dvc
Expand All @@ -125,5 +126,5 @@ M images
```

DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
avoiding copying files each time, so checking out data is quick even if you have
large data files.
avoiding copying files each time, so checking out data is quick even if you are
versioning large data files.
22 changes: 11 additions & 11 deletions content/docs/use-cases/versioning-data-and-model-files/tutorial.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Tutorial: Versioning
# Tutorial: Data & Model Versioning

The goal of this example is to give you some hands-on experience with a basic
machine learning version control scenario: working with multiple versions of
datasets and ML models using DVC commands. We'll work with a
machine learning version control scenario: managing multiple datasets and ML
model versions using DVC commands. We'll work with a
[tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html)
that [François Chollet](https://twitter.com/fchollet) put together to show how
to build a powerful image classifier using a pretty small dataset.
Expand Down Expand Up @@ -237,9 +237,9 @@ $ git commit -m "Second model, trained with 2000 images"
$ git tag -a "v2.0" -m "model v2.0, 2000 images"
```

That's it! We have tracked a second dataset, model, and metrics versioned DVC,
and the DVC-files that point to them committed with Git. Let's now look at how
DVC can help us go back to the previous version if we need to.
That's it! We've tracked a second version of the dataset, model, and metrics in
DVC and committed the DVC-files that point to them with Git. Let's now look at
how DVC can help us go back to the previous version if we need to.

## Switching between workspace versions

Expand Down Expand Up @@ -338,15 +338,15 @@ changed. For example, when we added new images to built the second version of
our model, that was a dependency change. It also updates outputs and puts them
into the <abbr>cache</abbr>.

To make things a little simpler: if `dvc add` and `dvc checkout` provide a basic
mechanism to version control large data files or models, `dvc run` and
`dvc repro` provide a build system for ML models, which is similar to
To make things a little simpler: `dvc add` and `dvc checkout` provide a basic
mechanism for model and large dataset versioning. `dvc run` and `dvc repro`
provide a build system for machine learning models, which is similar to
[Make](https://www.gnu.org/software/make/) in software build automation.

## What's next?

In this example, our focus was on giving you hands-on experience with versioning
ML models and datasets. We specifically looked at the `dvc add` and
In this example, our focus was on giving you hands-on experience with dataset
and ML model versioning. We specifically looked at the `dvc add` and
`dvc checkout` commands. We'd also like to outline some topics and ideas you
might be interested to try next to learn more about DVC and how it makes
managing ML projects simpler.
Expand Down
7 changes: 4 additions & 3 deletions content/docs/user-guide/what-is-dvc.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# What Is DVC?

**Data Version Control** is a new type of data versioning, workflow and
**Data Version Control** is a new type of data versioning, workflow, and
experiment management software, that builds upon [Git](https://git-scm.com/)
(although it can work stand-alone). DVC reduces the gap between established
engineering tool sets and data science needs, allowing users to take advantage
Expand All @@ -10,7 +10,8 @@ of new [features](#core-features) while reusing existing skills and intuition.

Data science experiment sharing and collaboration can be done through a regular
Git flow (commits, branching, pull requests, etc.), the same way it works for
software engineers.
software engineers. Using Git and DVC, data science and machine learning teams
can version experiments, manage large datasets, and make projects reproducible.

## Core Features

Expand All @@ -22,7 +23,7 @@ software engineers.
[versioning](/doc/use-cases/versioning-data-and-model-files) capabilities.

- **Data versioning** is enabled by replacing large files, dataset directories,
ML models, etc. with small
machine learning models, etc. with small
[metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with
Git). These placeholders point to the original data, which is decoupled from
source code management.
Expand Down