Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cases: Versioning Data and Models (rewrite) #1747

Merged
merged 119 commits into from
Dec 6, 2020
Merged
Show file tree
Hide file tree
Changes from 118 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
39d4400
Merge branch 'master' into use-cases
jorgeorpinel Sep 1, 2020
87264eb
cases: [WIP] befin rewriting Versioning:
jorgeorpinel Sep 1, 2020
8bdae1d
cases: give some sense of why versioning data and models is important
jorgeorpinel Sep 2, 2020
1679c59
guide: why DVC is the way to Version data (sell philosophy)
jorgeorpinel Sep 2, 2020
a794600
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 2, 2020
ab35693
cases: add example section explaining why data versionig is
jorgeorpinel Sep 3, 2020
5ed0d75
cases: wrap up Versioning full draft
jorgeorpinel Sep 3, 2020
c291558
cases: rename demo section in Versioning, roll back checkout img, et al.
jorgeorpinel Sep 3, 2020
a3008c8
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 15, 2020
ca062b2
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 16, 2020
7a4cec6
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 16, 2020
9e564e3
cases: some more versioning updates
jorgeorpinel Sep 16, 2020
9d5dca1
cases: shorten versioning intro
jorgeorpinel Sep 16, 2020
c188146
cases: add bullet list of Versioning advantages
jorgeorpinel Sep 17, 2020
1f4f2f0
cases: shorten Why DVC section in Versioning
jorgeorpinel Sep 17, 2020
c554b67
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 30, 2020
de07dfa
term: data modeling -> data engineering
jorgeorpinel Sep 30, 2020
daa99a0
cases: make advantages section in Data Registry (consistency)
jorgeorpinel Sep 30, 2020
02568e7
cases: make separate Versioned storage section
jorgeorpinel Sep 30, 2020
79d9b0b
cases: rewrite intro and other changes to Versioning
jorgeorpinel Sep 30, 2020
8ddc976
cases: cover gap between Versioning and (remote) storage, link to GS
jorgeorpinel Sep 30, 2020
ae8e7ad
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 8, 2020
09d1eab
use-cases: reapply SEO keyword changes from #1806
jorgeorpinel Oct 8, 2020
7ea83bf
cases: make p about storage less overlapping to previous one
jorgeorpinel Oct 8, 2020
e90d332
cases: add paragraph about versioning advantages before DVC's motivation
jorgeorpinel Oct 9, 2020
5f47377
cases: simplify lists of advantages in Versioning (and Data Reg)
jorgeorpinel Oct 9, 2020
f36d10b
cases: limitation->constraint (to avoid a redundancy)
jorgeorpinel Oct 9, 2020
b9ea7ec
guide: move DVC is not Git! from use cases to What is DVC?
jorgeorpinel Oct 9, 2020
3fdddf2
cases: ~~Summary of~~ Advantages (H2)
jorgeorpinel Oct 9, 2020
ce647f5
cases: rewrite parts of the DVC motivation paragraphs in Versioning
jorgeorpinel Oct 9, 2020
bc7018b
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 12, 2020
62a34db
cases: improve vrsng intro and dedupe bullet lists
jorgeorpinel Oct 13, 2020
81d848b
cases: rename Advantages sectino of vrsng
jorgeorpinel Oct 13, 2020
1898ccf
cases: expand on How it looks (vrsng) with focus on workspace
jorgeorpinel Oct 13, 2020
c6d969c
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 14, 2020
f447b14
guide: improve DVC is not Git! section
jorgeorpinel Oct 14, 2020
384218d
cases: rename Versioning use case (why "Files"?)
jorgeorpinel Oct 14, 2020
7facdf7
cases: rewrite (again) the intro to vrsng
jorgeorpinel Oct 15, 2020
8f3cb70
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 18, 2020
c0b6b58
cases: improve versioning intro (more coherent)
jorgeorpinel Oct 18, 2020
17f80b8
cmd: quick term update
jorgeorpinel Oct 20, 2020
018c3f3
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 21, 2020
b63bc7f
cases: update links to Versioning use case
jorgeorpinel Oct 21, 2020
0ec7350
cases: refine Versioning intro, add proposed figure
jorgeorpinel Oct 21, 2020
68619f1
cases: summarize, simplify, focus on the essence, et al.
jorgeorpinel Oct 21, 2020
12bc7ed
cases: add redirect for new Versioning use case location
jorgeorpinel Oct 21, 2020
78426d1
cases: merge How it looks + Version control sections
jorgeorpinel Oct 22, 2020
bc7ff0a
cases: simplify versioning-data-and-models#how-it-looks
jorgeorpinel Oct 22, 2020
53b65c2
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 22, 2020
579cc5b
Revert "redirect for new Versioning use case URL" 12bc7ed and
jorgeorpinel Oct 23, 2020
fb7265c
cases: rewrite intro to improve motivation and
jorgeorpinel Oct 24, 2020
c9d0444
cases: update Why DVC and benefits list
jorgeorpinel Oct 24, 2020
f49cce6
cases: actually revert URL change from recent commit
jorgeorpinel Oct 24, 2020
5b95d36
cases: more updates to the benefits bullets in Versioning
jorgeorpinel Oct 24, 2020
aeb860e
cases: rewrite How it looks (& feels) section
jorgeorpinel Oct 25, 2020
2c7e2ea
cases: remove non-essential info. from How it looks section of Versio…
jorgeorpinel Oct 25, 2020
da9390a
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 27, 2020
30ad4e7
cases: simplify How it looks per David and some of Ivan's feedback
jorgeorpinel Oct 27, 2020
c5e34ce
cases: remove H2s temporarily, simplify benefits bullet list, et al.
jorgeorpinel Oct 27, 2020
04e42cb
cses: rewrite benefit bullets and simplify how it feels section
jorgeorpinel Oct 27, 2020
531071a
cases: make bullet list into paragraph temp.
jorgeorpinel Oct 27, 2020
40f09df
cases: wrap up Vrsng? (text)
jorgeorpinel Oct 28, 2020
8fcd2e6
cases: hardcode colums in How it feels section of Vrsng
jorgeorpinel Oct 28, 2020
669f9f5
cache: simplify it's structure explanation and add CAS term (from Vrs…
jorgeorpinel Oct 28, 2020
67c6beb
guide: revert changes to this section for now
jorgeorpinel Oct 28, 2020
4329b60
cases: polish latest iteration of Versioning use case
jorgeorpinel Oct 28, 2020
8ca6ef1
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 29, 2020
8b81ac3
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 31, 2020
66b0829
cases: next iteration of Versioning page
jorgeorpinel Oct 31, 2020
49adf55
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 3, 2020
aa6c43e
cases: polishing my last iteration of the Vsng page
jorgeorpinel Nov 3, 2020
3c61ea7
remove a bunch of info from Vrsng to simplify again
jorgeorpinel Nov 4, 2020
b74e687
cases: minor iteration of Vrsng, pending benefits list
jorgeorpinel Nov 5, 2020
f02c1a7
guide: updates to What is DVC
jorgeorpinel Nov 5, 2020
63970bc
cmd: roll-back unrelated changes (stashed elsewhere for now)
jorgeorpinel Nov 5, 2020
e6ce632
cases: work on benefits of Vrsng
jorgeorpinel Nov 5, 2020
88ff11a
cases: more work on benefits of Vrsng
jorgeorpinel Nov 5, 2020
aeacb1f
cases: remove emojis; improve benefits list; add refs to other cases
jorgeorpinel Nov 10, 2020
3956590
cses: clarify about cache and about metafiles in Versioning
jorgeorpinel Nov 10, 2020
eeccb68
cases: simplify p about roll back/fwds; split benefit about data regs
jorgeorpinel Nov 10, 2020
7c22613
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 11, 2020
00b88e1
cases: change BEFORE to be similar to the top fig.
jorgeorpinel Nov 11, 2020
4436600
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 13, 2020
702d619
cases: another iteration of Versioning
jorgeorpinel Nov 13, 2020
00dc2d6
cases: simplify Versioning again
jorgeorpinel Nov 13, 2020
5edf502
cases: improvements on Vrsng per direct feedback
jorgeorpinel Nov 13, 2020
21fce9b
cases: more updates to latest text and figures
jorgeorpinel Nov 13, 2020
c0142c2
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 14, 2020
6cab1f3
cases: rephrase Vrsng benefits list
jorgeorpinel Nov 14, 2020
24a00cd
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 18, 2020
b1d75c7
cases: revert to previous draft fig
jorgeorpinel Nov 18, 2020
8307583
cases: update 2nd figure draft, and reorder codification p
jorgeorpinel Nov 18, 2020
79a071a
cases: rework Vrsng benefits and
jorgeorpinel Nov 18, 2020
953c16b
cases: draft What's Next section added with advanced scenarios for Vrsng
jorgeorpinel Nov 18, 2020
52ea945
cases: simplify 2nd figure
jorgeorpinel Nov 19, 2020
2b0d183
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 22, 2020
98b3135
cases: make first Vrsg figure shorter
jorgeorpinel Nov 22, 2020
87b8333
cases: merge advanced scenarios with benefits list
jorgeorpinel Nov 23, 2020
59fc3f9
cases: roll back changes to Data Regs
jorgeorpinel Nov 23, 2020
7677d25
cases: improvements per Dmitry's feedback...
jorgeorpinel Nov 23, 2020
a6f3352
cases: train_feats > features in figures for Vrsng
jorgeorpinel Nov 23, 2020
9644a38
cases: rename Vrng Tutorial label in nav (use emoji)
jorgeorpinel Nov 23, 2020
0e47977
cases: explain simple file naming a bit more
jorgeorpinel Nov 23, 2020
8bd9d96
cases: Vrng copy edits
jorgeorpinel Nov 23, 2020
1686639
cases: add efficient data mgmt benefit
jorgeorpinel Nov 24, 2020
4ab5ed4
cases: reorder Vrsg benefits list
jorgeorpinel Nov 24, 2020
1b48db2
cases: rewrite file naming and data mgmt benefits of Vrsg
jorgeorpinel Nov 24, 2020
0fbf200
cases: expand story to cover storage and data management
jorgeorpinel Nov 24, 2020
c7617ef
cases: generalized Vrsg benefits
jorgeorpinel Nov 24, 2020
aa62d9d
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 25, 2020
c14a271
cases: separate data mgmt from versioning (through codification) in Vrsg
jorgeorpinel Nov 25, 2020
020fb3f
Make note about other guides, refs, and tutorial (Vrng)
jorgeorpinel Nov 25, 2020
1a3a58c
cases: emphasize Simplicity benefit of Vrng is the opposite of "compl…
jorgeorpinel Nov 25, 2020
f6598bd
cases: another rewrite of text and benefits
jorgeorpinel Nov 27, 2020
98d97c7
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 27, 2020
6a75449
cases: copy edits to latest Vrng iteration, and append next steps par…
jorgeorpinel Nov 27, 2020
d6697dc
Merge branch 'master' into use-cases-versng
jorgeorpinel Dec 4, 2020
2ccd32e
cases: another iteration of Versioning use case
jorgeorpinel Dec 4, 2020
0d093b8
cases: clarify data mgmt is for data in Vrng benefits
jorgeorpinel Dec 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,15 @@
"source": "use-cases/index.md",
"children": [
{
"label": "Versioning Data & Model Files",
"label": "Versioning Data and Models",
"slug": "versioning-data-and-model-files",
"source": "versioning-data-and-model-files/index.md",
"children": ["tutorial"]
"children": [
{
"label": "Tutorial 👩‍💻",
"slug": "tutorial"
}
]
},
{
"label": "Sharing Data and Model Files",
Expand Down
2 changes: 1 addition & 1 deletion content/docs/start/data-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ outs:

> \* See
> [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and
> `dvc config cache` for more information on file linking.
> `dvc config cache` for more info. on file linking.

</details>

Expand Down
4 changes: 2 additions & 2 deletions content/docs/use-cases/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ knowledge, they are still difficult to implement, reuse, and manage.
If you store and process data files or datasets to produce other data or machine
learning models, and you want to

- track and save data and ML models the same way you capture code;
- create and switch among different
- track and save data and machine learning models the same way you capture code;
- create and switch between
[versions of data and ML models](/doc/use-cases/versioning-data-and-model-files)
easily;
- understand how datasets and ML artifacts were built in the first place;
Expand Down
218 changes: 88 additions & 130 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
@@ -1,130 +1,88 @@
# Versioning Data and Model Files

DVC enables versioning large files and directories such as datasets, data
science features, and machine learning models using Git, but without storing the
contents in Git.

This is achieved by saving information about the data in special
[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in
the repository. These can be versioned with regular Git workflows (branches,
pull requests, etc.)

To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
synchronizing it with various types of
[remote storage](/doc/command-reference/remote). This allows for easy data and
model versioning, storage, and sharing — right alongside code.

![](/img/model-versioning-diagram.png) _Code and data flows in DVC_

In this basic use case, DVC is a better alternative to
[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc
scripts used to manage ML <abbr>artifacts</abbr> (training data, models, etc.)
on cloud storage. DVC doesn't require special services, and works with
on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
(Amazon S3, Microsoft Azure, Google Drive,
[among others](/doc/command-reference/remote/add#supported-storage-types)).

> For hands-on experience, we recommend following the
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files).

## DVC is not Git!

DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track
data files and directories for versioning (among other purposes). They point to
specific data contents in the <abbr>cache</abbr>, providing the ability to store
multiple data versions out-of-the-box.

Full-fledged
[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These
are designed for source code management (SCM) however, and thus ill-equipped to
support data science needs. That's where DVC comes in: with its built-in data
<abbr>cache</abbr>, reproducible [pipelines](/doc/start/data-pipelines), among
several other novel features (see [Get Started](/doc/start/) for a primer.)

## Track data and models for versioning

Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
images in the `images/` directory. You can start tracking it with `dvc add`.
This generates a `.dvc` file, which can be committed to Git in order to save the
project's version:

```dvc
$ ls images/
0001.jpg 0002.jpg 0003.jpg 0004.jpg ...

$ dvc add images/

$ git add images.dvc .gitignore
$ git commit -m "Track images dataset with DVC."
```

DVC's also allows to define the processes that build artifacts based on tracked
data, such as an ML model, by writing a simple `dvc.yaml` file that connects the
pieces together:

> `dvc.yaml` files can be written manually or generated with `dvc run`.

```yaml
stages:
train:
cmd: python train.py images/
deps:
- images
outs:
- model.pkl
```

> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to
> this feature.

`dvc repro` can now execute the `train` stage for you. DVC will track all of its
outputs (`outs`) automatically. Let's do that, and commit this project version:

```dvc
$ dvc repro
Running stage 'train' with command:
python train.py images/
Updating lock file 'dvc.lock'
...

$ git add dvc.yaml dvc.lock .gitignore
$ git commit -m "Train model via DVC."
$ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;)
```

> See also `dvc.lock`.

## Switching versions

After iterating on this process and producing several versions, you can combine
`git checkout` and `dvc checkout` to perform full or partial
<abbr>workspace</abbr> restorations.

![](/img/versioning.png) _Code and data checkout_

> Note that `dvc install` enables auto-checkouts of data after `git checkout`.

A full checkout brings the whole <abbr>project</abbr> back to a previous version
— code, dataset and model files all match each other:

```dvc
$ git checkout v1.0
$ dvc checkout
M images
M model.pkl
```

However, we can checkout certain parts only, for example if we want to keep the
latest source code and model versions, but rewind to the previous version of the
dataset:

```dvc
$ git checkout v1.0 images.dvc
$ dvc checkout images.dvc
M images
```

DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
avoiding copying files each time, so checking out data is quick even if you are
versioning large data files.
# Versioning Data and Models

Data science teams face data management questions around versions of data and
machine learning models. How do we keep track of changes in data, source code,
and ML models together? What's the best way to organize and store variations of
these files and directories?

![](/img/data-ver-complex.png) _Exponential complexity of data science projects_
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Another problem in the field has to do with bookkeeping: being able to identify
past data inputs and processes to understand their results, for knowledge
sharing, or for debugging.

**Data Version Control** (DVC) lets you capture the versions of your data and
models in
[Git commits](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository),
while storing them on-premises or in cloud storage. It also provides a mechanism
to switch between these different data contents. The result is a single history
for data, code, and ML models that you can traverse — a proper journal of your
work!

![](/img/project-versions.png) _DVC matches the right versions of data, code,
and models for you 💘._

DVC enables data _versioning through codification_. You write simple
[metafiles](/doc/user-guide/dvc-files-and-directories) once, describing what
datasets, ML artifacts, etc. to track. This metadata can be put in Git in lieu
of large files. Now you can use DVC to create
[snapshots](/doc/command-reference/add) of the data,
[restore](/doc/command-reference/checkout) previous versions,
[reproduce](/doc/command-reference/repro) experiments, record evolving
[metrics](/doc/command-reference/metrics), and more!

👩‍💻 **Intrigued?** Try our
[versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial)
to learn how DVC looks and feels firsthand.

As you use DVC, unique versions of your data files and directories are
[cached](dvc-files-and-directories#structure-of-the-cache-directory) in a
systematic way (preventing file duplication). The working datastore is separated
from your <abbr>workspace</abbr> to keep the project light, but stays connected
via file
[links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
handled automatically by DVC.

Benefits of our approach include:

- **Lightweight**: DVC is a
[free](https://github.com/iterative/dvc/blob/master/LICENSE), open-source
[command line](/doc/command-reference) tool that doesn't require databases,
servers, or any other special services.

- **Consistency**: Keep your projects readable with stable file names — they
don't need to change because they represent variable data. No need for
complicated paths like `data/20190922/labels_v7_final` or for constantly
editing these in source code.

- **Efficient data management**: Use a familiar and cost-effective storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A final ask in private to update this a bit

solution of your choice (e.g. SFTP, S3, HDFS,
[etc.](/doc/command-reference/remote/add#supported-storage-types)) — free from
Git hosting
[constraints](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota).
Storing and transferring are
[optimized](/doc/user-guide/large-dataset-optimization) by DVC.

- **Collaboration**: Easily distribute your project development and share its
data [internally](/doc/use-cases/shared-development-server) and
[remotely](/doc/use-cases/sharing-data-and-model-files), or
[reuse](/doc/start/data-access) it in other places.

- **Data compliance**: Review data modification attempts as Git
[pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/).
Audit the project's immutable history to learn when datasets or models were
approved, and why.

- **GitOps**: Connect your data science projects with the Git-powered universe.
Git workflows open the door to advanced tools such as continuous integration
(like [CML](https://cml.dev/) CI/CD), specialized patterns such as
[data registries](/doc/use-cases/data-registries), and other best practices.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

In summary, data science and ML are iterative processes where the lifecycles of
data, models, and code happen at different paces. DVC helps you manage, and
enforce them.

And this is just the beginning. DVC supports multiple advanced features
out-of-the-box: Build, run, and versioning
[data pipelines](/doc/command-reference/dag),
[manage experiments](/doc/start/experiments) effectively, and more.
14 changes: 14 additions & 0 deletions content/docs/user-guide/what-is-dvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,17 @@ can version experiments, manage large datasets, and make projects reproducible.

> Git servers, as well as SSH and cloud storage providers are supported,
> however.

## DVC does not replace Git!

DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track
large data files and directories for versioning (among other
[purposes](/doc/user-guide/dvc-files-and-directories)). These metafiles change
along with your data, and you can use Git to place them under
[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
as a proxy to the actual data versions, which are stored in the <abbr>DVC
cache</abbr> (outside of Git). This does not replace features of Git.

DVC does, however, provide several commands similar to Git such as `dvc init`,
`dvc add`, `dvc checkout`, or `dvc push`, which interact with the underlying Git
repo (if one is being used, which is not required).
Binary file added static/img/data-ver-complex.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/project-versions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed static/img/versioning.png
Binary file not shown.