Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cases: Versioning Data and Models (rewrite) #1747

Merged
merged 119 commits into from
Dec 6, 2020
Merged
Changes from 5 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
39d4400
Merge branch 'master' into use-cases
jorgeorpinel Sep 1, 2020
87264eb
cases: [WIP] befin rewriting Versioning:
jorgeorpinel Sep 1, 2020
8bdae1d
cases: give some sense of why versioning data and models is important
jorgeorpinel Sep 2, 2020
1679c59
guide: why DVC is the way to Version data (sell philosophy)
jorgeorpinel Sep 2, 2020
a794600
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 2, 2020
ab35693
cases: add example section explaining why data versionig is
jorgeorpinel Sep 3, 2020
5ed0d75
cases: wrap up Versioning full draft
jorgeorpinel Sep 3, 2020
c291558
cases: rename demo section in Versioning, roll back checkout img, et al.
jorgeorpinel Sep 3, 2020
a3008c8
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 15, 2020
ca062b2
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 16, 2020
7a4cec6
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 16, 2020
9e564e3
cases: some more versioning updates
jorgeorpinel Sep 16, 2020
9d5dca1
cases: shorten versioning intro
jorgeorpinel Sep 16, 2020
c188146
cases: add bullet list of Versioning advantages
jorgeorpinel Sep 17, 2020
1f4f2f0
cases: shorten Why DVC section in Versioning
jorgeorpinel Sep 17, 2020
c554b67
Merge branch 'master' into use-cases-versng
jorgeorpinel Sep 30, 2020
de07dfa
term: data modeling -> data engineering
jorgeorpinel Sep 30, 2020
daa99a0
cases: make advantages section in Data Registry (consistency)
jorgeorpinel Sep 30, 2020
02568e7
cases: make separate Versioned storage section
jorgeorpinel Sep 30, 2020
79d9b0b
cases: rewrite intro and other changes to Versioning
jorgeorpinel Sep 30, 2020
8ddc976
cases: cover gap between Versioning and (remote) storage, link to GS
jorgeorpinel Sep 30, 2020
ae8e7ad
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 8, 2020
09d1eab
use-cases: reapply SEO keyword changes from #1806
jorgeorpinel Oct 8, 2020
7ea83bf
cases: make p about storage less overlapping to previous one
jorgeorpinel Oct 8, 2020
e90d332
cases: add paragraph about versioning advantages before DVC's motivation
jorgeorpinel Oct 9, 2020
5f47377
cases: simplify lists of advantages in Versioning (and Data Reg)
jorgeorpinel Oct 9, 2020
f36d10b
cases: limitation->constraint (to avoid a redundancy)
jorgeorpinel Oct 9, 2020
b9ea7ec
guide: move DVC is not Git! from use cases to What is DVC?
jorgeorpinel Oct 9, 2020
3fdddf2
cases: ~~Summary of~~ Advantages (H2)
jorgeorpinel Oct 9, 2020
ce647f5
cases: rewrite parts of the DVC motivation paragraphs in Versioning
jorgeorpinel Oct 9, 2020
bc7018b
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 12, 2020
62a34db
cases: improve vrsng intro and dedupe bullet lists
jorgeorpinel Oct 13, 2020
81d848b
cases: rename Advantages sectino of vrsng
jorgeorpinel Oct 13, 2020
1898ccf
cases: expand on How it looks (vrsng) with focus on workspace
jorgeorpinel Oct 13, 2020
c6d969c
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 14, 2020
f447b14
guide: improve DVC is not Git! section
jorgeorpinel Oct 14, 2020
384218d
cases: rename Versioning use case (why "Files"?)
jorgeorpinel Oct 14, 2020
7facdf7
cases: rewrite (again) the intro to vrsng
jorgeorpinel Oct 15, 2020
8f3cb70
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 18, 2020
c0b6b58
cases: improve versioning intro (more coherent)
jorgeorpinel Oct 18, 2020
17f80b8
cmd: quick term update
jorgeorpinel Oct 20, 2020
018c3f3
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 21, 2020
b63bc7f
cases: update links to Versioning use case
jorgeorpinel Oct 21, 2020
0ec7350
cases: refine Versioning intro, add proposed figure
jorgeorpinel Oct 21, 2020
68619f1
cases: summarize, simplify, focus on the essence, et al.
jorgeorpinel Oct 21, 2020
12bc7ed
cases: add redirect for new Versioning use case location
jorgeorpinel Oct 21, 2020
78426d1
cases: merge How it looks + Version control sections
jorgeorpinel Oct 22, 2020
bc7ff0a
cases: simplify versioning-data-and-models#how-it-looks
jorgeorpinel Oct 22, 2020
53b65c2
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 22, 2020
579cc5b
Revert "redirect for new Versioning use case URL" 12bc7ed and
jorgeorpinel Oct 23, 2020
fb7265c
cases: rewrite intro to improve motivation and
jorgeorpinel Oct 24, 2020
c9d0444
cases: update Why DVC and benefits list
jorgeorpinel Oct 24, 2020
f49cce6
cases: actually revert URL change from recent commit
jorgeorpinel Oct 24, 2020
5b95d36
cases: more updates to the benefits bullets in Versioning
jorgeorpinel Oct 24, 2020
aeb860e
cases: rewrite How it looks (& feels) section
jorgeorpinel Oct 25, 2020
2c7e2ea
cases: remove non-essential info. from How it looks section of Versio…
jorgeorpinel Oct 25, 2020
da9390a
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 27, 2020
30ad4e7
cases: simplify How it looks per David and some of Ivan's feedback
jorgeorpinel Oct 27, 2020
c5e34ce
cases: remove H2s temporarily, simplify benefits bullet list, et al.
jorgeorpinel Oct 27, 2020
04e42cb
cses: rewrite benefit bullets and simplify how it feels section
jorgeorpinel Oct 27, 2020
531071a
cases: make bullet list into paragraph temp.
jorgeorpinel Oct 27, 2020
40f09df
cases: wrap up Vrsng? (text)
jorgeorpinel Oct 28, 2020
8fcd2e6
cases: hardcode colums in How it feels section of Vrsng
jorgeorpinel Oct 28, 2020
669f9f5
cache: simplify it's structure explanation and add CAS term (from Vrs…
jorgeorpinel Oct 28, 2020
67c6beb
guide: revert changes to this section for now
jorgeorpinel Oct 28, 2020
4329b60
cases: polish latest iteration of Versioning use case
jorgeorpinel Oct 28, 2020
8ca6ef1
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 29, 2020
8b81ac3
Merge branch 'master' into use-cases-versng
jorgeorpinel Oct 31, 2020
66b0829
cases: next iteration of Versioning page
jorgeorpinel Oct 31, 2020
49adf55
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 3, 2020
aa6c43e
cases: polishing my last iteration of the Vsng page
jorgeorpinel Nov 3, 2020
3c61ea7
remove a bunch of info from Vrsng to simplify again
jorgeorpinel Nov 4, 2020
b74e687
cases: minor iteration of Vrsng, pending benefits list
jorgeorpinel Nov 5, 2020
f02c1a7
guide: updates to What is DVC
jorgeorpinel Nov 5, 2020
63970bc
cmd: roll-back unrelated changes (stashed elsewhere for now)
jorgeorpinel Nov 5, 2020
e6ce632
cases: work on benefits of Vrsng
jorgeorpinel Nov 5, 2020
88ff11a
cases: more work on benefits of Vrsng
jorgeorpinel Nov 5, 2020
aeacb1f
cases: remove emojis; improve benefits list; add refs to other cases
jorgeorpinel Nov 10, 2020
3956590
cses: clarify about cache and about metafiles in Versioning
jorgeorpinel Nov 10, 2020
eeccb68
cases: simplify p about roll back/fwds; split benefit about data regs
jorgeorpinel Nov 10, 2020
7c22613
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 11, 2020
00b88e1
cases: change BEFORE to be similar to the top fig.
jorgeorpinel Nov 11, 2020
4436600
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 13, 2020
702d619
cases: another iteration of Versioning
jorgeorpinel Nov 13, 2020
00dc2d6
cases: simplify Versioning again
jorgeorpinel Nov 13, 2020
5edf502
cases: improvements on Vrsng per direct feedback
jorgeorpinel Nov 13, 2020
21fce9b
cases: more updates to latest text and figures
jorgeorpinel Nov 13, 2020
c0142c2
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 14, 2020
6cab1f3
cases: rephrase Vrsng benefits list
jorgeorpinel Nov 14, 2020
24a00cd
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 18, 2020
b1d75c7
cases: revert to previous draft fig
jorgeorpinel Nov 18, 2020
8307583
cases: update 2nd figure draft, and reorder codification p
jorgeorpinel Nov 18, 2020
79a071a
cases: rework Vrsng benefits and
jorgeorpinel Nov 18, 2020
953c16b
cases: draft What's Next section added with advanced scenarios for Vrsng
jorgeorpinel Nov 18, 2020
52ea945
cases: simplify 2nd figure
jorgeorpinel Nov 19, 2020
2b0d183
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 22, 2020
98b3135
cases: make first Vrsg figure shorter
jorgeorpinel Nov 22, 2020
87b8333
cases: merge advanced scenarios with benefits list
jorgeorpinel Nov 23, 2020
59fc3f9
cases: roll back changes to Data Regs
jorgeorpinel Nov 23, 2020
7677d25
cases: improvements per Dmitry's feedback...
jorgeorpinel Nov 23, 2020
a6f3352
cases: train_feats > features in figures for Vrsng
jorgeorpinel Nov 23, 2020
9644a38
cases: rename Vrng Tutorial label in nav (use emoji)
jorgeorpinel Nov 23, 2020
0e47977
cases: explain simple file naming a bit more
jorgeorpinel Nov 23, 2020
8bd9d96
cases: Vrng copy edits
jorgeorpinel Nov 23, 2020
1686639
cases: add efficient data mgmt benefit
jorgeorpinel Nov 24, 2020
4ab5ed4
cases: reorder Vrsg benefits list
jorgeorpinel Nov 24, 2020
1b48db2
cases: rewrite file naming and data mgmt benefits of Vrsg
jorgeorpinel Nov 24, 2020
0fbf200
cases: expand story to cover storage and data management
jorgeorpinel Nov 24, 2020
c7617ef
cases: generalized Vrsg benefits
jorgeorpinel Nov 24, 2020
aa62d9d
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 25, 2020
c14a271
cases: separate data mgmt from versioning (through codification) in Vrsg
jorgeorpinel Nov 25, 2020
020fb3f
Make note about other guides, refs, and tutorial (Vrng)
jorgeorpinel Nov 25, 2020
1a3a58c
cases: emphasize Simplicity benefit of Vrng is the opposite of "compl…
jorgeorpinel Nov 25, 2020
f6598bd
cases: another rewrite of text and benefits
jorgeorpinel Nov 27, 2020
98d97c7
Merge branch 'master' into use-cases-versng
jorgeorpinel Nov 27, 2020
6a75449
cases: copy edits to latest Vrng iteration, and append next steps par…
jorgeorpinel Nov 27, 2020
d6697dc
Merge branch 'master' into use-cases-versng
jorgeorpinel Dec 4, 2020
2ccd32e
cases: another iteration of Versioning use case
jorgeorpinel Dec 4, 2020
0d093b8
cases: clarify data mgmt is for data in Vrng benefits
jorgeorpinel Dec 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 31 additions & 105 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,38 @@
# Versioning Data and Model Files
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

DVC enables versioning large files and directories such as datasets, data
science features, and machine learning models using Git, but without storing the
contents in Git.
[Version control](https://en.wikipedia.org/wiki/Version_control) has become a
staple in software engineering because it allows effective collaboration on
source code. This means having a change history to go back to (commits),
developing features in parallel (branching), assisted merging, peer-reviews
(pull requests), tagging key revisions, etc. Imagine if we could use these
features for data modeling!
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

This is achieved by saving information about the data in special
[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in
the repository. These can be versioned with regular Git workflows (branches,
pull requests, etc.)
Unfortunately, versioning tools like [Git](https://git-scm.com/) are designed to
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
handle small text files. While other assets can exist in the repository, storage
itself is not the goal, and is limited by Git hosting services
[such as GitHub](https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota).
Traditional storage solutions like hard drives or NAS, or cloud services like
Amazon S3 or Google Drive, are much better options for saving and transferring
large files.

To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
synchronizing it with various types of
[remote storage](/doc/command-reference/remote). This allows storing and sharing
data easily, and alongside code.
What if we could **combine effective data storage with robust versioning
features**?

![](/img/model-versioning-diagram.png) _Code and data flows in DVC_
![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage model_

In this basic use case, DVC is a better alternative to
[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc
scripts used to manage ML <abbr>artifacts</abbr> (training data, models, etc.)
on cloud storage. DVC doesn't require special services, and works with
on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
(Amazon S3, Microsoft Azure, Google Drive,
[among others](/doc/command-reference/remote/add#supported-storage-types)).
DVC brings the best of both worlds together by replacing the data in the repo
with small, human-readable
[metafiles](/doc/user-guide/dvc-files-and-directories). Tracked data is
<abbr>cache</abbr> locally outside the Git repo, and can easily be synchronized
with on-premises or cloud storage. Unlike other alternatives (like Git-LFS),
[remote storage](/doc/command-reference/remote) is optional — no server setup or
special services are required.

## How it looks
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

... reference to a problem (exemplify why to version data)

... demo DVC's look&feel (more philosophy?)

> For hands-on experience, we recommend following the
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files).
Expand All @@ -37,93 +47,9 @@ versions out-of-the-box.
Full-fledged
[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These
are designed for source code management (SCM) however, and thus ill-equipped to
are designed for source code versioning however, and thus ill-equipped to
support data science needs. That's where DVC comes in: with its built-in data
<abbr>cache</abbr>, reproducible [pipelines](/doc/start/data-pipelines), among
several other novel features (see [Get Started](/doc/start/) for a primer.)

## Track data and models for versioning

Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
images in the `images/` directory. You can start tracking it with `dvc add`.
This generate a `.dvc` file, which can be committed to Git in order to save the
project's version:

```dvc
$ ls images/
0001.jpg 0002.jpg 0003.jpg 0004.jpg ...

$ dvc add images/

$ git add images.dvc .gitignore
$ git commit -m "Track images dataset with DVC."
```

DVC's also allows to define the processes that build artifacts based on tracked
data, such as an ML model, by writing a simple `dvc.yaml` file that connects the
pieces together:

> `dvc.yaml` files can be written manually or generated with `dvc run`.

```yaml
stages:
train:
cmd: python train.py images/
deps:
- images
outs:
- model.pkl
```

> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to
> this feature.

`dvc repro` can now execute the `train` stage for you. DVC will track all of its
outputs (`outs`) automatically. Let's do that, and commit this project version:

```dvc
$ dvc repro
Running stage 'train' with command:
python train.py images/
Updating lock file 'dvc.lock'
...

$ git add dvc.yaml dvc.lock .gitignore
$ git commit -m "Train model via DVC."
$ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;)
```

> See also `dvc.lock`.

## Switching versions

After iterating on this process and producing several versions, you can combine
`git checkout` and `dvc checkout` to perform full or partial
<abbr>workspace</abbr> restorations.

![](/img/versioning.png) _Code and data checkout_

> Note that `dvc install` enables auto-checkouts of data after `git checkout`.

A full checkout brings the whole <abbr>project</abbr> back to a previous version
— code, dataset and model files all match each other:

```dvc
$ git checkout v1.0
$ dvc checkout
M images
M model.pkl
```

However, we can checkout certain parts only, for example if we want to keep the
latest source code and model but rewind to the previous dataset only:

```dvc
$ git checkout v1.0 images.dvc
$ dvc checkout images.dvc
M images
```

DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
avoiding copying files each time, so checking out data is quick even if you have
large data files.
... connect with other cases