Skip to content

Commit

Permalink
cases: copy edits to latest Vrng iteration, and append next steps par…
Browse files Browse the repository at this point in the history
…agaph (bottom)
  • Loading branch information
jorgeorpinel committed Nov 27, 2020
1 parent 98d97c7 commit 6a75449
Showing 1 changed file with 33 additions and 28 deletions.
61 changes: 33 additions & 28 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,58 +8,58 @@ these files and directories?
![](/img/data-ver-complex.png) _Exponential complexity of data science projects_

Another problem in the field has to do with bookkeeping: being able to identify
past research inputs and processes to understand results, for knowledge sharing,
or for debugging. There's also the matter of defining and enforcing data
lifecycles.
past research inputs and processes to understand results, for knowledge sharing
or for debugging.

**Data Version Control** (DVC) helps you organize data and models effectively,
and capture their versions with
[Git commits](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository).
It supports data _versioning through codification_: describing which datasets,
ML artifacts, etc. should be in the <abbr>workspace</abbr> at any given time.
DVC also provides mechanisms to cache different data contents, and to switch
between their versions.
between these versions.

![](/img/project-versions.png) _DVC matches the right versions of data to the
rest of your project._

DVC makes projects fully _reproducible_:
[Restore](/doc/command-reference/checkout) any project version and find the
corresponding data instantly. At the same time, the data stays with you — free
from Git hosting storage
[constraints](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota).

This is achieved with special
[metafiles](/doc/user-guide/dvc-files-and-directories) that can be put in Git
instead of the actual data. And by storing unique data versions (no file
duplication) separate from the workspace.
instead of the actual data. And by storing unique data versions (preventing file
duplication) separate from the workspace. Now file names don't need to change,
because they can variable data. The result is a single immutable history for
data, code, models, etc. — a proper experiments journal.

The result is a single immutable history for data, code, models, etc., like an
experiments journal. This makes the project fully _reproducible_:
[Restore](/doc/command-reference/checkout) any project version and find the
corresponding data instantly. And at the same time, the data stays with you —
free from the storage constraints of Git repos and
[hosting](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota).

> To learn how it looks and feels, try the
> To learn how DVC looks and feels, try the
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial)
> 👩‍💻. And there are many other [guides](/doc/user-guide) and
> [references](/doc/command-reference) that explain DVC in more detail.
> 👩‍💻. <br/> There are also other [guides](/doc/user-guide) and
> [references](/doc/command-reference) that explain DVC in detail.
These are the benefits of our approach:
Benefits of our approach include:

- **Lightweight**: DVC is a
[free](https://github.com/iterative/dvc/blob/master/LICENSE), open-source
[command line](/doc/command-reference) tool that doesn't require databases or
servers.
[command line](/doc/command-reference) tool that doesn't require databases,
servers, or any other special services.

- **Consistency**: Keep your projects readable with stable file names (which can
contain variable data). No need for complicated paths like
`data/20190922/labels_v7_final` or for constantly editing them in source code.
- **Consistency**: Keep your projects readable with stable file names. No need
for complicated paths like `data/20190922/labels_v7_final` or for constantly
editing these in source code.

- **Efficient data management**: Use a cost-effective on-premises or cloud
storage of your choice for your data and ML models. DVC
[optimizes](/doc/user-guide/large-dataset-optimization) their storage and
[optimizes](/doc/user-guide/large-dataset-optimization) data storage and
transfers.

- **Collaboration**: Easily share project data
[internally](/doc/use-cases/shared-development-server) or
- **Collaboration**: Easily distribute your project development and share its
data [internally](/doc/use-cases/shared-development-server) and
[remotely](/doc/use-cases/sharing-data-and-model-files), or
[reuse](/doc/start/data-access) it elsewhere.
[reuse](/doc/start/data-access) it in other places.

- **Data compliance**: Review data modification attempts as Git
[pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/).
Expand All @@ -72,5 +72,10 @@ These are the benefits of our approach:
[data registries](/doc/use-cases/data-registries), and other best practices.

In summary, data science and machine learning are iterative processes where the
lifecycles of data, models, and code happen at different paces. DVC helps
integrate and manage them effectively. There are other things that DVC can do
lifecycles of data, models, and code happen at different paces. DVC helps you
define, manage, and enforce them.

And this is only the beginning. DVC comes with other [features](/features)
specific to data science out-of-the-box. Build, run, and versioning
[data pipelines](/doc/command-reference/dag), manage
[experiments](/doc/start/experiments) effectively, and more!

0 comments on commit 6a75449

Please sign in to comment.