diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 1b27d3a9ed..423264d7e3 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -8,9 +8,8 @@ these files and directories? ![](/img/data-ver-complex.png) _Exponential complexity of data science projects_ Another problem in the field has to do with bookkeeping: being able to identify -past research inputs and processes to understand results, for knowledge sharing, -or for debugging. There's also the matter of defining and enforcing data -lifecycles. +past research inputs and processes to understand results, for knowledge sharing +or for debugging. **Data Version Control** (DVC) helps you organize data and models effectively, and capture their versions with @@ -18,48 +17,49 @@ and capture their versions with It supports data _versioning through codification_: describing which datasets, ML artifacts, etc. should be in the workspace at any given time. DVC also provides mechanisms to cache different data contents, and to switch -between their versions. +between these versions. ![](/img/project-versions.png) _DVC matches the right versions of data to the rest of your project._ +DVC makes projects fully _reproducible_: +[Restore](/doc/command-reference/checkout) any project version and find the +corresponding data instantly. At the same time, the data stays with you — free +from Git hosting storage +[constraints](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota). + This is achieved with special [metafiles](/doc/user-guide/dvc-files-and-directories) that can be put in Git -instead of the actual data. And by storing unique data versions (no file -duplication) separate from the workspace. +instead of the actual data. And by storing unique data versions (preventing file +duplication) separate from the workspace. Now file names don't need to change, +because they can variable data. The result is a single immutable history for +data, code, models, etc. — a proper experiments journal. -The result is a single immutable history for data, code, models, etc., like an -experiments journal. This makes the project fully _reproducible_: -[Restore](/doc/command-reference/checkout) any project version and find the -corresponding data instantly. And at the same time, the data stays with you — -free from the storage constraints of Git repos and -[hosting](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota). - -> To learn how it looks and feels, try the +> To learn how DVC looks and feels, try the > [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial) -> 👩‍💻. And there are many other [guides](/doc/user-guide) and -> [references](/doc/command-reference) that explain DVC in more detail. +> 👩‍💻.
There are also other [guides](/doc/user-guide) and +> [references](/doc/command-reference) that explain DVC in detail. -These are the benefits of our approach: +Benefits of our approach include: - **Lightweight**: DVC is a [free](https://github.com/iterative/dvc/blob/master/LICENSE), open-source - [command line](/doc/command-reference) tool that doesn't require databases or - servers. + [command line](/doc/command-reference) tool that doesn't require databases, + servers, or any other special services. -- **Consistency**: Keep your projects readable with stable file names (which can - contain variable data). No need for complicated paths like - `data/20190922/labels_v7_final` or for constantly editing them in source code. +- **Consistency**: Keep your projects readable with stable file names. No need + for complicated paths like `data/20190922/labels_v7_final` or for constantly + editing these in source code. - **Efficient data management**: Use a cost-effective on-premises or cloud storage of your choice for your data and ML models. DVC - [optimizes](/doc/user-guide/large-dataset-optimization) their storage and + [optimizes](/doc/user-guide/large-dataset-optimization) data storage and transfers. -- **Collaboration**: Easily share project data - [internally](/doc/use-cases/shared-development-server) or +- **Collaboration**: Easily distribute your project development and share its + data [internally](/doc/use-cases/shared-development-server) and [remotely](/doc/use-cases/sharing-data-and-model-files), or - [reuse](/doc/start/data-access) it elsewhere. + [reuse](/doc/start/data-access) it in other places. - **Data compliance**: Review data modification attempts as Git [pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/). @@ -72,5 +72,10 @@ These are the benefits of our approach: [data registries](/doc/use-cases/data-registries), and other best practices. In summary, data science and machine learning are iterative processes where the -lifecycles of data, models, and code happen at different paces. DVC helps -integrate and manage them effectively. There are other things that DVC can do +lifecycles of data, models, and code happen at different paces. DVC helps you +define, manage, and enforce them. + +And this is only the beginning. DVC comes with other [features](/features) +specific to data science out-of-the-box. Build, run, and versioning +[data pipelines](/doc/command-reference/dag), manage +[experiments](/doc/start/experiments) effectively, and more!