diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 5caa56b848..68b5ca0b26 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -65,10 +65,15 @@ "source": "use-cases/index.md", "children": [ { - "label": "Versioning Data & Model Files", + "label": "Versioning Data and Models", "slug": "versioning-data-and-model-files", "source": "versioning-data-and-model-files/index.md", - "children": ["tutorial"] + "children": [ + { + "label": "Tutorial 👩‍💻", + "slug": "tutorial" + } + ] }, { "label": "Sharing Data and Model Files", diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-versioning.md index d509e68a5b..552bc63a8f 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-versioning.md @@ -85,7 +85,7 @@ outs: > \* See > [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and -> `dvc config cache` for more information on file linking. +> `dvc config cache` for more info. on file linking. diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index b028d58650..55933d74d5 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -18,8 +18,8 @@ knowledge, they are still difficult to implement, reuse, and manage. If you store and process data files or datasets to produce other data or machine learning models, and you want to -- track and save data and ML models the same way you capture code; -- create and switch among different +- track and save data and machine learning models the same way you capture code; +- create and switch between [versions of data and ML models](/doc/use-cases/versioning-data-and-model-files) easily; - understand how datasets and ML artifacts were built in the first place; diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 448bdeab55..df3e8307a8 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -1,130 +1,88 @@ -# Versioning Data and Model Files - -DVC enables versioning large files and directories such as datasets, data -science features, and machine learning models using Git, but without storing the -contents in Git. - -This is achieved by saving information about the data in special -[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in -the repository. These can be versioned with regular Git workflows (branches, -pull requests, etc.) - -To actually store the data, DVC uses a built-in cache, and supports -synchronizing it with various types of -[remote storage](/doc/command-reference/remote). This allows for easy data and -model versioning, storage, and sharing — right alongside code. - -![](/img/model-versioning-diagram.png) _Code and data flows in DVC_ - -In this basic use case, DVC is a better alternative to -[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc -scripts used to manage ML artifacts (training data, models, etc.) -on cloud storage. DVC doesn't require special services, and works with -on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider -(Amazon S3, Microsoft Azure, Google Drive, -[among others](/doc/command-reference/remote/add#supported-storage-types)). - -> For hands-on experience, we recommend following the -> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files). - -## DVC is not Git! - -DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track -data files and directories for versioning (among other purposes). They point to -specific data contents in the cache, providing the ability to store -multiple data versions out-of-the-box. - -Full-fledged -[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) -is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These -are designed for source code management (SCM) however, and thus ill-equipped to -support data science needs. That's where DVC comes in: with its built-in data -cache, reproducible [pipelines](/doc/start/data-pipelines), among -several other novel features (see [Get Started](/doc/start/) for a primer.) - -## Track data and models for versioning - -Let's say you have an empty DVC repository and put a dataset of -images in the `images/` directory. You can start tracking it with `dvc add`. -This generates a `.dvc` file, which can be committed to Git in order to save the -project's version: - -```dvc -$ ls images/ -0001.jpg 0002.jpg 0003.jpg 0004.jpg ... - -$ dvc add images/ - -$ git add images.dvc .gitignore -$ git commit -m "Track images dataset with DVC." -``` - -DVC's also allows to define the processes that build artifacts based on tracked -data, such as an ML model, by writing a simple `dvc.yaml` file that connects the -pieces together: - -> `dvc.yaml` files can be written manually or generated with `dvc run`. - -```yaml -stages: - train: - cmd: python train.py images/ - deps: - - images - outs: - - model.pkl -``` - -> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to -> this feature. - -`dvc repro` can now execute the `train` stage for you. DVC will track all of its -outputs (`outs`) automatically. Let's do that, and commit this project version: - -```dvc -$ dvc repro -Running stage 'train' with command: - python train.py images/ -Updating lock file 'dvc.lock' -... - -$ git add dvc.yaml dvc.lock .gitignore -$ git commit -m "Train model via DVC." -$ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;) -``` - -> See also `dvc.lock`. - -## Switching versions - -After iterating on this process and producing several versions, you can combine -`git checkout` and `dvc checkout` to perform full or partial -workspace restorations. - -![](/img/versioning.png) _Code and data checkout_ - -> Note that `dvc install` enables auto-checkouts of data after `git checkout`. - -A full checkout brings the whole project back to a previous version -— code, dataset and model files all match each other: - -```dvc -$ git checkout v1.0 -$ dvc checkout -M images -M model.pkl -``` - -However, we can checkout certain parts only, for example if we want to keep the -latest source code and model versions, but rewind to the previous version of the -dataset: - -```dvc -$ git checkout v1.0 images.dvc -$ dvc checkout images.dvc -M images -``` - -DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by -avoiding copying files each time, so checking out data is quick even if you are -versioning large data files. +# Versioning Data and Models + +Data science teams face data management questions around versions of data and +machine learning models. How do we keep track of changes in data, source code, +and ML models together? What's the best way to organize and store variations of +these files and directories? + +![](/img/data-ver-complex.png) _Exponential complexity of data science projects_ + +Another problem in the field has to do with bookkeeping: being able to identify +past data inputs and processes to understand their results, for knowledge +sharing, or for debugging. + +**Data Version Control** (DVC) lets you capture the versions of your data and +models in +[Git commits](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository), +while storing them on-premises or in cloud storage. It also provides a mechanism +to switch between these different data contents. The result is a single history +for data, code, and ML models that you can traverse — a proper journal of your +work! + +![](/img/project-versions.png) _DVC matches the right versions of data, code, +and models for you 💘._ + +DVC enables data _versioning through codification_. You write simple +[metafiles](/doc/user-guide/dvc-files-and-directories) once, describing what +datasets, ML artifacts, etc. to track. This metadata can be put in Git in lieu +of large files. Now you can use DVC to create +[snapshots](/doc/command-reference/add) of the data, +[restore](/doc/command-reference/checkout) previous versions, +[reproduce](/doc/command-reference/repro) experiments, record evolving +[metrics](/doc/command-reference/metrics), and more! + +👩‍💻 **Intrigued?** Try our +[versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial) +to learn how DVC looks and feels firsthand. + +As you use DVC, unique versions of your data files and directories are +[cached](dvc-files-and-directories#structure-of-the-cache-directory) in a +systematic way (preventing file duplication). The working datastore is separated +from your workspace to keep the project light, but stays connected +via file +[links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +handled automatically by DVC. + +Benefits of our approach include: + +- **Lightweight**: DVC is a + [free](https://github.com/iterative/dvc/blob/master/LICENSE), open-source + [command line](/doc/command-reference) tool that doesn't require databases, + servers, or any other special services. + +- **Consistency**: Keep your projects readable with stable file names — they + don't need to change because they represent variable data. No need for + complicated paths like `data/20190922/labels_v7_final` or for constantly + editing these in source code. + +- **Efficient data management**: Use a familiar and cost-effective storage + solution for your data and models (e.g. SFTP, S3, HDFS, + [etc.](/doc/command-reference/remote/add#supported-storage-types)) — free from + Git hosting + [constraints](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota). + DVC [optimizes](/doc/user-guide/large-dataset-optimization) storing and + transferring large files. + +- **Collaboration**: Easily distribute your project development and share its + data [internally](/doc/use-cases/shared-development-server) and + [remotely](/doc/use-cases/sharing-data-and-model-files), or + [reuse](/doc/start/data-access) it in other places. + +- **Data compliance**: Review data modification attempts as Git + [pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/). + Audit the project's immutable history to learn when datasets or models were + approved, and why. + +- **GitOps**: Connect your data science projects with the Git-powered universe. + Git workflows open the door to advanced tools such as continuous integration + (like [CML](https://cml.dev/) CI/CD), specialized patterns such as + [data registries](/doc/use-cases/data-registries), and other best practices. + +In summary, data science and ML are iterative processes where the lifecycles of +data, models, and code happen at different paces. DVC helps you manage, and +enforce them. + +And this is just the beginning. DVC supports multiple advanced features +out-of-the-box: Build, run, and versioning +[data pipelines](/doc/command-reference/dag), +[manage experiments](/doc/start/experiments) effectively, and more. diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index ab7e2c2753..7e92379f90 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -47,3 +47,17 @@ can version experiments, manage large datasets, and make projects reproducible. > Git servers, as well as SSH and cloud storage providers are supported, > however. + +## DVC does not replace Git! + +DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track +large data files and directories for versioning (among other +[purposes](/doc/user-guide/dvc-files-and-directories)). These metafiles change +along with your data, and you can use Git to place them under +[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) +as a proxy to the actual data versions, which are stored in the DVC +cache (outside of Git). This does not replace features of Git. + +DVC does, however, provide several commands similar to Git such as `dvc init`, +`dvc add`, `dvc checkout`, or `dvc push`, which interact with the underlying Git +repo (if one is being used, which is not required). diff --git a/static/img/data-ver-complex.png b/static/img/data-ver-complex.png new file mode 100644 index 0000000000..f633a53d22 Binary files /dev/null and b/static/img/data-ver-complex.png differ diff --git a/static/img/project-versions.png b/static/img/project-versions.png new file mode 100644 index 0000000000..8ef0bbc3f7 Binary files /dev/null and b/static/img/project-versions.png differ diff --git a/static/img/versioning.png b/static/img/versioning.png deleted file mode 100644 index 1b92fcb0b5..0000000000 Binary files a/static/img/versioning.png and /dev/null differ