diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 6b544fa8b24..fe8bb25d0d7 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -87,7 +87,6 @@ "slug": "what-is-dvc", "source": "what-is-dvc.md" }, - "basic-concepts", { "label": "DVC Files and Directories", "slug": "dvc-files-and-directories" diff --git a/content/docs/user-guide/basic-concepts.md b/content/docs/user-guide/basic-concepts.md deleted file mode 100644 index a3906308388..00000000000 --- a/content/docs/user-guide/basic-concepts.md +++ /dev/null @@ -1,108 +0,0 @@ -# Basic Concepts of DVC - -DVC streamlines large data files and binary models into a single Git -environment. This approach will not require storing binary files in your Git -repository. - -## DVC Project - -Initialized by running `dvc init` in a directory, it will contain all the -[DVC files and directories](/doc/user-guide/dvc-files-and-directories), -including the cache, `dvc.yaml` and `.dvc` files, etc. Any other -files referenced from special DVC files are also considered part of the project -(for example [metrics files](/doc/command-reference/metrics)). - -> `dvc destroy` can be used to remove all DVC-specific files from the directory, -> in effect deleting the DVC project. - -## DVC repository - -DVC project initialized in a Git repository. This enables the -versioning features of DVC (recommended). Files tracked by Git are considered -part of the DVC project when referenced from special DVC files such as -`dvc.lock`, for example source code that is used as a stage -dependency. - -## Data Files - -Large files (or directories) that are tracked and cached by DVC. -Data files are too large to be added to a Git repository. DVC stores them on a -local/shared hard drive, and/or _remote storage_. `dvc.lock` or `.dvc` files -describing the data are put in the project as placeholders for DVC -needs (to maintain pipelines and reproducibility). These can be committed to Git -instead of the data files themselves. - -Examples of data files are raw datasets, extracted features, ML models, -performance data, etc. - -> A.k.a. data artifacts and outputs - -## Workspace - -It's comprised by the non-internal project files, as well as the -currently present set of _data files_ and directories (see `dvc checkout`). -Similar to the -[working tree](https://git-scm.com/docs/gitglossary#def_working_tree) in Git. - -## DVC Cache - -A DVC project's cache is an -[internal directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) -used to store all data files outside of the Git repository. It's a local hard -drive or external location. See `dvc cache dir`. - -## Remote Storage - -Storage location external to the DVC project, which is used to share and backup -all or parts of the cache. See `dvc remote` for more details. - -## Processing Stage - -An individual process that transforms a data input (dependency) -into some result (usually a data output). DVC stages execute -terminal commands to (re)generate their results. - -## Data Pipeline - -Dependency graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)), -or series of [data processing stages](#stage) to (re)produce certain results. -Multiple stages can be chained by their dependencies and outputs. Pipelines are -defined in special `dvc.yaml` files. Refer to `dvc dag` for more information. - -See [Data Pipelines](/doc/start/data-pipelines) for a hands-on explanation. - -## Reproducibility - -Action to reproduce an experiment state. This regenerates output files (or -directories) based on a set of input files and source code. This action usually -changes experiment state. - -> This is one of the biggest challenges in reusing, and hence managing ML -> projects. - -## Experiment - -An attempt at a data science task. Each one can be performed in a separate Git -branch or tag, and its states identified by different -[revisions](https://git-scm.com/docs/revisions). Examples: add a new data -source, extract data features, change model hyperparameters, etc. DVC doesn't -need to recompute the results after a successful merge that integrates an -experiment into the repository history. - -> See [Experiments](/doc/start/experiments) for a hands-on explanation. - -## Run Cache - -DVC's run-cache is an automatic performance feature that stores both the context -and results of past experiment runs. It's located in the `.dvc/cache/runs` -directory. - -`dvc run` and `dvc repro` look in the run-cache first before executing any -stages, to see if this exact same configuration has been run before (and if so -use the cached results). The run-cache can be uploaded and downloaded to/from -remote storage, along with the rest of the cache. - -## Workflow - -Set of experiments and relationships among them. Corresponds to the entire -project and may contain several [data pipelines](#data-pipelines). diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md index 2f789a7f640..391cef04b6d 100644 --- a/content/docs/user-guide/related-technologies.md +++ b/content/docs/user-guide/related-technologies.md @@ -6,11 +6,10 @@ bringing best practices from software engineering into the data science field ## Git -- DVC builds upon Git by introducing the concept of - [data files](/doc/user-guide/basic-concepts#data-files) – large files that - should not be stored in a Git repository, but still need to be tracked and - versioned. It leverages Git's features to enable managing different versions - of data itself, data pipelines, and experiments. +- DVC builds upon Git by introducing the concept of data files – large files + that should not be stored in a Git repository, but still need to be tracked + and versioned. It leverages Git's features to enable managing different + versions of data itself, data pipelines, and experiments. - DVC is not fundamentally bound to Git, and can work without it (except versioning-related features). This also applies to Git-LFS and Git-annex, @@ -27,7 +26,7 @@ bringing best practices from software engineering into the data science field [available](/doc/command-reference/install)). - Git-LFS was not made with data science in mind, so it doesn't provide related - features (e.g. [pipelines](/doc/user-guide/basic-concepts#data-pipeline), + features (e.g. [pipelines](/doc/command-reference/pipeline), [metrics](/doc/command-reference/metrics), etc.). - Github (most common Git hosting service) has a limit of 2 GB per repository. @@ -116,14 +115,13 @@ _Luigi_, etc. (DAG): - The DAG or dependency graph is defined implicitly by the connections between - pipeline [stages](/doc/user-guide/basic-concepts#data-processing-stage), - based on their dependencies and outputs. + pipeline [stages](/doc/command-reference/run), based on their + dependencies and outputs. - Each stage defines one node in the DAG. All DVC-files in a repository make - up a [pipelines](/doc/user-guide/basic-concepts#data-pipeline) (think a - single Makefile). All stages (and corresponding processes) are implicitly - combined through their inputs and outputs, simplifying conflict resolution - during merges. + up a [pipelines](/doc/command-reference/pipeline) (think a single Makefile). + All stages (and corresponding processes) are implicitly combined through + their inputs and outputs, simplifying conflict resolution during merges. - DVC stages can be written manually in an intuitive `dvc.yaml` file, or generated by the helper command `dvc run`, based on a terminal command, its diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index 15904d4c9e5..b63352f6534 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -21,8 +21,7 @@ software engineers. interface and flow as Git. DVC can also work stand-alone, but without versioning capabilities. -- **Data versioning** is enabled by replacing - [large files](/doc/user-guide/basic-concepts#data-files), dataset directories, +- **Data versioning** is enabled by replacing large files, dataset directories, ML models, etc. with small [metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with Git). These placeholders point to the original data, which is decoupled from @@ -33,8 +32,8 @@ software engineers. transfer large datasets or share a GPU-trained model with others. - DVC makes data science projects **reproducible** by creating lightweight - [pipelines](/doc/user-guide/basic-concepts#data-pipelines) using implicit - dependency graphs,and codifying the data and artifacts involved. + [pipelines](/doc/command-reference/pipeline) using implicit dependency + graphs,and codifying the data and artifacts involved. - DVC is **platform agnostic**: It runs on all major operating systems (Linux, MacOS, and Windows), and works independently of the programming languages