Skip to content

Commit

Permalink
guide: intro run-cache, change H3->H2 in Basic Concepts
Browse files Browse the repository at this point in the history
and other copy edits
  • Loading branch information
jorgeorpinel committed Aug 10, 2020
1 parent f7869ef commit d3cdb87
Showing 1 changed file with 24 additions and 15 deletions.
39 changes: 24 additions & 15 deletions content/docs/user-guide/basic-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ DVC streamlines large data files and binary models into a single Git
environment. This approach will not require storing binary files in your Git
repository.

### DVC Project
## DVC Project

Initialized by running `dvc init` in a directory, it will contain all the
[DVC files and directories](/doc/user-guide/dvc-files-and-directories),
Expand All @@ -15,15 +15,15 @@ files referenced from special DVC files are also considered part of the project
> `dvc destroy` can be used to remove all DVC-specific files from the directory,
> in effect deleting the DVC project.
### DVC repository
## DVC repository

<abbr>DVC project</abbr> initialized in a Git repository. This enables the
versioning features of DVC (recommended). Files tracked by Git are considered
part of the DVC project when referenced from special DVC files such as
`dvc.lock`, for example source code that is used as a stage
<abbr>dependency</abbr>.

### Data Files
## Data Files

Large files (or directories) that are tracked and <abbr>cached</abbr> by DVC.
Data files are too large to be added to a Git repository. DVC stores them on a
Expand All @@ -37,32 +37,32 @@ performance data, etc.

> A.k.a. <abbr>data artifacts</abbr> and <abbr>outputs</abbr>
### Workspace
## Workspace

It's comprised by the non-internal <abbr>project</abbr> files, as well as the
currently present set of _data files_ and directories (see `dvc checkout`).
Similar to the
[working tree](https://git-scm.com/docs/gitglossary#def_working_tree) in Git.

### DVC Cache
## DVC Cache

A DVC project's <abbr>cache</abbr> is an
[internal directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
used to store all data files outside of the Git repository. It's a local hard
drive or external location. See `dvc cache dir`.

### Remote Storage
## Remote Storage

Storage location external to the DVC project, which is used to backup all or
parts of the <abbr>cache</abbr>. See `dvc remote` for more details.
Storage location external to the DVC project, which is used to share and backup
all or parts of the <abbr>cache</abbr>. See `dvc remote` for more details.

### Processing Stage
## Processing Stage

An individual process that transforms a data input (<abbr>dependency</abbr>)
into some result (usually a data <abbr>output</abbr>). DVC stages execute
terminal commands to (re)generate their results.

### Data Pipeline
## Data Pipeline

Dependency graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)),
or series of [data processing stages](#stage) to (re)produce certain results.
Expand All @@ -71,7 +71,7 @@ defined in special `dvc.yaml` files. Refer to `dvc dag` for more information.

See [Data Pipelines](/doc/start/data-pipelines) for a hands-on explanation.

### Reproducibility
## Reproducibility

Action to reproduce an experiment state. This regenerates output files (or
directories) based on a set of input files and source code. This action usually
Expand All @@ -80,9 +80,7 @@ changes experiment state.
> This is one of the biggest challenges in reusing, and hence managing ML
> projects.
## Advanced Concepts

### Experiment
## Experiment

An attempt at a data science task. Each one can be performed in a separate Git
branch or tag, and its states identified by different
Expand All @@ -93,7 +91,18 @@ experiment into the <abbr>repository</abbr> history.

> See [Experiments](/doc/start/experiments) for a hands-on explanation.
### Workflow
## Run Cache

DVC's run-cache is an automatic performance feature that stores both the context
and results of past experiment runs. It's located in the `.dvc/cache/runs`
directory.

`dvc run` and `dvc repro` look in the run-cache first before executing any
stages, to see if this exact same configuration has been run before (and if so
use the cached results). The run-cache can be uploaded and downloaded to/from
remote storage, along with the rest of the <abbr>cache</abbr>.

## Workflow

Set of experiments and relationships among them. Corresponds to the entire
<abbr>project</abbr> and may contain several [data pipelines](#data-pipelines).

0 comments on commit d3cdb87

Please sign in to comment.