Skip to content

Commit

Permalink
Merge branch 'master' into get-started-1.0-experiments
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Jun 26, 2020
2 parents 7871e43 + 87e83e1 commit befb19b
Show file tree
Hide file tree
Showing 22 changed files with 424 additions and 488 deletions.
3 changes: 0 additions & 3 deletions config/prismjs/dvc-commands.js
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,6 @@ module.exports = [
'plots modify',
'plots diff',
'plots',
'pipeline show',
'pipeline list',
'pipeline',
'move',
'metrics show',
'metrics diff',
Expand Down
106 changes: 53 additions & 53 deletions content/blog/2020-06-22-dvc-1-0-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
title: 'DVC 1.0 release: new features for MLOps'
date: 2020-06-22
description: |
Today we're releasing DVC 1.0. New exciting features that users were waiting
for ❤️ . All the details in this blog post.
Today we're releasing DVC 1.0 with new exciting features that users were
waiting for ❤️. Find all the details in this blog post.
descriptionLong: |
Today we're releasing DVC 1.0. New exciting features that users were waiting
for ❤️. DVC is a more mature product now with stable release cycles and
benchmarks. All the details in this blog post.
Today we're releasing DVC 1.0. It brings new exciting features that users
were waiting for ❤️. DVC is a more mature product now, with stable release
cycles and benchmarks. Find all the details in this blog post.
picture: 2020-06-22/release.png
pictureComment: DVC 1.0 release
Expand All @@ -24,31 +24,31 @@ tags:
## Introduction

3 years ago, I was concerned about good engineering standards in data science:
data versioning, reproducibility, workflow automation - like continuous
integration and continuous delivery (CI/CD) - but for machine learning. I wanted
there to be Git for data to make this possible. So I made DVC (Data Version
Control), which works as version control for data projects.
data versioning, reproducibility, workflow automation like continuous
integration and continuous delivery (CI/CD), but for machine learning. I wanted
there to be a "Git for data" to make all this possible. So I created DVC (Data
Version Control), which works as version control for data projects.

Technically, DVC codifies your data and machine learning pipelines as text
metafiles (with pointers to actual data in S3/GCP/Azure/SSH), while you use Git
for the actual versioning. DevOps folks call this approach GitOps or more
specificaly in this case - _DataOps_ or _MLOps_.
for the actual versioning. DevOps folks call this approach GitOps or, more
specifically, in this case _DataOps_ or _MLOps_.

The new DVC 1.0. is inspired by discussions and contributions from our community
of data scientists, ML engineers, developers and software engineers.

## DVC 1.0

The new DVC 1.0 is inspired by discussions and contributions from our
community - both fresh ideas and bug reports 😅. All these contributions, big
and small, have a collective impact on DVC's development - I'm confident 1.0
wouldn't be possible without our community. They tell us what features matter
most, what approaches work (and what don't!), and what they need from DVC to
support their ML projects.
The new DVC 1.0 is inspired by discussions and contributions from our community
both fresh ideas and bug reports 😅. All these contributions, big and small,
have a collective impact on DVC's development. I'm confident 1.0 wouldn't be
possible without our community. They tell us what features matter most, which
approaches work (and which don't!), and what they need from DVC to support their
ML projects.

A few weeks ago we announced the 1.0 prerelease. After lots of helpful feedback
A few weeks ago we announced the 1.0 pre-release. After lots of helpful feedback
from brave users, it's time to go live. Now, DVC 1.0 is available with all the
standard installation methods including pip, conda, brew, choco, and
standard installation methods including `pip`, `conda`, `brew`, `choco`, and
system-specific packages: deb, rpm, msi, pkg. See https://dvc.org/doc/install
for more details.

Expand All @@ -62,21 +62,21 @@ learned in 3 years of this journey and how these are reflected in the new DVC.

Our users taught us that ML pipelines evolve much faster than data engineering
pipelines with data processing steps. People need to change the commands of the
pipeline often and it was not easy to do with the old DVC files.
pipeline often and it was not easy to do this with the old DVC-files.

In DVC 1.0, the DVC file format was changed in three big ways. First, instead of
multiple DVC stage files (`*.dvc`), each project has a single DVC file
`dvc.yaml`. By default, all stages go in this single `.yaml` file.
In DVC 1.0, the DVC metafile format was changed in three big ways. First,
instead of multiple DVC stage files (`*.dvc`), each project has a single
`dvc.yaml` file. By default, all stages go in this single YAML file.

Second, we made clear connections between the `dvc run` command, where pipeline
stages are defined, and how stages appear in `dvc.yaml`. Many of the `dvc run`
options are mirrored in the metafile. We wanted to make it far less complicated
to edit an existing pipeline by making `dvc.yaml` more human readable and
writable.
Second, we made clear connections between the `dvc run` command (a helper to
define pipeline stages), and how stages are defined in `dvc.yaml`. Many of the
options of `dvc run` are mirrored in the metafile. We wanted to make it far less
complicated to edit an existing pipeline by making `dvc.yaml` more human
readable and writable.

Third, data hash values are no longer stored in the pipeline metafile. This
approach aligns better with GitOps paradigms and simplifies the usage of DVC by
tremendously improving metafile human-readability:
Third, file and directory hash values are no longer stored in the pipeline
metafile. This approach aligns better with the GitOps paradigms and simplifies
the usage of DVC by tremendously improving metafile human-readability:

```yaml
stages:
Expand All @@ -99,53 +99,53 @@ stages:
- dropout
metrics:
- logs.csv
- summary.json
- summary.json:
cache: false
outs:
- model.pkl
```
All the hashes have been moved to a special file, `dvc.lock`, which is a lot
like the old DVC file format. DVC uses the `.lock` file to define what data
files need to be restored to the workspace from data remotes (cloud storage) and
if a particular pipeline stage needs to be rerun. In other words, we're
separating the human-readable parts of the pipeline into `dvc.yaml` and
auto-generated "machine" parts into `dvc.lock`.
All of the hashes have been moved to a special file, `dvc.lock`, which is a lot
like the old DVC-file format. DVC uses this lock file to define which data files
need to be restored to the workspace from data remotes (cloud storage) and if a
particular pipeline stage needs to be rerun. In other words, we're separating
the human-readable parts of the pipeline into `dvc.yaml`, and the auto-generated
"machine" parts into `dvc.lock`.

Another cool change: the auto-generated part doesn't necessarily need to be
stored in your Git repository. The new run-cache feature eliminates the need of
storing the lock file in Git repositories. That brings us to our next big
feature:
Another cool change: the auto-generated part (`dvc.lock`) doesn't necessarily
have to be stored in your Git repository. The new run-cache feature eliminates
the need of storing the lock file in Git repositories. That brings us to our
next big feature:

### [Run cache](https://github.com/iterative/dvc/issues/1234)

We built DVC with a workflow in mind: one experiment to one commit. Some users
love it, but this approach gets clunky fast for others (like folks who are
grid-searching hyperparameter space). Forcing users to make Git commits for each
ML experiment was a requirement for the old DVC, if you wanted to snapshot your
grid-searching a hyperparameter space). Making Git commits for each ML
experiment was a requirement with the old DVC, if you wanted to snapshot your
project or pipelines on each experiment. Moving forward, we want to give users
more flexibility to decide how often they want to commit.

We had an insight that data remotes (S3, Azure Blob, SSH etc) can be used
instead of Git for storing the codified meta information, not only data. In DVC
1.0 a special structure is implemented - run-cache - that preserves the state
including all the hashes. Basically, all the information that is stored in the
1.0, a special structure is implemented, the run-cache, that preserves the state
(including all the hashes). Basically, all the information that is stored in the
new `dvc.lock` file is replicated in the run-cache.

The advantage of the run-cache is that pipeline runs (and output file versions)
are not directly connected to Git commits anymore. New DVC can store all the
runs in run-cache - even if they were never committed to Git.
are not directly connected to Git commits anymore. The new DVC can store all the
runs in the run-cache, even if they were never committed to Git.

This approach gives DVC a "long memory" of DVC stages runs. If a user runs a
command that was run before (whether Git committed or not), then DVC can return
the result of the command from the cache without rerunning it. It is a useful
feature for a hyperparameter optimization stage - when users return to the
This approach gives DVC a "long memory" of DVC stages runs. If a user tries to
run a stage that was previously run (whether committed to Git or not), then DVC
can return the result from the run-cache without rerunning it. It is a useful
feature for a hyperparameter optimization stage when users return to the
previous sets of the parameters and don't want to wait for ML retraining.

Another benefit of the run-cache is related to CI/CD systems for ML, which is a
holy grail of MLOps. The long memory means users don't have to make auto-commits
in their CI/CD system side - see
[this stackowerflow question](https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments).
[this Stackowerflow question](https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments).

### [Plots](https://github.com/iterative/dvc/issues/3409)

Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ progress made by the checkout.

There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
regenerate its outputs (see also `dvc pipeline`). In other cases the cache can
be pulled from remote storage using `dvc pull`.
regenerate its outputs (see also `dvc dag`). In other cases the cache can be
pulled from remote storage using `dvc pull`.

## Options

Expand Down
108 changes: 108 additions & 0 deletions content/docs/command-reference/dag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# dag

Show [stages](/doc/command-reference/run) in a pipeline that lead to the
specified stage. By default it lists
[DVC-files](/doc/user-guide/dvc-files-and-directories).

## Synopsis

```usage
usage: dvc dag [-h] [-q | -v] [--dot] [--full] [target]
positional arguments:
targets Stage or output to show pipeline for (optional)
Finds all stages in the workspace by default.
```

## Description

A data pipeline, in general, is a series of data processing
[stages](/doc/command-reference/run) (for example console commands that take an
input and produce an <abbr>output</abbr>). A pipeline may produce intermediate
data, and has a final result. Machine learning (ML) pipelines typically start a
with large raw datasets, include intermediate featurization and training stages,
and produce a final model, as well as accuracy
[metrics](/doc/command-reference/metrics).

In DVC, pipeline stages and commands, their data I/O, interdependencies, and
results (intermediate or final) are specified with `dvc add` and `dvc run`,
among other commands. This allows DVC to restore one or more pipelines of stages
interconnected by their dependencies and outputs later. (See `dvc repro`.)

> DVC builds a dependency graph
> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this.
`dvc dag` displays the stages of a pipeline up to the target stage. If `target`
is omitted, it will show the full project DAG.

## Options

- `--dot` - show DAG in
[DOT](<https://en.wikipedia.org/wiki/DOT_(graph_description_language)>)
format. It can be passed to third party visualization utilities.

- `--full` - show full DAG that the `target` belongs too, instead of showing the
part that consists only of the target ancestors.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Paging the output

This command's output is automatically piped to
[Less](<https://en.wikipedia.org/wiki/Less_(Unix)>), if available in the
terminal. (The exact command used is `less --chop-long-lines --clear-screen`.)
If `less` is not available (e.g. on Windows), the output is simply printed out.

> It's also possible to
> [enable Less paging on Windows](/doc/user-guide/running-dvc-on-windows#enabling-paging-with-less).
### Providing a custom pager

It's possible to override the default pager via the `DVC_PAGER` environment
variable. For example, the following command will replace the default pager with
[`more`](<https://en.wikipedia.org/wiki/More_(command)>), for a single run:

```dvc
$ DVC_PAGER=more dvc dag
```

For a persistent change, define `DVC_PAGER` in the shell configuration. For
example in Bash, we could add the following line to `~/.bashrc`:

```bash
export DVC_PAGER=more
```

## Examples

Visualize DVC pipeline:

```dvc
$ dvc dag
+---------+
| prepare |
+---------+
*
*
*
+-----------+
| featurize |
+-----------+
** **
** *
* **
+-------+ *
| train | **
+-------+ *
** **
** **
* *
+----------+
| evaluate |
+----------+
```
6 changes: 2 additions & 4 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,8 @@ Get tracked files or directories from
## Synopsis

```usage
usage: dvc fetch [-h] [-q | -v] [-j <number>]
[-r <name>] [-a] [-T]
[--all-commits] [-d] [-R]
[--run-cache]
usage: dvc fetch [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
[--all-commits] [-d] [-R] [--run-cache]
[targets [targets ...]]
positional arguments:
Expand Down
Loading

0 comments on commit befb19b

Please sign in to comment.