Skip to content

Commit

Permalink
Merge pull request #877 from iterative/jorgeorpinel
Browse files Browse the repository at this point in the history
Regular doc updates (mid Dec)
  • Loading branch information
shcheklein authored Dec 21, 2019
2 parents 650356d + 0b605f4 commit 5b3bf77
Show file tree
Hide file tree
Showing 16 changed files with 115 additions and 105 deletions.
2 changes: 1 addition & 1 deletion .prettierrc
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ trailingComma: none
printWidth: 80
tabWidth: 2
useTabs: false
proseWrap: always
proseWrap: "always"
6 changes: 4 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
"test": "jest",
"start": "NODE_ENV=production node server.js",
"format-staged": "pretty-quick --staged --no-restage --bail",
"format-check": "pretty-quick --check --verbose",
"lint-check": "eslint src pages"
"format-check": "prettier --check '{.,pages/**,src/**,static/**}/*.{js,jsx,md}'",
"lint-check": "eslint src pages",
"format-all": "prettier --write '{.,pages/**,src/**,static/**}/*.{js,jsx,md}'",
"format": "prettier --write"
},
"repository": {
"type": "git",
Expand Down
6 changes: 3 additions & 3 deletions pages/features.js
Original file line number Diff line number Diff line change
Expand Up @@ -128,9 +128,9 @@ export default function FeaturesPage() {
<Description>
No matter which programming language or libraries are in use or
how code is structured, reproducibility and pipelines are based on
input and output files. Python, R, Julia, Scala Spark, custom
binary, Notebooks, flatfiles/TensorFlow, PyTorch, etc. are all
supported.
input and output files or directories. Python, R, Julia, Scala
Spark, custom binary, Notebooks, flatfiles/TensorFlow, PyTorch,
etc. are all supported.
</Description>
</Feature>
<Feature>
Expand Down
2 changes: 1 addition & 1 deletion src/Tooltip/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ import React, { Component } from 'react'
import PropTypes from 'prop-types'
import includes from 'lodash.includes'

import glossary from '../Documentation/glossary'
import glossary from '../../static/docs/glossary'
import { OnlyDesktop, OnlyMobile } from '../styles'
import DesktopView from './desktop-view'
import MobileView from './mobile-view'
Expand Down
2 changes: 1 addition & 1 deletion static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ an added file or directory is also committed to the <abbr>cache</abbr>. (Use the
ready.)

The `targets` are files or directories to be places under DVC control. These are
turned into outputs (`outs` field) in a resulting
turned into <abbr>outputs<abbr> (`outs` field) in a resulting
[DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.)
Note that target data outside the current <abbr>workspace</abbr> is supported,
that becomes [external outputs](/doc/user-guide/managing-external-data).
Expand Down
7 changes: 3 additions & 4 deletions static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,9 @@ positional arguments:
## Description

[DVC-files](/doc/user-guide/dvc-file-format) in a <abbr>project</abbr> specify
which instance of each data file or directory is to be used, using the checksum
saved in the `outs` fields. The `dvc checkout` command updates the workspace
data to match with the <abbr>cached</abbr> files corresponding to those
checksums.
which instance of each data file or directory should be used, with the checksums
saved in the `outs` field. The `dvc checkout` command updates the workspace data
to match with the <abbr>cached</abbr> files that correspond to those checksums.

Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM repository, the DVC-files will contain checksums for
Expand Down
10 changes: 5 additions & 5 deletions static/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# commit

Record changes to the repository by updating
[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to the
<abbr>cache</abbr>.
[DVC-files](/doc/user-guide/dvc-file-format) and saving <abbr>outputs<abbr> to
the <abbr>cache</abbr>.

## Synopsis

Expand Down Expand Up @@ -38,9 +38,9 @@ time tying stages or a pipeline.
- Sometimes we want to clean up a code or configuration file in a way that
doesn't cause a change in its results. We might write in-line documentation
with comments, change indentation, remove some debugging printouts, or any
other change that doesn't produce different output of pipeline stages.
`dvc commit` can help avoid having to reproduce a pipeline in these cases by
forcing the update of the DVC-files.
other change that doesn't produce different <abbr>outputs<abbr> of pipeline
stages. `dvc commit` can help avoid having to reproduce a pipeline in these
cases by forcing the update of the DVC-files.

Let's take a look at what is happening in the fist scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
Expand Down
11 changes: 6 additions & 5 deletions static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,12 @@ copy the target data from the external source project or its

The `path` argument of this command is used to specify the location, within the
source repository at `url`, of the target(s) to be downloaded. It can point to
any file or directory in the source project, including all files tracked by Git.
Note that data tracked by DVC should be specified in one of the
[DVC-files](/doc/user-guide/dvc-file-format) of the source repository. (In this
case, a default [DVC remote](/doc/command-reference/remote) needs to be
configured in the project, containing the actual data.)
any file or directory in the source project, including <abbr>outputs</abbr>
tracked by DVC as well as files tracked by Git. Note that for the former, data
should be specified in one of the [DVC-files](/doc/user-guide/dvc-file-format)
of the source repository. (In this case, a default
[DVC remote](/doc/command-reference/remote) needs to be configured in the
project, containing the actual data.)

> See `dvc get-url` to download data from other supported URLs.
Expand Down
16 changes: 8 additions & 8 deletions static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# import

Download a file or directory, possibly being an <abbr>output</abbr>, from any
<abbr>DVC repository</abbr> (e.g. hosted on GitHub) into the
<abbr>workspace</abbr>. This also creates a
Download a file or directory from any <abbr>DVC repository</abbr> (e.g. hosted
on GitHub) into the <abbr>workspace</abbr>. This also creates a
[DVC-file](/doc/user-guide/dvc-file-format) with information about the data
source, which can later be used to [update](/doc/command-reference/update) the
import.
Expand Down Expand Up @@ -38,11 +37,12 @@ copy the target data from the external source project or its

The `path` argument of this command is used to specify the location, within the
source repository at `url`, of the target(s) to be downloaded. It can point to
any file or directory in the source project, including all files tracked by Git.
Note that data tracked by DVC should be specified in one of the
[DVC-files](/doc/user-guide/dvc-file-format) of the source repository. (In this
case, a default [DVC remote](/doc/command-reference/remote) needs to be
configured in the project, containing the actual data.)
any file or directory in the source project, including <abbr>outputs</abbr>
tracked by DVC as well as files tracked by Git. Note that for the former, data
should be specified in one of the [DVC-files](/doc/user-guide/dvc-file-format)
of the source repository. (In this case, a default
[DVC remote](/doc/command-reference/remote) needs to be configured in the
project, containing the actual data.)

> See `dvc import-url` to download and tack data from other supported URLs.
Expand Down
3 changes: 2 additions & 1 deletion static/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ omitted, `Dvcfile` will be assumed.

By default, this command recursively searches in pipeline stages, starting from
the `targets`, to determine which ones have changed. Then it executes the
corresponding commands.
corresponding commands.<br /> Note that DVC removes cached <abbr>outputs</abbr>
before running the stages that produce them.

`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results. It saves all the data files, intermediate
Expand Down
91 changes: 45 additions & 46 deletions static/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@ options) DVC can later connect each stage by building a dependency graph
([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). This graph is
used by DVC to restore a full data [pipeline](/doc/command-reference/pipeline).

The remainder of command line input provided to `dvc run` after the options (`-`
or `--` arguments) will become the required `command` argument. Please wrap the
The remaining terminal input provided to `dvc run` after the options (`-`/`--`
arguments) will become the required `command` argument. Please wrap the
`command` with `"` quotes if there are special characters in it like `|` (pipe)
or `<`, `>` (redirection) that would otherwise apply to the entire `dvc run`
command e.g. `dvc run -d script.sh "./script.sh > /dev/null 2>&1"`. Use single
quotes `'` instead of `"` to wrap the `command` if there are environment
variables in it, that you want to be evaluated dynamically. E.g.
or `<`, `>` (redirection) that would otherwise apply to `dvc run` itself e.g.
`dvc run -d script.sh "./script.sh > /dev/null 2>&1"`. Use single quotes `'`
instead of `"` to wrap the `command` if there are environment variables in it,
that you want to be evaluated dynamically. E.g.
`dvc run -d script.sh './myscript.sh $MYENVVAR'`

Unless the `-f` options is used, by default the DVC-file name generated is
Expand All @@ -52,25 +52,25 @@ captures data and <abbr>caches</abbr> relevant <abbr>data artifacts</abbr> along
the way. See [this example](/doc/get-started/example-pipeline) to learn more and
try creating a pipeline.

### Well-behaved commands
### Avoiding unexpected behavior

DVC is simple to use, you only need to wrap your commands with `dvc run`, and
define your dependencies and outputs.
We don't want to tell you how to write your code! However, please be aware that
in order to prevent unexpected results when DVC executes or reproduces your
commands, they should ideally follow these rules:

However, to prevent unexpected behaviors, ideally, your commands should follow
some principles:
- Read/write exclusively from/to the specified <abbr>dependencies</abbr> and
<abbr>outputs</abbr>.
- Completely rewrite outputs (i.e. do not append or edit).<br/> Note that DVC
removes cached outputs before running the stages that produce them (including
at `dvc repro`).
- Stop reading and writing files when the `command` exits.

- Read exclusively from specified dependencies.
- Write exclusively to specified outputs.
- Completely rewrite the outputs (i.e. do not append or edit).
- Stop reading and writing when the command exits.

To guarantee reproducibilty, your command should be
[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (i.e. it
must produce the same results given the same inputs/dependencies).

Have in mind what brings entropy to your command (e.g. random generators, time,
hardware, etc.) and try to minize it (e.g. fix seeds).
Keep in mind that if the pipeline's reproducibility goals include consistent
output data, its code should be as
[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) as
possible (produce the same output for a given input). In this case, avoid code
that brings [entropy](https://en.wikipedia.org/wiki/Software_entropy) into your
data pipeline (e.g. random numbers, time functions, hardware dependency, etc.)

## Options

Expand All @@ -91,9 +91,9 @@ hardware, etc.) and try to minize it (e.g. fix seeds).
- `-o`, `--outs` - specify a file or directory that is the result of running the
`command`. Multiple outputs can be specified: `-o model.pkl -o output.log`.
DVC builds a dependency graph (pipeline) to connect different stages with each
other based on this list of outputs, along with dependencies (see `-d`). DVC
takes all output files and directories under its control and puts them into
the cache (this is similar to what's happening when you use `dvc add`).
other based on this list of outputs and dependencies (see `-d`). DVC takes all
output files and directories under its control and puts them into the cache
(this is similar to what's happening when you use `dvc add`).

- `-O`, `--outs-no-cache` - the same as `-o` except outputs are not put
automatically under DVC control. It means that they are not cached, and it's
Expand All @@ -120,54 +120,53 @@ hardware, etc.) and try to minize it (e.g. fix seeds).
this location, by including a path in the provided value (e.g.
`-f stages/stage.dvc`).

- `-c`, `--cwd` - deprecated, use `-f` and `-w` to change location and working
directory of a stage file.
- `-c`, `--cwd` (_deprecated_) - Use `-f` and `-w` to change the name and
location (working directory) of a stage file.

- `-w`, `--wdir` - specifies a working directory for the `command` to run in.
`dvc run` expects that dependencies, outputs, metric files are specified
relative to this directory. This value is saved in the `wdir` field of the
stage file generated (as a relative path to the location of the DVC-file) and
is used by `dvc repro` to change the working directory before execuring the
command.
is used by `dvc repro` to change the working directory before executing the
`command`.

- `--no-exec` - create a stage file, but do not execute the command defined in
- `--no-exec` - create a stage file, but do not execute the `command` defined in
it, nor take dependencies or outputs under DVC control. In the DVC-file
contents, the `md5` hash sums will be empty; They will be populated the next
time this stage is actually executed. This command is useful, if for example,
you need to build a pipeline (dependency graph) first, and then run it all at
once.
time this stage is actually executed. This is useful if, for example, you need
to build a pipeline (dependency graph) first, and then run it all at once.

- `-y`, `--yes` - deprecated, use `--overwrite-dvcfile` instead.
- `-y`, `--yes` (_deprecated_) - See `--overwrite-dvcfile` below.

- `--overwrite-dvcfile` - overwrite an existing DVC-file (with file name
determined by the logic described in the `-f` option) without asking for
confirmation.

- `--ignore-build-cache` - This options has an effect if an equivalent stage
file exists (same dependencies, outputs, and command to execute) which has
file exists (same dependencies, outputs, and `command` to execute) which has
been already executed and is up to date. In this case, `dvc run` won't
normally execute the command again. The exception is when the existing stage
normally execute the `command` again. The exception is when the existing stage
is considered always changed (see `--always-changed` option). This option
gives a way to forcefully execute the command anyway. It's useful if the
command is non-deterministic (meaning it produces different outputs from the
same list of inputs).
gives a way to forcefully execute the `command` anyway. Useful if the
command's code is non-deterministic (meaning it produces different outputs
from the same list of inputs).

- `--remove-outs` - it removes stage outputs before executing the command. If
`--no-exec` specified outputs are removed anyway. This option is enabled by
default and deprecated. See `dvc remove` as well for more details.
- `--remove-outs` (_deprecated_) - remove stage outputs before executing the
`command`. If `--no-exec` specified outputs are removed anyway. See
`dvc remove` as well for more details. This is the default behavior.

- `--no-commit` - do not save outputs to cache. A DVC-file is created, and an
entry is added to `.dvc/state`, while nothing is added to the cache. (The
`dvc status` command will report that the file is `not in cache`.) Useful when
entry is added to `.dvc/state`, while nothing is added to the cache.
(`dvc status` will report that the file is `not in cache`.) Useful when
running different experiments and you don't want to fill up your cache with
temporary files. Use `dvc commit` when ready to commit the results to cache.

- `--always-changed` - always consider this DVC-file as changed. As a result
`dvc status` will report it as `always changed` and `dvc repro` will always
execute it.

> Note that a DVC-file without dependencies is automatically considered always
> changed, so this option has no effect in that case.
> Note that a DVC-file without dependencies is considered always changed, so
> this option has no effect in that case.
- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
14 changes: 11 additions & 3 deletions src/Documentation/glossary.js → static/docs/glossary.js
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,17 @@ For more details, please refer to this [document]
name: 'Output',
match: ['output', 'outputs'],
desc: `
A file or a directory that is under DVC control, recorded in the \`outs\`
section of a DVC-file. See \`dvc add\` \`dvc run\`, \`dvc import\`,
\`dvc import-url\` commands. A.k.a. **data artifact*.
A file or directory that is under DVC control, recorded in the \`outs\` section
of a DVC-file. See \`dvc add\` \`dvc run\`, \`dvc import\`, \`dvc import-url\`
commands. A.k.a. **data artifact*.
`
},
{
name: 'Dependency',
match: ['dependency', 'dependencies'],
desc: `
A file or directory (possibly under DVC control) recorded in the \`deps\`
section of a DVC-file. See \`dvc run\`.
`
},
{
Expand Down
22 changes: 11 additions & 11 deletions static/docs/tutorials/deep/define-ml-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,20 +144,20 @@ into a ML [pipeline](/doc/command-reference/pipeline).

`dvc run` executes any command that you pass it as a list of parameters.
However, the command to run alone is not as interesting as its role within a
larger data pipeline, so we'll need to specify its dependencies and output
files. We call all this a pipeline _stage_. Dependencies may include input files
or directories, and the actual command to run. Outputs are files written to by
the command, if any.
larger data pipeline, so we'll need to specify its <abbr>dependencies</abbr> and
<abbr>outputs</abbr>. We call all this a pipeline _stage_. Dependencies may
include input files or directories, and the actual command to run. Outputs are
files written to by the command, if any.

- Option `-d file.tsv` should be used to specify a dependency file or directory.
The dependency can be a regular file from a repository or a data file.

- `-o file.tsv` (lower case o) specifies output data file. DVC will track this
data file by creating a corresponding
- `-o file.tsv` (lower case o) specifies an output data file. DVC will track
this data file by creating a corresponding
[DVC-file](/doc/user-guide/dvc-file-format) (as if running `dvc add file.tsv`
after `dvc run` instead).

- `-O file.tsv` (upper case O) specifies a regular output file (not to be added
- `-O file.tsv` (upper case O) specifies a simple output file (not to be added
to DVC).

It's important to specify dependencies and outputs before the command to run
Expand Down Expand Up @@ -194,10 +194,10 @@ The `unzip` command extracts data file `data/Posts.xml.zip` to a regular file
`data/Posts.xml`. It knows nothing about data files or DVC. DVC executes the
command and does some additional work if the command was successful:

1. DVC transforms all the output files (`-o` option) into tracked data files
(similar to using `dvc add` for each of them). As a result, all the actual
data contents goes to the <abbr>cache</abbr> directory `.dvc/cache`, and each
of the file names will be added to `.gitignore`.
1. DVC transforms all the outputs (`-o` option) into tracked data files (similar
to using `dvc add` for each of them). As a result, all the actual data
contents goes to the <abbr>cache</abbr> directory `.dvc/cache`, and each of
the file names will be added to `.gitignore`.

2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage
file in the <abbr>project</abbr> with information about this pipeline stage.
Expand Down
2 changes: 1 addition & 1 deletion static/docs/tutorials/versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,7 +337,7 @@ the command (`python train.py`) that was run to produce the result. We call such
a DVC-file a "stage file".

> At this point you could run `git add .` and `git commit` to save the `Dvcfile`
> stage file and its changed output files to the repository.
> stage file and its changed outputs to the repository.
`dvc repro` will run `Dvcfile` if any of its dependencies (`-d`) changed. For
example, when we added new images to built the second version of our model, that
Expand Down
Loading

0 comments on commit 5b3bf77

Please sign in to comment.