Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: regular update (mid June) [2] #441

Merged
merged 21 commits into from
Jun 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
b17630e
term: link first mention of "DVC-files" in `install` cmd ref
jorgeorpinel Jun 18, 2019
690938e
format: use capital letter after `:` (colon)
jorgeorpinel Jun 18, 2019
8cb17fe
cmdref: update to `version` "Binary" explanation and
jorgeorpinel Jun 18, 2019
2a6f0c0
term: remove "experimentation software" synonim in favor of
jorgeorpinel Jun 18, 2019
a44d4de
term: " Review Dvcfile usage and correctness."
jorgeorpinel Jun 18, 2019
7eb12cd
term: "filename" -> "file name" or just "file".
jorgeorpinel Jun 18, 2019
7e76b62
format: tutorial/define-ml-pipeline
jorgeorpinel Jun 18, 2019
075efcb
cmd ref: std `-d` option in related commands
jorgeorpinel Jun 18, 2019
531b335
term: eliminate "rerun" in favor of "repro" or "run again"
jorgeorpinel Jun 18, 2019
cd345b9
term: use "stage file" in `import` cmd ref and
jorgeorpinel Jun 19, 2019
7ef9d93
term: use "DVC-file" instead of "pipeline" in `metrics modify` and
jorgeorpinel Jun 19, 2019
e85e2c3
cmd ref: change `text` for `dvc` in code block in `metrics add`
jorgeorpinel Jun 19, 2019
bcc7764
cmd ref: reorg the ##Description of `repro` and
jorgeorpinel Jun 19, 2019
8623097
format: review a couple get-started chapters with better terminology use
jorgeorpinel Jun 19, 2019
c0b39f9
term: Avoid using "THE pipeline" in singular
jorgeorpinel Jun 19, 2019
8bd9b40
cmd ref: Update `-d` option again in all cmds with better terminology
jorgeorpinel Jun 20, 2019
df0e650
cmd ref: Update explanation and example for `--downstream` option in …
jorgeorpinel Jun 20, 2019
837e6d2
cmd ref: clarify `move` description, also
jorgeorpinel Jun 20, 2019
e8452d2
term: kill "named" word since it was used in confusing ways, and
jorgeorpinel Jun 20, 2019
05ec711
feedback: PR #441 "docs: regular update (mid June) [2]" round 1
jorgeorpinel Jun 20, 2019
bacc356
cmd ref: update remote `add` and `move` S3 compatible API examples
jorgeorpinel Jun 20, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,15 @@ file is committed to the DVC cache. Using the `--no-commit` option, the file
will not be added to the cache and instead the `dvc commit` command is used when
(or if) the file is to be committed to the DVC cache.

Under the hood, a few actions are taken for each file in the target(s):
Under the hood, a few actions are taken for each file in `targets`:

1. Calculate the file checksum.
2. Move the file content to the DVC cache (default location is `.dvc/cache`).
3. Replace the file by a link to the file in the cache (see details below).
4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store
the checksum to identify the cache entry.
5. Add the _target_ filename to `.gitignore` (if Git is used in this workspace)
to prevent it from being committed to the Git repository.
5. Add the target(s) to `.gitignore` (if Git is used in this workspace) to
prevent it from being committed to the Git repository.
6. Instructions are printed showing `git` commands for adding the files to a Git
repository. If a different SCM system is being used, use the equivalent
command for that system or nothing is printed if `--no-scm` was specified for
Expand Down Expand Up @@ -79,8 +79,10 @@ This way you bring data provenance and make your project reproducible.

## Options

- `-R`, `--recursive` - recursively add each file under the named directory. For
each file a new DVC-file is created using the process described earlier.
- `-R`, `--recursive` - `targets` is expected to contain directory path(s).
Determines the files to add by searching each target directory and its
subdirectories for data files. For each file found, a new DVC-file is created
using the process described in this command's description.

- `--no-commit` - do not put files/directories into cache. A DVC-file is
created, and an entry is added to `.dvc/state`, while nothing is added to the
Expand Down
24 changes: 12 additions & 12 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ The execution of `dvc checkout` does:
- Scan the `outs` entries in DVC-files to compare with the currently checked out
data files. The scanned DVC-files is limited by the listed `targets` (if any)
on the command line. And if the `--with-deps` option is specified, it scans
backward in the [pipeline](/doc/get-started/pipeline) from the named targets.
backward from the given `targets` in the corresponding
[pipeline](/doc/get-started/pipeline).
- For any data files where the checksum doesn't match their DVC-file entry, the
data file is restored from the cache. The link strategy used (`reflink`,
`hardlink`, `symlink`, or `copy`) depends on the OS and the configured value
Expand Down Expand Up @@ -68,18 +69,17 @@ such a case, `dvc checkout` prints a warning message. Any files that can be
checked out without error will be restored.

There are two methods to restore a file missing from the cache, depending on the
situation. In some cases the pipeline must be rerun using the `dvc repro`
command. In other cases the cache can be pulled from a remote cache using the
`dvc pull` command. See also `dvc pipeline`
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
regenerate its outputs. (See also `dvc pipeline`.) In other cases the cache can
be pulled from a remote cache using `dvc pull`.

## Options

- `-d`, `--with-deps` - determine workspace files to update by tracking
dependencies to the named target DVC-file(s). This option only has effect when
one or more `targets` are specified. By traversing all stage dependencies, DVC
searches backward through the pipeline from the named target(s). This means
DVC will not checkout files referenced later in the pipeline than the named
target(s).
- `-d`, `--with-deps` - determine files to update by tracking dependencies to
the target DVC-file(s) (stages). This option only has effect when one or more
`targets` are specified. By traversing all stage dependencies, DVC searches
backward from the target stage(s) in the corresponding pipeline(s). This means
DVC will not checkout files referenced in later stage(s) than `targets`.

- `-f`, `--force` - do not prompt when removing workspace files. Changing the
current set of DVC-files with SCM commands like `git checkout` can result in
Expand Down Expand Up @@ -134,8 +134,8 @@ $ pip install -r requirements.txt

</details>

The existing pipeline looks almost like in this
[example](/doc/get-started/example-pipeline):
The workspace looks almost like in this
[pipeline setup](/doc/get-started/example-pipeline):

```dvc
.
Expand Down
55 changes: 28 additions & 27 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ positional arguments:
## Description

The `dvc commit` command is useful for several scenarios where a dataset is
being changed: a [stage](/doc/commands-reference/run) or
being changed: when a [stage](/doc/commands-reference/run) or
[pipeline](/doc/get-started/pipeline) is in development, when one wishes to run
commands outside the control of DVC, or to force DVC-files updates to save time
rerunning the stage or pipeline.
commands outside the control of DVC, or to force DVC-file updates to save time
tying stages or a pipeline.

- Code or data for a stage is under active development, with rapid iteration of
code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and
Expand All @@ -36,8 +36,8 @@ rerunning the stage or pipeline.
doesn't cause a change in its results. We might write in-line documentation
with comments, change indentation, remove some debugging printouts, or any
other change which doesn't introduce a change in the output of pipeline
stages. `dvc commit` can help avoid rerunning the pipeline in these cases by
forcing the update of the DVC-files.
stages. `dvc commit` can help avoid having to reproduce a pipeline in these
cases by forcing the update of the DVC-files.

The last two use cases are **not recommended**, and essentially force update the
DVC-files and save data to cache. They are still useful, but keep in mind that
Expand Down Expand Up @@ -65,16 +65,16 @@ It handles that last step of adding the file to the DVC cache.
## Options

- `-d`, `--with-deps` - determine files to commit by tracking dependencies to
the named target DVC-file(s). This option only has effect when one or more
the target DVC-file(s) (stages). This option only has effect when one or more
`targets` are specified. By traversing all stage dependencies, DVC searches
backward through the pipeline from the named target(s). This means DVC will
not commit files referenced later in the pipeline than the named target(s).
backward from the target stage(s) in the corresponding pipeline(s). This means
DVC will not commit files referenced in later stage(s) than `targets`.

- `-R`, `--recursive` - the `targets` value is expected to be a directory path.
With this option, `dvc commit` determines the files to commit by searching the
named directory, and its subdirectories, for DVC-files for which to commit
data. Along with providing a `target`, or `target` along with `--with-deps`,
it is yet another way to limit the scope of DVC-files to upload.
- `-R`, `--recursive` - `targets` is expected to contain directory path(s).
Determines the files to commit by searching each target directory and its
subdirectories for DVC-files to inspect. Along with providing `targets`, or
`targets` and `--with-deps`, this is another way to limit the scope of
DVC-files to commit.

- `-f`, `--force` - commit data even if checksums for dependencies or outputs
did not change.
Expand Down Expand Up @@ -131,11 +131,11 @@ This data will be retrieved from a preconfigured remote cache.

## Example: Rapid iterations

Sometimes we want to iterate through multiple changes to configuration, or to
code, sometimes to data, trying multiple options, and improving the output of a
stage. To avoid filling the DVC cache with undesired intermediate results, we
can rerun the whole pipeline using `dvc repro --no-commit`, or a single stage
with `dvc run --no-commit`. This prevents data from being pushed to cache. When
Sometimes we want to iterate through multiple changes to configuration, code, or
data, trying multiple options to improve the output of a stage. To avoid filling
the DVC cache with undesired intermediate results, we can run a single stage
with `dvc run --no-commit`, or reproduce an entire pipeline using
`dvc repro --no-commit`. This prevents data from being pushed to cache. When
development of the stage is finished, `dvc commit` can be used to store data
files in the DVC cache.

Expand Down Expand Up @@ -195,11 +195,11 @@ outs:
wdir: .
```

To verify this instance of `model.pkl` is not in the cache, we must know how the
cache files are named. In the DVC cache the first two characters of the checksum
are used as a directory name, and the file name is the remaining characters.
Therefore, if the file had been committed to the cache it would appear in the
directory `.dvc/cache/70`. But:
To verify this instance of `model.pkl` is not in the cache, we must know the
names of the cache files. In the DVC cache the first two characters of the
checksum are used as a directory name, and the file name is the remaining
characters. Therefore, if the file had been committed to the cache it would
appear in the directory `.dvc/cache/70`. But:

```dvc
$ ls .dvc/cache/70
Expand Down Expand Up @@ -256,9 +256,10 @@ train.dvc:
Let's edit one of the source files. It doesn't matter which one. You'll see that
both Git and DVC recognize a change was made.

If we ran `dvc repro` at this point the pipeline would be rerun. But since the
change was inconsequential, that would be a waste of time and CPU resources.
That's especially critical if the pipeline takes a long time to execute.
If we ran `dvc repro` at this point, this pipeline would be reproduced. But
since the change was inconsequential, that would be a waste of time and CPU.
That's especially critical if the corresponding stages lots of resources to
execute.

```dvc
$ git add src/train.py
Expand All @@ -277,4 +278,4 @@ Pipeline is up to date. Nothing to reproduce.
```

Nothing special is required, we simply `commit` to both the SCM and DVC. Since
the pipeline is up to date, `dvc repro` will not do anything.
this pipeline is up to date, `dvc repro` will not do anything.
32 changes: 17 additions & 15 deletions static/docs/commands-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ files under DVC control could already exist in remote storage, but won't be in
your local cache. (Refer to `dvc remote` for more information on DVC remotes.)
These necessary data or model files are listed as dependencies or outputs in a
DVC-file (target [stage](/doc/commands-reference/run)) so they are required to
[reproduce](/doc/get-started/reproduce) the
[reproduce](/doc/get-started/reproduce) the corresponding
[pipeline](/doc/get-started/pipeline). (See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more information on
dependencies and outputs.)
Expand Down Expand Up @@ -78,10 +78,10 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch`
using the `dvc remote` command.

- `-d`, `--with-deps` - determine files to download by tracking dependencies to
the named target DVC-file(s). This option only has effect when one or more
the target DVC-file(s) (stages). This option only has effect when one or more
`targets` are specified. By traversing all stage dependencies, DVC searches
backward through the pipeline from the named target(s). This means DVC will
not fetch files referenced later in the pipeline than the named target(s).
backward from the target stage(s) in the corresponding pipeline(s). This means
DVC will not fetch files referenced in later stage(s) than `targets`.

- `-R`, `--recursive` - this option tells DVC that `targets` are directories
(not DVC-files), and to traverse them recursively. All DVC-files found will be
Expand Down Expand Up @@ -112,9 +112,10 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch`

## Examples

To explore `dvc fetch` let's consider a simple pipeline with several stages and
a few Git tags. Then we can see what happens with `fetch` as we shift from tag
to tag with `git`.
To explore `dvc fetch` let's consider a simple
[pipeline](/doc/get-started/pipeline) with several stages and a few Git tags.
Then we can see what happens with `fetch` as we shift from tag to tag with
`git`.

<details>

Expand Down Expand Up @@ -145,8 +146,8 @@ $ pip install -r requirements.txt

</details>

The existing pipeline looks almost like in this
[example](/doc/get-started/example-pipeline):
The workspace looks almost like in this
[pipeline setup](/doc/get-started/example-pipeline):

```dvc
.
Expand Down Expand Up @@ -241,7 +242,7 @@ $ tree .dvc/cache
└── 603888ec04a6e75a560df8678317fb
```

> Note that `prepare.dvc` is the first stage in our example's implicit pipeline.
> Note that `prepare.dvc` is the first stage in our example's pipeline.

Cache entries for the necessary directories, as well as the actual
`data/prepared/test.tsv` and `data/prepared/train.tsv` files were download,
Expand All @@ -251,7 +252,8 @@ checksums shown above.

After following the previous example (**Specific stages**), only the files
associated with the `prepare.dvc` stage file have been fetched. Several
dependencies/outputs for the full pipeline are still missing from local cache:
dependencies/outputs of other pipeline stages are still missing from local
cache:

```dvc
$ dvc status -c
Expand Down Expand Up @@ -296,13 +298,13 @@ $ tree .dvc/cache
```

Fetching using `--with-deps` starts with the target DVC-file (stage) and
searches backwards through the pipeline for data files to download into the
searches backwards through its pipeline for data files to download into the
local cache. All the data for the second and third stages ("featurize" and
"train") has now been downloaded to cache. We could now use `dvc checkout` to
get the data files needed to reproduce the pipeline up to the third stage into
get the data files needed to reproduce this pipeline up to the third stage into
the workspace (with `dvc repro train.dvc`).

> Note that in this sample project, the last stage file `evaluate.dvc` doesn't
> add any more data files than those form previous stages so at this point all
> the pipeline's files are in local cache and `dvc status -c` would output
> "Pipeline is up to date. Nothing to reproduce."
> of the files for this pipeline are in local cache and `dvc status -c` would
> output "Pipeline is up to date. Nothing to reproduce."
45 changes: 22 additions & 23 deletions static/docs/commands-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ DVC supports several types of (local or) remote locations:
> running it internally expands this URL into a regular S3, SSH, GS, etc URL by
> appending `/path/to/file` to the `myremote`'s configured base path.

Another way to understand the `dvc import` command is as a short-cut for more
verbose `dvc run` commands. This is discussed in the
Another way to understand the `dvc import` command is as a short-cut for a more
verbose `dvc run` command. This is discussed in the
[External Dependencies](/doc/user-guide/external-dependencies) documentation,
where an alternative is demonstrated for each of these schemes.

Expand All @@ -80,16 +80,16 @@ $ dvc run -d https://example.com/path/to/data.csv \
wget https://example.com/path/to/data.csv -O data.csv
```

Both methods generate a DVC-file with an external dependency, and they perform a
roughly equivalent result. The `dvc import` command saves the user from using
the command to copy files from each of the remote storage schemes, and from
Both methods generate a stage file (DVC-file) with an external dependency, and
they produce equivalent results. The `dvc import` command saves the user from
having to manually copy files from each of the remote storage schemes, and from
having to install CLI tools for each service.

When DVC inspects a DVC-file, one step is inspecting the dependencies to see if
any have changed. A changed dependency will appear in the `dvc status` report,
indicating the need to re-run the corresponding part of the pipeline. When DVC
inspects an external dependency, it uses a method appropriate to that dependency
to test its current status.
When DVC inspects a DVC-file, its dependencies will be checked to see if any
have changed. A changed dependency will appear in the `dvc status` report,
indicating the need to reproduce this import stage. When DVC inspects an
external dependency, it uses a method appropriate to that dependency to test its
current status.

## Options

Expand Down Expand Up @@ -146,7 +146,7 @@ $ pip install -r requirements.txt

## Example: Tracking a remote file

The [DVC getting started tutorial](/doc/get-started) demonstrates a simple DVC
The [DVC getting started tutorial](/doc/get-started) demonstrates a simple
pipeline. In the [Add Files step](/doc/get-started/add-files) we are told to
download a file, then use `dvc add` to integrate it with the workspace.

Expand Down Expand Up @@ -189,7 +189,7 @@ $ git add data/.gitignore data.xml.dvc
> [stages](/doc/commands-reference/run) from the _Getting Started_ example, but
> since we don't need them for this example, we'll skip it.

Let's take a look at the resulting DVC-file `data.xml.dvc`:
Let's take a look at the resulting stage file (DVC-file) `data.xml.dvc`:

```yaml
deps:
Expand All @@ -215,9 +215,8 @@ file has changed.
## Example: Detecting remote file changes

What if that remote file is one which will be updated regularly? The project
goal might include regenerating some artifact based on the updated data. With a
DVC external dependency, the pipeline can be triggered to re-execute based on a
changed external dependency.
goal might include regenerating some artifact based on the updated data. A
pipeline can be triggered to re-execute based on a changed external dependency.

Let us again use the [Getting Started](/doc/get-started) example, in a way which
will mimic an updated external data source.
Expand All @@ -242,8 +241,8 @@ On your machine initialize the workspace again:

### Click and expand to prepare the workspace

This is needed to actually run the command below in case you are reproducing
this example:
This is needed to actually run the command below in case you are trying this
example:

```dvc
$ git checkout 2-remote
Expand All @@ -269,9 +268,9 @@ To track the changes with git run:
```

At this point we have the workspace set up in a similar fashion. The difference
is that DVC-file references now references the editable data file in the data
store directory we just set up. We did this to make it easy to edit the data
file:
is that stage file (DVC-file) outputs (`outs`) now references the editable file
in the data store directory we just set up. We did this to make it easy to edit
the data file:

```yaml
deps:
Expand Down Expand Up @@ -316,8 +315,8 @@ $ dvc run -f prepare.dvc \
python src/prepare.py data/data.xml
```

Having setup this "prepare" stage means that later when we run `dvc repro` a
pipeline will be executed.
> Having setup this "prepare" stage means that later when we run `dvc repro`, a
> pipeline will be executed.

The workspace says it is fine:

Expand Down Expand Up @@ -393,7 +392,7 @@ Pipeline is up to date. Nothing to reproduce.
```

Because the external source for the data file changed, the change was noticed by
the `dvc status` command. Running `dvc repro` then ran both stages of the
the `dvc status` command. Running `dvc repro` then ran both stages of this
pipeline, and if we had set up the other stages they also would have been run.
It first downloaded the updated data file. And then noticing that
`data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute.
Loading