Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: regular updates (early Oct) #667

Merged
merged 1 commit into from
Oct 2, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/Documentation/glossary.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
export default {
name: 'Glossary',
desc:
'This guide is aimed to familiarize the users with definitions to ' +
'relevant DVC concepts and terminologies which are frequently used.',
'This guide is aimed to provide definitions for relevant DVC concepts ' +
'and terminologies that are frequently used.',
contents: [
{
name: 'Workspace',
Expand Down
10 changes: 5 additions & 5 deletions static/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The `targets` are files or directories to be places under DVC control. These are
turned into outputs (`outs` field) in a resulting
[DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.)
Note that target data outside the current <abbr>workspace</abbr> is supported,
which becomes [external outputs](/doc/user-guide/managing-external-data).
that becomes [external outputs](/doc/user-guide/managing-external-data).

Under the hood, a few actions are taken for each file (or directory) in
`targets`:
Expand Down Expand Up @@ -126,8 +126,8 @@ To track the changes with git run:
git add .gitignore data.xml.dvc
```

As the output says, a DVC-file has been created for `data.xml`. Let us explore
the result:
As the output says, a [DVC-file](/doc/user-guide/dvc-file-format)) has been
created for `data.xml`. Let us explore the result:

```dvc
$ tree
Expand Down Expand Up @@ -229,8 +229,8 @@ Saving 'pics/train/cats/cat.438.jpg' to cache '.dvc/cache'.
In this case a DVC-file corresponding to each file is generated, and no
top-level DVC-file is generated. But this is less convenient.

With the `dvc add pics` a single DVC-file is generated, `pics.dvc`, which lets
us treat the entire directory structure in one unit. It lets you pass the whole
With `dvc add pics`, a single `pics.dvc` DVC-file is generated, that lets us
treat the entire directory structure in one unit. It lets you pass the whole
directory tree as a dependency to a `dvc run` stage definition, like this:

```dvc
Expand Down
16 changes: 8 additions & 8 deletions static/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ time tying stages or a pipeline.
- Sometimes we want to clean up a code or configuration file in a way that
doesn't cause a change in its results. We might write in-line documentation
with comments, change indentation, remove some debugging printouts, or any
other change which doesn't introduce a change in the output of pipeline
stages. `dvc commit` can help avoid having to reproduce a pipeline in these
cases by forcing the update of the DVC-files.
other change that doesn't produce different output of pipeline stages.
`dvc commit` can help avoid having to reproduce a pipeline in these cases by
forcing the update of the DVC-files.

Let's take a look at what is happening in the fist scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
Expand Down Expand Up @@ -145,8 +145,8 @@ bag_of_words = CountVectorizer(stop_words='english',
max_features=6000, ngram_range=(1, 2))
```

This option not only changes the trained model, it also introduces a change
which would cause the `featurize.dvc`, `train.dvc` and `evaluate.dvc` stages to
This option not only changes the trained model, it also introduces a change that
would cause the `featurize.dvc`, `train.dvc` and `evaluate.dvc` stages to
execute if we ran `dvc repro`. But if we want to try several values for this
option and save only the best result to the cache, we can execute as so:

Expand Down Expand Up @@ -230,15 +230,15 @@ $ python src/train.py data/features model.pkl
$ python src/evaluate.py model.pkl data/features auc.metric
```

As before, `dvc status` will show which the files have changed, and when your
work is finalized `dvc commit` will commit everything to the <abbr>cache</abbr>.
As before, `dvc status` will show which files have changed, and when your work
is finalized `dvc commit` will commit everything to the <abbr>cache</abbr>.

## Example: Updating dependencies

Sometimes we want to clean up a code or configuration file in a way that doesn't
cause a change in its results. We might write in-line documentation with
comments, change indentation, remove some debugging printouts, or any other
change which doesn't introduce a change in the output of pipeline stages.
change that doesn't produce different output of pipeline stages.

```dvc
$ git status -s
Expand Down
12 changes: 6 additions & 6 deletions static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,8 @@ for more details.)

- `cache.dir` - set/unset cache directory location. A correct value must be
either an absolute path or a path **relative to the config file location**.
The default value is `cache`, which resolved relative to the default project
config location results in `.dvc/cache`.
The default value is `cache` that, resolved relative to the default project
config location, results in `.dvc/cache`.

> See also helper command `dvc cache dir` to intuitively set this config
> option, properly transforming paths relative to the current working
Expand Down Expand Up @@ -171,10 +171,10 @@ for more details.)

State config options. See
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn
more about the state file that is used for optimization.
more about the state file (database) that is used for optimization.

- `state.row_limit` - maximum number of entries in the state database which
affects the physical size of the state file itself as well as the performance
- `state.row_limit` - maximum number of entries in the state database, which
affects the physical size of the state file itself, as well as the performance
of certain DVC operations. The bigger the limit the more checksum history DVC
can keep in order to avoid sequential checksum recalculations for the files.
Default limit is set to 10 000 000 rows.
Expand Down Expand Up @@ -226,7 +226,7 @@ Clear default remote value:
$ dvc config --unset core.remote
```

which is equivalent to:
The above command is equivalent to:

```dvc
$ dvc config core.remote -u
Expand Down
12 changes: 6 additions & 6 deletions static/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,8 @@ added file with size 37.9 MB

We can base this example in the [Metrics](/doc/get-started/metrics) and
[Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get
Started_ section, which describe different experiments to produce the
`model.pkl` file. Our example repository has the `bigrams-experiment` and
Started_ section, that describe different experiments to produce the `model.pkl`
file. Our example repository has the `bigrams-experiment` and
`baseline-experiment`
[tags](https://github.com/iterative/example-get-started/tags) respectively to
reference these experiments.
Expand Down Expand Up @@ -155,7 +155,7 @@ command dependencies or <abbr>outputs</abbr>).

We can use `dvc diff` to check for changes in a directory by specifying the
directory as the target (with option `-t`). Note that we skip the `b_ref`
argument this time, which defaults to `HEAD`.
argument this time, that defaults to `HEAD`.

```dvc
$ dvc diff -t data/features baseline-experiment
Expand All @@ -171,11 +171,11 @@ diff for 'data/features'

## Example: Confirming that a target has not changed

Let's use our example repo once again, which has several
Let's use our example repo once again, that has several
[available tags](https://github.com/iterative/example-get-started/tags) for
conveniency. The `5-preparation` tag corresponds to the
[Connect Code and Data](/doc/get-started/connect-code-and-data) section of our
_Get Started_ section, in which the `dvc run` command is used to create the
[Connect Code and Data](/doc/get-started/connect-code-and-data) chapter of our
_Get Started_ section, where the `dvc run` command is used to create a
`prepare.dvc` stage file. This DVC-file tracks the `data/prepared` directory
<abbr>output</abbr>.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This command deletes (garbage collects) data files or directories that may exist
in the cache (or [remote storage](/doc/command-reference/remote)) but no longer
referred to in [DVC-files](/doc/user-guide/dvc-file-format) currently
[checked out](/doc/command-reference/checkout) in the <abbr>project</abbr>. By
default this command only cleans up the local cache, which is typically located
default this command only cleans up the local cache, that is typically located
on the same machine as the project in question. This usually helps to free up
disk space.

Expand Down
17 changes: 9 additions & 8 deletions static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ created in the current working directory, with its original file name.
> DVC is [installed](/doc/get-started/install).

We can use `dvc get` to download the resulting model file from our
[get started example repo](https://github.com/iterative/example-get-started),
which is a <abbr>DVC project</abbr> external to the current working directory.
The desired <abbr>output</abbr> file would be located in the root of the
external project (if the
[get started example repo](https://github.com/iterative/example-get-started), a
<abbr>DVC project</abbr> external to the current working directory. The desired
<abbr>output</abbr> file would be located in the root of the external project
(if the
[`train.dvc` stage](https://github.com/iterative/example-get-started/blob/master/train.dvc)
was reproduced) and named `model.pkl`.

Expand All @@ -83,7 +83,7 @@ Note that the `model.pkl` file doesn't actually exist in the
[root directory](https://github.com/iterative/example-get-started/tree/master/)
of the external Git repository. Instead, the corresponding DVC-file
[train.dvc](https://github.com/iterative/example-get-started/blob/master/train.dvc)
is found, which specifies `model.pkl` in its outputs (`outs`). DVC then
is found, that specifies `model.pkl` in its outputs (`outs`). DVC then
[pulls](/doc/command-reference/pull) the file from the default
[remote](/doc/command-reference/remote) of the external DVC project (found in
its
Expand Down Expand Up @@ -140,9 +140,10 @@ The `model.monograms.pkl` file now contains the older version of the model. To
get the most recent one, we use a similar command, but with

`-o model.bigrams.pkl` and `--rev 9-bigrams-model` or even without `--rev`
(since it's the latest version anyway). In fact in this case using `dvc pull`
should suffice, downloading the file as just `model.pkl`, which we can then
rename to make it extra obvious:
(since it's the latest version anyway). In fact, in this case using `dvc pull`
with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) should
suffice, downloading the file as just `model.pkl`. We can then rename it to make
its version explicit:

```dvc
$ dvc pull train.dvc
Expand Down
14 changes: 7 additions & 7 deletions static/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Download or copy a file or directory from any supported URL (for example
<abbr>workspace</abbr>, and track changes in the remote data source with DVC.
Creates a DVC-file.

> See also `dvc get-url` which corresponds to the first half of what this
> See also `dvc get-url`, that corresponds to the first half of what this
> command does (downloading the <abbr>data artifact</abbr>).

## Synopsis
Expand Down Expand Up @@ -37,8 +37,8 @@ for the imported data file or directory in the workspace.
> See `dvc import` to download and tack data or model files or directories from
> other DVC repositories (e.g. Github URLs).

DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to data in
an external location, see
DVC supports [DVC-files](/doc/user-guide/dvc-file-format) that refer to data in
external locations, see
[External Dependencies](/doc/user-guide/external-dependencies). In such a
DVC-file, the `deps` section stores the remote URL, and the `outs` section
contains the corresponding local path in the workspace. It records metadata from
Expand Down Expand Up @@ -196,10 +196,10 @@ trying this example (especially if trying out the following one).

## Example: Detecting remote file changes

What if that remote file is one which will be updated regularly? The project
goals might include regenerating a <abbr>data artifact</abbr> based on the
updated data source. [Pipeline](/doc/command-reference/pipeline) reproduction
can be triggered based on a changed external dependency.
What if that remote file is updated regularly? The project goals might include
regenerating a <abbr>data artifact</abbr> based on the updated data source.
[Pipeline](/doc/command-reference/pipeline) reproduction can be triggered based
on a changed external dependency.

Let's use the [Get Started](/doc/get-started) project again, simulating an
updated external data source. (Remember to prepare the <abbr>workspace</abbr>,
Expand Down
4 changes: 2 additions & 2 deletions static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ repository (e.g. hosted on Github) into the <abbr>workspace</abbr>, and track
changes in this [external dependency](/doc/user-guide/external-dependencies).
Creates a DVC-file.

> See also `dvc get` which corresponds to the first step this command performs
> See also `dvc get`, that corresponds to the first step this command performs
> (just download the data).

## Synopsis
Expand Down Expand Up @@ -47,7 +47,7 @@ stage (DVC-file) is then created extending the full file or directory name of
the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` to
generate the same output.

DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to data in
DVC supports [DVC-files](/doc/user-guide/dvc-file-format) that refer to data in
an external DVC repository (hosted on a Git server). In such a DVC-file, the
`deps` section specifies the `repo` URL and data `path`, and the `outs` section
contains the corresponding local path in the workspace. It records enough data
Expand Down
6 changes: 3 additions & 3 deletions static/docs/command-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ DVC is a command-line tool. The typical DVC workflow goes as follows:
`dvc init`.
- Copy source code files for modeling into the repository and track the files
with DVC using the `dvc add` command.
- Process raw data with your own data processing and modeling code using the
`dvc run` command, using the `--outs` option to outputs which will also be
tracked by DVC after the code is executed.
- Process raw data with your own data processing and modeling code, using the
`dvc run` command, along with its `--outs` option for outputs that should also
be tracked by DVC after the code is executed.
- Sharing a Git repository with the source code of your ML
[pipeline](/doc/command-reference/pipeline) will not include the project's
<abbr>cache</abbr>. Use [remote storage](/doc/command-reference/remote) and
Expand Down
17 changes: 9 additions & 8 deletions static/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ The installed Git hook automates running `dvc checkout`.
**Commit**: When committing a change to the Git repository, that change possibly
requires reproducing the corresponding
[pipeline](/doc/command-reference/pipeline) (using `dvc repro`) to regenerate
the project results. Or there might be new data not yet in cache, which requires
running `dvc commit` to update.
the project results. Or there might be new data files not yet in cache, which
requires running `dvc commit` to store them.

The installed Git hook automates reminding the user to run either `dvc repro` or
`dvc commit`, as needed.
Expand Down Expand Up @@ -121,9 +121,10 @@ $ dvc pull --all-branches --all-tags

Let's start our exploration with the impact of `dvc install` on the
`dvc checkout` command. Remember that switching from one Git version to another
(with `git checkout`) changes the set of DVC-files in the project, which then
also changes the data files that should be placed in the workspace (with
`dvc checkout`).
(with `git checkout`) changes the set of
[DVC-files](/doc/user-guide/dvc-file-format) in the project. This changes the
set of data files that should be located in the workspace (which can be achieved
with `dvc checkout`).

Let's first list the available tags in the _Get Started_ project:

Expand Down Expand Up @@ -209,7 +210,7 @@ exec dvc checkout
```

The two Git hooks have been installed, and the one of interest for this exercise
is the `post-checkout` script which runs after `git checkout`.
is the `post-checkout` script that runs after `git checkout`.

We can now repeat the command run earlier, to see the difference.

Expand Down Expand Up @@ -252,8 +253,8 @@ featurize.dvc:
1 file changed, 1 insertion(+), 1 deletion(-)
```

We see that `dvc status` output has appeared in the `git commit` interaction.
This new behavior corresponds to the Git hook which was installed, and it
We see that the output of `dvc status` has appeared in the `git commit`
interaction. This new behavior corresponds to the Git hook installed, and it
helpfully informs us the workspace is out of sync. We should therefore run the
`dvc repro` command.

Expand Down
6 changes: 3 additions & 3 deletions static/docs/command-reference/metrics/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,9 @@ $ dvc run -o metrics.txt "echo 0.9643 > metrics.txt"
```

Even when we named this output file `metrics.txt`, DVC won't know that it's a
metric if we don't specify so. The content of stage file `metrics.txt.dvc`
(which is a [DVC-file](/doc/user-guide/dvc-file-format)) should look like this:
(Notice the `metric: false` field.)
metric if we don't specify so. The content of stage file `metrics.txt.dvc` (a
[DVC-file](/doc/user-guide/dvc-file-format)) should look like this: (Notice the
`metric: false` field.)

```yaml
cmd: echo 0.9643 > metrics.txt
Expand Down
2 changes: 1 addition & 1 deletion static/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ only the files (or directories) missing from the workspace by searching all
[DVC-files](/doc/user-guide/dvc-file-format) currently in the
<abbr>project</abbr>. It will not download files associated with earlier
versions or branches of the repository if using Git, nor will it download files
which have not changed.
that have not changed.

The command `dvc status -c` can list files referenced in current DVC-files, but
missing in the <abbr>cache</abbr>. It can be used to see what files `dvc pull`
Expand Down
4 changes: 2 additions & 2 deletions static/docs/command-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ configure a remote.
With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads
only the files (or directories) that are new in the local repository to remote
storage. It will not upload files associated with earlier versions or branches
of the <abbr>project</abbr> directory, nor will it upload files which have not
of the <abbr>project</abbr> directory, nor will it upload files that have not
changed.

The `dvc status -c` command can list files tracked by DVC that are new in the
Expand Down Expand Up @@ -259,7 +259,7 @@ $ tree ../vault/recursive

The directory `.dvc/cache` is the local cache, while `../vault/recursive` is the
remote storage – a "local remote" in this case. This listing shows the cache
having more files in it than the remote does (which is what `new` means).
having more files in it than the remote which is what the `new` state means.

Next we can upload part of the data from the cache to the remote using the
command `dvc push --with-deps <stage>.dvc`. Remember that `--with-deps` searches
Expand Down
7 changes: 3 additions & 4 deletions static/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,9 @@ url = /tmp/dvc-storage
remote = myremote
```

DVC supports the concept of a _default remote_. For the commands which take a
DVC supports the concept of a _default remote_. For the commands that accept a
`--remote` option (`dvc pull`, `dvc push`, `dvc status`, `dvc gc`, `dvc fetch`),
this option can be left off the command line and the default remote will be used
instead.
the default remote is used if that option is not used.

Use `dvc config` to unset/change the default remote as so:
`dvc config -u core.remote`.
Expand Down Expand Up @@ -327,7 +326,7 @@ $ export OSS_ACCESS_KEY_ID='AccessKeyID'
$ export OSS_ACCESS_KEY_SECRET='AccessKeySecret'
```

> Use default key id and key secret when they are not given, which gives read
> Uses default key id and key secret when they are not given, which gives read
> access to public read bucket and public bucket.

</details>
Expand Down
6 changes: 3 additions & 3 deletions static/docs/command-reference/remote/default.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ This command assigns the default remote in the core section of the DVC
remote = myremote
```

For the commands which take a `--remote` option (`dvc pull`, `dvc push`,
`dvc status`, `dvc gc`, `dvc fetch`), default remote is used if that option is
not specified.
For the commands that accept a `--remote` option (`dvc pull`, `dvc push`,
`dvc status`, `dvc gc`, `dvc fetch`), the default remote is used if that option
is not used.

You can also use `dvc config`, `dvc remote add` and `dvc remote modify` commands
to set/unset/change the default remote configurations.
Expand Down
Loading