Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc updates (1.0) #2053

Merged
merged 11 commits into from
Dec 24, 2020
14 changes: 10 additions & 4 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,9 @@ and <abbr>caches</abbr> the pipeline's <abbr>outputs</abbr> along the way.
💡 For convenience, a Git hook is available to remind you to `dvc repro` when
needed after a `git commit`. See `dvc install` for more details.

[Stage](/doc/command-reference/run) outputs are deleted from the
<abbr>workspace</abbr> before executing the stage commands that produce them.
`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results (except if the `--pull` option is used).
By default, this command checks all [pipeline](/doc/command-reference/dag)
stages to determine which ones have changed. Then it executes the corresponding
commands (`cmd` field of `dvc.yaml`).

There are a few ways to restrict what will be regenerated by this command: by
specifying specific reproduction [`targets`](#options), or by using certain
Expand All @@ -54,6 +53,13 @@ command [options](#options), such as `--single-item` or `--all-pipelines`.
> Note that stages without dependencies are considered _always changed_, so
> `dvc repro` always executes them.

[Stage](/doc/command-reference/run) outputs are deleted from the
<abbr>workspace</abbr> before executing the stage commands that produce them
(unless `persist: true` is used in `dvc.yaml`).

`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results (except if the `--pull` option is used).

It stores all the data files, intermediate or final results in the
<abbr>cache</abbr> (unless the `--no-commit` option is used), and updates the
hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc`
Expand Down
14 changes: 8 additions & 6 deletions content/docs/start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,15 @@ Changes to be committed:
$ git commit -m "Initialize DVC"
```

DVC features can be grouped into functional components. We'll explore them one
by one in the next few sections:
Now you're ready to DVC!
Comment on lines 48 to +51
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This p helps close the previous section (install/init steps). Note that the previous p is preserved too (lines 53-4).


- [**Data versioning**](/doc/start/data-versioning) is the base layer of DVC for
large files, datasets, and machine learning models. It looks like a regular
Git workflow, but without storing large files in the repo (think "Git for
data"). Data is stored separately, which allows for efficient sharing.
DVC's features can be grouped into functional components. We'll explore them one
by one in the next few pages:

- [**Data versioning**](/doc/start/data-versioning) (try this next) is the base
layer of DVC for large files, datasets, and machine learning models. Use a
regular Git workflow, but without storing large files in the repo (think "Git
for data"). Data is stored separately, which allows for efficient sharing.

- [**Data access**](/doc/start/data-access) shows how to use data artifacts from
outside of the project and how to import data artifacts from another DVC
Expand Down
2 changes: 1 addition & 1 deletion content/docs/use-cases/data-registries.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Advantages of data registries:
management and optimizes space requirements.
- **Data as code**: leverage Git workflows such as commits, branching, pull
requests, reviews, and even CI/CD for your data and models lifecycle. Think
"Git for cloud storage", but without ad-hoc conventions.
"Git for cloud storage".
- **Security**: registries can be setup with read-only remote storage (e.g. an
HTTP server).

Expand Down
19 changes: 8 additions & 11 deletions content/docs/user-guide/dvc-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,8 @@ and `dvc commit` commands, but not when a `.dvc` file is overwritten by
HTTP, S3, or Azure [external outputs](/doc/user-guide/managing-external-data);
and a special _checksum_ for HDFS and WebHDFS.
- `size`: Size of the file or directory (sum of all files).
- `nfiles`: If a directory, number of files inside.
- `nfiles`: If this output is a directory, the number of files inside
(recursive).
- `isexec`: Whether this is an executable file. DVC preserves execute
permissions upon `dvc checkout` and `dvc pull`. This has no effect on
directories, or in general on Windows.
Expand All @@ -88,8 +89,8 @@ and `dvc commit` commands, but not when a `.dvc` file is overwritten by
- `persist`: Whether the output file/dir should remain in place while
`dvc repro` runs. By default outputs are deleted when `dvc repro` starts (if
this value is not present).
- `desc`: User description for this output. This doesn't affect any DVC
operations.
- `desc` (optional): User description for this output. This doesn't affect any
DVC operations.

### Dependency entries

Expand All @@ -103,7 +104,8 @@ and `dvc commit` commands, but not when a `.dvc` file is overwritten by
HTTP, S3, or Azure <abbr>external dependencies</abbr>; and a special
_checksum_ for HDFS and WebHDFS. See `dvc import-url` for more information.
- `size`: Size of the file or directory (sum of all files).
- `nfiles`: If a directory, number of files inside.
- `nfiles`: If this dependency is a directory, the number of files inside
(recursive).
- `repo`: This entry is only for external dependencies created with
`dvc import`, and can contains the following fields:

Expand Down Expand Up @@ -142,9 +144,7 @@ stages:
- performance.json
training:
desc: Training stage description
cmd:
- pip install -r requirements.txt
- python train.py
cmd: python train.py
deps:
- train.py
- features
Expand All @@ -163,10 +163,7 @@ stages:
by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain
the following fields:

- `cmd` (always present): One or more commands executed by the stage (may
contain either a single value, or a list). Commands are executed sequentially
until all are finished or until one of them fails (see
[`dvc repro`](/doc/command-reference/repro) for details).
- `cmd` (always present): Executable command defined in this stage
- `wdir`: Working directory for the stage command to run in (relative to the
file's location). If this field is not present explicitly, it defaults to `.`
(the file's location).
Expand Down
5 changes: 2 additions & 3 deletions content/docs/user-guide/how-to/merge-conflicts.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,8 @@ stages:

## `dvc.lock`

There's no need to resolve lock file conflicts manually. You can safely delete
this file and then use `dvc repro` after merging `dvc.yaml` to regenerate this
file.
There's no need to resolve lock file conflicts manually. You can safely
overwrite this file by using `dvc repro` after merging `dvc.yaml`.

> `dvc commit` can also be a good option, but only for the specific case where
> the `HEAD` version is chosen.
Expand Down
2 changes: 1 addition & 1 deletion content/docs/user-guide/setup-google-drive-remote.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ individually.

If you use multiple GDrive remotes, by default they will be sharing the same
`.dvc/tmp/gdrive-user-credentials.json` file. It can be overridden with the
`gdrive_user_credentials_file` setting:
`gdrive_user_credentials_file` parameter:

```dvc
$ dvc remote modify myremote gdrive_user_credentials_file \
Expand Down