Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. updates #1881

Merged
merged 13 commits into from
Oct 23, 2020
2 changes: 1 addition & 1 deletion content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Fetching is performed automatically by `dvc pull` (when the data is not already
in the <abbr>cache</abbr>), along with `dvc checkout`:

```
Controlled files Commands
Tracked files Commands
---------------- ---------------------------------

remote storage
Expand Down
10 changes: 5 additions & 5 deletions content/docs/command-reference/init.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,11 @@ In rare cases, the `--no-scm` option might be desirable: to initialize DVC in a
directory that is not part of a Git repo, or to make DVC ignore Git. Examples
include:

- Version control other than Git is being used. Even though there are DVC
features that require DVC to be run in the Git repo, DVC can work well with
other version control systems. Since DVC relies on simple `dvc.yaml` files to
manage <abbr>pipelines</abbr>, data, etc, they can be added into any version
control system, thus providing large data files and directories versioning.
- SCM other than Git is being used. Even though there are DVC features that
require DVC to be run in the Git repo, DVC can work well with other version
control systems. Since DVC relies on simple `dvc.yaml` files to manage
<abbr>pipelines</abbr>, data, etc, they can be added into any version control
system, thus providing large data files and directories versioning.

- There is no need to keep the history at all, e.g. having a deployment
automation like running a data pipeline using `cron`.
Expand Down
8 changes: 4 additions & 4 deletions content/docs/command-reference/plots/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@ experiments. Usual plot examples are AUC curves, loss functions, confusion
matrices, among others.

This type of metrics files are created by users, or generated by user data
processing code, and get defined with the `-p` (`--plots`) and
`--plots-no-cache`) options of `dvc run`. `dvc plots` subcommands can work with
plots files committed to a Git repo history, data files controlled by DVC, or
any other file in system.
processing code, and can be defined in `dvc.yaml` stages (using the `--plots`
and `--plots-no-cache` options if using `dvc run`). `dvc plots show` and
`dvc plots diff` can work with any valid plots files in the system, whether
tracked by Git or DVC, or not.
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finishes #1800 @pared @shcheklein but note that since plots modify does require that plots are defined in dvc.yaml, I couldn't change this to be as general as suggested initially (summarized in #1809 (comment)):

if we explain in the plots concept itself that plots are not only dvc.yaml based people won't expect dvc.yaml to be present in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the shift from get defined to can be defined. This is on point what we actually did.

can work with any valid plots files in the system

While it is something we need to do (as part of iterative/dvc#4446, currently we cannot dvc plots show {target} in no-repo case, as we will get error that '.' is not a git repo.

Copy link
Contributor

@pared pared Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created iterative/dvc#4761 to address this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great, thanks for addressing that. In that case we can leave this text as-is, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are still overcomplicating this. In this doc the most important points to say that users can output certain type of metrics (arrays, etc) into JSON, YAML or whatnot and DVC provides a bunch of commands to deal with them - visualize, compare, etc, etc. We should def mention that it is similar to metrics, but for "continuous", etc, etc

Details that certain commands require dvc.yaml to be present can be hidden into those commands. Or there should be a separate section that starts with some motivation for creating a dvc.yaml- what benefits on top of regular files people can get from using it.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I was following the original idea to emphasize that plots don't have to be tracked by DVC... But you're right, this was too low level for an index. I just removed a bunch of redundant info and left the basics + some refs to where the details are. Also updated show and diff a bit again. See 27519b4.


DVC generates plots as HTML files that can be open with a web browser. These
HTML files use [Vega-Lite](https://vega.github.io/vega-lite/). Vega is a
Expand Down
7 changes: 4 additions & 3 deletions content/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ to `dvc config cache.type`).
It has the same effect as running `dvc fetch` and `dvc checkout`:

```
Controlled files Commands
Tracked files Commands
---------------- ---------------------------------

remote storage
Expand Down Expand Up @@ -112,8 +112,9 @@ used to see what files `dvc pull` would download.
`dvc remote list`).

- `--run-cache` - downloads all available history of stage runs from the remote
repository into the local run-cache. A `dvc repro <stage_name>` is necessary
to checkout these files into the workspace and update the `dvc.lock` file.
repository (to the cache only, like `dvc fetch --run-cache`). Note that
`dvc repro <stage_name>` is necessary to checkout these files (into the
workspace) and update `dvc.lock`.

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from remote storage. This only applies when the `--cloud` option is used, or a
Expand Down
12 changes: 8 additions & 4 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ and caches relevant <abbr>data artifacts</abbr> along the way.
needed after a `git commit`. See `dvc install` for more details.

`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results.
files, intermediate or final results (except if the `--pull` option is used).

By default, this command checks all pipeline stages to determine which ones have
changed. Then it executes the corresponding commands. <abbr>Outputs</abbr> are
Expand Down Expand Up @@ -135,7 +135,7 @@ up-to-date and only execute the final stage.
present in the DVC project.

- `--no-run-cache` - execute stage commands even if they have already been run
with the same command/dependencies/outputs/etc before.
with the same dependencies/outputs/etc. before.

- `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will
reproduce `A` first and then `B`, even if `B` was previously executed with the
Expand All @@ -157,8 +157,12 @@ up-to-date and only execute the final stage.
corresponding pipelines, including the target stages themselves. This option
has no effect if `targets` are not provided.

- `--pull` - try automatically [pulling](/doc/command-reference/pull) missing
cache for outputs restored from run-cache.
- `--pull` - [pulls](/doc/command-reference/pull) dependencies and outputs
involved in the stages being reproduced, if they are found in the
[default](/doc/command-reference/remote/default) remote storage. Note that it
checks the local run-cache too (available history of stage runs).

> Has no effect if combined with `--no-run-cache`.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
23 changes: 11 additions & 12 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,9 +170,8 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'

- `-O <path>`, `--outs-no-cache <path>` - the same as `-o` except that outputs
are not tracked by DVC. It means that they are not cached, and it's up to a
user to save and version control them. This is useful if the outputs are small
enough to be tracked by Git directly, or if these files are not of future
interest.
user to manage them separately. This is useful if the outputs are small enough
to be tracked by Git directly, or if these files are not of future interest.

- `--outs-persist <path>` - declare output file or directory that will not be
removed upon `dvc repro`.
Expand All @@ -197,9 +196,9 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'

- `-M <path>`, `--metrics-no-cache <path>` - the same as `-m` except that DVC
does not track the metrics file. This means that the file is not cached, so
it's up to the user to save and version control it. This is typically
desirable with _metrics_ because they are small enough to be tracked with Git
directly. See also the difference between `-o` and `-O`.
it's up to the user to manage them separately. This is typically desirable
with _metrics_ because they are small enough to be tracked with Git directly.
See also the difference between `-o` and `-O`.

- `--plots <path>` - specify a plot metrics file produces by this stage. This
option behaves like `-o` but registers the file in a `plots` field inside the
Expand All @@ -210,8 +209,8 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'

- `--plots-no-cache <path>` - the same as `--plots` except that DVC does not
track the plots metrics file. This means that the file is not cached, so it's
up to the user to save and version control it. See also the difference between
`-o` and `-O`.
up to the user to manage them separately. See also the difference between `-o`
and `-O`.

- `-w <path>`, `--wdir <path>` - specifies a working directory for the `command`
to run in (uses the `wdir` field in `dvc.yaml`). Dependency and output files
Expand All @@ -231,10 +230,10 @@ $ dvc run -n my_stage './my_script.sh $MYENVVAR'
- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without
asking for confirmation.

- `--no-run-cache` - forcefully execute the `command` again, even if the same
`dvc run` command has already been run in this workspace. Useful if the
command's code is non-deterministic (meaning it produces different outputs
from the same list of inputs).
- `--no-run-cache` - execute the stage `command` even if it has already been run
with the same dependencies/outputs/etc. before. Useful for example if the
command's code is non-deterministic
([not recommended](#avoiding-unexpected-behavior)).

- `--no-commit` - do not save outputs to cache. A stage created and an entry is
added to `.dvc/state`, while nothing is added to the cache. In the stage file,
Expand Down
5 changes: 2 additions & 3 deletions content/docs/use-cases/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,8 @@ knowledge, they are still difficult to implement, reuse, and manage.
If you store and process data files or datasets to produce other data or machine
learning models, and you want to

- capture and save <abbr>data artifacts</abbr> the same way you capture code;
- track, control, and switch between different versions of data or models
easily;
- track and save <abbr>data artifacts</abbr> the same way you capture code;
- create and switch among different versions of data or models easily;
- understand how data or ML models were built in the first place;
- compare machine learning models and metrics to each other;
- bring software engineering best practices and tools to your data science team
Expand Down
4 changes: 2 additions & 2 deletions content/docs/user-guide/basic-concepts/dvc-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ match: ['DVC cache', cache, caches, cached]
---

The DVC cache is a hidden storage (by default located in the `.dvc/cache`
directory) for files that are under DVC control, and their different versions.
For more details, please refer to this
directory) for files that are tracked by DVC, and their different versions. For
more details, please refer to this
[document](/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).