Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.lock updates #2139

Merged
merged 14 commits into from
Feb 3, 2021
28 changes: 15 additions & 13 deletions content/docs/command-reference/params/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,19 +249,21 @@ stages:
```

```yaml
train:
cmd: python train.py
deps:
- path: users.csv
md5: 23be4307b23dcd740763d5fc67993f11
params:
INT: 5
BOOL: true
TrainConfig.EPOCHS: 70
TrainConfig.layers: 9
outs:
- path: model.pkl
md5: 1c06b4756f08203cc496e4061b1e7d67
schema: '2.0'
stages:
train:
cmd: python train.py
deps:
- path: users.csv
md5: 23be4307b23dcd740763d5fc67993f11
params:
INT: 5
BOOL: true
TrainConfig.EPOCHS: 70
TrainConfig.layers: 9
outs:
- path: model.pkl
md5: 1c06b4756f08203cc496e4061b1e7d67
```

Alternatively, the entire `TestConfig` params group
Expand Down
30 changes: 16 additions & 14 deletions content/docs/start/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,20 +278,22 @@ same set of inputs (parameters + data) and reused it.
considered a _state_ of the pipeline:

```yaml
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- path: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
schema: '2.0'
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- path: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
```

> `dvc status` command can be used to compare this state with an actual state of
Expand Down
21 changes: 13 additions & 8 deletions content/docs/user-guide/project-structure/pipelines-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ stages:
| `cmd` | (Required) One or more commands executed by the stage (may contain either a single value or a list). Commands are executed sequentially until all are finished or until one of them fails (see `dvc repro`). |
| `wdir` | Working directory for the stage command to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to `.` (the file's location). |
| `deps` | List of <abbr>dependency</abbr> paths of this stage (relative to `wdir`). |
| `outs` | List of <abbr>output</abbr> paths of this stage (relative to `wdir`). See [Output entries](#output-entries) for more details. |
| `outs` | List of <abbr>output</abbr> paths of this stage (relative to `wdir`). See [Simple output entries](#simple-output-entries) for more details. |
| `params` | List of <abbr>parameter</abbr> dependency keys (field names) to track from `params.yaml` (in `wdir`). The list may also contain other parameters file names, with a sub-list of the param names to track in them. |
| `metrics` | List of [metrics files](/doc/command-reference/metrics), and optionally, whether or not this metrics file is <abbr>cached</abbr> (`true` by default). See the `--metrics-no-cache` (`-M`) option of `dvc run`. |
| `plots` | List of [plot metrics](/doc/command-reference/plots), and optionally, their default configuration (subfields matching the options of `dvc plots modify`), and whether or not this plots file is <abbr>cached</abbr> ( `true` by default). See the `--plots-no-cache` option of `dvc run`. |
Expand All @@ -364,7 +364,11 @@ validation and auto-completion.
> See also
> [How to Merge Conflicts](/doc/user-guide/how-to/merge-conflicts#dvcyaml).

### Output entries
### Simple output entries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't like this term "simple". There is nothing specifically simple or complex about those entries. It's totally fine to have them defined differently, with a bit different schema if we very clear in docs about this.

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my mind we are already in the section that describes the dvc.yaml schema and ideally it should be clear from the context that this is about dvc.yaml. I can see how additional message can help, but anything that comes to my mind would create even more confusion (people are not aware about other files, different types of outputs, etc).

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I simplified by calling the section in dvc.yaml "Output subfields" (which is what it is), PTAL.

If it's still confusing we could consider calling the sections in .dvc "Output/Dependency tracking entries".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. TBH the ideal solution would be to have tables inside tables but that's not doable on md, would need some design and frontend work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!


> _Simple_ compared to full
> [output entries](/doc/user-guide/project-structure/dvc-files#output-entries)
> used in `dvc.lock` and `.dvc` files

| Field | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
Expand All @@ -382,8 +386,8 @@ DVC will maintain a `dvc.lock` file for each `dvc.yaml`. Their purposes include:
- Allow DVC to detect when stage definitions, or their <abbr>dependencies</abbr>
have changed. Such conditions invalidate stages, requiring their reproduction
(see `dvc status`).
- Tracking of intermediate and final <abbr>outputs</abbr> of a pipeline —
similar to `.dvc` files.
- Tracking of intermediate and final outputs of a pipeline — similar to `.dvc`
files.
- Needed for several DVC commands to operate, such as `dvc checkout` or
`dvc get`.

Expand Down Expand Up @@ -415,12 +419,13 @@ stages:
Stages are listed again in `dvc.lock`, in order to know if their definitions
change in `dvc.yaml`.

Regular <abbr>dependencies</abbr> and all kinds of <abbr>outputs</abbr>
Full
[dependency entries](/doc/user-guide/project-structure/dvc-files#dependency-entries)
and all forms of
[output entries](/doc/user-guide/project-structure/dvc-files#output-entries)
(including [metrics](/doc/command-reference/metrics) and
[plots](/doc/command-reference/plots) files) are also listed (per stage) in
`dvc.lock`, but with an additional field storing a hash of their last known
contents. Specifically: `md5`, `etag`, or `checksum` are used (same as in `deps`
and `outs` entries of `.dvc` files).
`dvc.lock`, including a content hash field (`md5`, `etag`, or `checksum`).

Full <abbr>parameter dependencies</abbr> (key and value) are listed too (under
`params`), grouped by parameters file.
Expand Down