-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. updates (2.0ish) #2062
Misc. updates (2.0ish) #2062
Changes from 31 commits
1cf6519
8dd77a9
df113ec
4dd5322
fa30e85
27b3007
2e370de
ec5cb83
e7297b5
523af6b
442d325
3e5435a
95965ea
7a59869
a7f35ef
194e1e4
4e1cb47
8ede12f
0930fa4
bf0f04c
59fe782
082d52d
7165723
917b0ab
7dc0596
3cdd408
64bf6b7
dc38b81
919f6fd
0579261
5d38ea0
edf7565
a163a35
25834cc
64b31de
cbd6d62
53ed2d2
d8221e0
19c846a
ff5665d
3e7d40f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,44 +18,64 @@ positional arguments: | |
|
||
In order to track parameters and hyperparameters associated to machine learning | ||
experiments in <abbr>DVC projects</abbr>, DVC provides a different type of | ||
dependencies: _parameters_. Parameters are defined using the the `-p` | ||
(`--params`) option of `dvc run`, using simple names like `epochs`, | ||
dependencies: _parameters_. They usually have simple names like `epochs`, | ||
`learning-rate`, `batch_size`, etc. | ||
|
||
In contrast to a regular <abbr>dependency</abbr>, a parameter is not a file (or | ||
directory). Instead, it consists of a _parameter name_ (or key) to find inside a | ||
YAML 1.2, JSON, TOML, or [Python](#examples-python-parameters-file) _parameters | ||
file_. Multiple parameter dependencies can be specified from one or more | ||
parameters files. | ||
To start tracking parameters, list them under the `params` field of `dvc.yaml` | ||
stages (manually or with the the `-p`/`--params` option of `dvc run`). For | ||
example: | ||
|
||
The default parameters file name is `params.yaml`. Parameters should be | ||
organized as a tree hierarchy inside, as DVC will locate param names by their | ||
tree path. Parameters files have to be manually written, or generated, and these | ||
can be versioned directly with Git. | ||
```yaml | ||
stages: | ||
mystage: | ||
cmd: ./myscript.sh | ||
params: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is way better 😅. Done! |
||
- foo | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- bar.baz | ||
- myparams.toml: | ||
- qux | ||
``` | ||
|
||
Supported parameter _value_ types are: string, integer, float, and arrays. DVC | ||
itself does not ascribe any specific meaning for these values. They are | ||
user-defined, and serve as a way to generalize and parametrize an machine | ||
learning algorithms or data processing code. | ||
> By default, parameters are read from `params.yaml`. Other params files can be | ||
> listed too, with sub-lists of the params found in them (as shown above). | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
DVC saves the param names and their latest values in the `dvc.yaml` file. These | ||
values will be compared to the ones in the params files to determine if the | ||
stage is invalidated upon pipeline [reproduction](/doc/command-reference/repro). | ||
In contrast to a regular <abbr>dependency</abbr>, a parameter is not a file or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a parameter dependency ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||
directory. Instead, it consists of a _parameter name_ (or key) in a | ||
[parameters file](#parameters-file), where the _parameter value_ should be | ||
found. This allows you to define [stage](/doc/command-reference/run) | ||
dependencies more granularly: changes to other parts of the params file will not | ||
affect the stage. Parameter dependencies also prevent situations where several | ||
stages share a regular dependency (e.g. a config file), and any change in it | ||
invalidates all these stages, causing unnecessary re-executions upon | ||
`dvc repro`. | ||
|
||
> Note that DVC does not pass the parameter values to stage commands. The | ||
> associated command executed by `dvc run` or `dvc repro` will have to open and | ||
> parse the parameters file by itself, and use the params specified with `-p`. | ||
The `dvc params diff` command is available to show parameter changes, displaying | ||
their current and previous [values](#parameter-values). | ||
|
||
### Parameters files | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This content should be simple enough to avoid additional structure to my mind. Keep it simpler, remove repetitions, move from high level explanation (an example) to the details or/and advanced cases (custom file name). Also keep in mind, that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes that would be ideal although I'm not sure that sections hurt in this instance (there's several relevant aspects of params so maybe sections are appropriate for a reference doc.) But 1. all of this content was already there (in fact this PR already makes the text shorter compared to https://dvc.org/doc/command-reference/params and 2. again, since we will move a lot of this info to basic concepts soonish, should we let it be for now?
Good point. It will also operate on parameters insterted to dvc.yaml from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I don't see any value in those sections, they have a lot of repetition with the previous introductory section. They don't have any motivation behind them (not clear why do they exist, how do they connect to the other text).
is is addressed already? UPDATE: See #2062 (review) below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe in past iterations? I'm not seeing the repetition right now (just 2 mentions of
They are the references for parameters files and for parameter values, which are not detailed in the Description, but they are linked as part of the explanation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see a few concerns (overall comes down to complicating and making things less clear):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any repetition I should definitely address (⌛), its kind of a separate issue (it could be there with or without sections).
Good points. OK, I'll removed the headers ⌛
Not sure what you mean by w/o ref. since they're linked to from the main Desc. but indeed there's no intro paragraph giving them some motivation, that's because these are reference docs so we don't have a story in each one. Sections can be pretty useful in this kinds of docs so I'm not sure why we're trying to avoid them in general (your specific points did convince me this time though). For example what about https://dvc.org/doc/command-reference/metrics#supported-file-formats? We have many of those. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK I removed the H3s and addressed the other specific feedback. I also just realized that the Examples had some problems so I threw that in. PTAL |
||
|
||
The parameters concept helps to define [stage](/doc/command-reference/run) | ||
dependencies more granularly. A particular parameter or set of parameters will | ||
be required for the stage invalidation (see `dvc status` and `dvc repro`). | ||
Changes to other parts of the dependency file will not affect the stage. This | ||
prevents situations where several stages share a (configuration) file as a | ||
common dependency, and any change in this dependency invalidates all these | ||
stages and causes their reproduction unnecessarily. | ||
The default params file name is `params.yaml`, but any other YAML 1.2, JSON, | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
TOML, or [Python](#examples-python-parameters-file) files can be used | ||
additionally. These files are typically written manually (or they can be | ||
generated) and they can be versioned directly with Git. | ||
|
||
`dvc params diff` is available to show changes in parameters, displaying the | ||
param names as well as their current and previous values. | ||
### Parameter values | ||
|
||
Param values should be organized in tree-like hierarchies (dictionaries) inside | ||
param files (see [Examples](#examples)). DVC will interpret param names as the | ||
tree path to find those values. | ||
|
||
Supported types are: string, integer, float, and arrays. Note that DVC does not | ||
ascribe any specific meaning to these values. | ||
|
||
DVC saves parameter names and values in the project's | ||
[DVC files](/doc/user-guide/dvc-files) in order to track them over time. They | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in this case we can be specific? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Specified. |
||
will be compared to the latest params files to determine if the stage is | ||
outdated upon `dvc repro` (or `dvc status`). | ||
|
||
> Note that DVC does not pass the parameter values to stage commands. The | ||
> commands executed by DVC will have to load and parse the parameters file by | ||
> itself. | ||
|
||
## Options | ||
|
||
|
@@ -95,8 +115,8 @@ $ dvc run -n train -d users.csv -o model.pkl \ | |
> Note that we could use the same parameter addressing with JSON, TOML, or | ||
> Python parameters files. | ||
|
||
The `train.py` script will have some code to parse the needed parameters. For | ||
example: | ||
The `train.py` script will have some code to parse and load the needed | ||
parameters. For example: | ||
|
||
```py | ||
import yaml | ||
|
@@ -109,11 +129,12 @@ epochs = params['train']['epochs'] | |
layers = params['train']['layers'] | ||
``` | ||
|
||
You can find that each parameter and it's value were saved to `dvc.yaml`. These | ||
values will be compared to the ones in the parameters files whenever `dvc repro` | ||
is used, to determine if dependency to the params file is invalidated: | ||
You can find that each parameter and their values were saved to `dvc.yaml` and | ||
`dvc.lock`. These are compared to the values in the params files when | ||
`dvc repro` is used to determine if the parameter dependency has changed. | ||
|
||
```yaml | ||
# dvc.yaml | ||
stages: | ||
train: | ||
cmd: python train.py | ||
|
@@ -127,7 +148,7 @@ stages: | |
``` | ||
|
||
Alternatively, the entire group of parameters `train` can be referenced, instead | ||
of specifying each of the group parameters separately: | ||
of specifying each of the params separately: | ||
|
||
```dvc | ||
$ dvc run -n train -d users.csv -o model.pkl \ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -143,9 +143,12 @@ stages: | |
metrics: | ||
- performance.json | ||
training: | ||
desc: Training stage description | ||
cmd: python train.py | ||
desc: Train model with Python | ||
cmd: | ||
- pip install -r requirements.txt | ||
- python train.py --out ${model_file} | ||
deps: | ||
- requirements.txt | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- train.py | ||
- features | ||
outs: | ||
|
@@ -163,15 +166,19 @@ stages: | |
by the user with the `--name` (`-n`) option of `dvc run`. Each stage can contain | ||
the following fields: | ||
|
||
- `cmd` (always present): Executable command defined in this stage | ||
- `cmd` (always present): One or more commands executed by the stage (may | ||
contain either a single value, or a list). Commands are executed sequentially | ||
until all are finished or until one of them fails (see `dvc repro` for | ||
details). | ||
- `wdir`: Working directory for the stage command to run in (relative to the | ||
file's location). If this field is not present explicitly, it defaults to `.` | ||
(the file's location). | ||
- `deps`: List of <abbr>dependency</abbr> file or directory paths of this stage | ||
(relative to `wdir` which defaults to the file's location). See | ||
[Dependency entries](#dependency-entries) above for more details. | ||
- `params`: List of <abbr>parameter</abbr> dependency keys (field names) that | ||
are read from a YAML, JSON, TOML, or Python file (`params.yaml` by default) | ||
- `params`: List of <abbr>parameter</abbr> dependency keys (field names) to | ||
track in `params.yaml`. The list may also contain other YAML, JSON, TOML, or | ||
Python file names, with a sub-list of the param names to track in them. | ||
Comment on lines
-173
to
+181
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that I also expanded on this dvc.yaml field since we're encouraging users to avoid dvc run now (so we should better explain all the possible ways to write this manually or generate it). |
||
- `outs`: List of <abbr>output</abbr> file or directory paths of this stage | ||
(relative to `wdir` which defaults to the file's location). See | ||
[Output entries](#output-entries) above for more details. | ||
|
@@ -207,8 +214,8 @@ For every `dvc.yaml` file, a matching `dvc.lock` (YAML) file usually exists. | |
It's created or updated by DVC commands such as `dvc run` and `dvc repro`. | ||
`dvc.lock` describes the latest pipeline state. It has several purposes: | ||
|
||
- Tracking of intermediate and final results of a pipeline — similar to | ||
[`.dvc` files](#dvc-files). | ||
- Tracking of intermediate and final <abbr>outputs</abbr> of a pipeline — | ||
similar to [`.dvc` files](#dvc-files). | ||
- Allow DVC to detect when stage definitions, or their dependencies have | ||
changed. Such conditions invalidate stages, requiring their reproduction (see | ||
`dvc status`, `dvc repro`). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
not used by DVC
is not very clear, can we rephrase it somehow?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK changed to "regardless of whether
dvc.yaml
is currently tracking any metrics in them". WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not even about
dvc.yaml
at this point, even DVC repo might not be initialized. I see that the whole change for that ticket might be out of scope since it's a much bigger changeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right but yes, since iterative/dvc/issues/4446 is still open and the core team hasn't really sent many docs updates per their work on that... We could leave for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again since I just added that note about not needing a DVC project... This text in --targets is kind of contradicting again...
I'll simplify it a little for now (less specific).UPDATE: Actually, I'm not sure how to generalize since we didn't want to say "not used by DVC". Sry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. the current text is not incorrect so... I think it should do for now 🙂