-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: add allow-missing scenarios #4585
Changes from 3 commits
b923fa7
91c7610
ba5ba08
c7a74d7
c2d806b
1216a91
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,8 +86,7 @@ DVC will skip that stage: | |
Stage 'prepare' didn't change, skipping | ||
``` | ||
|
||
DVC will also recover the outputs from previous runs using the | ||
[run cache](/doc/user-guide/pipelines/run-cache): | ||
DVC will also recover the outputs from previous runs using the [run cache]. | ||
|
||
``` | ||
Stage 'prepare' is cached - skipping run, checking out outputs | ||
|
@@ -108,6 +107,165 @@ stages: | |
always_changed: true | ||
``` | ||
|
||
## Pull Missing Data | ||
|
||
`--pull` will download missing dependencies (and will download the cached | ||
outputs of previous runs saved in the [run cache]), so you don't need to pull | ||
all data for your project before running the pipeline. `--allow-missing` will | ||
skip stages with no other changes than missing data. You can combine the | ||
`--pull` and `--allow-missing` flags to run a pipeline while only pulling the | ||
data that is actually needed to run the changed stages. | ||
|
||
Given the pipeline used in | ||
[example-get-started-experiments](https://github.com/iterative/example-get-started-experiments): | ||
|
||
```cli | ||
$ dvc dag | ||
+--------------------+ | ||
| data/pool_data.dvc | | ||
+--------------------+ | ||
* | ||
* | ||
* | ||
+------------+ | ||
| data_split | | ||
+------------+ | ||
** ** | ||
** ** | ||
* ** | ||
+-------+ * | ||
| train | ** | ||
+-------+ ** | ||
** ** | ||
** ** | ||
* * | ||
+----------+ | ||
| evaluate | | ||
+----------+ | ||
``` | ||
|
||
If we are in a machine where all the data is missing: | ||
|
||
```cli | ||
$ dvc status | ||
Not in cache: | ||
(use "dvc fetch <file>..." to download files) | ||
models/model.pkl | ||
data/pool_data/ | ||
data/test_data/ | ||
data/train_data/ | ||
``` | ||
|
||
We can modify the `evaluate` stage (for example, we changed the code to add a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't think about it. I think I just copied from your example in https://dvc.org/doc/command-reference/repro#example-only-pull-pipeline-data-as-needed, which used |
||
new evaluation method) and DVC will only pull the necessary data to run that | ||
stage (`models/model.pkl` `data/test_data/`) while skipping the rest of the | ||
stages: | ||
|
||
```cli | ||
$ dvc exp run | ||
'data/pool_data.dvc' didn't change, skipping | ||
Stage 'data_split' didn't change, skipping | ||
Stage 'train' didn't change, skipping | ||
Running stage 'evaluate': | ||
... | ||
``` | ||
|
||
## Verify Pipeline Status | ||
|
||
In scenarios like CI jobs, you may want to check that the pipeline is up to date | ||
without pulling or running anything. You can check that nothing has changed: | ||
|
||
<details> | ||
|
||
### Clean example | ||
|
||
In the example below, data is missing because nothing has been pulled, but | ||
otherwise the pipeline is up to date. | ||
|
||
```cli | ||
$ dvc status | ||
data_split: | ||
changed deps: | ||
deleted: data/pool_data | ||
changed outs: | ||
not in cache: data/test_data | ||
not in cache: data/train_data | ||
train: | ||
changed deps: | ||
deleted: data/train_data | ||
changed outs: | ||
not in cache: models/model.pkl | ||
evaluate: | ||
changed deps: | ||
deleted: data/test_data | ||
deleted: models/model.pkl | ||
data/pool_data.dvc: | ||
changed outs: | ||
not in cache: data/pool_data | ||
``` | ||
|
||
</details> | ||
|
||
```cli | ||
$ dvc exp run --allow-missing --dry | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tbh, if I were new do DVC it would have not been clear to me why it's not the default behavior ... if I run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't pull anything but it will fail because of the missing data even during |
||
Reproducing experiment 'agley-nuke' | ||
'data/pool_data.dvc' didn't change, skipping | ||
Stage 'data_split' didn't change, skipping | ||
Stage 'train' didn't change, skipping | ||
Stage 'evaluate' didn't change, skipping | ||
``` | ||
|
||
If anything is not up to date, the pipeline will fail: | ||
|
||
<details> | ||
|
||
### Dirty example | ||
|
||
In the example below, the `data_split` parameter in `params.yaml` was modified, | ||
so the pipeline is not up to date. | ||
|
||
```cli | ||
$ dvc status | ||
data_split: | ||
changed deps: | ||
deleted: data/pool_data | ||
params.yaml: | ||
modified: data_split | ||
changed outs: | ||
not in cache: data/test_data | ||
not in cache: data/train_data | ||
train: | ||
changed deps: | ||
deleted: data/train_data | ||
changed outs: | ||
not in cache: models/model.pkl | ||
evaluate: | ||
changed deps: | ||
deleted: data/test_data | ||
deleted: models/model.pkl | ||
data/pool_data.dvc: | ||
changed outs: | ||
not in cache: data/pool_data | ||
``` | ||
|
||
</details> | ||
|
||
```cli | ||
$ dvc exp run --allow-missing --dry | ||
Reproducing experiment 'dozen-jogs' | ||
'data/pool_data.dvc' didn't change, skipping | ||
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data' | ||
``` | ||
|
||
You can also check that all data exists on the remote. The command below will | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this check? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had to update the command, but it's needed because |
||
succeed (set the exit code to `0`) if all data is found in the remote. | ||
Otherwise, it will fail (set the exit code to `1`). | ||
|
||
```cli | ||
$ dvc data status --not-in-remote --json | grep -v not_in_remote | ||
true | ||
``` | ||
|
||
## Debugging Stages | ||
|
||
If you are using advanced features to interpolate values for your pipeline, like | ||
|
@@ -132,3 +290,4 @@ stage train: {'model': {'batch_size': 512, 'latent_dim': 8, | |
|
||
[templating]: /doc/user-guide/project-structure/pipelines-files#templating | ||
[hydra composition]: /docs/user-guide/experiment-management/hydra-composition | ||
[run cache]: /doc/user-guide/pipelines/run-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard to understand that "missing data" is a change