-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: add allow-missing scenarios #4585
Changes from 1 commit
b923fa7
91c7610
ba5ba08
c7a74d7
c2d806b
1216a91
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -108,6 +108,160 @@ stages: | |
always_changed: true | ||
``` | ||
|
||
## Pulling Data | ||
|
||
You can combine the `--pull` and `--allow-missing` flags to run a pipeline while | ||
only pulling the data that is actually needed to run the changed stages. | ||
|
||
Given the pipeline used in | ||
[example-get-started-experiments](https://github.com/iterative/example-get-started-experiments): | ||
|
||
```cli | ||
$ dvc dag | ||
+--------------------+ | ||
| data/pool_data.dvc | | ||
+--------------------+ | ||
* | ||
* | ||
* | ||
+------------+ | ||
| data_split | | ||
+------------+ | ||
** ** | ||
** ** | ||
* ** | ||
+-------+ * | ||
| train | ** | ||
+-------+ ** | ||
** ** | ||
** ** | ||
* * | ||
+----------+ | ||
| evaluate | | ||
+----------+ | ||
``` | ||
|
||
If we are in a machine where all the data is missing: | ||
|
||
```cli | ||
$ dvc status | ||
Not in cache: | ||
(use "dvc fetch <file>..." to download files) | ||
models/model.pkl | ||
data/pool_data/ | ||
data/test_data/ | ||
data/train_data/ | ||
``` | ||
|
||
We can modify the `evaluate` stage and DVC will only pull the necessary data to | ||
daavoo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
run that stage (`models/model.pkl` `data/test_data/`) while skipping the rest of | ||
the stages: | ||
|
||
```cli | ||
$ dvc exp run | ||
'data/pool_data.dvc' didn't change, skipping | ||
Stage 'data_split' didn't change, skipping | ||
Stage 'train' didn't change, skipping | ||
Running stage 'evaluate': | ||
... | ||
``` | ||
|
||
## Verify Pipeline Status | ||
|
||
In scenarios like CI jobs, you may want to check that the pipeline is up to date | ||
without pulling or running anything. You can check that nothing has changed: | ||
|
||
<details> | ||
|
||
### Clean example | ||
|
||
In the example below, data is missing because nothing has been pulled, but | ||
otherwise the pipeline is up to date. | ||
|
||
```cli | ||
$ dvc status | ||
data_split: | ||
changed deps: | ||
deleted: data/pool_data | ||
changed outs: | ||
not in cache: data/test_data | ||
not in cache: data/train_data | ||
train: | ||
changed deps: | ||
deleted: data/train_data | ||
changed outs: | ||
not in cache: models/model.pkl | ||
evaluate: | ||
changed deps: | ||
deleted: data/test_data | ||
deleted: models/model.pkl | ||
data/pool_data.dvc: | ||
changed outs: | ||
not in cache: data/pool_data | ||
``` | ||
|
||
</details> | ||
|
||
```cli | ||
$ dvc exp run --allow-missing --dry | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tbh, if I were new do DVC it would have not been clear to me why it's not the default behavior ... if I run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't pull anything but it will fail because of the missing data even during |
||
Reproducing experiment 'agley-nuke' | ||
'data/pool_data.dvc' didn't change, skipping | ||
Stage 'data_split' didn't change, skipping | ||
Stage 'train' didn't change, skipping | ||
Stage 'evaluate' didn't change, skipping | ||
``` | ||
|
||
If anything is not up to date, the pipeline will fail: | ||
|
||
<details> | ||
|
||
### Dirty example | ||
|
||
In the example below, the `data_split` parameter in `params.yaml` was modified, | ||
so the pipeline is not up to date. | ||
|
||
```cli | ||
$ dvc status | ||
data_split: | ||
changed deps: | ||
deleted: data/pool_data | ||
params.yaml: | ||
modified: data_split | ||
changed outs: | ||
not in cache: data/test_data | ||
not in cache: data/train_data | ||
train: | ||
changed deps: | ||
deleted: data/train_data | ||
changed outs: | ||
not in cache: models/model.pkl | ||
evaluate: | ||
changed deps: | ||
deleted: data/test_data | ||
deleted: models/model.pkl | ||
data/pool_data.dvc: | ||
changed outs: | ||
not in cache: data/pool_data | ||
``` | ||
|
||
</details> | ||
|
||
```cli | ||
$ dvc exp run --allow-missing --dry | ||
Reproducing experiment 'dozen-jogs' | ||
'data/pool_data.dvc' didn't change, skipping | ||
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data' | ||
``` | ||
|
||
You can also check that all data exists on the remote. The command below will | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this check? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had to update the command, but it's needed because |
||
succeed (return `true` and set the exit code to `0`) if all data is found in the | ||
remote. Otherwise, it will fail (return `false` and set the exit code to `1`). | ||
|
||
```cli | ||
$ dvc status -c --json | jq -e '. == {}' | ||
true | ||
``` | ||
|
||
## Debugging Stages | ||
|
||
If you are using advanced features to interpolate values for your pipeline, like | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since title is a bit general, should we give a one sentence intro- pull gives a way to ... then explain how allow missing helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more in the intro