Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: add allow-missing scenarios #4585

Merged
merged 6 commits into from
Jun 13, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 161 additions & 2 deletions content/docs/user-guide/pipelines/running-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,7 @@ DVC will skip that stage:
Stage 'prepare' didn't change, skipping
```

DVC will also recover the outputs from previous runs using the
[run cache](/doc/user-guide/pipelines/run-cache):
DVC will also recover the outputs from previous runs using the [run cache].

```
Stage 'prepare' is cached - skipping run, checking out outputs
Expand All @@ -108,6 +107,165 @@ stages:
always_changed: true
```

## Pull Missing Data

`--pull` will download missing dependencies (and will download the cached
outputs of previous runs saved in the [run cache]), so you don't need to pull
all data for your project before running the pipeline. `--allow-missing` will
skip stages with no other changes than missing data. You can combine the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to understand that "missing data" is a change

`--pull` and `--allow-missing` flags to run a pipeline while only pulling the
data that is actually needed to run the changed stages.

Given the pipeline used in
[example-get-started-experiments](https://github.com/iterative/example-get-started-experiments):

```cli
$ dvc dag
+--------------------+
| data/pool_data.dvc |
+--------------------+
*
*
*
+------------+
| data_split |
+------------+
** **
** **
* **
+-------+ *
| train | **
+-------+ **
** **
** **
* *
+----------+
| evaluate |
+----------+
```

If we are in a machine where all the data is missing:

```cli
$ dvc status
Not in cache:
(use "dvc fetch <file>..." to download files)
models/model.pkl
data/pool_data/
data/test_data/
data/train_data/
```

We can modify the `evaluate` stage (for example, we changed the code to add a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor.
Why not use --set-param?
Is it to keep the focus on only the 2 flags being explained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about it. I think I just copied from your example in https://dvc.org/doc/command-reference/repro#example-only-pull-pipeline-data-as-needed, which used repro so I guess couldn't rely on --set-param. I could update with an example that uses it.

new evaluation method) and DVC will only pull the necessary data to run that
stage (`models/model.pkl` `data/test_data/`) while skipping the rest of the
stages:

```cli
$ dvc exp run
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Running stage 'evaluate':
...
```

## Verify Pipeline Status

In scenarios like CI jobs, you may want to check that the pipeline is up to date
without pulling or running anything. You can check that nothing has changed:

<details>

### Clean example

In the example below, data is missing because nothing has been pulled, but
otherwise the pipeline is up to date.

```cli
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
```

</details>

```cli
$ dvc exp run --allow-missing --dry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, if I were new do DVC it would have not been clear to me why it's not the default behavior ... if I run dry why would it try to pull anything ... does it btw still pull anything at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't pull anything but it will fail because of the missing data even during --dry since it's considered deleted.

Reproducing experiment 'agley-nuke'
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
```

If anything is not up to date, the pipeline will fail:

<details>

### Dirty example

In the example below, the `data_split` parameter in `params.yaml` was modified,
so the pipeline is not up to date.

```cli
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
params.yaml:
modified: data_split
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
```

</details>

```cli
$ dvc exp run --allow-missing --dry
Reproducing experiment 'dozen-jogs'
'data/pool_data.dvc' didn't change, skipping
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
```

You can also check that all data exists on the remote. The command below will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to update the command, but it's needed because --allow-missing won't actually check whether the missing data exists on the remote. However, it's probably important to users to ensure all data has been pushed so that the pipeline could be reproduced from scratch if needed. See the initial request for this feature in iterative/dvc#5369.

succeed (set the exit code to `0`) if all data is found in the remote.
Otherwise, it will fail (set the exit code to `1`).

```cli
$ dvc data status --not-in-remote --json | grep -v not_in_remote
true
```

## Debugging Stages

If you are using advanced features to interpolate values for your pipeline, like
Expand All @@ -132,3 +290,4 @@ stage train: {'model': {'batch_size': 512, 'latent_dim': 8,

[templating]: /doc/user-guide/project-structure/pipelines-files#templating
[hydra composition]: /docs/user-guide/experiment-management/hydra-composition
[run cache]: /doc/user-guide/pipelines/run-cache