Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: add allow-missing scenarios #4585

Merged
merged 6 commits into from
Jun 13, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions content/docs/user-guide/pipelines/running-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,160 @@ stages:
always_changed: true
```

## Pulling Data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since title is a bit general, should we give a one sentence intro- pull gives a way to ... then explain how allow missing helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more in the intro


You can combine the `--pull` and `--allow-missing` flags to run a pipeline while
only pulling the data that is actually needed to run the changed stages.

Given the pipeline used in
[example-get-started-experiments](https://github.com/iterative/example-get-started-experiments):

```cli
$ dvc dag
+--------------------+
| data/pool_data.dvc |
+--------------------+
*
*
*
+------------+
| data_split |
+------------+
** **
** **
* **
+-------+ *
| train | **
+-------+ **
** **
** **
* *
+----------+
| evaluate |
+----------+
```

If we are in a machine where all the data is missing:

```cli
$ dvc status
Not in cache:
(use "dvc fetch <file>..." to download files)
models/model.pkl
data/pool_data/
data/test_data/
data/train_data/
```

We can modify the `evaluate` stage and DVC will only pull the necessary data to
daavoo marked this conversation as resolved.
Show resolved Hide resolved
run that stage (`models/model.pkl` `data/test_data/`) while skipping the rest of
the stages:

```cli
$ dvc exp run
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Running stage 'evaluate':
...
```

## Verify Pipeline Status

In scenarios like CI jobs, you may want to check that the pipeline is up to date
without pulling or running anything. You can check that nothing has changed:

<details>

### Clean example

In the example below, data is missing because nothing has been pulled, but
otherwise the pipeline is up to date.

```cli
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
```

</details>

```cli
$ dvc exp run --allow-missing --dry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, if I were new do DVC it would have not been clear to me why it's not the default behavior ... if I run dry why would it try to pull anything ... does it btw still pull anything at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't pull anything but it will fail because of the missing data even during --dry since it's considered deleted.

Reproducing experiment 'agley-nuke'
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
```

If anything is not up to date, the pipeline will fail:

<details>

### Dirty example

In the example below, the `data_split` parameter in `params.yaml` was modified,
so the pipeline is not up to date.

```cli
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
params.yaml:
modified: data_split
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
```

</details>

```cli
$ dvc exp run --allow-missing --dry
Reproducing experiment 'dozen-jogs'
'data/pool_data.dvc' didn't change, skipping
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
```

You can also check that all data exists on the remote. The command below will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to update the command, but it's needed because --allow-missing won't actually check whether the missing data exists on the remote. However, it's probably important to users to ensure all data has been pushed so that the pipeline could be reproduced from scratch if needed. See the initial request for this feature in iterative/dvc#5369.

succeed (return `true` and set the exit code to `0`) if all data is found in the
remote. Otherwise, it will fail (return `false` and set the exit code to `1`).

```cli
$ dvc status -c --json | jq -e '. == {}'
true
```

## Debugging Stages

If you are using advanced features to interpolate values for your pipeline, like
Expand Down