Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkout: consistency in handling files that are missing version info #5913

Closed
efiop opened this issue May 3, 2021 · 21 comments
Closed

checkout: consistency in handling files that are missing version info #5913

efiop opened this issue May 3, 2021 · 21 comments
Assignees
Labels
A: data-management Related to dvc add/checkout/commit/move/remove discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension ui user interface / interaction

Comments

@efiop
Copy link
Contributor

efiop commented May 3, 2021

dvc checkout(and dvc pull since it uses it internally) will error-out if version-info is missing, while dvc push will just warn us, which creates a confusing inconsistency. For example, it makes CML guys use || true for dvc pull https://github.com/DavidGOrtega/cml-dvc-test/runs/2438634469 CC @DavidGOrtega , as they don't have dvc.lock on initial run.

Current dvc push behaviour with warnings is also quite annoying and doesn't scale well, as it might get way to noisy for a big pipeline. So we should probably agree on one common behaviour for all such cases that will make sense to our users. CC @dberenbaum

@efiop efiop added enhancement Enhances DVC ui user interface / interaction discussion requires active participation to reach a conclusion labels May 3, 2021
@dberenbaum
Copy link
Collaborator

Current dvc push behaviour with warnings is also quite annoying and doesn't scale well, as it might get way to noisy for a big pipeline.

Isn't this what --quiet is for? It would be great to have some nice summary option that would be a compromise between no output and overly verbose output, but I think that's separate from the issue here, which is about an inconsistency between push and pull. If I have done dvc stage add ... && dvc push, then dvc pull right after should not fail.

@dberenbaum dberenbaum added the product: VSCode Integration with VSCode extension label May 3, 2021
@dberenbaum dberenbaum self-assigned this May 18, 2021
@dberenbaum
Copy link
Collaborator

@efiop Are you waiting on product decisions here? I think we can revisit the warnings, but seems like we can implement pull behavior to match push for now.

@karajan1001
Copy link
Contributor

I'm sorry, may I asked which version info is missing

@dberenbaum
Copy link
Collaborator

dberenbaum commented May 19, 2021

@karajan1001 This is the scenario as far as I remember:

  • dvc add data and commit .dvc file.
  • dvc stage add and commit dvc.yaml (the stage does not get run).
  • git push and dvc push, which warns the user that version info is missing for the outputs of the dvc.yaml stage.
  • On remote server in CML, git pull and dvc pull, which throws an error because version info is missing for the outputs of the dvc.yaml stage.

EDIT: A much simpler summary is that no dvc.lock file exists.

@efiop
Copy link
Contributor Author

efiop commented May 19, 2021

@dberenbaum We might consider starting to treat no-version-info checkout errors as nonfatal. We have some visualization of result in checkout right now (A, D etc), so maybe there is something we can do there instead of push-like warnings, need to take a look.

@skshetry WDYT? ^

There is also a question of backward compatibility, seems like if we switch to nonfatal behaviour by-default, we might break this for guys that use this to check if they were able to successfully checkout/pull (their cases and behavior could be separate, it is just an implementation detail that pull uses checkout inside). But it does make sense to do that switch, since this is closer to being a bug. At the same time, looks like we might want to provide a flag (or alternative way) to check that version info is present for particular outs/stages/etc.

@skshetry
Copy link
Member

skshetry commented May 20, 2021

I think that the incomplete checkout is a failure, be it due to missing cache or missing version info. || true is the correct way to handle that. We could provide a --ignore-errors flags to not error out though. push and fetch could be seen as loosening the guarantee/consistency as what matters most to us is the user's current workspace.

@pared
Copy link
Contributor

pared commented May 20, 2021

Related #6039

@dberenbaum
Copy link
Collaborator

@skshetry Should push also fail?

In the scenario above, it seems surprising to me that push or pull would fail. Should any stage that hasn't been run yet cause a failure on push or pull?

@dberenbaum
Copy link
Collaborator

Should any stage that hasn't been run yet cause a failure on push or pull?

Ping to keep this discussion moving. Any thoughts?

@skshetry
Copy link
Member

Ping to keep this discussion moving. Any thoughts?

For the recently created stage, maybe we should not fail, but there could be other reasons why the version info might be missing:

  1. git-ignored/dvc-ignored dvc.lock,
  2. missing/deleted dvc.lock file,
  3. unreproduced stage,
  4. mismatched entries in the dvc.yaml and dvc.lock file, and
  5. missing md5 hash on .dvc file, etc.

@dberenbaum
Copy link
Collaborator

Sure, there could be lots of reasons for not having version info. If there's no way to determine whether this is expected, isn't that what warnings are for?

push and fetch could be seen as loosening the guarantee/consistency as what matters most to us is the user's current workspace.

So checkout and pull can be inconsistent with push and fetch because they don't touch the user's workspace? In the case of missing version info, wouldn't the user's workspace remain untouched anyway? I don't see the danger.

More importantly, how would a user know this, and how can we avoid confusion? This seems like a subtle and unexpected distinction if I'm running various commands and some fail and some don't despite doing encountering the same issues.

@skshetry
Copy link
Member

Sure, there could be lots of reasons for not having version info. If there's no way to determine whether this is expected, isn't that what warnings are for?

@dberenbaum, I am not clear on how it should be handled. I understand dvc checkout to mean "checkout everything" from the index (i.e. stages). Users might depend on it to exist later in their workspace, so just throwing warnings might make it fail later on the user's script.

wouldn't the user's workspace remain untouched anyway? I don't see the danger.

Right now, we delete the file if we are missing version info, even though we fail.

@dberenbaum
Copy link
Collaborator

Right now, we delete the file if we are missing version info, even though we fail.

Okay, that does seem more dangerous! Although if it's still happening on failure, then it doesn't seem like this behavior has much to do with whether the command fails or not. Also, why does dvc delete the file?

@dberenbaum
Copy link
Collaborator

Ping @skshetry

@skshetry
Copy link
Member

skshetry commented Jun 23, 2021

I don't have a good idea on why it deletes the file on missing version info. DVC behaves like this in a lot of places, which I'd like to fix (eg: dvc add removing stage files on failure, etc). Pinging @efiop.

@efiop
Copy link
Contributor Author

efiop commented Jun 23, 2021

I don't have a good idea on why it deletes the file on missing version info.

Just an old assumption so that you are not confused by checkout that will keep some random file even though you don't have version info in a corresponding dvcfile. Could reconsider it, sure.

@dberenbaum
Copy link
Collaborator

Okay, rethinking this from the high-level perspective. There's inconsistency between push and pull, but I think I'm at least as bothered by why either command should throw an error here.

In CML, they set up a pipeline but don't run that pipeline yet. Then they set up a job that pulls the data and runs the pipeline. DVC is trying to push/pull the outputs of the pipeline, which don't exist yet since the pipeline hasn't run.

For both push and pull, the user expects that the outputs don't exist. There are lots of scenarios where users don't expect pipeline outputs to exist:

  1. New pipelines.
  2. Pipelines that have been partially run.
  3. New stages added to existing pipelines.
  4. Modifying output paths in a pipeline.

There are two possible approaches:

  1. Tightly couple dvc.yaml and dvc.lock, assuming that dvc.lock should always match whatever is in dvc.yaml. This seems useful if the user reproduced the latest dvc.yaml version but ended up committing some missing or corrupted state.
  2. Ignore dvc.yaml for data management purposes, focusing only on what's in dvc.lock and .dvc files. This seems useful in the scenarios above where users make changes to pipelines without re-running everything.

It seems like we have opted for the first option, but what do you think about the second approach? It seems worth considering since we are increasingly trying to separate data management and pipelines.

@pmrowla
Copy link
Contributor

pmrowla commented Jul 6, 2021

related #4746

@dberenbaum
Copy link
Collaborator

@skshetry What do you think about introducing this --ignore-missing option or similar from #4746 to checkout and pull?

@agrizzli
Copy link

It would be great to have such an option to --ignore-missing outputs (including metrics and plots) during dvc checkout!

In my use case, I define pipeline stages in dvc.yaml to be run on a remote cluster. I cannot prepopulate them locally using dvc repro (or similar) due to hardware requirements imposed by those stages. The only workaround is to manually create them as empty files and then use dvc run -f --no-exec as described here, which is quite annoying.

@daavoo daavoo added the A: data-management Related to dvc add/checkout/commit/move/remove label Feb 22, 2022
@dberenbaum
Copy link
Collaborator

Closing this in favor of #4746, but feel free to reopen if there's any other aspects you want to keep discussing.

@dberenbaum dberenbaum added this to DVC Dec 1, 2022
@dberenbaum dberenbaum moved this to Backlog in DVC Dec 1, 2022
@dberenbaum dberenbaum removed the status in DVC Dec 6, 2022
@dberenbaum dberenbaum removed this from DVC Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension ui user interface / interaction
Projects
None yet
Development

No branches or pull requests

8 participants