Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add allow-missing flag to commit command #10555

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

anunayasri
Copy link

Fixes #10524

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@anunayasri
Copy link
Author

@skshetry Could you review this please, especially --allow-missing description. If it looks fine, I will update the docs repo as well and raise the PR.

@shcheklein
Copy link
Member

@anunayasri It would be great to have a test for this. Also, could you confirm its semantics - that it keeps md5s of the missing dirs / files / deps as-is?

@skshetry
Copy link
Member

skshetry commented Sep 16, 2024

Looks like we don't have tests for allow_missing flag for commit. We don't write tests for CLI, as they are thin wrapper over Repo API. But since we are now going to depend on it, we need a test for it.

Do you mind adding a test for this API?

PTAL https://github.com/iterative/dvc/blob/main/tests/func/test_commit.py, which might help you how to write tests. If you have any questions, please feel free to ask here or in discord.

@anunayasri
Copy link
Author

@shcheklein There were no tests for cli commands, hence I had not added tests. It seems tests need to be added for the allow_missing logic in Repo.commit(). Will do so.

@skshetry Thanks. I will look into it.

@shcheklein
Copy link
Member

@anunayasri hey, are there any updates on this? do you need some help on this? (thanks for the PR btw!)

@anunayasri
Copy link
Author

Hi @shcheklein. Sorry. I was busy with office work and missed your comment. I have been chatting with @skshetry on discord DM. He helped me with my doubts. I am planning to work on the issue this week.

@anunayasri
Copy link
Author

Hi. I need help with writing the test case for dvc commit usecase for a pipeline. I tried checking the existing test case like dvc repro but I am not clear about few things. Could you help or point me to a code piece/doc.

  1. In the test case, generate a dvc.yaml file. It should have a dep on data file that is missing from the repo. How to generate this setup? I was trying out allow-missing through the example project example-get-started.
  2. dvc commit --allow-missing dataset should do nothing. I believe allow-missing is relevant to commit of pipelines.
  3. Could you explain the diff between stage.save() and stage.commit().

cc: @skshetry @shcheklein

@skshetry
Copy link
Member

skshetry commented Oct 22, 2024

Hey @anunayasri. Sorry for the late response.

I took a look at the implementation, and it looks like --allow-missing is only implemented for the things that we need for experiments, where we run it in --force mode and only use it to commit data sources (see data_only=True).

Without --force, we show a prompt about changes to the users before committing. This is done through Stage.changed_entries(). It uses Output.workspace_status() to find out the changes, but how it has changed is getting lost inside it. So, we need to re-work that API and teach it about allow_missing.

Additionally, it does not work with missing dependencies at this time. It's due to "run cache," which fails on missing dependencies, but it should be easy to work around.


Hi. I need help with writing the test case for dvc commit usecase for a pipeline. I tried checking the existing test case like dvc repro but I am not clear about few things. Could you help or point me to a code piece/doc.

  1. In the test case, generate a dvc.yaml file. It should have a dep on data file that is missing from the repo. How to generate this setup? I was trying out allow-missing through the example project example-get-started.

Here's an example test case for pipelines:

def test_allow_missing(tmp_dir, dvc):
    tmp_dir.dvc_gen("foo", "foo")
    dvc.run(
        name="copy",
        cmd=["cp foo foo_copy1", "cp foo foo_copy2"],
        deps=["foo"],
        outs=["foo_copy1", "foo_copy2"],
    )

    (tmp_dir / "foo_copy1").unlink()

    (stage,) = dvc.commit("dvc.yaml", allow_missing=True)
    outs = {out.def_path: out.hash_info.value for out in stage.outs}
    assert outs == {
        "foo_copy1": "acbd18db4cc2f85cedef654fccc4a4d8",
        "foo_copy2": "acbd18db4cc2f85cedef654fccc4a4d8",
    }

    (tmp_dir / "foo_copy2").write_text("foobar", encoding="utf-8")
    (stage,) = dvc.commit("dvc.yaml", allow_missing=True)
    outs = {out.def_path: out.hash_info.value for out in stage.outs}
    assert outs == {
        "foo_copy1": "acbd18db4cc2f85cedef654fccc4a4d8",
        "foo_copy2": "3858f62230ac3c915f300c664312c63f",
    }

Tip

If you pass force=True to dvc.commit above, the test passes.

  1. dvc commit --allow-missing dataset should do nothing. I believe allow-missing is relevant to commit of pipelines.

--allow-missing dataset should do nothing if dataset is missing. That is correct. If it does exist, it should work the same way as without --allow-missing.

  1. Could you explain the diff between stage.save() and stage.commit().

save() hashes the output/dependency and updates its internal model (well, saves it), whereas commit hashes output/dependency, copies it to the cache, and checkout them back to the workspace (without changing its internal model as possible).
See Output.save() and Output.commit() for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add --allow-missing for dvc commit
3 participants