Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: use-cases: versioning: add a note about replacing/modifying data #82

Closed
efiop opened this issue Sep 10, 2018 · 8 comments
Closed
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases

Comments

@efiop
Copy link
Contributor

efiop commented Sep 10, 2018

iterative/dvc#599 (comment)

@shcheklein
Copy link
Member

@drorata could you please clarify why do we need this section https://dvc.org/doc/user-guide/update-tracked-file#updating-generated-files ?

@shcheklein
Copy link
Member

@drorata I mean this specific paragraph:

If `train.tsv` is generated during your pipeline (e.g. some intermediate
result), you have to be careful and remove it from tracking prior to the
execution of the pipeline which modifies it.

@drorata
Copy link
Contributor

drorata commented Sep 26, 2018

I guess what I have in mind is, for example, the case where a model is persisted to the disk and it is desired to track it. In this case, the file is generated during the run of some code and can change while breaking the cache.

@shcheklein
Copy link
Member

Thanks, for your answer, @drorata.

Let's imagine for example that we have a train.py script that takes some data.tsv and produces a model.pkl file. If you create a stage DVC file like this:

dvc run -f train.dvc -d data.tsv -d train.py -o model.pkl python train.py 0.2

then if you run dvc repro train.dvc it will automatically remove model.pkl file before running the command. It should be safe and in this case no need to run dvc remove or something else.

Do you mean that even though there is a train.dvc file (which essentially takes inputs and outputs under DVC control) you still sometimes want to run python train.py manually, or even the same command again:

dvc run -f train.dvc -d data.tsv -d train.py -o model.pkl python train.py 0.3

(may be providing slightly different parameters)?

Is it the case you were thinking about?

@drorata
Copy link
Contributor

drorata commented Sep 26, 2018

This is indeed the case I have in mind. So far I used dvc for tracking data files and artifacts, and skipped the dvc run part. Therefore, I found myself running train.py manually and potentially corrupting my cache.

@shcheklein
Copy link
Member

@drorata gotcha! thank you for the clarification. It's a valid use case and worth mentioning. I'm not sure if we need a separate section specifically for that, especially it's confusing a little bit because we also have dvc repro and dvc run that manage intermediate artifacts. Let't me think and do mention this case in the documentation. I'll let you know to check if it's fine with you. Thanks again!

@shcheklein
Copy link
Member

@drorata check this document again https://dvc.org/doc/user-guide/update-tracked-file, I put a note about generated files at the very beginning, otherwise it's easy to miss it and it's not very well connected with the content above. Let me know if you have other ideas.

@drorata
Copy link
Contributor

drorata commented Oct 1, 2018

See #88. Otherwise, great! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases
Projects
None yet
Development

No branches or pull requests

4 participants