-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to return back to an older version of data #599
Comments
Hi @rahulguptajss ! Versions of data files are determined by the appropriate dvc files that store their md5 checksums. I.e. if you've added your training set with Example:
Edit: fixed typos. |
thank you for the detailed explanation. |
@rahulguptajss FYI: there was a typo in my last comment. There should be 'git add train.tsv.dvc' instead of 'git add train.tsv'. Kudos to @dmpetrov for noticing. |
@efiop Thanks for the details! Very helpful and might be worthy to add to the docs. Could you also elaborate how to handle the data versioning when it is stored on the cloud (S3 for example)? |
Hi @drorata !
Thank you for the feedback! Created iterative/dvc.org#78 to track the progress on it.
If you are talking about external output scenario then it is absolutely no different from what I've described above. If you are talking about cache stored on s3(e.g. https://dvc.org/doc/use-cases/share-data-and-model-files ) then it stores all the versions that you've pushed to it from your local workspace, so all the versioning happens locally and once again no different from my example from above 🙂 Thanks, |
The more I think about it I understand this is a super important and central use case which has a tricky step: |
Good point! I agree, it should be in https://dvc.org/doc/use-cases/data-and-model-files-versioning. I will add it ASAP. Thanks, |
@drorata You are absolutely right! Fixed. Thank you! |
I am a little confused. If I run What I just tried is to simply edit the file in the working directory and then |
If you want to simply modify the file, then in general this:
should be replaced by this:
This is a general flow that is needed for hardlink/symlink cache types in order to avoid corrupting the cache for the previous version of the file.
Unless your workspace supports reflinks(if you are on a recent Mac then chances are you are using reflinks) or you've manually specified Thanks, |
Thanks for your detailed responses. |
Try
If it was corrupted, dvc will print a warning and remove the corrupted cache.
I agree, and we are working on it. Here is #799 where we track the progress on it. There is a v1.0 coming pretty soon, where we are trying to improve dvc with all the feedback we've received so far. We will be sure to bring attention to this moment in the docs. Thank you for the feedback! |
for the modification, shouldn't we use the dvc unprotect |
@bayethiernodiop Yes, but it was added fairly recently and this post is from a year ago 🙂 We have the process described at https://dvc.org/doc/user-guide/how-to/update-tracked-files |
Link is broke |
@safijari Fixed https://dvc.org/doc/user-guide/updating-tracked-files . Thanks for the heads up! 🙂 |
Hi,
If i change a data file (let's say my training set) and then run dvc repro. How do i revert back to an older version of data file which is my older training set
Thanks.
The text was updated successfully, but these errors were encountered: