-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UX] dvc remove
safety
#1524
Comments
Hi, @colllin. That sounds bad. First of all, let's try to run |
|
Sorry for the confusion though. May be we should put a note after |
@colllin just want to follow up on this and make sure that you was able to checkout your data back. Please, let us know how things are going. |
Hey @shcheklein, thank you for the follow-up! I think I mislead you into thinking that there was a trivial solution based on my example. Here's the backstory:
I guess it wasn't clear to me the correct way to go about modifying my data and updating the dvc references to replace the old dataset. I went into this by overwriting the tarball first, as I would in git (modify my file first, add/commit the changes later), but it wasn't clear how to do the same thing with dvc. I thought I might want to "replace the file [in dvc]", which led me to follow the instructions to erase the tarball I had just spent hours creating. That's obviously not what I wanted. Then I found Does my confusion and complaint about "safety" make more sense now? (Also, what's the right way to accomplish this?) Have you considered a stage & commit strategy more like git? As for making |
Another comment would be that I don't know the answers — I'm just trying to clarify the problems I'm feeling right now. |
@colllin your scenario makes sense! Thank you for a so detailed follow up! There are a lot of questions in it. I'll try to address all of them (may be not within a single comment and when I have more clarity and hopefully you too). But before, we go deep into this, could you clarify me a little bit what is expectation and what value do you expect from DVC to get. I'm asking because the way to organize your DVC project might depend on this. For example, it looks like you don't care that much about tracking different versions of the images dataset, you are fine with a single version and you can update it outside of the DVC project. Is it correct?
Yes, you can manually remove it by looking for an md5 in the remote cache. Mind though that it has an hierarchical structure internally - something like
It definitely makes sense to store them as tar at least, it should be easier to read a single file. I don't quite understand why do you use compression? It might be you can switch to
Yes, protected mode is opt-in now. Probably we will make it a default soon and there will be no way to edit/modify files w/o running Regarding |
As we said, I don’t care about tracking changes to my dataset. I’m more interested in the linking between datasets and git commits, and the `dvc repro` framework for running experiments.
That said, sometimes I do need to modify my dataset, for example I discovered some corrupt labels and removed those samples.
Note that at this point I’m talking about modifying the extracted, uncached dataset — image files and mask files.
So then I re-tarred this dataset in the hopes of replacing the original tarball, while keeping the accompanying pipeline which extracts the tarball to a specific location.
After I had recreated the tarball, it was not clear how to add the new tarball to dvc as a replacement for the original.
At this point I tried `dvc remove some.tgz.dvc`, which then deleted the tarball I had just spent hours creating.
Is my state and goal more clear now? What was the correct way to accomplish it?
|
@colllin I think I have a better sense now. Just to be completely on the same page. Let's say you have To go back, and answer your initial questions. The workflow to update the tarball should have looked like this:
What does What does From what I see, it's definitely confusing to see that Another thing is that |
@shcheklein Thank you for your time and thorough responses. ❤️ I think the only confusing thing about my situation at the beginning is that I started out with the As a new user, this Feel free to close this issue if it is not helping you track anything. |
Sure, thank you @colllin for the valuable feedback! |
Backstory: I just
dvc remove
d 2 tarballs which took me several hours to generate 😭.Suggestions:
dvc remove some.tgz.dvc
makes the user think they're removing the dvc file, which would seem to have some sort of "unlinking" functionality. Instead, it leaves the.dvc
file and deletes the file which is under dvc control. This is unintuitive and destructive — a nasty combination. If I wanted to delete that file, why wouldn't I justrm some.tgz
? What, if anything, did dvc do beyond removing that file? It's not clear to me. I expecteddvc remove
to do the inverse ofdvc add
. Why doesn't it? Is there some other way to do the inverse ofdvc add
?This will remove train.tsv from the working dir
did not match the file in the command. Another suggestion might be to put the non-destructive option (dvc unlink
) first, or clearly label those headings as [Destructive] and [Non-destructive].The text was updated successfully, but these errors were encountered: