-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data files protection #1599
Comments
@tdeboissiere @sotte @ternaus @ophiry @colllin @villasv @polvoazul @drorata @Casyfill Guys, would love to have your feedback on this discussion. I have tried to explain it as simple as possible. Basically it's a tradeoff between performance and UX simplicity in the default DVC setup. As most active users you should definitely have something to say on this. |
Not an answer to your question, but maybe relevant. When not using DVC I set all my data to be immutable (locally as well as remote storage buckets). That means I can only add data (newer version, transformations, etc.), I can access all versions of the data at the same time, and I can not change existing data. Protecting data from accidental changes/corruption is super important and a requirement for reproducibility. Regarding copy: assume you have to copy >100GB, that would be a deal breaker. It just takes way too long. |
Wonderful thread. Thanks for bringing it and for inviting me. If I understand correctly, (2) means that linking will be reflink if possible and otherwise copy. My remarks next are given this understanding. There is no magic solution between performance and protection. As our data is so valuable, its protection should come first. I would favor (2) as it is not breaking workflows. I would also bring this fundamental issue to users attention, ideally as part of the installation process. If a user prefers option (1) they can take it manually. In this case it is important that unprotecting a file should be accessible also from within python. This way, scripts can be adopted to the needed flow. As the last option, users can go back to the current default behavior - but this should be an active choice of the user and the ramifications should be clearly communicated. |
I'm down for [1]. Every workflow that breaks is trying to corrupt cache, at least that's what I get at first glance. [2] could be a first stage on adoption plan for [1] though. Instead of just enforcing protection at first, make it a best practice in the tutorials, let the community sink that in, only then break their workflows because they should've know better. No one should expect their workflows not to break if dvc has a new major release anyway. |
@sotte thank you, it's really valuable! Let me clarify a few questions:
What about end results? Let's imagine we have a script that produces a model file, or a metric file? Do you make your scripts specifically to create a new copy every time? Have you ever had a script that was overwriting some results? In dvc terms it could something like this:
Basically the difference between the two options means that the first might fail at the very end of the training cycle if files already exist (both model.pkl and metrics.json will be set to read-only after the first run). User has to remember to run
Totally! Do you think a message that there is an advanced mode available would not be enough in this scenario. Basically, we can detect that files are large and it would too long to copy them and warn the user. It might be actually an improvement to both options - introduce a flag that specifies what cache type is used per output. For example, for What do you think about this? |
@drorata thanks! You are absolutely right, protection first. Just to clarify, both solutions proposed do protect files. The tradeoff is performance vs UX complexity in the default DVC setup. What do you think about that possible improvement, btw (check the previous answer, last paragraph). |
@villasv thanks! |
Remember, the setup I described does not use DVC.
I normally create a lot of artifacts and metrics. The tracking tool also tracks git version and parameters. DVCs metric tracking system never cut it for me. We use some external services for tracking parameters and metrics.
In my case the data would be written to another location automatically. I'd argue that a failed training is also a valuable training :)
Wait if
I have to say, that I'm relatively happy with my workflow right now and haven't used DVC for our bigger projects. I also don't see myself overwriting files. So option 1 or 2 are not really relevant for me. I want access to different versions at the same time. This is one of DVCs the big design decisions that I disagree with. |
These two downsides:
These shouldn't be happening anyway, so you'll only break peoples projects if they were already broken, in the sense that they were likely causing cache corruption. |
@villasv gotcha! I was thinking more about new users. It's not about backward compatibility. They have already a certain way of writing training scripts and they don't use |
@shcheklein Absolutely. My point is that the barrier has always been there. Before, the barrier was a frustrating experience with caches being corrupted. The barrier then becomes some human readable error in the CLI and very well detailed in a tutorial/FAQ on it. It was quite a barrier for me, I had to change all my scripts to make sure none of them would even reuse the fs node even if they were truncating with |
@villasv yep, agreed. It's a little bit nuanced bc it's hard to make the error human readable - users' code will be failing with some IOExceptions and we don't control that :( What about the option [2]? It also gives protection (by using copy or advanced reflinks if they are available). Imagine yourself doing a first project with DVC or migrating the existing project, what would be more painful - copying (with some message that more efficient mode is available if it takes too long and files are big) or hitting some random IO errors and being not able to edit/move files? |
My first few DVC projects dealt with datasets small enough that using copy would be fine, so option [2] would definitely would have been easier. I was lucky, though. But again, I defend both options as good options. Both enhance protection. The trade-off is between harder adoption and lower default performance. If I was a new user today I can imagine each scenario:
|
Thank you for inviting me to this thread. I'm sensitive to the scenario you describe in [1] where a days-long script results in an IOException due to forgetting to unprotect an output file. I feel that it will be difficult to justify the value of DVC if/when work is lost in this way. At work, I deal exclusively with annoyingly-large image datasets, so I'm also sensitive to the "copy" strategy in [2], mostly because my manager is sensitive to EBS costs. It appears that 2 out of 3 downsides to [2] are related to large datasets. Assuming that my team and I will usually end up on filesystems which don't support reflinks, I think I could survive the "copy" strategy in [2] if there was a mechanism for caching directly to the dvc remote, i.e. I would probably prefer this direct-to-remote cache mechanism over everyone remembering to upgrade their file system in option [2], or remembering to unprotect their dvc-tracked files in option [1]. I might need help thinking through any other ramifications of this behavior. More on my thinking (click to expand)I agree that it is of utmost importance to protect the integrity of the data which is added to dvc. I think if dvc has only one guarantee, it should be that if I In [2], I imagine that If I decided to |
Comments1️⃣
I guess it would be something like the following: sleep 1000 && echo "hello" > hello
dvc add hello # chmod -w hello
dvc run -d hello -o greetings "cp hello greetings"
sleep 1000 && echo "hola" > hello
# permission denied: hello Indeed, is awful to have an exception like that after waiting for so long. The question is, why wouldn't you have something like the following instead: dvc run -o hello "sleep 1000 && echo hello > hello"
dvc run -d hello -o greetings "cp hello greetings"
# sed -i "/cmd/ s/hello/hola/'' hello.dvc
dvc repro hello.dvc
If you are modifying a protected file by hand, the worst that could happend is seeing a message like:
But you wont lose any data. " NOTE: This is only an example, not an accurate implementation
autocmd BufWritePre * if !&modifiable | :!dvc unprotect <afile>
Why can't DVC be smarter about this one? It can use a cross-plataform implementation to check the file system in use https://github.com/giampaolo/psutil 2️⃣
I really like the idea of informing the user when DVC is doing some heavy work (copying big files); "friendly" approaches are always better (you can't imply users will read the manual looking to improve their experience with the tool). ConclusionMy vote goes with number 2️⃣ . I appreciate more the ease of use that it offers, and that you can have efficiency as an opt-in (with a recommendation from the tool when the cache starts to grow). |
@shcheklein Just to make sure I understand, the difference between [2] and the current behavior is that currently if reflink is not available the fallback would be another available linking which is not safe. Am I right? The performance price of The more I think about it, the price of |
@drorata thank you for your thoughtful analysis! :)
Yep! To be precise - it's not the link itself is not safe, it's because we don't put files under protection (make them read-only) for the links that require that protection.
You are absolutely right again! It's ( |
Great discussion! Thank you guys for your thoughtful opinions. It is not an easy choice... The 2️⃣-nd option definitely provides better user experience while 1️⃣-st one is better for large datasets. DVC was initially designed with the large dataset scenarios in mind and this is why it (still) prioritizes Yet another opinion I think that it might be a good choice to set the 2️⃣-nd option by default as the most common use case but give an easy ability to switch to the 1️⃣-st option. The logic behind: the best user experience for mass users but if you have large data files then please do an extra step. Issues There are a few issues related to this opinion/choice:
It can look like a flag for
So, this approach gives an ability to switch to 1️⃣ for a particular file. It requires some development - specify cache type in dvc-files, not config file. What do you guys think about this combination 2️⃣ by default and 1️⃣-per file?
|
@shcheklein wrote:
I believe the correct "formula" is @dmpetrov In your example, just to make sure, is Personally, I am worried that the on-a-file-level solution introduces too much complexity for a second order problem. I think a KISS is to be preferred, namely, play the safe side with option [2] and make sure the user is aware that the performances can be improved but then there's a workflow complexity trade-off. |
@drorata slow means - copying of course and we can easily measure that. We cannot judge the performance of command. It is a good point regarding complexity. We should think carefully if DVC can get enough benefits from this complexity increase. We can easily implement per-repository level protection and keep the per-file level as a feature request and see if there is a demand for that. It won't break backward compatibility. |
@dmpetrov doesn't option [2] address the per-repository issue? |
@drorata Both are per-repository (strategy is defined in config file). There is a possibility to make them per-file (overwrite the strategy in dvc file). |
@shcheklein Yes, that sums it up — the difference is paying for 2TB EBS vs. 1TB EBS. EBS costs (without DVC) account for 15-20% of our annual AWS bill. Regarding @drorata's comment, I don't think this should be dismissed as O(1). While the cache size is constant (does not grow) with the number of versions, it does grow linearly with the size of my data, effectively doubling my disk requirements. If there is no (built-in) way to offload this cached copy (per my proposal above), then if I modify my dataset, after I For reference, where
It surprises me, but it appears that the copy+upload approximately doubles the time compared to upload alone, and I think it speaks to the validity of my proposal of caching directly to the dvc remote. Again, this gives me more control over my time and disk requirements, without adding steps to non-dvc-specific workflows. To clarify, because I think I understand better now what I was proposing, I think I proposed two features:
Possibly dvc could warn when copying large files, and in I understand that the use cases I'm discussing might be rare, and I fully expect you to take that into consideration. |
I'm missing something: how (and why) does the remote cache influenced by the linking/copying strategy of DVC? If @colllin has a dataset which evolves and gets bigger, then each iteration has a unique copy in the cache and this is the only copy in the cache (local and remote alike). Each new version of the dataset has to be pushed to the remote cache upon |
The remote cache is not influenced by the linking/copying strategy of DVC. The remote cache is only relevant due to my observations/assumptions that (a) the remote cache is generally the true/safe destination which fulfills the full promise of DVC and that (b) I rarely revisit past versions — I need to have access to them, but it doesn't need to be immediate. Thus, in the context of option 2️⃣, I don't really care about having any local cache once it is safe in my remote cache, especially at the expense of time (copying) and disk space (copying plus dvc-imposed local cache behavior). So my proposal prioritizes the use of the remote cache in order to reduce the local time and disk space requirements. I'm not suggesting anything about space requirements for the remote cache — it's taken for granted that it grows with the number of versions and size of your data files. |
I've also been maybe-too-indirectly suggesting that, if the main drawbacks of option 2️⃣ are related to large files, then 1️⃣ solves that by linking them and imposing a protect/unprotect workflow, whereas my proposal might offer an alternative which doesn't impose any changes to normal workflows outside of dvc-specific workflows, while minimizing the negative impacts of the "copy" strategy for large datasets. It's totally fine if it's not interesting — my goal right now isn't to push my proposal, but to make sure it is understood. |
@colllin I think we're slowly converging :) Can you clarify what you meant with the x1, x2 and x3 factors:
|
I'll try. Based on my understanding of the "copy" strategy in option 2️⃣, the best case scenario, meaning that you've just run
Now imagine that you want to modify this file and add the new version to dvc. You modify or overwrite your working copy, you
This is the "peak demand" situation I was referring to, where in order for dvc to work I need enough room for 3 copies of my data on disk. Sometime later, I can run |
Thanks for the clarification. I still think the If you're short of space on your local machine, then you probably want to persists the cache remotely and garbage collect old stages/copies/versions/snapshots locally. In this case, when using Lastly, regardless of the chosen strategy, the remote storage costs won't change and they depend on how much "old" data you want to keep. |
I see that when you're not working with large datasets or when you have too much disk space, then you're typically keeping a substantial local cache, treating the remote as a backup, and the extra disk allocation for the In this situation, I'm actually doing the opposite — I'm minimizing Then, since So, for me, the significant change from the current behavior to proposed option 2️⃣ is that I would now need to allocate minimum 3x the size of my working copy in order meet the peak disk space required for dvc to function properly. Even if we take the EBS costs as negligible, this will still likely be surprising to users, and will certainly result in headaches around migrating to larger EBS volumes and trying to understand why I need to maintain a 1TB volume in order to work with a 300GB dataset. My proposal directly addresses these concerns, is a potential complete replacement for the proposed option 1️⃣, and in my understanding, my proposed strategy provides better UX and better control over the use of my machine's disk space and time. |
@colllin it feels like in your case the initial data set is actually an external dependency. Can we do something like this to avoid caching (and even storing a single copy of the tarball on your machine):
would it solve the problem? I assume you do use cache for all the subsequent steps. And space is not a big issue for them, is it correct? |
I would vote for 2️⃣, since it's less likely to surprise a newbie. Should lessen friction in the beginning, there's already plenty to be careful of when you're starting out that you don't need additional cognitive load. After you're done experimenting and learning, feel more comfortable, and want to optimize your disk usage (in the specific case where you're on a system that does not support reflinks!), then you can switch to "advanced mode". Kind of like the pythonic approach, make things easy until there's good enough reason to make them hard. |
If @colllin is trying to minimize the number of versions A working directory comprises of one (and only one) copy of each item from the cache. You have to have enough space to host these items regardless of whether it is on a local machine or some cloud instance. The cache's size is the total of sizes of all items and all their versions. This is naturally much larger than your local copy. If you need/want to have many versions, be prepared to pay for the storage. I don't see a way around it. If you're short on space but your data is huge, you will have to reduce the cache size. The strategy used by dvc is rather secondary and it won't magically allow you to enjoy both many versions and large volumes. Furthermore, you @colllin actually suggest that the cache is a minor consideration as he runs |
Thanks for this thread. I feel little dumb :( cause i was thinking that there is no way that someone could want to prefer option 2 (I wasn't even thinking of it) cause i am new to DVC and just trying to understand it. |
Related #1821 |
Guys, thanks a lot for this amazing discussion! We've decided to go with |
I would like to bring more community attention to an important topic of protecting data artifacts that are under DVC control. Before we decide to implement a certain workflow (more on this below) it would be great if everyone would share their thoughts on what option looks more appealing and why.
First, let me state the problem - shorter version and a longer one, which you can skip if you feel that you understand the shorter one. So, bear with me please, I'll try to give more explanations along the way
Short version:
hardlink
, sometimesreflink
orsymlink
, see longer version for the detailed explanation) with their counterparts in the local cache (.dvc/cache
). It is an optimization to avoid copying to save time and space.dvc unprotect
ordvc remove
.⬇LONG VERSION (CLICK TO EXPAND) ⬇
When you do `dvc add` or `dvc checkout`, or other operations that result in DVC taking some files under its control, DVC calculates `md5` for the file and puts it into local `.dvc\cache`. This is done to save the file, and semantically similar to `git commit`.The naive way to perform this operation would be something like this:
It basically creates a full copy of the file (
z1234567890abcdefgh
) which is addressable by its md5 sum.Instead, DVC tries to optimize this operation and runs something like this:
And your workspace looks like this now:
The goal is to avoid copying files! Thus, it gives you speed and saves space - two great benefits, especially if you deal with GBs of data! Nice, right? :)
To give you even more details, internally this link operation actually tries one by one different types of links (from the very best to worst):
To summarize:
dvc unprotect
is not required)This optimization with links does not come for free. As you can see (from the Safe to write column above) not all links support file updates directly from the working space. It ends up it the workflow like this. Let's imagine, we have a file
some.csv
in the project. Before it was DVC-managed, nothing prevented us from adding more entries, modifying it in any other way, or even rewriting it (for example, if there is a script that generates it). Now, right after we diddvc add some.csv
even though from the user perspective it looks like a regularsome.csv
it's not that simple file anymore - it might be a link and it might be not safe to edit anymore.Before:
Copy:
It's safe to edit (`some.csv') - it has its own copy of content. But it takes time and space to create it, obviously.
Hadlink:
It's not safe to edit. Both files share the same content. If you edit
some.csv
the file in cache that has itsmd5
as an address will be corrupted. Basically, it will have an inconsistency - path will not correlate with content.To mitigate this problem, we had to introduce two things:
dvc unprotect
command. It a syntax sugar to remove the read-only mode and if needed make a full copy so it's safe to edit the file.Bottom line: having files unprotected by default and a mandatory
dvc unprotect
/dvc remove
workflow to modify files create problems and is confusing: #799, #599 (comment), #1524, etc.Possible solutions:
There are two possible way to make it consistent and safe for users:
1️⃣Enable the
protected=true
mode by default. All files that under DVC control are read-only. Users must rundvc unprotect
(ordvc remove
depending on the use case) to modify, overwrite, replace the file:dvc unprotect
before modifying any file under DVC control. It means that a very regular workflow with data looks different.dvc unprotect
has not been run. Might be confusing and frustrating to users. Let's imagine that a script that takes a few days to train a model fails at the very end with anIOException
only because we put the model file under DVC control.xfs
(default on some Linux distro) users will have to usedvc unprotect
workflow even though it's not required on them.2️⃣Enable
cache type=reflink,copy
by default. It means that users on almost all wide-spread file-systems that do not support reflinks (like ext4, NTFS, etc) will experience a performance downgrade and increased space consumption. Let user opt-in intohardlinks
andsymlinks
+ protected (dvc unprotect
workflow) advanced mode only if it's needed. For example, when they have to deal with GBs of data. Another option for these advanced users will be to update the FS - usebrtfs
orapfs
:ext4
on Linux, NFS Windows, older Mac OS systems) because CoW (reflinks) links are not supported on them. We can write a message about the advanced mode if files are too big and it takes too long to add them to DVC.dvc unprotect
unless you have GBs of data and really need to optimize it.Finally, the question is - which one should we pick as a default option? 1️⃣ vs 2️⃣? Am I missing some other arguments in favor on in contra to one of the options? Am I missing other ways to solve this nasty issue? Please, share your thoughts & vote for one of the options & explain your motivation. It's extremely important to hear out you guys!
The text was updated successfully, but these errors were encountered: