-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: generalize save and checkout #5412
Conversation
@@ -138,7 +138,7 @@ def changed(self, path_info, hash_info, filter_info=None): | |||
"checking if '%s'('%s') has changed.", path_info, hash_info | |||
) | |||
|
|||
if not self.tree.exists(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the sources of dvcignore-related problems, as self.tree
(cache tree) doesn't use it.
dvc/cache/base.py
Outdated
self.makedirs(cache_info.parent) | ||
self.move(path_info, cache_info) | ||
self.tree.upload_fobj(fobj, cache_info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to the previous implementation, we don't move things to cache but rather copy them. This might be a problem in case you want to add a file that can only fit once in your filesystem, but that's rather unusual and you have --to-remote
now to help you with that. A giant PRO here is that it is much easier to recover from dvc add
failures (no space, permission problems, etc), as the file/dir is lying right there, intact. Keeping old behavior is not a problem and fits into the arch, so I'll bring that back for now, but will start a discussion elsewhere. Might be a good start for the "design principles" doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand we were about to look into recovering from such errors and this is a good and intuitive way to do it. We've received multiple reports over this year of dvc badly handling initial dvc add
when there are problems with space or permissions and that's especially bad with directories, where we might be stuck in a limbo of half the files being in cache and half being in the workspace, recovering from such things is a nightmare. Keeping the safer new behavior after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry not much context on my end, just reading your comments. it means we'll hit some performance issues as well, right? E.g. will it do a full copy even on a FS that supports reflinks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. 1 copy when adding the file, that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, the process itself will be way way longer on these modern file systems. Not sure this is good to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or any file system where people set symlinks/hardlinks I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it would be better to be able to properly recover. I'm keeping it in this PR for now(in WIP), will tackle transfer now and will then bring the optimization back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, checked the next steps with transfer
, there will be a bit more steps required to make it merge nicely. So brought back the move
optimization for now. So the behavior is unchanged.
Tried to combine this with |
def already_cached(self, path_info, tree): | ||
assert path_info.scheme in ["", "local"] | ||
|
||
return super().already_cached(path_info) | ||
return super().already_cached(path_info, tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be handled in the followup related to checkout.
Pre-requisites for merging save and transfer as well as separating trees and their path_info's, and for the upcoming ODB.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π