Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: generalize save and checkout #5412

Merged
merged 1 commit into from
Feb 4, 2021
Merged

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Feb 4, 2021

Pre-requisites for merging save and transfer as well as separating trees and their path_info's, and for the upcoming ODB.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@efiop efiop added refactoring Factoring and re-factoring skip-changelog Skips changelog labels Feb 4, 2021
@@ -138,7 +138,7 @@ def changed(self, path_info, hash_info, filter_info=None):
"checking if '%s'('%s') has changed.", path_info, hash_info
)

if not self.tree.exists(path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the sources of dvcignore-related problems, as self.tree(cache tree) doesn't use it.

self.makedirs(cache_info.parent)
self.move(path_info, cache_info)
self.tree.upload_fobj(fobj, cache_info)
Copy link
Contributor Author

@efiop efiop Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the previous implementation, we don't move things to cache but rather copy them. This might be a problem in case you want to add a file that can only fit once in your filesystem, but that's rather unusual and you have --to-remote now to help you with that. A giant PRO here is that it is much easier to recover from dvc add failures (no space, permission problems, etc), as the file/dir is lying right there, intact. Keeping old behavior is not a problem and fits into the arch, so I'll bring that back for now, but will start a discussion elsewhere. Might be a good start for the "design principles" doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand we were about to look into recovering from such errors and this is a good and intuitive way to do it. We've received multiple reports over this year of dvc badly handling initial dvc add when there are problems with space or permissions and that's especially bad with directories, where we might be stuck in a limbo of half the files being in cache and half being in the workspace, recovering from such things is a nightmare. Keeping the safer new behavior after all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry not much context on my end, just reading your comments. it means we'll hit some performance issues as well, right? E.g. will it do a full copy even on a FS that supports reflinks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. 1 copy when adding the file, that's it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, the process itself will be way way longer on these modern file systems. Not sure this is good to be honest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or any file system where people set symlinks/hardlinks I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it would be better to be able to properly recover. I'm keeping it in this PR for now(in WIP), will tackle transfer now and will then bring the optimization back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, checked the next steps with transfer, there will be a bit more steps required to make it merge nicely. So brought back the move optimization for now. So the behavior is unchanged.

@efiop efiop mentioned this pull request Feb 4, 2021
11 tasks
@efiop
Copy link
Contributor Author

efiop commented Feb 4, 2021

Tried to combine this with transfer, but our State got in the way again, as it is not really meant for multithreading(we were noticing some race conditions in trees too because of it). Might be a time to use some ORM to handle that for us.

@efiop efiop changed the title [WIP] cache: generalize save and checkout cache: generalize save and checkout Feb 4, 2021
Comment on lines +84 to +87
def already_cached(self, path_info, tree):
assert path_info.scheme in ["", "local"]

return super().already_cached(path_info)
return super().already_cached(path_info, tree)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be handled in the followup related to checkout.

@efiop efiop merged commit 8676617 into iterative:master Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Factoring and re-factoring skip-changelog Skips changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants