-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: continue remote/cache/tree refactoring #4041
Conversation
* operates on obj.tree, so that GC can be done with both remote and cache instances
b27d57c
to
aef4701
Compare
|
||
class CloudCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite strange why it is CloudCache and not just Cache? And why LocalCache is the only exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, because of the push/pull. Ok, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with CloudCache to avoid the collision with dvc.cache.Cache
, but maybe this should be Cache, and the other one should be RepoCache
or something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla Makes sense. I'm fine with CloudCache for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also get rid of LocalCache in the future by making push/pull/status more generic for syncing data between any Cache and any Remote, rather than having it be specific to LocalCache, but I don't think there's any pressing need for that right now.
def get_remote(repo, **kwargs): | ||
tree = get_cloud_tree(repo, **kwargs) | ||
if tree.scheme == "local": | ||
return LocalRemote(tree) | ||
if tree.scheme == "ssh": | ||
return SSHRemote(tree) | ||
return Remote(tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For initializing caches, we now just do
tree = get_cloud_tree(<remote config>)
cache = CloudCache(tree)
But for remotes, we still have this helper method. It would be ideal if we could just always use Remote(tree)
but LocalRemote
and SSHRemote
still have their own overridden batch exists methods. It may be worth investigating whether or not the custom ssh/local methods are still needed with the other performance improvements that we have now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe those custom methods could be moved to the trees. IIRC ssh uses some custom sftp channel handling, which might be abstracted away too. But surely could do that later. Maybe consider creating an issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other question I had was whether or not the indexing related code should also just go into the trees at some point instead of being DVC Remote specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla Oh, that's a very good question! To me, it seems purely remote-related, as it solely relies on remote structure. But how do you see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's remote specific it makes sense to keep it in the Remote class for now, but I wasn't sure if indexing any kind of content (as opposed to just DVC Remotes) in a remote/cloud tree is something we might want to support in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there is a good way to index arbitrary data in that way, it is really tailored for our cache structure, so I really doubt that full index logic will ever make its way into the Tree. But, I'm sure that parts of it will, as many functions are actually quite generic (like walk_files replacing _list_cache).
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
Related to #3882.
remote.Remote
andremote.Cache
helper functions with actualRemote
andCloudCache
classesRemote
andCloudCache
constructors take a RemoteTree parameterremote.get_cloud_tree
helper method can be used to return the proper tree for a given URL