-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
odb: treat external repo dependencies as objects #6109
Conversation
For the record: discussed with @pmrowla that it might be worth it to not create a new object type, but rather to collect regular objects and return them with an odb from the erepo. |
|
||
path_info = PathInfo(repo.root_dir) / str(self.def_path) | ||
try: | ||
for odb, objs in repo.used_objs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up making us support #3305 - if there's a chained import here, we will end up recursively calling dep.get_used_objs
for each repo in the chain.
from dvc.objects.stage import stage | ||
from dvc.path_info import PathInfo | ||
from dvc.scm.base import CloneError | ||
def _fetch_naive_objs(repo, objs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the remote push/pull are unified into ODB save, we will be able to get rid of these separate fetch helpers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, you mean that all three will go away, meaning that default odb, custom odb(e.g. for externals, or future per-output remotes) and git odb will all have the same fetching logic, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct
return hash(self.odb) | ||
|
||
@classmethod | ||
def from_odb(cls, odb): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary workaround until the push/pull/save unification is completed (since we will no longer need Remote)
if status_ == cloud.STATUS_OK: | ||
for odb, objs in used.items(): | ||
if odb is not None: | ||
# ignore imported objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we may want to start reporting status for imports later, but for now we can just skip any objects that come from non-default (None) odbs
@@ -29,6 +29,23 @@ def __init__(self, fs, path_info, **config): | |||
self.cache_types = config.get("type") or copy(self.DEFAULT_CACHE_TYPES) | |||
self.cache_type_confirmed = False | |||
self.slow_link_warning = config.get("slow_link_warning", True) | |||
self.tmp_dir = config.get("tmp_dir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp_dir isn't actually used anywhere in ODB yet, but it will be once the current remote index is moved out of Remote into ODB.
In the meantime, this is needed to be able to construct a Remote object from an ODB instance.
# objs contains staged import objects which should be saved | ||
# last (after all other objects have been pulled) | ||
external.update(objs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only to keep previous behavior or is there some additional reason to keep fetching these last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to preserve the current fetch optimization for imported DVC objects and should be able to go away once pull/save are unified.
If we just call save() on the GitODB objects right now, we will end up streaming/downloading any imported DVC-tracked files, during the save(), but as we discussed it would be better to use fetch/pull to get all of the DVC-tracked files for now.
class GitObjectDB(ObjectDB): | ||
"""Dummy read-only ODB for uncached objects in external Git repos.""" | ||
|
||
def __init__(self, fs, path_info, **config): | ||
from dvc.fs.repo import RepoFileSystem | ||
|
||
assert isinstance(fs, RepoFileSystem) | ||
super().__init__(fs, path_info) | ||
|
||
def get(self, hash_info): | ||
raise NotImplementedError | ||
|
||
def add(self, path_info, fs, hash_info, move=True, **kwargs): | ||
raise NotImplementedError | ||
|
||
def gc(self, used, jobs=None): | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently for git files we go the stage() + save()
way for simplicity, but how do you see this odb in the future? Right now we want md5s from git files, but git odb can provide us with sha*s (and even then there are some details to it), so we'll need some translation layer? Or is there a more elegant way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we just need to convert them (aka stage()&save() like we do right now) on-the-fly from GitFS to LocalODB, so that we just use the resulting local objects right away? Since that is essentially the translation layer and once these files are converted into our md5-based objs, they are no longer in real git odb (i mean real-real git odb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've been thinking about what we can do here, since git files won't be structured like a regular DVC ODB.
In theory though, if we move to using SHA's for git imports (or in general eventually), we could do an complete ODB implementation for translating git blob storage into DVC object storage. And lookups like hashes_exist could be done directly through git instead of using the fs walk/find method we do for other ODBs.
Staging ODB will addressed in a follow-up PR |
I understand this implies that chained imports are now possible per https://github.com/iterative/dvc/issues/3305#issuecomment-871048690 π If so, it may require some docs updates depending on the answer to https://github.com/iterative/dvc/issues/3305#issuecomment-871052955. |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π