Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

odb: treat external repo dependencies as objects #6109

Merged
merged 23 commits into from
Jun 17, 2021

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jun 3, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@pmrowla pmrowla added the refactoring Factoring and re-factoring label Jun 3, 2021
@pmrowla pmrowla self-assigned this Jun 3, 2021
dvc/repo/fetch.py Outdated Show resolved Hide resolved
dvc/repo/fetch.py Outdated Show resolved Hide resolved
dvc/objects/external.py Outdated Show resolved Hide resolved
dvc/objects/__init__.py Outdated Show resolved Hide resolved
@pmrowla pmrowla changed the title [WIP] odb: treat external repo dependencies as objects odb: treat external repo dependencies as objects Jun 4, 2021
@pmrowla pmrowla marked this pull request as ready for review June 4, 2021 09:01
@pmrowla pmrowla requested a review from efiop June 4, 2021 09:03
@efiop
Copy link
Contributor

efiop commented Jun 4, 2021

For the record: discussed with @pmrowla that it might be worth it to not create a new object type, but rather to collect regular objects and return them with an odb from the erepo.

@pmrowla pmrowla marked this pull request as draft June 7, 2021 07:03
@pmrowla pmrowla changed the title odb: treat external repo dependencies as objects [WIP] odb: treat external repo dependencies as objects Jun 8, 2021
dvc/objects/__init__.py Outdated Show resolved Hide resolved
dvc/dependency/repo.py Outdated Show resolved Hide resolved

path_info = PathInfo(repo.root_dir) / str(self.def_path)
try:
for odb, objs in repo.used_objs(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up making us support #3305 - if there's a chained import here, we will end up recursively calling dep.get_used_objs for each repo in the chain.

dvc/objects/db/base.py Outdated Show resolved Hide resolved
from dvc.objects.stage import stage
from dvc.path_info import PathInfo
from dvc.scm.base import CloneError
def _fetch_naive_objs(repo, objs, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the remote push/pull are unified into ODB save, we will be able to get rid of these separate fetch helpers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, you mean that all three will go away, meaning that default odb, custom odb(e.g. for externals, or future per-output remotes) and git odb will all have the same fetching logic, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct

return hash(self.odb)

@classmethod
def from_odb(cls, odb):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary workaround until the push/pull/save unification is completed (since we will no longer need Remote)

if status_ == cloud.STATUS_OK:
for odb, objs in used.items():
if odb is not None:
# ignore imported objects
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we may want to start reporting status for imports later, but for now we can just skip any objects that come from non-default (None) odbs

@@ -29,6 +29,23 @@ def __init__(self, fs, path_info, **config):
self.cache_types = config.get("type") or copy(self.DEFAULT_CACHE_TYPES)
self.cache_type_confirmed = False
self.slow_link_warning = config.get("slow_link_warning", True)
self.tmp_dir = config.get("tmp_dir")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp_dir isn't actually used anywhere in ODB yet, but it will be once the current remote index is moved out of Remote into ODB.

In the meantime, this is needed to be able to construct a Remote object from an ODB instance.

@pmrowla pmrowla marked this pull request as ready for review June 10, 2021 08:50
@pmrowla pmrowla changed the title [WIP] odb: treat external repo dependencies as objects odb: treat external repo dependencies as objects Jun 10, 2021
Comment on lines +80 to +82
# objs contains staged import objects which should be saved
# last (after all other objects have been pulled)
external.update(objs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only to keep previous behavior or is there some additional reason to keep fetching these last?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to preserve the current fetch optimization for imported DVC objects and should be able to go away once pull/save are unified.

If we just call save() on the GitODB objects right now, we will end up streaming/downloading any imported DVC-tracked files, during the save(), but as we discussed it would be better to use fetch/pull to get all of the DVC-tracked files for now.

Comment on lines +8 to +24
class GitObjectDB(ObjectDB):
"""Dummy read-only ODB for uncached objects in external Git repos."""

def __init__(self, fs, path_info, **config):
from dvc.fs.repo import RepoFileSystem

assert isinstance(fs, RepoFileSystem)
super().__init__(fs, path_info)

def get(self, hash_info):
raise NotImplementedError

def add(self, path_info, fs, hash_info, move=True, **kwargs):
raise NotImplementedError

def gc(self, used, jobs=None):
raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently for git files we go the stage() + save() way for simplicity, but how do you see this odb in the future? Right now we want md5s from git files, but git odb can provide us with sha*s (and even then there are some details to it), so we'll need some translation layer? Or is there a more elegant way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we just need to convert them (aka stage()&save() like we do right now) on-the-fly from GitFS to LocalODB, so that we just use the resulting local objects right away? Since that is essentially the translation layer and once these files are converted into our md5-based objs, they are no longer in real git odb (i mean real-real git odb)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've been thinking about what we can do here, since git files won't be structured like a regular DVC ODB.

In theory though, if we move to using SHA's for git imports (or in general eventually), we could do an complete ODB implementation for translating git blob storage into DVC object storage. And lookups like hashes_exist could be done directly through git instead of using the fs walk/find method we do for other ODBs.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 17, 2021

Staging ODB will addressed in a follow-up PR

@efiop efiop merged commit 1d3524f into iterative:master Jun 17, 2021
@pmrowla pmrowla deleted the odb-external branch June 18, 2021 01:30
@jorgeorpinel
Copy link
Contributor

I understand this implies that chained imports are now possible per https://github.com/iterative/dvc/issues/3305#issuecomment-871048690 πŸŽ‰ If so, it may require some docs updates depending on the answer to https://github.com/iterative/dvc/issues/3305#issuecomment-871052955.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Factoring and re-factoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants