odb: treat external repo dependencies as objects #6109

pmrowla · 2021-06-03T10:00:57Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

dvc/repo/fetch.py

dvc/objects/external.py

dvc/objects/__init__.py

efiop · 2021-06-04T14:47:25Z

For the record: discussed with @pmrowla that it might be worth it to not create a new object type, but rather to collect regular objects and return them with an odb from the erepo.

dvc/objects/__init__.py

dvc/dependency/repo.py

pmrowla · 2021-06-10T07:56:04Z

dvc/dependency/repo.py

+
+            path_info = PathInfo(repo.root_dir) / str(self.def_path)
+            try:
+                for odb, objs in repo.used_objs(


This ends up making us support #3305 - if there's a chained import here, we will end up recursively calling dep.get_used_objs for each repo in the chain.

dvc/objects/db/base.py

pmrowla · 2021-06-10T08:02:35Z

dvc/repo/fetch.py

-    from dvc.objects.stage import stage
-    from dvc.path_info import PathInfo
-    from dvc.scm.base import CloneError
+def _fetch_naive_objs(repo, objs, **kwargs):


Once the remote push/pull are unified into ODB save, we will be able to get rid of these separate fetch helpers

Just to clarify, you mean that all three will go away, meaning that default odb, custom odb(e.g. for externals, or future per-output remotes) and git odb will all have the same fetching logic, right?

Yes that's correct

pmrowla · 2021-06-10T08:04:16Z

dvc/remote/base.py

+        return hash(self.odb)
+
+    @classmethod
+    def from_odb(cls, odb):


This is a temporary workaround until the push/pull/save unification is completed (since we will no longer need Remote)

pmrowla · 2021-06-10T08:05:24Z

dvc/repo/status.py

-        if status_ == cloud.STATUS_OK:
+    for odb, objs in used.items():
+        if odb is not None:
+            # ignore imported objects


As discussed, we may want to start reporting status for imports later, but for now we can just skip any objects that come from non-default (None) odbs

pmrowla · 2021-06-10T08:37:59Z

dvc/objects/db/base.py

@@ -29,6 +29,23 @@ def __init__(self, fs, path_info, **config):
        self.cache_types = config.get("type") or copy(self.DEFAULT_CACHE_TYPES)
        self.cache_type_confirmed = False
        self.slow_link_warning = config.get("slow_link_warning", True)
+        self.tmp_dir = config.get("tmp_dir")


tmp_dir isn't actually used anywhere in ODB yet, but it will be once the current remote index is moved out of Remote into ODB.

In the meantime, this is needed to be able to construct a Remote object from an ODB instance.

efiop · 2021-06-10T23:43:25Z

dvc/repo/fetch.py

+            # objs contains staged import objects which should be saved
+            # last (after all other objects have been pulled)
+            external.update(objs)


Is this only to keep previous behavior or is there some additional reason to keep fetching these last?

This is to preserve the current fetch optimization for imported DVC objects and should be able to go away once pull/save are unified.

If we just call save() on the GitODB objects right now, we will end up streaming/downloading any imported DVC-tracked files, during the save(), but as we discussed it would be better to use fetch/pull to get all of the DVC-tracked files for now.

efiop · 2021-06-10T23:50:41Z

dvc/objects/db/git.py

+class GitObjectDB(ObjectDB):
+    """Dummy read-only ODB for uncached objects in external Git repos."""
+
+    def __init__(self, fs, path_info, **config):
+        from dvc.fs.repo import RepoFileSystem
+
+        assert isinstance(fs, RepoFileSystem)
+        super().__init__(fs, path_info)
+
+    def get(self, hash_info):
+        raise NotImplementedError
+
+    def add(self, path_info, fs, hash_info, move=True, **kwargs):
+        raise NotImplementedError
+
+    def gc(self, used, jobs=None):
+        raise NotImplementedError


So currently for git files we go the stage() + save() way for simplicity, but how do you see this odb in the future? Right now we want md5s from git files, but git odb can provide us with sha*s (and even then there are some details to it), so we'll need some translation layer? Or is there a more elegant way?

Or maybe we just need to convert them (aka stage()&save() like we do right now) on-the-fly from GitFS to LocalODB, so that we just use the resulting local objects right away? Since that is essentially the translation layer and once these files are converted into our md5-based objs, they are no longer in real git odb (i mean real-real git odb)

Yeah I've been thinking about what we can do here, since git files won't be structured like a regular DVC ODB.

In theory though, if we move to using SHA's for git imports (or in general eventually), we could do an complete ODB implementation for translating git blob storage into DVC object storage. And lookups like hashes_exist could be done directly through git instead of using the fs walk/find method we do for other ODBs.

pmrowla · 2021-06-17T10:46:20Z

Staging ODB will addressed in a follow-up PR

jorgeorpinel · 2021-06-30T02:41:25Z

I understand this implies that chained imports are now possible per https://github.com/iterative/dvc/issues/3305#issuecomment-871048690 🎉 If so, it may require some docs updates depending on the answer to https://github.com/iterative/dvc/issues/3305#issuecomment-871052955.

pmrowla added the refactoring Factoring and re-factoring label Jun 3, 2021

pmrowla self-assigned this Jun 3, 2021

pmrowla commented Jun 4, 2021

View reviewed changes

dvc/repo/fetch.py Outdated Show resolved Hide resolved

pmrowla force-pushed the odb-external branch from 74dc942 to e5955af Compare June 4, 2021 08:48

pmrowla commented Jun 4, 2021

View reviewed changes

dvc/repo/fetch.py Outdated Show resolved Hide resolved

pmrowla commented Jun 4, 2021

View reviewed changes

dvc/objects/external.py Outdated Show resolved Hide resolved

pmrowla commented Jun 4, 2021

View reviewed changes

dvc/objects/__init__.py Outdated Show resolved Hide resolved

pmrowla changed the title ~~[WIP] odb: treat external repo dependencies as objects~~ odb: treat external repo dependencies as objects Jun 4, 2021

pmrowla marked this pull request as ready for review June 4, 2021 09:01

pmrowla requested a review from efiop June 4, 2021 09:03

pmrowla marked this pull request as draft June 7, 2021 07:03

pmrowla force-pushed the odb-external branch from e5955af to d632797 Compare June 8, 2021 10:17

pmrowla changed the title ~~odb: treat external repo dependencies as objects~~ [WIP] odb: treat external repo dependencies as objects Jun 8, 2021

pmrowla commented Jun 8, 2021

View reviewed changes

dvc/objects/__init__.py Outdated Show resolved Hide resolved

pmrowla commented Jun 8, 2021

View reviewed changes

dvc/dependency/repo.py Outdated Show resolved Hide resolved

pmrowla force-pushed the odb-external branch from 032bb6d to 4ea059f Compare June 10, 2021 07:50

pmrowla commented Jun 10, 2021

View reviewed changes

dvc/objects/db/base.py Outdated Show resolved Hide resolved

pmrowla commented Jun 10, 2021

View reviewed changes

pmrowla marked this pull request as ready for review June 10, 2021 08:50

pmrowla changed the title ~~[WIP] odb: treat external repo dependencies as objects~~ odb: treat external repo dependencies as objects Jun 10, 2021

efiop reviewed Jun 10, 2021

View reviewed changes

pmrowla added 21 commits June 16, 2021 16:36

drop distinction between used external and used objects

27df2a9

update tests

2834e06

remove unneeded fetch from ExternalRepoFile

fa1fb09

wrap erepo errors as an object error

9ec1517

update cloud/remote functions for new used_objs behavior

36f22dc

fix update() rev/locked behavior

282d241

update import tests

963013e

remove ExternalRepoFile

04a8200

move get_obj and get_used_objs implementations back into repo dependency

1f51b0a

return objects plus associated ODB (remote) in get_used_objs

7a81322

odb: add tmp_dir field and support creating Remote from an odb

04848f9

return dict mapping odb -> objects in get_used_objs

82ba143

use dummy git odb for git imports

7ec6baf

fetch: support pulling from specific odb

14f9513

update cloud funcs for get_used_objs usage

b6f8962

update tests for used_objs behavior

4aa942b

use fetch in dep.download

467effe

catch circular imports from local fs repos

fe05a5f

add tests for chained and circular imports

9088299

move tmp_dir into odb config

6ec1fc6

update return docstring for repo.used_objs

40df77a

This was referenced Jun 16, 2021

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

Closed

DVC import file hashing only runs on one CPU thread #5546

Closed

pmrowla force-pushed the odb-external branch from f1bf8d3 to 40df77a Compare June 17, 2021 10:04

efiop merged commit 1d3524f into iterative:master Jun 17, 2021

pmrowla deleted the odb-external branch June 18, 2021 01:30

pmrowla mentioned this pull request Jun 18, 2021

objects: use separate staging ODB for staging trees #6195

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

odb: treat external repo dependencies as objects #6109

odb: treat external repo dependencies as objects #6109

pmrowla commented Jun 3, 2021 •

edited

Loading

efiop commented Jun 4, 2021

pmrowla Jun 10, 2021

pmrowla Jun 10, 2021

efiop Jun 10, 2021

pmrowla Jun 11, 2021

pmrowla Jun 10, 2021

pmrowla Jun 10, 2021

pmrowla Jun 10, 2021

efiop Jun 10, 2021

pmrowla Jun 11, 2021

efiop Jun 10, 2021

efiop Jun 11, 2021

pmrowla Jun 11, 2021

pmrowla commented Jun 17, 2021

jorgeorpinel commented Jun 30, 2021

odb: treat external repo dependencies as objects #6109

odb: treat external repo dependencies as objects #6109

Conversation

pmrowla commented Jun 3, 2021 • edited Loading

efiop commented Jun 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla commented Jun 17, 2021

jorgeorpinel commented Jun 30, 2021

pmrowla commented Jun 3, 2021 •

edited

Loading