objects: use object IDs and references instead of naive objs in status/transfer #6360

pmrowla · 2021-07-23T11:16:51Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Adds ReferenceHashFile object type and ReferenceObjectDB ODB type
- Ref ODB stores references to files outside the ODB fs, so the object at hash abc123 is a serialized reference to some filesystem and path where the actual source file w/hash abc123 exists
objects.stage now stages objects as references in a memfs based ref ODB
objects.save is removed in favor of stage + transfer from staging to ODB pattern
objects.status and objects.transfer now accept a collection of object IDs (hashinfos) instead of naive objects
- repo/stage/output object collection now collects object IDs instead of naive objects as well

efiop · 2021-07-27T18:01:59Z

dvc/objects/db/reference.py

+    def get(self, hash_info: "HashInfo"):
+        if hash_info.isdir:
+            return super().get(hash_info)
+        path_info = self.hash_to_path(hash_info.value)


For the record: as we've discussed with @pmrowla before, it might make sense to either store these ref objects in a some subdir in odb (e.g. refs/ or smth) or maybe use a file extention (similar to .dir, e.g. .ref) to distinguish these objects from actual real objects.

This might get useful in cases where you need to store actual objects as well as ref objects and is kinda like git's object header, but encoded into a path or filename, which allows us to very easilly distinguish them on sight and through API calls like list_objects.

efiop · 2021-07-27T18:15:07Z

Btw, even though refodb is not able to make dvc provide 100% git-like staging experience (we can't save actual files since they are too big), we could maybe improve --no-commit by just showing that in dvc status. We won't be able to make your data roll back to the staged state, but at least we won't need to store md5 of an uncommited (to .dvc/cache) file/dir, which is less hazardous and so uncommited and missing cache cases could be more easilly distinguished in functionality that cares (e.g. dvc status, dvc diff, etc). This will, of course, be possible, if staging refodb becomes permanently saved to disk, which is a future possible step. But just wanted to mention this, since it seems to allow us for a better ui or better functionality in general in the future.

pmrowla · 2021-07-30T11:00:57Z

For the record:

had discussion with @efiop regarding potential future use cases for refodb

replacing get with the ability to "checkout" directly from a remote odb seems promising
checkout itself would be generalized into a "transfer/link from source odb into checkout refodb" where the refodb contains references to the output paths

efiop · 2021-08-02T17:45:19Z

@Mergifyio rebase

imports

mergify · 2021-08-02T17:45:41Z

Command rebase: success

Branch has been successfully rebased

efiop · 2021-08-02T19:14:01Z

dvc/state.py

            return None

-        return HashInfo("md5", value[2], size=int(size))
+        return HashInfo("md5", value[2], size=size)


This breaks backward compatibility for this db 🙁

We could change the layout, but just need to make sure we do corresponding version adjustments.

I'll adjust it so that we go back to writing size as a string in the state DB.

The main thing here was that I don't think the utils function should have been returning both as strings (the returned mtime does have to be a string since for dirs it's actually the hash of all the file mtimes, but that doesn't apply to size)

@pmrowla Agreed, that's part of an old code that used to be directly in old state class 🙁

tests/utils/__init__.py

efiop · 2021-08-02T19:21:56Z

dvc/dependency/repo.py

            rev = repo.get_rev()
            if locked and self.def_repo.get(self.PARAM_REV_LOCK) is None:
                self.def_repo[self.PARAM_REV_LOCK] = rev

            path_info = PathInfo(repo.root_dir) / str(self.def_path)
            if not obj_only:
                try:
-                    for odb, objs in repo.used_objs(
+                    for odb, obj_ids in repo.used_objs(


Should we go obj_ids -> oids and coin this name? Sure, it is a bit more cryptic, but short and sweet and makes sense if you know what you are doing. Just feels like we should get the terminology straight right away so it is settled, as it will be used more and more around the code.

dvc/hash_info.py

pmrowla · 2021-08-04T10:36:37Z

Thinking about staging some more (with regard to the partial add issue), we will need to always write staged trees to disk somewhere prior to the formal save/transfer (whether it's in the actual ODB or somewhere else like .dvc/tmp).

With the current behavior for this PR where it's all done in memory, we wouldn't be able to recover from a partial/failed add since we would lose the dir cache/tree for the original "complete" directory

efiop · 2021-08-04T10:56:01Z

@pmrowla Thank you! 🙏

efiop · 2021-08-08T21:12:53Z

For the record: some followups that we've discussed before:

obj type is part of oid
base odb should be aware of obj type (will likely result in removal of ReferenceODB)
use protobuf? (not needed right now, as we don't need to serialize anything to an actual disk, but still)
hash type might be part of oid but we could keep plain 12/34 structure for now in refs/
ref objects could be a compatibility layer between different hash types.
should be possible to transfer all objects from one odb to another (e.g. run cache too). Aka cloning

pmrowla added the refactoring label Jul 23, 2021

pmrowla self-assigned this Jul 23, 2021

pmrowla force-pushed the odb-transfer-oid branch from b82c49a to 8222337 Compare July 27, 2021 08:05

efiop reviewed Jul 27, 2021

View reviewed changes

pmrowla force-pushed the odb-transfer-oid branch from db27b63 to 76b2498 Compare July 29, 2021 12:17

pared self-requested a review July 30, 2021 14:03

pmrowla added 17 commits August 2, 2021 17:45

objects.status: use object IDs instead of naive objects

418b6d8

objects.transfer: use object IDs instead of naive objects

578775b

objects: add ReferenceHashFile

f9df44e

odb: add ReferenceObjectDB

f138939

repo: collect used object IDs instead of naive objs

513843d

support basic/naive fs serialization

d54b0a2

objects.stage: stage file objects in addition to trees

a91520b

objects: remove save(), update transfer()

8dc5ce5

output/dep: update for osave removal

3c1d7e8

path_info: support both url and path attributes

5c2fcf7

status/gc: use object IDs instead of naive objs

04ad49a

diff: update staging usage

25d8dfb

commit: update staging usage

7edddb5

tests: clean staging when needed

cc127a1

tests: update for save/transfer/stage changes

0753943

dvcfs: don't allow config serialization (repofs should be used instead)

c2c2224

utils: return mtime as str, size as int

7857a06

pmrowla added 12 commits August 2, 2021 17:45

odb: only protect on check() if hash was verified

0e17368

staging: use unique memfs namespace for staging per dest ODB

c9fdf5b

tests: clean full staging namespace

d637cbc

refodb: preserve repofs cache dir

903fe31

repodependency: stage imports to local ODB and strict check circular

9311147

imports

fetch: fetch staging last

5cd6da8

staging: make dest odb mandatory and preserve state

3caedf8

refodb: cache nonlocal fs's

30ef0e5

fix rebase errors

98ea83c

repofs: support full instantiation from fs config

5ec45fb

repofs: preserve subrepo factory for erepos

f100f0a

tests: clarify repofs dirty file behavior

e9e0ce1

efiop force-pushed the odb-transfer-oid branch from e5f8690 to e9e0ce1 Compare August 2, 2021 17:45

efiop reviewed Aug 2, 2021

View reviewed changes

tests/utils/__init__.py Show resolved Hide resolved

efiop reviewed Aug 2, 2021

View reviewed changes

dvc/hash_info.py Show resolved Hide resolved

pmrowla marked this pull request as ready for review August 3, 2021 11:31

pmrowla requested a review from a team as a code owner August 3, 2021 11:31

efiop approved these changes Aug 4, 2021

View reviewed changes

pmrowla changed the title ~~[WIP] objects: use object IDs and references instead of naive objs in status/transfer~~ objects: use object IDs and references instead of naive objs in status/transfer Aug 4, 2021

efiop merged commit 114a07e into iterative:master Aug 4, 2021

This was referenced Aug 8, 2021

refodb: cache ref objects #6392

Merged

DVC import file hashing only runs on one CPU thread #5546

Closed

pmrowla deleted the odb-transfer-oid branch December 16, 2021 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

objects: use object IDs and references instead of naive objs in status/transfer #6360

objects: use object IDs and references instead of naive objs in status/transfer #6360

pmrowla commented Jul 23, 2021 •

edited

Loading

efiop Jul 27, 2021

efiop commented Jul 27, 2021

pmrowla commented Jul 30, 2021

efiop commented Aug 2, 2021

mergify bot commented Aug 2, 2021

efiop Aug 2, 2021 •

edited

Loading

pmrowla Aug 3, 2021

efiop Aug 3, 2021

efiop Aug 2, 2021

pmrowla commented Aug 4, 2021

efiop commented Aug 4, 2021

efiop commented Aug 8, 2021

objects: use object IDs and references instead of naive objs in status/transfer #6360

objects: use object IDs and references instead of naive objs in status/transfer #6360

Conversation

pmrowla commented Jul 23, 2021 • edited Loading

efiop Jul 27, 2021

Choose a reason for hiding this comment

efiop commented Jul 27, 2021

pmrowla commented Jul 30, 2021

efiop commented Aug 2, 2021

mergify bot commented Aug 2, 2021

efiop Aug 2, 2021 • edited Loading

Choose a reason for hiding this comment

pmrowla Aug 3, 2021

Choose a reason for hiding this comment

efiop Aug 3, 2021

Choose a reason for hiding this comment

efiop Aug 2, 2021

Choose a reason for hiding this comment

pmrowla commented Aug 4, 2021

efiop commented Aug 4, 2021

efiop commented Aug 8, 2021

pmrowla commented Jul 23, 2021 •

edited

Loading

efiop Aug 2, 2021 •

edited

Loading