Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking β€œSign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objects: use object IDs and references instead of naive objs in status/transfer #6360

Merged
merged 38 commits into from
Aug 4, 2021

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jul 23, 2021 β€’

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • Adds ReferenceHashFile object type and ReferenceObjectDB ODB type
    • Ref ODB stores references to files outside the ODB fs, so the object at hash abc123 is a serialized reference to some filesystem and path where the actual source file w/hash abc123 exists
  • objects.stage now stages objects as references in a memfs based ref ODB
  • objects.save is removed in favor of stage + transfer from staging to ODB pattern
  • objects.status and objects.transfer now accept a collection of object IDs (hashinfos) instead of naive objects
    • repo/stage/output object collection now collects object IDs instead of naive objects as well

Sorry, something went wrong.

@pmrowla pmrowla added the refactoring Factoring and re-factoring label Jul 23, 2021
@pmrowla pmrowla self-assigned this Jul 23, 2021
@pmrowla pmrowla force-pushed the odb-transfer-oid branch from b82c49a to 8222337 Compare July 27, 2021 08:05
def get(self, hash_info: "HashInfo"):
if hash_info.isdir:
return super().get(hash_info)
path_info = self.hash_to_path(hash_info.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: as we've discussed with @pmrowla before, it might make sense to either store these ref objects in a some subdir in odb (e.g. refs/ or smth) or maybe use a file extention (similar to .dir, e.g. .ref) to distinguish these objects from actual real objects.

This might get useful in cases where you need to store actual objects as well as ref objects and is kinda like git's object header, but encoded into a path or filename, which allows us to very easilly distinguish them on sight and through API calls like list_objects.

@efiop
Copy link
Contributor

efiop commented Jul 27, 2021

Btw, even though refodb is not able to make dvc provide 100% git-like staging experience (we can't save actual files since they are too big), we could maybe improve --no-commit by just showing that in dvc status. We won't be able to make your data roll back to the staged state, but at least we won't need to store md5 of an uncommited (to .dvc/cache) file/dir, which is less hazardous and so uncommited and missing cache cases could be more easilly distinguished in functionality that cares (e.g. dvc status, dvc diff, etc). This will, of course, be possible, if staging refodb becomes permanently saved to disk, which is a future possible step. But just wanted to mention this, since it seems to allow us for a better ui or better functionality in general in the future.

@pmrowla pmrowla force-pushed the odb-transfer-oid branch from db27b63 to 76b2498 Compare July 29, 2021 12:17
@pmrowla
Copy link
Contributor Author

pmrowla commented Jul 30, 2021

For the record:

had discussion with @efiop regarding potential future use cases for refodb

  • replacing get with the ability to "checkout" directly from a remote odb seems promising
  • checkout itself would be generalized into a "transfer/link from source odb into checkout refodb" where the refodb contains references to the output paths

@pared pared self-requested a review July 30, 2021 14:03
@efiop
Copy link
Contributor

efiop commented Aug 2, 2021

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Aug 2, 2021

Command rebase: success

Branch has been successfully rebased

@efiop efiop force-pushed the odb-transfer-oid branch from e5f8690 to e9e0ce1 Compare August 2, 2021 17:45
return None

return HashInfo("md5", value[2], size=int(size))
return HashInfo("md5", value[2], size=size)
Copy link
Contributor

@efiop efiop Aug 2, 2021 β€’

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backward compatibility for this db πŸ™

We could change the layout, but just need to make sure we do corresponding version adjustments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust it so that we go back to writing size as a string in the state DB.

The main thing here was that I don't think the utils function should have been returning both as strings (the returned mtime does have to be a string since for dirs it's actually the hash of all the file mtimes, but that doesn't apply to size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla Agreed, that's part of an old code that used to be directly in old state class πŸ™

rev = repo.get_rev()
if locked and self.def_repo.get(self.PARAM_REV_LOCK) is None:
self.def_repo[self.PARAM_REV_LOCK] = rev

path_info = PathInfo(repo.root_dir) / str(self.def_path)
if not obj_only:
try:
for odb, objs in repo.used_objs(
for odb, obj_ids in repo.used_objs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go obj_ids -> oids and coin this name? Sure, it is a bit more cryptic, but short and sweet and makes sense if you know what you are doing. Just feels like we should get the terminology straight right away so it is settled, as it will be used more and more around the code.

dvc/hash_info.py Show resolved Hide resolved
@pmrowla pmrowla marked this pull request as ready for review August 3, 2021 11:31
@pmrowla pmrowla requested a review from a team as a code owner August 3, 2021 11:31
@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 4, 2021

Thinking about staging some more (with regard to the partial add issue), we will need to always write staged trees to disk somewhere prior to the formal save/transfer (whether it's in the actual ODB or somewhere else like .dvc/tmp).

With the current behavior for this PR where it's all done in memory, we wouldn't be able to recover from a partial/failed add since we would lose the dir cache/tree for the original "complete" directory

@pmrowla pmrowla changed the title [WIP] objects: use object IDs and references instead of naive objs in status/transfer objects: use object IDs and references instead of naive objs in status/transfer Aug 4, 2021
@efiop efiop merged commit 114a07e into iterative:master Aug 4, 2021
@efiop
Copy link
Contributor

efiop commented Aug 4, 2021

@pmrowla Thank you! πŸ™

@efiop
Copy link
Contributor

efiop commented Aug 8, 2021

For the record: some followups that we've discussed before:

  1. obj type is part of oid
  2. base odb should be aware of obj type (will likely result in removal of ReferenceODB)
  3. use protobuf? (not needed right now, as we don't need to serialize anything to an actual disk, but still)
  4. hash type might be part of oid but we could keep plain 12/34 structure for now in refs/
  5. ref objects could be a compatibility layer between different hash types.
  6. should be possible to transfer all objects from one odb to another (e.g. run cache too). Aka cloning

@pmrowla pmrowla deleted the odb-transfer-oid branch December 16, 2021 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Factoring and re-factoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants