-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking βSign up for GitHubβ, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
objects: use object IDs and references instead of naive objs in status/transfer #6360
Conversation
b82c49a
to
8222337
Compare
def get(self, hash_info: "HashInfo"): | ||
if hash_info.isdir: | ||
return super().get(hash_info) | ||
path_info = self.hash_to_path(hash_info.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: as we've discussed with @pmrowla before, it might make sense to either store these ref objects in a some subdir in odb (e.g. refs/
or smth) or maybe use a file extention (similar to .dir
, e.g. .ref
) to distinguish these objects from actual real objects.
This might get useful in cases where you need to store actual objects as well as ref objects and is kinda like git's object header, but encoded into a path or filename, which allows us to very easilly distinguish them on sight and through API calls like list_objects
.
Btw, even though refodb is not able to make dvc provide 100% git-like staging experience (we can't save actual files since they are too big), we could maybe improve |
db27b63
to
76b2498
Compare
For the record: had discussion with @efiop regarding potential future use cases for refodb
|
@Mergifyio rebase |
Command
|
return None | ||
|
||
return HashInfo("md5", value[2], size=int(size)) | ||
return HashInfo("md5", value[2], size=size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks backward compatibility for this db π
We could change the layout, but just need to make sure we do corresponding version adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll adjust it so that we go back to writing size as a string in the state DB.
The main thing here was that I don't think the utils
function should have been returning both as strings (the returned mtime
does have to be a string since for dirs it's actually the hash of all the file mtimes, but that doesn't apply to size
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla Agreed, that's part of an old code that used to be directly in old state class π
rev = repo.get_rev() | ||
if locked and self.def_repo.get(self.PARAM_REV_LOCK) is None: | ||
self.def_repo[self.PARAM_REV_LOCK] = rev | ||
|
||
path_info = PathInfo(repo.root_dir) / str(self.def_path) | ||
if not obj_only: | ||
try: | ||
for odb, objs in repo.used_objs( | ||
for odb, obj_ids in repo.used_objs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we go obj_ids
-> oids
and coin this name? Sure, it is a bit more cryptic, but short and sweet and makes sense if you know what you are doing. Just feels like we should get the terminology straight right away so it is settled, as it will be used more and more around the code.
Thinking about staging some more (with regard to the partial add issue), we will need to always write staged trees to disk somewhere prior to the formal save/transfer (whether it's in the actual ODB or somewhere else like .dvc/tmp). With the current behavior for this PR where it's all done in memory, we wouldn't be able to recover from a partial/failed add since we would lose the dir cache/tree for the original "complete" directory |
@pmrowla Thank you! π |
For the record: some followups that we've discussed before:
|
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
ReferenceHashFile
object type andReferenceObjectDB
ODB typeabc123
is a serialized reference to some filesystem and path where the actual source file w/hashabc123
existsobjects.stage
now stages objects as references in a memfs based ref ODBobjects.save
is removed in favor of stage + transfer from staging to ODB patternobjects.status
andobjects.transfer
now accept a collection of object IDs (hashinfos) instead of naive objects