-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud versioning: get rid of .dir
files
#8357
Comments
Locally it is still beneficial for us to use the .dir files for performance reasons. The dvc-data object DB isn't aware of the DVC repo and pipeline files within the repo, and it's faster to do directory tree lookups via the content-addressed storage version of the .dir file than it is to collect outs from all of the possible pipeline files in the local repo. |
Maybe I'm naively assuming all operations are file-based and missing how the DB is used, but even if DVC needs the |
Having a It's described here - https://github.com/iterative/iterative.ai/issues/690. And unfortunately, we won't be able to use it until it's solved. If we use it only locally, we should definitely find a way to move outside the use-facing metadata. |
Is it possible to just delete the .dir entries from the .dvc files when encountering merge conflicts? |
From the support channel:
|
@shcheklein Do you have a link for the support request? I haven't seen that one. I don't mean to push back on the feature, which is planned for the current sprint, but I want to get some clarification in case something's not working as expected with our merge driver after it was updated in #8360.
Should we add this comment to #770? |
I was en email to [email protected] (let me know if you don't see them).
Yes, also wonder if @iterative/cml could help here with a report that does diffs? |
@shcheklein This may be hard to accomplish solely for cloud versioning since it is unknown during
The other option is to drop/hide .dir entries for all .dvc files, not only cloud-versioned ones. It might make sense, but I think we are probably not ready to do that today because we need to use cloud versioning to test out how well it works and scales. Without researching too much, I would also guess it would take much more effort. |
Q: do we need to support mixed modes? can we make them exclusive (I know this was probably discussed before, I wonder if that is a scenario from a user perspective that we want to maintain?)
That can be better than nothing, we already do an update to the file, right? I think if .dir is hidden is fine, if it doesn't affect users. |
I'm not sure I follow how we would make it exclusive. When you do
Yes, so far it may be the best option we have if we need to avoid .dir entries. It would require some changes in the .dvc file and how we read data locally. Right now, it looks like:
Other than YAML validation failing, it basically works if the format instead looks like:
(There need to be some unrelated changes to the cloud fields If we make changes here, they should probably be blockers for release since they are likely to break the .dvc files. Theoretically, the same changes could be useful without cloud versioning, but maybe it's good to make cloud versioning a testing ground. It may also relate somehow to #4657 (comment).
If we go with the format above, we could probably hash the .dvc file and keep a reference to it in a temp dir or database to do the same optimizations we do today with |
DVC when a specific mode is set in its config and when it's trying to read, push, pull to/from some incompatible format should fail fast w/o trying to do anything with files, etc. E.g. if a config set to operate in a cloud versioning format it won't be able to do |
Thanks @shcheklein. What do you think about trying to keep the formats compatible and dropping .dir entries on push for now as described above? Seems like a simpler starting point and prevents cloud versioning from diverging too much from the existing dvc workflow for now. |
If it's simpler to implement and maintain then yes, it makes sense of course. |
Need to catch up on this 😅 . Just wanted to say that my current impression is that trying to:
It is causing more problems than benefits (internally but also at the UX level, IMO). |
This feature in particular is not really specific to cloud versioning except that cloud versioning started down the path of saving individual file entries into .dvc files, and I think that info isn't used for add/commit/etc. but instead only for remote transfer operations (cc @pmrowla). Another option here is to configure whether to use |
I didn't understand that the scope was to get rid of
If we want to make the scope for DVC in general, I think we should discuss that separately and reconsider prioritization as it would take a different effort and I am not even sure if we really want to drop
add/commit/etc do behave differently if the |
Part of #7995
With cloud versioned remotes, the resulting
.dvc
file for directories looks like:There is a corresponding entry in
.dvc/cache/f4/37247ec66d73ba66b0ade0246fcb49.dir
:Is the
.dir
entry needed?The text was updated successfully, but these errors were encountered: