Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: get rid of .dir files #8357

Closed
dberenbaum opened this issue Sep 23, 2022 · 16 comments · Fixed by #8849
Closed

cloud versioning: get rid of .dir files #8357

dberenbaum opened this issue Sep 23, 2022 · 16 comments · Fixed by #8849
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Collaborator

Part of #7995

With cloud versioned remotes, the resulting .dvc file for directories looks like:

outs:
- md5: f437247ec66d73ba66b0ade0246fcb49.dir
  size: 0
  nfiles: 2
  path: data
  files:
  - size: 2
    version_id: mHkZWv4Pa0rZm0SaFdew0WKu7cxPUHrW
    etag: b026324c6904b2a9cb4b88d6d61c81d1
    md5: b026324c6904b2a9cb4b88d6d61c81d1
    relpath: '1'
  - size: 2
    version_id: GO04I1aowL.OoiDYYcoX90W96PufrgFE
    etag: 26ab0db90d72e28ad0ba1e22ee510510
    md5: 26ab0db90d72e28ad0ba1e22ee510510
    relpath: '2'

There is a corresponding entry in .dvc/cache/f4/37247ec66d73ba66b0ade0246fcb49.dir:

[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"}, {"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"}]

Is the .dir entry needed?

@dberenbaum dberenbaum added the A: cloud-versioning Related to cloud-versioned remotes label Sep 23, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Sep 28, 2022

Locally it is still beneficial for us to use the .dir files for performance reasons. The dvc-data object DB isn't aware of the DVC repo and pipeline files within the repo, and it's faster to do directory tree lookups via the content-addressed storage version of the .dir file than it is to collect outs from all of the possible pipeline files in the local repo.

@dberenbaum
Copy link
Collaborator Author

it's faster to do directory tree lookups via the content-addressed storage version of the .dir file than it is to collect outs from all of the possible pipeline files in the local repo.

Maybe I'm naively assuming all operations are file-based and missing how the DB is used, but even if DVC needs the .dir entry in the database and in the .dvc file above, does DVC actually need to save a copy of the .dir file in .dvc/cache? Even if the DB were deleted, it seems like the .dvc/cache/...dir file is only needed when called from a .dvc file that contains the same info already and could rebuild its contents. The only meaningful differences I see between the .dvc file and the .dir file is YAML vs. JSON parsing speed (which I think is a valid and important concern), but the .dvc file needs to be parsed anyway.

@dberenbaum dberenbaum added the p3-nice-to-have It should be done this or next sprint label Oct 12, 2022
@shcheklein
Copy link
Member

Having a .dir entry in the metafiles breaks the scenario when cloud versioning could be used to simplify the workflow that includes updating the directory. Namely, ideally we would not need a special merge driver, etc.

It's described here - https://github.com/iterative/iterative.ai/issues/690. And unfortunately, we won't be able to use it until it's solved.

If we use it only locally, we should definitely find a way to move outside the use-facing metadata.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Jan 4, 2023
@dberenbaum
Copy link
Collaborator Author

Is it possible to just delete the .dir entries from the .dvc files when encountering merge conflicts?

@daavoo daavoo self-assigned this Jan 10, 2023
@shcheklein
Copy link
Member

shcheklein commented Jan 10, 2023

From the support channel:

We work with documents (PDF, PNG, JPEG) and recently switched from GitLFS to DVC for security and efficiency reasons. However, we are currently experiencing some issues with its use because our annotation team, which previously only worked with Git, now has to deal with DVC as well. Conflicts can be frequent and are difficult to resolve for them. Additionally, we are having challenges with pull request reviews, as before it was directly done through our user interface (Bitbucket), and it was easy to see the differences on JSON/text files.

@dberenbaum
Copy link
Collaborator Author

@shcheklein Do you have a link for the support request? I haven't seen that one.

I don't mean to push back on the feature, which is planned for the current sprint, but I want to get some clarification in case something's not working as expected with our merge driver after it was updated in #8360.

Additionally, we are having challenges with pull request reviews, as before it was directly done through our user interface (Bitbucket), and it was easy to see the differences on JSON/text files.

Should we add this comment to #770?

@shcheklein
Copy link
Member

I was en email to [email protected] (let me know if you don't see them).

Should we add this comment to #770?

Yes, also wonder if @iterative/cml could help here with a report that does diffs?

@dberenbaum
Copy link
Collaborator Author

@shcheklein This may be hard to accomplish solely for cloud versioning since it is unknown during dvc add that this will later be pushed to a cloud-versioned remote. Some options discussed with @daavoo (TBH I don't really like them):

  • Encourage dvc add --to-remote to push at the same time as adding, and skip the .dir file.
  • Delete the .dir file from .dvc during dvc push to a cloud-versioned remote.
  • Same as the options above, except the .dir file gets hidden somewhere that's not tracked instead of being deleted.

The other option is to drop/hide .dir entries for all .dvc files, not only cloud-versioned ones. It might make sense, but I think we are probably not ready to do that today because we need to use cloud versioning to test out how well it works and scales. Without researching too much, I would also guess it would take much more effort.

@shcheklein
Copy link
Member

that this will later be pushed to a cloud-versioned remote.

Q: do we need to support mixed modes? can we make them exclusive (I know this was probably discussed before, I wonder if that is a scenario from a user perspective that we want to maintain?)

Delete the .dir file from .dvc during dvc push to a cloud-versioned remote.

That can be better than nothing, we already do an update to the file, right?

I think if .dir is hidden is fine, if it doesn't affect users.

@dberenbaum
Copy link
Collaborator Author

Q: do we need to support mixed modes? can we make them exclusive (I know this was probably discussed before, I wonder if that is a scenario from a user perspective that we want to maintain?)

I'm not sure I follow how we would make it exclusive. When you do dvc add, would we ask if you plan to push to a cloud-versioned remote? Maybe it's related to the dvc add --to-remote suggestion, where we are not only adding locally but also doing some operation on the remote simultaneously?

Delete the .dir file from .dvc during dvc push to a cloud-versioned remote.

That can be better than nothing, we already do an update to the file, right?

Yes, so far it may be the best option we have if we need to avoid .dir entries.

It would require some changes in the .dvc file and how we read data locally. Right now, it looks like:

outs:
- md5: 22e3f61e52c0ba45334d973244efc155.dir
  size: 64128504
  nfiles: 2800
  path: cats-dogs
  files:
  - size: 16880
    version_id: CxBqqlHnOmHSG09sYlP36Mytf.48xmaJ
    etag: ed779276108738fdb2179ccabf9680d9
    md5: ed779276108738fdb2179ccabf9680d9
    relpath: data/train/cats/cat.1.jpg
  - size: 34315
    version_id: 4QWq043ohGMxOk_pilnTkVPnJFJEV2lc
    etag: 10d2a131081a3095726c5721ed31c21f
    md5: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
...

Other than YAML validation failing, it basically works if the format instead looks like:

outs:
- size: 16880
  version_id: LgpYHQnSyJ9EF0sJV1KWv58p7wOK.sHL
  etag: ed779276108738fdb2179ccabf9680d9
  md5: ed779276108738fdb2179ccabf9680d9
  path: cats-dogs/data/train/cats/cat.1.jpg
- size: 34315
  version_id: wBYroaUEZSYH088HetJD.0XfaDUjdhAF
  etag: 10d2a131081a3095726c5721ed31c21f
  md5: 10d2a131081a3095726c5721ed31c21f
  path: cats-dogs/data/train/cats/cat.10.jpg

(There need to be some unrelated changes to the cloud fields etag and version_id because of #8356, and the field ordering could probably be improved also)

If we make changes here, they should probably be blockers for release since they are likely to break the .dvc files.

Theoretically, the same changes could be useful without cloud versioning, but maybe it's good to make cloud versioning a testing ground. It may also relate somehow to #4657 (comment).

I think if .dir is hidden is fine, if it doesn't affect users.

If we go with the format above, we could probably hash the .dvc file and keep a reference to it in a temp dir or database to do the same optimizations we do today with .dir files.

@shcheklein
Copy link
Member

I'm not sure I follow how we would make it exclusive.

DVC when a specific mode is set in its config and when it's trying to read, push, pull to/from some incompatible format should fail fast w/o trying to do anything with files, etc. E.g. if a config set to operate in a cloud versioning format it won't be able to do dvc push to a regular remote at all.

@dberenbaum
Copy link
Collaborator Author

Thanks @shcheklein. What do you think about trying to keep the formats compatible and dropping .dir entries on push for now as described above? Seems like a simpler starting point and prevents cloud versioning from diverging too much from the existing dvc workflow for now.

@shcheklein
Copy link
Member

If it's simpler to implement and maintain then yes, it makes sense of course.

@daavoo
Copy link
Contributor

daavoo commented Jan 13, 2023

Need to catch up on this 😅 .

Just wanted to say that my current impression is that trying to:

prevents cloud versioning from diverging too much from the existing dvc workflow for now.

It is causing more problems than benefits (internally but also at the UX level, IMO).

@dberenbaum
Copy link
Collaborator Author

This feature in particular is not really specific to cloud versioning except that cloud versioning started down the path of saving individual file entries into .dvc files, and I think that info isn't used for add/commit/etc. but instead only for remote transfer operations (cc @pmrowla). Another option here is to configure whether to use .dir files separate from cloud versioning with something like dvc config dir false.

@daavoo
Copy link
Contributor

daavoo commented Jan 16, 2023

This feature in particular is not really specific to cloud versioning except that cloud versioning started down the path of saving individual file entries into .dvc files,

I didn't understand that the scope was to get rid of .dir files on all cases.

Another option here is to configure whether to use .dir files separate from cloud versioning with something like dvc config dir false.

If we want to make the scope for DVC in general, I think we should discuss that separately and reconsider prioritization as it would take a different effort and I am not even sure if we really want to drop .dir files for the regular workflow.

and I think that info isn't used for add/commit/etc. but instead only for remote transfer operations (cc @pmrowla).

add/commit/etc do behave differently if the .dvc is "cloud-versioned".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants