-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping .dir #8884
Comments
I would say that we wait how cloud versioning goes before making any decisions here. |
Also, yaml parsing is many times slower than json parsing. |
Any reason for lock files to be in YAML? |
We wanted the lock file to be human-readable too ( |
Possibly related: https://github.com/iterative/studio/issues/4968. Studio checks the size limit of outputs but not at a granular level. Would it be simpler for them to do this granularly if they had the individual files in |
What would be the harm in at least adding a flag to opt in? |
I agree that we should just drop it. Especially for imports. This makes much more sense than creating It seems like it worked out pretty nicely for cloud versioning, so I think we are ready to switch. |
We have had some reports that |
I have tested cloud versioned CelebA dataset, it took 30 seconds to be parsed. It was a .dvc file though. |
@skshetry Do you know how long it takes currently with |
Also, how hard would it be to use a more efficient yaml parser? |
@dberenbaum yaml is pretty bad overall. But json should be significantly better though. This is to the topic of We might be able to shave down unneeded features from yaml parser or use something that is doing that for us out of the box, but yaml is just not quite suitable for that many entries. |
We use yaml parser written in C. You can check if you have it installed using: python -c "import ruamel.yaml; print(ruamel.yaml.__with_libyaml__)" Here are some benchmarks on yaml/json parsing based on celeba: Parsing
|
In #8849, we stopped serializing the directory info in the resulting .dvc/dvc.lock files for cloud versioned remotes. Can we do the same everywhere?
This would help with a bunch of existing issues:
dvc diff
: unexpected output in Github Actions #8875:dvc diff
fails to compare refs unless the data associated with those refs has been pulled locally.dvc data status
also reports anunknown
status when data hasn't been pulled. By having all the files listed in the .dvc/dvc.lock file, it would always be possible to get the granular file info of any commit.dvc pull
fails after version aware import-url for non version aware s3 remote #8872:dvc pull
on animport-url
target is now supposed to be able to pull the data directly from the source without having to push a copy to the remote, but it doesn't work for directories because only the high-level directory info is saved to the .dvc file.dvc add/remove
work at a granular level by only modifying part of the .dvc file.Automatically pushing and pulling the .dir files from the remote could also solve a lot of these problems, but it seems like a worse UX. It's less transparent, harder for users to manage, and fails when users don't have access to the remote or forgot to push something.
How much do we really need the reference to the .dir file? If necessary, could we serialize that reference somewhere that's not git-tracked, like in a shadow
.dvc/tmp/mydataset.dvc
file?The text was updated successfully, but these errors were encountered: