Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore empty files #9329

Closed
johnyaku opened this issue Apr 14, 2023 · 1 comment
Closed

Ignore empty files #9329

johnyaku opened this issue Apr 14, 2023 · 1 comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature

Comments

@johnyaku
Copy link

The chances of hash collisions between two different files is extraordinarily low -- unless the two files are both "empty", in which case they will both have checksum d41d8cd98f00b204e9800998ecf8427e.

Empty files can get created for various reasons, including by workflow tools such as Snakemake. Snakemake creates .snakemake_timestamp files that exist only for their mtime, which is then lost when the file is added to the cache. (#8602)

There is not much to be gained by caching/tracking these empty files either. We could explicitly ignore them via .dvcignore when we know that they might turn up, but perhaps DVC could ignore empty files by default?

By "ignore", I think I mean "leave in the workspace, don't add to cache". Not sure if they should be tracked by .dir files.

Not sure if there would be unintended consequences. If so, perhaps "ignore empty files" could be configurable.

@daavoo daavoo added feature request Requesting a new feature A: data-management Related to dvc add/checkout/commit/move/remove labels Apr 14, 2023
@efiop
Copy link
Contributor

efiop commented Apr 14, 2023

This would be breaking backward compatibility. Also imagine that your pipeline only created 1 file and it is empty - ignoring it would look like there was no output created at all, which looks like an error. Implicitly ignoring empty files seems too opinionated, you could indeed .dvcignore files that don't matter in your particular scenario instead.

@efiop efiop closed this as completed Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

3 participants