Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misclassification of binary files as text files #3364

Closed
kevin-hanselman opened this issue Feb 19, 2020 · 2 comments
Closed

Misclassification of binary files as text files #3364

kevin-hanselman opened this issue Feb 19, 2020 · 2 comments
Labels
bug Did we break something? p3-nice-to-have It should be done this or next sprint

Comments

@kevin-hanselman
Copy link

In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice.

As noted in #3264, at the minimum, DVC should probably match
Git's text file detection routine, which interrogates the first 8 kilobytes
(and doesn't do heuristics on ratio of printable characters,
as DVC currently does).

@efiop
Copy link
Contributor

efiop commented Feb 19, 2020

Note that changing this heuristic will break backward compatibility, so we won't be able to adjust this right now. Related #992

@efiop efiop added bug Did we break something? p3-nice-to-have It should be done this or next sprint labels Feb 19, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 19, 2020
@efiop
Copy link
Contributor

efiop commented Dec 8, 2020

Closing in favor of #4658

@efiop efiop closed this as completed Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants