Misclassification of binary files as text files #3364

kevin-hanselman · 2020-02-19T14:50:55Z

In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice.

As noted in #3264, at the minimum, DVC should probably match
Git's text file detection routine, which interrogates the first 8 kilobytes
(and doesn't do heuristics on ratio of printable characters,
as DVC currently does).

efiop · 2020-02-19T18:08:26Z

Note that changing this heuristic will break backward compatibility, so we won't be able to adjust this right now. Related #992

efiop · 2020-12-08T00:38:09Z

Closing in favor of #4658

triage-new-issues bot added the triage Needs to be triaged label Feb 19, 2020

kevin-hanselman mentioned this issue Feb 19, 2020

utils: more robust text file detection on checksum #3264

Closed

3 tasks

efiop added bug Did we break something? p3-nice-to-have It should be done this or next sprint labels Feb 19, 2020

triage-new-issues bot removed the triage Needs to be triaged label Feb 19, 2020

weekly-digest bot mentioned this issue Feb 23, 2020

Weekly Digest (16 February, 2020 - 23 February, 2020) #3387

Closed

efiop closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misclassification of binary files as text files #3364

Misclassification of binary files as text files #3364

kevin-hanselman commented Feb 19, 2020

efiop commented Feb 19, 2020 •

edited

Loading

efiop commented Dec 8, 2020

Misclassification of binary files as text files #3364

Misclassification of binary files as text files #3364

Comments

kevin-hanselman commented Feb 19, 2020

efiop commented Feb 19, 2020 • edited Loading

efiop commented Dec 8, 2020

efiop commented Feb 19, 2020 •

edited

Loading