Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM error message always present #65

Closed
affimnem opened this issue Feb 14, 2024 · 1 comment
Closed

BOM error message always present #65

affimnem opened this issue Feb 14, 2024 · 1 comment

Comments

@affimnem
Copy link

affimnem commented Feb 14, 2024

The BOM check has an error message, but it is outside the "then" part of the code, making the error message always appear. https://github.com/DigitalTrustCenter/sectxt/blob/main/sectxt/__init__.py#L419C1-L432C57 has:

   def _get_str(self, content: bytes) -> str:
        try:
            if content.startswith(codecs.BOM_UTF8):
                content = content.replace(codecs.BOM_UTF8, b'')
            self._add_error(
                "bom_in_file",
                "The Byte-Order Mark was found in the UTF-8 File. "
                "Security.txt must be encoded using UTF-8 in Net-Unicode form, "
                "the BOM signature must not appear at the beginning."
            )
            return content.decode('utf-8')
        except UnicodeError:
            self._add_error("utf8", "Content must be utf-8 encoded.")
        return content.decode('utf-8', errors="replace")

I did a simple test and by indenting self._add_error(...) to be on the same level as the line before it only gave error when there is a BOM at the beginning of the file.

In addition, the error message doesn't state that the BOM is at the start of the file, which is the problem. I'd suggest to change the first sentence to: "The Byte-Order Mark (BOM) was found at the start of the UTF-8 file." (Not really sure if "UTF-8" is necessary.)

Finally, content.replace(codecs.BOM_UTF8, b'') will replace all BOM occurences, not only the starting BOM. I don't know if this is an oversight, or if it is intended to make the following parsing easier, but conceptually it feels wrong, and the following would fix the issue, and only that, as bytes.replace() has a third count argument:

                content = content.replace(codecs.BOM_UTF8, b'', 1)
@DigitalTrustCenter
Copy link
Owner

Thank you for reporting this issue. This was a serious oversight and your analysis was correct. The error should only be reported if the BOM was found in the file. We also agree with your proposal to improve the error message to include that the bom was at the start of the file and that the replace only does the first occurrence, even though the BOM should only be able to occur at the start it does make it more clear what it is doing.

This issue is resolved with the update to version 0.9.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants