Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to change Parser from utf-8 to bytes #69

Closed
bwbroersma opened this issue Apr 5, 2024 · 1 comment
Closed

Request to change Parser from utf-8 to bytes #69

bwbroersma opened this issue Apr 5, 2024 · 1 comment

Comments

@bwbroersma
Copy link
Contributor

Thanks for fixing:

However currently the UTF-8 and BOM check is done in the SecurityTXT class, not the Parser class, could the following code mode to the Parser class? This would of course change the Parser from accepting utf-8 to bytes.

sectxt/sectxt/__init__.py

Lines 422 to 435 in ad85c74

def _get_str(self, content: bytes) -> str:
try:
if content.startswith(codecs.BOM_UTF8):
content = content.replace(codecs.BOM_UTF8, b'', 1)
self._add_error(
"bom_in_file",
"The Byte-Order Mark was found at the start of the file. "
"Security.txt must be encoded using UTF-8 in Net-Unicode form, "
"the BOM signature must not appear at the beginning."
)
return content.decode('utf-8')
except UnicodeError:
self._add_error("utf8", "Content must be utf-8 encoded.")
return content.decode('utf-8', errors="replace")

Since Internet.nl uses the Parser class this would remove the need to duplicate these UTF-8 and BOM checks in Internet.nl.

@DigitalTrustCenter
Copy link
Owner

With the new version we changed the parser to accept bytes instead of the string and the get_str function has been moved to the parser as was requested in your comment. Now if you use the parser directly you will still see the BOM error.
This is added in version 0.9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants