Request to change Parser from utf-8 to bytes #69

bwbroersma · 2024-04-05T10:23:54Z

Thanks for fixing:

Improve parser error for Byte order mark (BOM) #57

However currently the UTF-8 and BOM check is done in the SecurityTXT class, not the Parser class, could the following code mode to the Parser class? This would of course change the Parser from accepting utf-8 to bytes.

sectxt/sectxt/__init__.py

Lines 422 to 435 in ad85c74

    
           def _get_str(self, content: bytes) -> str: 
        
               try: 
        
                   if content.startswith(codecs.BOM_UTF8): 
        
                       content = content.replace(codecs.BOM_UTF8, b'', 1) 
        
                       self._add_error( 
        
                           "bom_in_file", 
        
                           "The Byte-Order Mark was found at the start of the file. " 
        
                           "Security.txt must be encoded using UTF-8 in Net-Unicode form, " 
        
                           "the BOM signature must not appear at the beginning." 
        
                       ) 
        
                   return content.decode('utf-8') 
        
               except UnicodeError: 
        
                   self._add_error("utf8", "Content must be utf-8 encoded.") 
        
               return content.decode('utf-8', errors="replace")

Since Internet.nl uses the Parser class this would remove the need to duplicate these UTF-8 and BOM checks in Internet.nl.

The text was updated successfully, but these errors were encountered:

DigitalTrustCenter · 2024-04-09T12:31:43Z

With the new version we changed the parser to accept bytes instead of the string and the get_str function has been moved to the parser as was requested in your comment. Now if you use the parser directly you will still see the BOM error.
This is added in version 0.9.3

bwbroersma mentioned this issue Apr 5, 2024

Update sectxt to 0.9.0 internetstandards/Internet.nl#1046

Open

DigitalTrustCenter closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to change Parser from utf-8 to bytes #69

Request to change Parser from utf-8 to bytes #69

bwbroersma commented Apr 5, 2024

DigitalTrustCenter commented Apr 9, 2024

Request to change Parser from utf-8 to bytes #69

Request to change Parser from utf-8 to bytes #69

Comments

bwbroersma commented Apr 5, 2024

DigitalTrustCenter commented Apr 9, 2024