cohesion crashes when using system-dependent encoding #21

Snezhnaya-chan · 2025-01-14T17:28:27Z

Currently cohesion uses the default encoding for open():

Lines 10 to 11 in b9add60

    
           with open(filename) as fd: 
        
               return fd.read()

This takes the encoding from locale.getpreferredencoding(), which on my Windows installation is cp1252. So even though my file is saved in UTF-8, cohesion loads it in cp1252, crashing with
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 9011: character maps to <undefined>

Adding encoding="utf-8" would solve this issue.

The text was updated successfully, but these errors were encountered:

mschwager · 2025-01-16T16:25:03Z

Hmm, interesting. Thanks for opening this issue and even including a fix PR. I do have one question though, could this cause issues in reverse? In other words, if we hardcode encoding="utf-8", could we run into decoding errors if we try to open a file that's been encoded as cp1252? Sorry, I'm not too familiar with how this works on Windows.

Would another solution here be to use UTF-8 mode? In other words you could run python with the -X utf8 flag, or set PYTHONUTF8=1 when running cohesion.

Snezhnaya-chan · 2025-01-16T16:37:18Z

Hi, it's possible, but since UTF-8 is the default encoding for Python code, it'd be a more sane default, instead of relying on the filesystem.
Python does support using a different encoding if it's specified in the first or second line of the file, I'm not really sure if anyone uses that, but it does offer the possibility of detecting that comment and switching the encoding automatically, to avoid such possible errors.
I'll look into that and update the PR accordingly.

Snezhnaya-chan · 2025-01-16T17:42:42Z

Looking into it further, ast.parse() seems to accept bytes as well as str, and respects the encoding comment. We can just return bytes from get_file_contents and let the parser figure it out. I think that'd be the best solution right now, unless it's necessary for get_file_contents to only ever return str.

I updated the PR in the case that's satisfactory. Thanks

Snezhnaya-chan linked a pull request Jan 14, 2025 that will close this issue

Fix encoding crashes when reading files #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cohesion crashes when using system-dependent encoding #21

cohesion crashes when using system-dependent encoding #21

Snezhnaya-chan commented Jan 14, 2025

mschwager commented Jan 16, 2025

Snezhnaya-chan commented Jan 16, 2025

Snezhnaya-chan commented Jan 16, 2025

cohesion crashes when using system-dependent encoding #21

cohesion crashes when using system-dependent encoding #21

Comments

Snezhnaya-chan commented Jan 14, 2025

mschwager commented Jan 16, 2025

Snezhnaya-chan commented Jan 16, 2025

Snezhnaya-chan commented Jan 16, 2025