Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cohesion crashes when using system-dependent encoding #21

Open
Snezhnaya-chan opened this issue Jan 14, 2025 · 3 comments · May be fixed by #22
Open

cohesion crashes when using system-dependent encoding #21

Snezhnaya-chan opened this issue Jan 14, 2025 · 3 comments · May be fixed by #22

Comments

@Snezhnaya-chan
Copy link

Currently cohesion uses the default encoding for open():

with open(filename) as fd:
return fd.read()

This takes the encoding from locale.getpreferredencoding(), which on my Windows installation is cp1252. So even though my file is saved in UTF-8, cohesion loads it in cp1252, crashing with
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 9011: character maps to <undefined>

Adding encoding="utf-8" would solve this issue.

@Snezhnaya-chan Snezhnaya-chan linked a pull request Jan 14, 2025 that will close this issue
@mschwager
Copy link
Owner

Hmm, interesting. Thanks for opening this issue and even including a fix PR. I do have one question though, could this cause issues in reverse? In other words, if we hardcode encoding="utf-8", could we run into decoding errors if we try to open a file that's been encoded as cp1252? Sorry, I'm not too familiar with how this works on Windows.

Would another solution here be to use UTF-8 mode? In other words you could run python with the -X utf8 flag, or set PYTHONUTF8=1 when running cohesion.

@Snezhnaya-chan
Copy link
Author

Hi, it's possible, but since UTF-8 is the default encoding for Python code, it'd be a more sane default, instead of relying on the filesystem.
Python does support using a different encoding if it's specified in the first or second line of the file, I'm not really sure if anyone uses that, but it does offer the possibility of detecting that comment and switching the encoding automatically, to avoid such possible errors.
I'll look into that and update the PR accordingly.

@Snezhnaya-chan
Copy link
Author

Looking into it further, ast.parse() seems to accept bytes as well as str, and respects the encoding comment. We can just return bytes from get_file_contents and let the parser figure it out. I think that'd be the best solution right now, unless it's necessary for get_file_contents to only ever return str.

I updated the PR in the case that's satisfactory. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants