Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t #1133

Open
AnyOldName3 opened this issue Mar 21, 2024 · 1 comment

Comments

@AnyOldName3
Copy link
Contributor

On some platforms, wchar_t is a 32-bit type, and the same width as char32_t, intended to hold UCS4/UTF-32 code points as fixed-width strings. On others, in particular, ones that attempted to support Unicode in the 90s, when UTF-8 and UTF-16 hadn't been invented and The Unicode Consortium thought that sixteen bits would be enough to hold any character from any writing system humans had ever used, wchar_t is a 16-bit type and the same width as char16_t, intended to hold UCS2 fixed-width strings or UTF-16 variable-width strings.

src/vsg/io/convert_utf.cpp works under the assumption that wchar_t can hold an entire Unicode code point on its own, which isn't guaranteed. This can be easily demonstrated by attempting to convert strings containing emoji between std::string and std::wstring in either direction on Windows, as most emoji occupy code points above 65536, and Windows is one of the platforms where wchar_t is sixteen bits. When converting wide strings to narrow, the unpaired surrogates are converted to three bytes each, giving six nonsensical code units per code point, instead of glued to their partner and converted to a combined four correct code units. When converting narrow strings to wide, the four code units are correctly converted to the right code unit held in a uint32_t, then static_casted into wchar_t, which truncates the most significant sixteen bits, which works for the first 65536 code points (which is most non-emoji text, hence why it's not been noticed), and then wraps around.

I noticed this because I was poking around, and have seen this bug lots of times in different projects, and not because I'm affected by it, so there's no pressing need to fix this immediately, but it'll end up affecting someone eventually.

@lufriem
Copy link
Contributor

lufriem commented Apr 3, 2024

I wonder if one could simply check for the size of wchar_t and then assume UTF-16? Or check whether it's a Windows build (I think that information is easily available?) and then assume UTF-16?

@AnyOldName3 AnyOldName3 changed the title vag::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants