vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t #1133

AnyOldName3 · 2024-03-21T18:12:47Z

On some platforms, wchar_t is a 32-bit type, and the same width as char32_t, intended to hold UCS4/UTF-32 code points as fixed-width strings. On others, in particular, ones that attempted to support Unicode in the 90s, when UTF-8 and UTF-16 hadn't been invented and The Unicode Consortium thought that sixteen bits would be enough to hold any character from any writing system humans had ever used, wchar_t is a 16-bit type and the same width as char16_t, intended to hold UCS2 fixed-width strings or UTF-16 variable-width strings.

src/vsg/io/convert_utf.cpp works under the assumption that wchar_t can hold an entire Unicode code point on its own, which isn't guaranteed. This can be easily demonstrated by attempting to convert strings containing emoji between std::string and std::wstring in either direction on Windows, as most emoji occupy code points above 65536, and Windows is one of the platforms where wchar_t is sixteen bits. When converting wide strings to narrow, the unpaired surrogates are converted to three bytes each, giving six nonsensical code units per code point, instead of glued to their partner and converted to a combined four correct code units. When converting narrow strings to wide, the four code units are correctly converted to the right code unit held in a uint32_t, then static_casted into wchar_t, which truncates the most significant sixteen bits, which works for the first 65536 code points (which is most non-emoji text, hence why it's not been noticed), and then wraps around.

I noticed this because I was poking around, and have seen this bug lots of times in different projects, and not because I'm affected by it, so there's no pressing need to fix this immediately, but it'll end up affecting someone eventually.

The text was updated successfully, but these errors were encountered:

lufriem · 2024-04-03T15:17:39Z

I wonder if one could simply check for the size of wchar_t and then assume UTF-16? Or check whether it's a Windows build (I think that information is easily available?) and then assume UTF-16?

AnyOldName3 changed the title ~~vag::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t~~ vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t Apr 3, 2024

AnyOldName3 mentioned this issue Jul 31, 2024

Check and report errors for Win32 functions that may fail #1255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t #1133

vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t #1133

AnyOldName3 commented Mar 21, 2024

lufriem commented Apr 3, 2024

vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t #1133

vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t #1133

Comments

AnyOldName3 commented Mar 21, 2024

lufriem commented Apr 3, 2024