You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On some platforms, wchar_t is a 32-bit type, and the same width as char32_t, intended to hold UCS4/UTF-32 code points as fixed-width strings. On others, in particular, ones that attempted to support Unicode in the 90s, when UTF-8 and UTF-16 hadn't been invented and The Unicode Consortium thought that sixteen bits would be enough to hold any character from any writing system humans had ever used, wchar_t is a 16-bit type and the same width as char16_t, intended to hold UCS2 fixed-width strings or UTF-16 variable-width strings.
src/vsg/io/convert_utf.cpp works under the assumption that wchar_t can hold an entire Unicode code point on its own, which isn't guaranteed. This can be easily demonstrated by attempting to convert strings containing emoji between std::string and std::wstring in either direction on Windows, as most emoji occupy code points above 65536, and Windows is one of the platforms where wchar_t is sixteen bits. When converting wide strings to narrow, the unpaired surrogates are converted to three bytes each, giving six nonsensical code units per code point, instead of glued to their partner and converted to a combined four correct code units. When converting narrow strings to wide, the four code units are correctly converted to the right code unit held in a uint32_t, then static_casted into wchar_t, which truncates the most significant sixteen bits, which works for the first 65536 code points (which is most non-emoji text, hence why it's not been noticed), and then wraps around.
I noticed this because I was poking around, and have seen this bug lots of times in different projects, and not because I'm affected by it, so there's no pressing need to fix this immediately, but it'll end up affecting someone eventually.
The text was updated successfully, but these errors were encountered:
I wonder if one could simply check for the size of wchar_t and then assume UTF-16? Or check whether it's a Windows build (I think that information is easily available?) and then assume UTF-16?
AnyOldName3
changed the title
vag::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t
vsg::convert_utf does not work properly on platforms where wchar_t is not equivalent to char32_t
Apr 3, 2024
On some platforms,
wchar_t
is a 32-bit type, and the same width aschar32_t
, intended to hold UCS4/UTF-32 code points as fixed-width strings. On others, in particular, ones that attempted to support Unicode in the 90s, when UTF-8 and UTF-16 hadn't been invented and The Unicode Consortium thought that sixteen bits would be enough to hold any character from any writing system humans had ever used,wchar_t
is a 16-bit type and the same width aschar16_t
, intended to hold UCS2 fixed-width strings or UTF-16 variable-width strings.src/vsg/io/convert_utf.cpp
works under the assumption thatwchar_t
can hold an entire Unicode code point on its own, which isn't guaranteed. This can be easily demonstrated by attempting to convert strings containing emoji betweenstd::string
andstd::wstring
in either direction on Windows, as most emoji occupy code points above 65536, and Windows is one of the platforms wherewchar_t
is sixteen bits. When converting wide strings to narrow, the unpaired surrogates are converted to three bytes each, giving six nonsensical code units per code point, instead of glued to their partner and converted to a combined four correct code units. When converting narrow strings to wide, the four code units are correctly converted to the right code unit held in auint32_t
, thenstatic_cast
ed intowchar_t
, which truncates the most significant sixteen bits, which works for the first 65536 code points (which is most non-emoji text, hence why it's not been noticed), and then wraps around.I noticed this because I was poking around, and have seen this bug lots of times in different projects, and not because I'm affected by it, so there's no pressing need to fix this immediately, but it'll end up affecting someone eventually.
The text was updated successfully, but these errors were encountered: