call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
The source-charset is the encoding used by Visual Studio to interpret the source files into the internal representation. Specially, for Narrow String Literals in the source files, the compiler use UTF-8 (why not UTF-16?) encoded strings as the internal representation, and then these strings are converted to the execution-charset and store in the compiled object files.
To sum up, the compiler converts narrow string literals in source files from source-charset to Unicode and then to
execution-charset, and finally stores the results into compiled binaries. source-charset must be the encoding of the
source files used to store on disk. execution-charset is the encoding of const char[]
in memory when the program runs.
source-charset and execution-charset are independent. If a character in the source file cannot be represented in the
execution character set, the Unicode conversion substitutes a question mark '?' character, see
/validate-charset
option.
By default, execution-charset is the Windows code page, a.k.a. ANSI code page (ACP), unless you have specified a character
set name or code page by using the /execution-charset
option. For source-charset, if no /source-charset
option is
specified, Visual Studio detects BOM to determine if a source file is in an encoded Unicode format, for example, UTF-16
or UTF-8. If no BOM is found, it assumes the source file is encoded using ACP.
The testing source file test\execution_charset.c
is encoded as Windows-1252 which cannot be auto-detected and
these are characters invalid in ACP. Without /source-charset
, the compiler performs ACP to Unicode conversion for
Windows-1252 strings and complains C4819
for some invalid ACP characters.
cl /c test\execution_charset.c
warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in
Unicode format to prevent data loss.
Tell compiler the real encoding of the source file, the Unicode to ACP conversion is finally performed and the compiler complains C4566 for some Unicode characters for which then substitutes a question mark '?'.
cl /c /source-charset:.1252 test\execution_charset.c
warning C4566: character represented by universal-character-name '\u00FF' cannot be represented in the current code page (936).
Windows Console (conhost.exe) is a Win32 GUI app that consists of:
- InputBuffer: Stores keyboard and mouse event records generated by user input.
- OutputBuffer: Stores the text rendered on the Console's window client area.
OutputBuffer was ssentially a 2D array of CHAR_INFO
structs which contain each cell's character data & attributes. That means only UCS-2 text was supported. Since Windows 10
October 2018 Update (Version 1809, Build Number 10.0.17763), a new OutputBuffer is introduced to fully support all unicode
characters.
Another issue is that Console uses GDI for text rendering, which doesn't support font-fallback. So some complex glyphs can't be displayed even if the OutputBuffer could store them. ConPTY is introduced together with the new OutputBuffer. Then Console becomes a true "Console Host", which is windowless and not responsible for user input and rendering, supporting all Command-Line apps and/or GUI apps that communicate with Command-Line apps through Console Virtual Terminal Sequences. Terminal (TTY) is such a typical GUI app responsible for user input and rendering. With ConPTY infrastructure, Windows Terminal uses a new rendering engine that supports font-fallback and displays all testing characters correctly.
Command-Line apps use WriteConsoleW
to write unicode text to OutputBuffer and ReadConsoleW
to read unicode text from
InputBuffer. WriteConsoleA/WriteFile
can also be used for output but that involves a encoding conversion from
ConsoleOutputCP
(defaults to OEMCP) to Unicode before
storing text into OutputBuffer. Accordingly, use ReadConsoleA/ReadFile
for input will do the conversion from Unicode to
ConsoleInputCP
(also defaults to OEMCP). Note that ConsoleInputCP
only supports DBCS, see
ms-terminal/src/host/dbcs.cpp#TranslateUnicodeToOem.
The builtin command type
of the
"Command Prompt" shell (cmd.exe) checks the start of a file for a UTF-16LE BOM. If it finds such a mark, it displays the
file content using WriteConsoleW
, otherwise using WriteConsoleA/WriteFile
. So type
displays correctly only for
UTF-16LE BOM-ed files and those encoded in current ConsoleOutputCP
. In PowerShell, type
detects BOM for UTF-16 and
UTF-8. To verify these, just run type words\word-*.txt
in Cmd and PowerShell.
UCRT is the Windows' equivalent of the GNU C Library (glibc) that including C99
and POSIX functionality and some extensions since Visual Studio 2015. Some POSIX functions have historically used the ACP
for doing narrow->wide conversions. In order to support UTF-8, utf8 locale is implemented in
ucrt/locale/get_qualified_locale.cpp
since UCRT 10.0.17134.0, and those functions have been modified so that they use CP_UTF8
when current locale is utf8,
but the ACP otherwise in order to preserve backwards compatibility. These POSIX functions call
ucrt/inc/corecrt_internal_win32_buffer.h#__acrt_get_utf8_acp_compatibility_codepage
to grab the codepage they should use for their conversions. An example is fopen
: it convert narrow path to wide path
using the grabbed codepage and then delegates to wide version of
ucrt/lowio/open.cpp#_sopen_nolock.
Besides, the encoding of the narrow string representation of std::filesystem::path
is also the grabbed codepage.
The I/O flow path in the UCRT is
C++ I/O -> C I/O -> POSIX I/O -> Win32 File/Console I/O
filebuf -> FILE* -> read/write -> ReadFile/WriteFile/ReadConsoleW/WriteConsoleW
[w]cin/f[w]scanf/fget[w]s -> fget[w]c
[w]cout/f[w]printf/fput[w]s -> fput[w]c
fgetwc -> fgetc (*2, compose for _O_U16TEXT and _O_BINARY, mbtowc(DBCS) for _O_TEXT) -> fread -> read
fputwc -> (wctomb -> fputc, for _O_TEXT) -> fwrite -> write
The details of read
with different mode:
- _O_BINARY or _O_TEXT: ReadFile
- _O_U8TEXT: ReadFile -> UTF-8 -> UTF-16
- File _O_U16TEXT: ReadFile
- Console _O_U16TEXT: ReadConsoleW
The details of write
with different mode:
- _O_BINARY: WriteFile
- File _O_U8TEXT: UTF-16 -> UTF-8 -> WriteFile
- File otherwise: WriteFile
- Console Unicode: WriteConsoleW for each wchar, so only supports UCS-2
- Console _O_TEXT with LC_CTYPE:
- C: WriteFile
- utf8: UTF-8 -> UTF-16 ->
ConsoleInputCPConsoleOutputCP -> WriteFile - otherwise: DBCS (mbtowc) -> UTF-16 ->
ConsoleInputCPConsoleOutputCP -> WriteFile
Win32 Direct Console I/O and C Wide I/O are always available for Unicode Console I/O.
Since UCRT 10.0.17763.0, print functions treat the text data as UTF-8 encoded if locale is set to utf8. The changes are in
ucrt/lowio/write.cpp#write_double_translated_ansi_nolock.
The translation to (This bug is
fixed in UCRT 10.0.19041.0)
double translation is no need. UCRT should be reworked to use ConsoleInputCP
is strange, I think it should be ConsoleOutputCP
andWriteConsoleW
after translated to UTF-16 such that no
codepage is involved: ANSI(including UTF-8) -> UTF-16 -> WriteConsoleW
.
ReadConsoleA/ReadFile
get ANSI characters from ConsoleInputCP
, but SetConsoleCP(CP_UTF8)
doesn't work since
it only supports DBCS.
There are two workarounds to support UTF-8 Console input:
delegating to wide input or doing
ConsoleInputCP -> UTF-16 -> UTF-8
conversion. The example test\utf8_io.cpp illustrates these two
workarounds. UCRT should implement input as the reverse process of reworked output, i.e.
ReadConsoleW -> UTF-16 -> ANSI(including UTF-8)
.
Since May 2021, UCRT64 for gcc toolchain and CLANG64 for clang toolchain are available as MSYS2 environments. They link against UCRT instead of MSVCRT.
Windows is UTF-16 internal, so command-line arguments and the environment variables set are all UTF-16. Visual C++ compiler
provides a Unicode version of C/C++ program entry point, named wmain.
For the ANSI version of main
, argv
, an array of null-terminated strings representing command-line arguments entered
by the user of the program, and envp
, an array of key=value
formatted null-terminated strings representing a "frozen"
copy of the variables set in the user's environment during the program startup, are all
encoded in ACP
(converted from Unicode). So even a simple C program using printf
to echo command-line arguments doesn't work, since
ACP != OEMCP (usually), e.g. in English language
Windows, ACP is 1252 while OEMCP is
437, and code point 00F7 is "÷" for 1252 but "≈" for 437.
To get UTF-8 encoded argv
, simply link wmain into the final executable, take a look at the example
test\echo.c.
As of Windows 10 May 2019 Update (Version 1903, Build Number 10.0.18362), one can set active code page per process in the manifest. By using UTF-8 process code page, the command-line arguments and ANSI variant of Win32 APIs are all UTF-8 encoded as test\win32_gui.cpp demonstrated. This model has the benefit of supporting existing code built with -A APIs without any code changes, but must handle legacy code page detection and conversion as usual if targeting/running on earlier Windows builds.
- Use Visual Studio 2015 or later with UCRT 10.0.17763.0 or later.
- Add
/utf-8
to compile options to make all narrow string literals UTF-8. - Link to wmain to get UTF-8 encoded
argv
. setlocale(LC_CTYPE, ".utf8")
to support UTF-8 output and filenames (e.g.printf
,fopen
andstd::filesystem::path
).SetConsoleCP(CP_UTF8)
due to the bug of double translation for console output before UCRT 10.0.19041.0. No need for C locale.- Skip above three items if using UTF-8 process code page.
SetConsoleOutputCP(CP_UTF8)
to display characters correctly due to the encoding conversion fromConsoleOutputCP
to Unicode in Windows Console.- Use wide console input. The typical structure of a Command-Line app is: input somewhere, output everywhere.
HANDLE hConsoleInput = GetStdHandle(STD_INPUT_HANDLE); DWORD mode; if (GetConsoleMode(hConsoleInput, &mode)) { _setmode(_fileno(stdin), _O_U16TEXT); wstring ws; string s; // a UTF-8 string while (getline(std::wcin, ws)) { s.resize(WideCharToMultiByte(CP_UTF8, 0, ws.data(), ws.size(), NULL, 0, NULL, NULL)); WideCharToMultiByte(CP_UTF8, 0, ws.data(), ws.size(), s.data(), s.size(), NULL, NULL); // process(s); } } else { string s; // a UTF-8 string while (getline(std::cin, s)) { // process(s); } }