Fix character encoding in source files #469

marktsuchida · 2024-06-19T18:05:26Z

All to UTF-8.

GitHub appears to be quite good at guessing the encoding, so many of the diffs may look like they are not changing anything (which is a good thing).

TODO:

Update vcxproj source encoding (if necessary)
Enforce UTF-8 for .h, .cpp in CI

iconv -f CP1252 -t UTF-8 Many of these files are pure iso-8859-1 (latin1), but CP-1252 is a superset of latin1. Some files contain CP-1252-only symbols.

iconv -f GB18030 -t UTF-8

Used `nkf -w`. (iconv -f Shift-JIS -t UTF-8 introduces spurious changes to ASCII `~` and `\`)

iconv -f UTF-16 -t UTF-8

This file already contained ASCII '?' characters, suggesting a previous encoding conversion had already lost some characters. (This seems to have happened before the file was first committed (2012) to this repository or its predecessors.) The character in question here is 8-bit F8, which in CP-1252 (or latin1) would be 'ø'. Maybe that is what it was, but in CP-1253 (iso-8859-7, Latin/Greek) it is 'ψ', which seems more likely here. In any case, we are not losing any information by this conversion.

/utf-8 is equivalent to /source-charset:utf-8 /execution-charset:utf-8. Now that all source files (except for included third-party headers outside of this repository) are in UTF-8, we must set source-charset to utf-8, because otherwise the default (on our build machines) is CP-1252. (The compiler does not auto-detect UTF-8 source files unless they have a UTF-8 BOM, which we do not use.) The execution charset determines how (non-wide) string literals are encoded in the executable binary. UTF-8 is usully appropriate for our string constants (such as property names and values), because JNI and SWIG assume UTF-8 and C/C++ library functions work with UTF-8. Our only examples of non-ASCII characters in string literals (at least among the files recently converted to UTF-8) are 8-bit characters from iso-8859-1 (latin1) (there are one or two exceptions that are insignificant). These, if exposed to the CMMCore API (to Java or Python) were presumably working just because UTF-8 is a superset of iso-8859-1 (but not of CP-1252) -- they were stored in the binary as CP-1252 but treated by SWIG-generated code as UTF-8. (Use of non-ASCII characters in property names and other such strings is still not a very good idea, but we don't want to change names that will invalidate configuration files.) As far as I can tell, the conversion of source code to UTF-8 and the introduction of /utf-8 do not interact with the project Character Set setting (Unicode or Multi-byte), which only control whether the Win32 API functions default to the "A" or "W" version. Incidentally, use of /utf-8 also helps to prepare for the switch to Meson, which adds /utf-8 by default.

"The file contains a character starting at offset 0x____ that is illegal in the current source character set (____)." These are caused by included third-party headers that are not UTF-8. Hopefully their use of non-UTF-8 characters is limited to comments.

marktsuchida · 2024-06-19T22:11:49Z

I've taken care to make sure that these changes don't break anything on Windows (details in commit messages); if anything this may have fixed some non-ASCII strings on Windows computers set to a non-latin1 language. Non-Windows machines are probably not affected. But it's always possible that I forgot to consider some case, and it's hard to test these things. Fingers crossed.

marktsuchida added 5 commits June 19, 2024 12:33

Convert CP-1252 to UTF-8

c397e1b

iconv -f CP1252 -t UTF-8 Many of these files are pure iso-8859-1 (latin1), but CP-1252 is a superset of latin1. Some files contain CP-1252-only symbols.

Convert GB 18030 to UTF-8

eb9d96d

iconv -f GB18030 -t UTF-8

Convert Shift-JIS to UTF-8

dd29d76

Used `nkf -w`. (iconv -f Shift-JIS -t UTF-8 introduces spurious changes to ASCII `~` and `\`)

Convert UTF-16 to UTF-8

264666a

iconv -f UTF-16 -t UTF-8

marktsuchida mentioned this pull request Jun 19, 2024

use keyword for Camera tag added in circular buffer metadata #468

Merged

marktsuchida added 3 commits June 19, 2024 14:03

CI: Enforce UTF-8 for .cpp/.h/.txt files

32f4db9

Disable MSVC warning C4828

057499f

"The file contains a character starting at offset 0x____ that is illegal in the current source character set (____)." These are caused by included third-party headers that are not UTF-8. Hopefully their use of non-UTF-8 characters is limited to comments.

marktsuchida marked this pull request as ready for review June 19, 2024 22:07

marktsuchida merged commit 446fab8 into main Jun 19, 2024
1 check passed

marktsuchida deleted the fix-encoding branch June 19, 2024 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix character encoding in source files #469

Fix character encoding in source files #469

marktsuchida commented Jun 19, 2024 •

edited

Loading

marktsuchida commented Jun 19, 2024

Fix character encoding in source files #469

Fix character encoding in source files #469

Conversation

marktsuchida commented Jun 19, 2024 • edited Loading

marktsuchida commented Jun 19, 2024

marktsuchida commented Jun 19, 2024 •

edited

Loading