Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix character encoding in source files #469

Merged
merged 8 commits into from
Jun 19, 2024
Merged

Fix character encoding in source files #469

merged 8 commits into from
Jun 19, 2024

Conversation

marktsuchida
Copy link
Member

@marktsuchida marktsuchida commented Jun 19, 2024

All to UTF-8.

GitHub appears to be quite good at guessing the encoding, so many of the diffs may look like they are not changing anything (which is a good thing).

TODO:

  • Update vcxproj source encoding (if necessary)
  • Enforce UTF-8 for .h, .cpp in CI

iconv -f CP1252 -t UTF-8

Many of these files are pure iso-8859-1 (latin1), but CP-1252 is a
superset of latin1. Some files contain CP-1252-only symbols.
iconv -f GB18030 -t UTF-8
Used `nkf -w`.

(iconv -f Shift-JIS -t UTF-8 introduces spurious changes to ASCII `~`
and `\`)
iconv -f UTF-16 -t UTF-8
This file already contained ASCII '?' characters, suggesting a previous
encoding conversion had already lost some characters. (This seems to
have happened before the file was first committed (2012) to this
repository or its predecessors.)

The character in question here is 8-bit F8, which in CP-1252 (or latin1)
would be 'ø'. Maybe that is what it was, but in CP-1253 (iso-8859-7,
Latin/Greek) it is 'ψ', which seems more likely here.

In any case, we are not losing any information by this conversion.
/utf-8 is equivalent to /source-charset:utf-8 /execution-charset:utf-8.

Now that all source files (except for included third-party headers
outside of this repository) are in UTF-8, we must set source-charset to
utf-8, because otherwise the default (on our build machines) is CP-1252.
(The compiler does not auto-detect UTF-8 source files unless they have a
UTF-8 BOM, which we do not use.)

The execution charset determines how (non-wide) string literals are
encoded in the executable binary. UTF-8 is usully appropriate for our
string constants (such as property names and values), because JNI and
SWIG assume UTF-8 and C/C++ library functions work with UTF-8.

Our only examples of non-ASCII characters in string literals (at least
among the files recently converted to UTF-8) are 8-bit characters from
iso-8859-1 (latin1) (there are one or two exceptions that are
insignificant). These, if exposed to the CMMCore API (to Java or Python)
were presumably working just because UTF-8 is a superset of iso-8859-1
(but not of CP-1252) -- they were stored in the binary as CP-1252 but
treated by SWIG-generated code as UTF-8.

(Use of non-ASCII characters in property names and other such strings is
still not a very good idea, but we don't want to change names that will
invalidate configuration files.)

As far as I can tell, the conversion of source code to UTF-8 and the
introduction of /utf-8 do not interact with the project Character Set
setting (Unicode or Multi-byte), which only control whether the Win32
API functions default to the "A" or "W" version.

Incidentally, use of /utf-8 also helps to prepare for the switch to
Meson, which adds /utf-8 by default.
"The file contains a character starting at offset 0x____ that is illegal
in the current source character set (____)."

These are caused by included third-party headers that are not UTF-8.
Hopefully their use of non-UTF-8 characters is limited to comments.
@marktsuchida marktsuchida marked this pull request as ready for review June 19, 2024 22:07
@marktsuchida
Copy link
Member Author

I've taken care to make sure that these changes don't break anything on Windows (details in commit messages); if anything this may have fixed some non-ASCII strings on Windows computers set to a non-latin1 language. Non-Windows machines are probably not affected. But it's always possible that I forgot to consider some case, and it's hard to test these things. Fingers crossed.

@marktsuchida marktsuchida merged commit 446fab8 into main Jun 19, 2024
1 check passed
@marktsuchida marktsuchida deleted the fix-encoding branch June 19, 2024 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant