-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix character encoding in source files #469
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
iconv -f CP1252 -t UTF-8 Many of these files are pure iso-8859-1 (latin1), but CP-1252 is a superset of latin1. Some files contain CP-1252-only symbols.
iconv -f GB18030 -t UTF-8
Used `nkf -w`. (iconv -f Shift-JIS -t UTF-8 introduces spurious changes to ASCII `~` and `\`)
iconv -f UTF-16 -t UTF-8
This file already contained ASCII '?' characters, suggesting a previous encoding conversion had already lost some characters. (This seems to have happened before the file was first committed (2012) to this repository or its predecessors.) The character in question here is 8-bit F8, which in CP-1252 (or latin1) would be 'ø'. Maybe that is what it was, but in CP-1253 (iso-8859-7, Latin/Greek) it is 'ψ', which seems more likely here. In any case, we are not losing any information by this conversion.
/utf-8 is equivalent to /source-charset:utf-8 /execution-charset:utf-8. Now that all source files (except for included third-party headers outside of this repository) are in UTF-8, we must set source-charset to utf-8, because otherwise the default (on our build machines) is CP-1252. (The compiler does not auto-detect UTF-8 source files unless they have a UTF-8 BOM, which we do not use.) The execution charset determines how (non-wide) string literals are encoded in the executable binary. UTF-8 is usully appropriate for our string constants (such as property names and values), because JNI and SWIG assume UTF-8 and C/C++ library functions work with UTF-8. Our only examples of non-ASCII characters in string literals (at least among the files recently converted to UTF-8) are 8-bit characters from iso-8859-1 (latin1) (there are one or two exceptions that are insignificant). These, if exposed to the CMMCore API (to Java or Python) were presumably working just because UTF-8 is a superset of iso-8859-1 (but not of CP-1252) -- they were stored in the binary as CP-1252 but treated by SWIG-generated code as UTF-8. (Use of non-ASCII characters in property names and other such strings is still not a very good idea, but we don't want to change names that will invalidate configuration files.) As far as I can tell, the conversion of source code to UTF-8 and the introduction of /utf-8 do not interact with the project Character Set setting (Unicode or Multi-byte), which only control whether the Win32 API functions default to the "A" or "W" version. Incidentally, use of /utf-8 also helps to prepare for the switch to Meson, which adds /utf-8 by default.
"The file contains a character starting at offset 0x____ that is illegal in the current source character set (____)." These are caused by included third-party headers that are not UTF-8. Hopefully their use of non-UTF-8 characters is limited to comments.
I've taken care to make sure that these changes don't break anything on Windows (details in commit messages); if anything this may have fixed some non-ASCII strings on Windows computers set to a non-latin1 language. Non-Windows machines are probably not affected. But it's always possible that I forgot to consider some case, and it's hard to test these things. Fingers crossed. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
All to UTF-8.
GitHub appears to be quite good at guessing the encoding, so many of the diffs may look like they are not changing anything (which is a good thing).
TODO:
.h
,.cpp
in CI