Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters are not recognised as alphanumeric on some versions of macOS #746

Closed
mbonani opened this issue Dec 12, 2017 · 12 comments
Closed

Comments

@mbonani
Copy link
Member

mbonani commented Dec 12, 2017

aseba on windows (probably linux) accept in the code accentuate character.
on macosx it make error " indentifier must begin with _ or an alpha numeric character, found unicode character instead"
In the description of the language it is normally not permit. But generated code coming from blockly could have some. We should be consistent.

@stephanemagnenat
Copy link
Member

stephanemagnenat commented Dec 13, 2017

Yes unicode identifiers should be supported, so this is a bug. On my mac (High Sierra) with my compiler and Qt from MacPorts, unicode works. My locale is:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

However, with the version from the build server, it does not. So the problem is probably linked to the default setup in the C library or the Qt version used to compile Aseba from the build server.

@stephanemagnenat stephanemagnenat added this to the Version 1.6 milestone Dec 13, 2017
@stephanemagnenat stephanemagnenat changed the title caractarer like é ü are not accept to compile in macosx Unicode characters are not recognised as alphanumeric on some versions of macOS Dec 13, 2017
@stephanemagnenat
Copy link
Member

stephanemagnenat commented Dec 14, 2017

It seems some locales do not consider unicode characters to be alphanumeric. According to cpp reference.com, it could be forced. Maybe forcing the locale to "UTF8" could help.

Hence, a potential hack is to do:

std::locale utf8Locale("UTF8");
std::isalpha(c, utf8Locale);

in the Aseba compiler.

@stephanemagnenat
Copy link
Member

Do we move that to 1.6.1?

@mbonani
Copy link
Member Author

mbonani commented Jan 9, 2018

yes

@ypiguet-epfl
Copy link
Collaborator

The error message has typos which make it difficult to find in the source code. It's Identifiers must begin with _ or an alphanumeric character, found unicode character 0x?? instead where 0x?? stands for the hexadecimal unicode code point for the Basic Multilingual Plane (up to U+FFFF), something else beyond. It's for error ERROR_INVALID_IDENTIFIER thrown by method Aseba::Compiler::tokenize in aseba/compiler/lexer.cpp.

Suggestion: write your own unicode handling code and don't depend on Qt, c++ standard libraries, the OS, user settings, or any short-term hack.

@cor3ntin
Copy link
Contributor

The underlying issue have been found, it's indeed because the tool rely on the underlying environment settings ( which it should not) and for some reason on OSX that environment may not be utf8.
Expect a fix in the coming weeks.

However, we should definitively not reinvent the wheel. Why locale support in the STL is somewhat lacking, it's top of the class in Qt, when used properly.

@ypiguet-epfl
Copy link
Collaborator

What about tools which use the compiler without a GUI, such as the switch? In mine, when compiled on macOS, I get the same error:

./vmshell --dis --src -
var é
Error at Line: 1 Col: 5 : Identifiers must begin with _ or an alphanumeric character, found unicode character 0xe9 instead
Compilation error.

Unicode character categories and utf-8 aren't rocket science. A platform-independent, Qt-independent, definitive solution would be nice.

@cor3ntin
Copy link
Contributor

Unicode is anything but simple, especially if we don't want a dependency on icu ( which Qt has on most systems).
For this particular issue It should be doable to have a reasonably well behaving solution, don't worry.

@ypiguet-epfl
Copy link
Collaborator

While the whole unicode standard is large and prone to bugs, that isn't needed by the compiler or any Aseba non-graphical tool. Classifying code points between what's valid as first character in an identifier, what's valid as remaining characters in an identifier, and everything else would be enough. If by "unicode is anything but simple" you mean the text conversion to canonical forms, it's uncommon: Javascript explicitly doesn't do it for identifier matching, for instance. I'm not impressed by Qt unicode support, which fails to display reliably "e acute" with a combining acute accent (U+0065 U+0301) in Studio 1.6 on Mac.

I don't worry, I just hope there won't be dependencies difficult to fulfill or huge for the simple tools.

@cor3ntin
Copy link
Contributor

See 09835e6#diff-0e1c4b3b848444324ad5dc75a312633cR195 for a temporary fix and a detailed explanation.

Nothing is simple if you want to do things properly, and I'm not a huge fan of half-backed, wheel reinventing solutions.

@cor3ntin
Copy link
Contributor

cor3ntin commented Mar 7, 2018

It has been merge in master.
A PR for 1.6 has been made too #849

@cor3ntin
Copy link
Contributor

cor3ntin commented Mar 7, 2018

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants