Allow Unicode code points as chars #2024
Labels
Ren.important
Status.important
Type.Unicode
Type.wish
Waiting for future
Issues and wishes which are closed, but will be nice to resolve later
Submitted by: Ladislav
Currently, only BMP code points are supported by the CHAR! datatype, while:
Imported from: CureCode [ Version: r3 master Type: Wish Platform: All Category: Datatype Reproduce: Always Fixed-in:none ]
Imported from: metaeducation#2024
Comments:
Submitted by: BrianH
#683 is related, though I would prefer to support them rather than just trigger an error. Which method would you prefer to encode such characters?
One model is to start with UCS1 (aka Latin-1), then upgrade to UCS2 when characters not in UCS1 are added, then UCS4 when characters not in UCS2 are added. Strings wouldn't downgrade to a lower UCS automatically, we would need a function to do it on demand. This model has O(1) access for indexed operations and LENGTH?. This is what Python 3 and Red have done.
Another model would be to use UTF-8 or UTF-16 internally, depending on what the platform supports. Windows would be UTF-16, Linux would be UTF-8. This would have lower memory usage when higher codepoints are used, but indexed operations and LENGTH? would be O(n). Many other programming languages have done this, notably .NET and Java use UTF-16, and Go uses UTF-8. Note that strings would be series of codepoints, so all UTF encoding and decoding would need to be done internally, and no partial encodings (surrogate pairs or individual UTF-8 bytes) would be exposed.
Submitted by: BrianH
that was the plan, but he hasn't actually implemented this plan yet.
Much of the string/binary series code can handle 1-byte and 2-byte element sizes in the macros, so we'd need to add support for 4-byte. The autoconversion code isn't written yet. Most code that refers to individual characters uses the REBUNI type, which is 16-bit, even though Unicode codepoints should be 32-bit when in individual variables rather than arrays. There is no 32-bit Unicode character type like REBUNI defined, so code that works on full codepoints tends to repurpose other defined types, and almost always mixes signed and unsigned types, especially using pointers to the one to refer to the other. Finally, I have asked whether we should be using signed or unsigned 32-bit values to store codepoints, but noone I've asked has been able to answer that question yet.
To implement the UCS switching model:
In comparison, to implement the UTF model:
Either way, that is a lot of work, but it's doable once we put our open-source many-hands effort to it. Fortunately we already include Unicode Consortium code, so we have some code to adapt that can do almost everything mentioned above.
Submitted by: Ladislav
I slightly prefer unsigned
Submitted by: BrianH
The main problem I ran into is that I didn't understand the hash calculation. Does it need unsigned integers or not? Does it need to be adjusted to be able to handle 32-bit values? The Unicode Consortium code doesn't have any hash code, so I can't just go off of its preferences.
Not knowing any better, I also slightly prefer unsigned because that is the convention for 8 and 16 bit characters. The only downside is that some functions return -1 to indicate an error, having all non-negative values be considered non-erroneous, and you can't do that with an unsigned datatype. This could be solved by designating another error indication value. Or it could be solved by using a signed return for those functions and then convert to an unsigned value after screening the negative numbers indicating errors.
Submitted by: abolka
Preferences:
One remark: "UCS-1" to UCS-2 widening was not only planned, it is already implemented and used (in R3). So I think that suggests to continue with this model and provide full Unicode support via widening to UCS-4, internally.
Brian's implementation plan looks very solid to me and nicely summarises many related issues.
What hash calculation are you concerned about? The one used for computing internal hash values?
Submitted by: BrianH
Yes, that hash calculation.
After looking it up, UCS-4 is defined as having a range of the 0 or greater portion of a 32-bit signed integer. In practice, it is always less than that. So, it would be OK to have REBUNI be either a signed or unsigned value, though REBUCS2 and REUCS1 should still be unsigned to cover their respective ranges, so REBUNI should be unsigned too. However, the internal functions like the UTF-8 decoder that return a negative number for bad data or other errors can probably get away with returning a signed 32-bit integer, as long as they don't do so through a pointer (some functions have this problem), and don't use REBCNT as the type in case that type changes with 64-bit builds. Negative characters should never be put in a string though, they're out of range.
Weird, Wikipedia seems to have been changed since I last looked at its Unicode stuff. It now refers to UCS2 as UCS-2 and UCS4 as UCS-4 (the hyphens used to be specifically disallowed), no longer mentions UCS1 as a synonym for Latin-1, and scarcely mentions the UCS* encodings at all anymore, preferring to talk about the UTF-* stuff.
Given the compatibility between ASCII and the ASCII range of UTF-8, it might be a good idea to just use the byte-element strings for characters in the ASCII range. That would let us use UTF-8 (Linux) or ANSI (Windows) APIs with no conversion necessary in that mode. Similarly, we can pass UCS-2 mode strings to UTF-16 APIs with no conversion needed (on Windows, OSX?, .NET and Java). We would only consistently need to convert the UCS-4/UTF-32 mode strings to call string APIs, since noone uses UTF-32 APIs.
Plan in Ren-C is more radical than the above but hopefully, when all is said and done--simpler. The plan is to always keep strings encoded in UTF-8, and convert them only at edges that require it (e.g. Windows print and input device would need to do this):
http://utf8everywhere.org/
The text was updated successfully, but these errors were encountered: