-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String issues with lone surrogate code units #184
Comments
I discussed this a couple of weeks ago on reddit. Servo's html parser had to deal with this, and they created a superset of One way of fixing this would be to disable D's checking of UTF-16 strings, maybe by using plain arrays of |
I also checked what V8 and SpiderMonkey do, they treat it as two distinct entities.
|
Possibly we can use plain wchar arrays. That won't solve the issue of how to display these invalid characters though. |
I tried to find out, |
I think displaying a broken character is the best we can hope to do? I'm guessing we should avoid using wstring and use wchar when referring to raw strings from JS, and filter the strings before displaying them from D code, to replace broken characters with empty squares or something? |
I'm not really sure what to do. It seems to be a gray area of text management. Obviously, having lone surrogates shouldn't be allowed, but the way JavaScript strings work (no The best solution might be to avoid using D's strings altogether, and create a type like
|
Just so I'm clear (you know more about unicode than I do), a lone surrogate is basically an extension to a char code, but missing the char code it should be attached to? I guess the next question is, is it the creation of a wstring that chokes, or is it trying to write invalid UTF-16 data for output? |
Yea, a surrogate pair is two I just tried creating a lone surrogate without printing it, and no error occurred. So it seems to be in the IO layer. |
Sorry that I haven't taken care of this yet. I've been very busy with all sorts of things that are more directly related to my thesis work. Would you care to take a stab at a fix? Bounce some more specific ideas for a fix? I think the main target for the fix should be the output. We should identify specifically which D function chokes on the invalid string. |
This issue is caused by UTF-16. It encodes characters that are not on the BMP (first unicode plane) using two wide chars (surrogate pair), instead of one. Issues stem when a string is created containing a lone surrogate. For instance:
This can also escalate into a segfault, for example:
It also breaks on things like:
'🂡'.split("")
new String('🂡')
The text was updated successfully, but these errors were encountered: