String issues with lone surrogate code units #184

sbstp · 2015-04-03T04:20:47Z

This issue is caused by UTF-16. It encodes characters that are not on the BMP (first unicode plane) using two wide chars (surrogate pair), instead of one. Issues stem when a string is created containing a lone surrogate. For instance:

> String.fromCodePoint(0x1F0A1)[0] // [1] too
std.utf.UTFException@/usr/include/dmd/phobos/std/utf.d(1260): surrogate UTF-16 high value past end of string
----------------
higgs(pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(immutable(wchar)[]).put(immutable(wchar)[])+0x6b) [0x56ebdb]
higgs(immutable(char)[] runtime.vm.ValuePair.toString()+0x659) [0x5f5479]
higgs(void repl.repl(runtime.vm.VM)+0x15a) [0x6907aa]
higgs(_Dmain+0x835) [0x69b39d]
higgs(_D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv+0x1f) [0x6a1ed7]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate())+0x2a) [0x6a1e2a]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll()+0x30) [0x6a1e90]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate())+0x2a) [0x6a1e2a]
higgs(_d_run_main+0x1dc) [0x6a1da4]
higgs(main+0x17) [0x69b557]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f75c741aec5]

This can also escalate into a segfault, for example:

h> print('🂡'[0].toString())

Caught segmentation fault
IP=null
exiting

It also breaks on things like:

'🂡'.split("")
new String('🂡')

The text was updated successfully, but these errors were encountered:

sbstp · 2015-04-03T04:26:01Z

I discussed this a couple of weeks ago on reddit. Servo's html parser had to deal with this, and they created a superset of UTF-8 to deal with it. Someone could write partial surrogate pairs and it's perfectly valid.

One way of fixing this would be to disable D's checking of UTF-16 strings, maybe by using plain arrays of u16s.

sbstp · 2015-04-03T04:31:36Z

I also checked what V8 and SpiderMonkey do, they treat it as two distinct entities.

// Chrome:
> '🂡'.split("")
["�", "�"]
// Firefox:
> '🂡'.split("")
Array [ "�", "�" ]

maximecb · 2015-04-03T04:43:46Z

Possibly we can use plain wchar arrays. That won't solve the issue of how to display these invalid characters though.

sbstp · 2015-04-03T04:55:52Z

I tried to find out, alert('🂡'.split("")[0]) crashed Chromium, and Firefox displays a broken character.

maximecb · 2015-04-03T22:56:27Z

I think displaying a broken character is the best we can hope to do? I'm guessing we should avoid using wstring and use wchar when referring to raw strings from JS, and filter the strings before displaying them from D code, to replace broken characters with empty squares or something?

sbstp · 2015-04-03T23:29:01Z

I'm not really sure what to do. It seems to be a gray area of text management. Obviously, having lone surrogates shouldn't be allowed, but the way JavaScript strings work (no char/wchar type), we don't have a choice.

The best solution might be to avoid using D's strings altogether, and create a type like jsstring. At some point though, the jsstring has to be converted from UTF-16 (maybe with lone surrogates) to some encoding. We need something that makes it easy to convert a jsstring for IO.

jsstring is basically the same as wstring, but it doesn't check for lone surrogates.

maximecb · 2015-04-04T00:16:22Z

Just so I'm clear (you know more about unicode than I do), a lone surrogate is basically an extension to a char code, but missing the char code it should be attached to?

I guess the next question is, is it the creation of a wstring that chokes, or is it trying to write invalid UTF-16 data for output?

sbstp · 2015-04-04T00:44:53Z

Yea, a surrogate pair is two wchar that represent one code point. It's used for all code points over 2^16 in UTF-16. A lone surrogate is one element of the pair separated from the other, creating a malformed code point.

I just tried creating a lone surrogate without printing it, and no error occurred. So it seems to be in the IO layer.

maximecb · 2015-04-11T05:56:23Z

Sorry that I haven't taken care of this yet. I've been very busy with all sorts of things that are more directly related to my thesis work. Would you care to take a stab at a fix? Bounce some more specific ideas for a fix? I think the main target for the fix should be the output. We should identify specifically which D function chokes on the invalid string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String issues with lone surrogate code units #184

String issues with lone surrogate code units #184

sbstp commented Apr 3, 2015

sbstp commented Apr 3, 2015

sbstp commented Apr 3, 2015

maximecb commented Apr 3, 2015

sbstp commented Apr 3, 2015

maximecb commented Apr 3, 2015

sbstp commented Apr 3, 2015

maximecb commented Apr 4, 2015

sbstp commented Apr 4, 2015

maximecb commented Apr 11, 2015

String issues with lone surrogate code units #184

String issues with lone surrogate code units #184

Comments

sbstp commented Apr 3, 2015

sbstp commented Apr 3, 2015

sbstp commented Apr 3, 2015

maximecb commented Apr 3, 2015

sbstp commented Apr 3, 2015

maximecb commented Apr 3, 2015

sbstp commented Apr 3, 2015

maximecb commented Apr 4, 2015

sbstp commented Apr 4, 2015

maximecb commented Apr 11, 2015