Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String issues with lone surrogate code units #184

Open
sbstp opened this issue Apr 3, 2015 · 9 comments
Open

String issues with lone surrogate code units #184

sbstp opened this issue Apr 3, 2015 · 9 comments

Comments

@sbstp
Copy link
Member

sbstp commented Apr 3, 2015

This issue is caused by UTF-16. It encodes characters that are not on the BMP (first unicode plane) using two wide chars (surrogate pair), instead of one. Issues stem when a string is created containing a lone surrogate. For instance:

> String.fromCodePoint(0x1F0A1)[0] // [1] too
std.utf.UTFException@/usr/include/dmd/phobos/std/utf.d(1260): surrogate UTF-16 high value past end of string
----------------
higgs(pure @safe void std.array.Appender!(immutable(char)[]).Appender.put!(immutable(wchar)[]).put(immutable(wchar)[])+0x6b) [0x56ebdb]
higgs(immutable(char)[] runtime.vm.ValuePair.toString()+0x659) [0x5f5479]
higgs(void repl.repl(runtime.vm.VM)+0x15a) [0x6907aa]
higgs(_Dmain+0x835) [0x69b39d]
higgs(_D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv+0x1f) [0x6a1ed7]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate())+0x2a) [0x6a1e2a]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll()+0x30) [0x6a1e90]
higgs(void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate())+0x2a) [0x6a1e2a]
higgs(_d_run_main+0x1dc) [0x6a1da4]
higgs(main+0x17) [0x69b557]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f75c741aec5]

This can also escalate into a segfault, for example:

h> print('🂡'[0].toString())

Caught segmentation fault
IP=null
exiting

It also breaks on things like:

  • '🂡'.split("")
  • new String('🂡')
@sbstp
Copy link
Member Author

sbstp commented Apr 3, 2015

I discussed this a couple of weeks ago on reddit. Servo's html parser had to deal with this, and they created a superset of UTF-8 to deal with it. Someone could write partial surrogate pairs and it's perfectly valid.

One way of fixing this would be to disable D's checking of UTF-16 strings, maybe by using plain arrays of u16s.

@sbstp
Copy link
Member Author

sbstp commented Apr 3, 2015

I also checked what V8 and SpiderMonkey do, they treat it as two distinct entities.

// Chrome:
> '🂡'.split("")
["�", "�"]
// Firefox:
> '🂡'.split("")
Array [ "�", "�" ]

@maximecb
Copy link
Contributor

maximecb commented Apr 3, 2015

Possibly we can use plain wchar arrays. That won't solve the issue of how to display these invalid characters though.

@sbstp
Copy link
Member Author

sbstp commented Apr 3, 2015

I tried to find out, alert('🂡'.split("")[0]) crashed Chromium, and Firefox displays a broken character.

@maximecb
Copy link
Contributor

maximecb commented Apr 3, 2015

I think displaying a broken character is the best we can hope to do? I'm guessing we should avoid using wstring and use wchar when referring to raw strings from JS, and filter the strings before displaying them from D code, to replace broken characters with empty squares or something?

@sbstp
Copy link
Member Author

sbstp commented Apr 3, 2015

I'm not really sure what to do. It seems to be a gray area of text management. Obviously, having lone surrogates shouldn't be allowed, but the way JavaScript strings work (no char/wchar type), we don't have a choice.

The best solution might be to avoid using D's strings altogether, and create a type like jsstring. At some point though, the jsstring has to be converted from UTF-16 (maybe with lone surrogates) to some encoding. We need something that makes it easy to convert a jsstring for IO.

jsstring is basically the same as wstring, but it doesn't check for lone surrogates.

@maximecb
Copy link
Contributor

maximecb commented Apr 4, 2015

Just so I'm clear (you know more about unicode than I do), a lone surrogate is basically an extension to a char code, but missing the char code it should be attached to?

I guess the next question is, is it the creation of a wstring that chokes, or is it trying to write invalid UTF-16 data for output?

@sbstp
Copy link
Member Author

sbstp commented Apr 4, 2015

Yea, a surrogate pair is two wchar that represent one code point. It's used for all code points over 2^16 in UTF-16. A lone surrogate is one element of the pair separated from the other, creating a malformed code point.

I just tried creating a lone surrogate without printing it, and no error occurred. So it seems to be in the IO layer.

@maximecb
Copy link
Contributor

Sorry that I haven't taken care of this yet. I've been very busy with all sorts of things that are more directly related to my thesis work. Would you care to take a stab at a fix? Bounce some more specific ideas for a fix? I think the main target for the fix should be the output. We should identify specifically which D function chokes on the invalid string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants