comp_ui.NiceDisplay should take into account unicode character width #269

andychu · 2019-03-24T04:15:38Z

@jyn514 I'm pretty sure this is the issue you just mentioned in #257. There are two separate bugs:

If you keep typing past the end of the terminal, OSH has no idea (but zsh and fish do). This is Detect when the user types past the last column in the terminal #257.
If your prompt contains a unicode character that is say 4 bytes long in utf-8, the code will currently move right FOUR cursor positions instead of ONE. I think that explains the screenshot, since you have a unicode character there.

I think we can just call the POSIX function wcwidth(), although I guess that involves some conversion to wchar_t, which is annoying:

http://man7.org/linux/man-pages/man3/wcwidth.3.html

There is a pure Python version of it, but I'm not sure if we should use it.

https://pypi.org/project/wcwidth/#files

The text was updated successfully, but these errors were encountered:

andychu · 2019-03-24T04:46:40Z

It looks like wchar_t might introduce some portability problems? Not sure how much of an issue this is in practice.

https://en.wikipedia.org/wiki/Wide_character#C/C++

Is there a version wcwidth() that works on utf-8 data?

Note: this function might be useful for short-circuiting?

http://man7.org/linux/man-pages/man3/iswprint.3.html

andychu · 2019-03-24T04:51:26Z

Fairly related, since OSH essentially does this!

https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf

The OpenBSD base system supportsexactly two LC_CTYPE locales:

UTF-8
C=POSIX = US-ASCII

We don’t even support ISO-LATIN-1anylonger in the base system.

Apparently there is mbtowc(3), but it's not multi-threaded! Gah... author recommends not using the reentrant versions?

andychu · 2019-03-24T05:00:22Z

Here's an interesting alternative:

https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

We could just print the prompt, and query cursor positions before and after! Then we know exactly how many displayable characters it is. Downsides?

Querying the cursor position is probably also necessary for #257, for proper wrapping.

jyn514 · 2019-03-24T21:57:29Z

Why do we need multithreading? Shouldn't we have at most one main thread at a time?

andychu · 2019-03-24T22:29:21Z

Yeah we don't really need multi-threading in the near term, but functions that use globals are a "smell" to me ... it's not what I'm used to.

Some people have asked for Oil to be a Lua-like library, in which case it should use no globals. This would be nice in theory, but it's way down on the priority list now.

It's possible that the prompt could use globals but it's separate from the OSH interpreter as a library, though.

Although querying the cursor position might seem like a hack, I think it might be the right thing for the case of the prompt, if not in general.

jyn514 · 2019-06-22T18:23:38Z

Since this is the last blocker from me using osh as my fulltime shell, I think I'll take a look at it.

Unfortunately I don't think this is going to play nicely with readline escape codes (https://superuser.com/questions/301353/escape-non-printing-characters-in-a-function-for-a-bash-prompt). The python wcwidth library returns -1 when it encounters non-printing characters and doesn't work for single characters, so we couldn't just modify the existing loop in core/comp_ui.py:

$ printf '(\x01\x1b[0;33m\x02osh\x01\x1b[0;0m\x02) ' | python -c 'import wcwidth; print(wcwidth.wcswidth(input()))'
-1
$ printf '▶️' | python -c 'import wcwidth; print(wcwidth.wcwidth(input()))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/joshua/.local/lib/python3.6/site-packages/wcwidth/wcwidth.py", line 157, in wcwidth
    ucs = ord(wc)
TypeError: ord() expected a character, but string of length 2 found

I think your idea of querying readline for the cursor position is the best way to do this without becoming platform dependent, but I'm not sure how to do that ...

jyn514 · 2019-06-22T18:29:58Z

It looks like line_span is an offender as well: I see a lot of calls to arena.AddLineSpan that use len(line), but len counts the number of bytes, not their width:

$ python -c 'print(len("▶️"))'
2

andychu · 2019-06-22T18:37:38Z

@jyn514 I heard of that ability here:

http://ballingt.com/rich-terminal-applications-2/

Finding this is a bit tricky. We can get at it by querying the terminal for the cursor’s position with \x1b[6n, but it’s a bit of a hassle because the response comes back from the terminal on stdin and needs to be distinguished from user input.

(BTW there are several blog posts on that site about terminals worth reading)

I'm not sure if this is the best way to do it, since I didn't try it. Other shells seem to use wcwidth.

Can we just write our own wcwidth wrapper? I think wcwidth.py is pure Python.

But we could put a wrapper in native/libc.c.

Oh I remember the issue is that we don't use wchar_t in OSH. We use utf-8 encoded strings. But I think that is solveable? let me think...

andychu · 2019-06-22T18:40:25Z

Oh sorry I didn't read back on this thread... that was already discussed.

I'm fine with using mbtowc(3) if it works, despite the globals...

Also the stackexchange link above talks about the cursor position.

I think the "safer" thing to do is probably to try wcwidth... But either one seems doable with enough effort (?)

andychu · 2019-06-22T18:49:29Z

Oh I guess a gotcha for mbtowc is what happens when the locale isn't utf-8 or c-posix ?

The complication is that OSH is basically the following the OpenBSD utf-8 philosophy, whereas most shells use fixed width representations in memory.

(However, they do it very badly.

Take a look at how bash's length operator behaves... It basically gives you garbage when there are encoding errors. The number of characters it reports is not a monotonically increasing function of the number of bytes!!!

https://github.com/oilshell/oil/blob/master/spec/var-op-len.test.sh#L18
)

So OSH is somewhat of an outlier, but it also has more of a chance of being correct.

andychu · 2019-06-22T19:00:19Z

FWIW if you're not familiar with these Unicode issues (I wasn't until recently and still have more to learn), I just updated this section of the FAQ, and it has some relevant links:

http://www.oilshell.org/blog/2018/03/04.html#faq

http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/

We can chat about it on Zulip too

jyn514 · 2019-06-22T19:59:57Z

Ok I have this working using the wcwidth python library. How do you like to add libraries to the project? I know it gets compiled so we can't use pip, it's short enough to copy the whole thing into pylib if we need to.

Addresses oils-for-unix#269. Uses the [wcwidth](https://pypi.org/project/wcwidth/) python library to find the display width of a character.

andychu · 2019-06-22T20:34:12Z

The problem is that we don't have Python's unicode types in the release build! It will work in the dev build because that's running with a plain Python interpreter, e.g. /usr/bin/python2.

I shaved the Python interpreter down to this:

http://www.oilshell.org/release/0.6.pre22/metrics.wwz/line-counts/nativedeps.txt

e.g. there's a stringobject.c but no unicodeobject.c

This blog post talks about that:

http://www.oilshell.org/blog/2018/11/15.html

and the conclusion is

Oil is now more like a C program, and less like a Python program.

And this is not just for vanity / minimalism. Speed is probably the top problem right now! Python is too slow, e.g. the benchmarks show that running a configure script is 6x slower :-(

http://www.oilshell.org/release/0.6.pre22/benchmarks.wwz/osh-runtime/

andychu · 2019-06-22T20:36:44Z

But it's good that this works for your use case! That means we can just write libc.wcwidth(), which does utf-8 decoding from a Python string to wchar_t, and then calls libc's wcwidth function.

The way I would start is to copy your prompt into native/libc_test.py, and then iterate on libc.wcwidth(myprompt) until it matches what the Python version does:

This command can quickly iterate on it. (Although you might want to re-enable stdout; it's piped to /dev/null)

build/dev.sh pylibc

I would love that as a contribution because I bet a lot of other people have unicode prompts too!

andychu · 2019-06-22T20:38:30Z

FWIW using the Python/C API should be pretty straightforward for this case (although I find it hard in general)

The libc.fnmatch() function already takes a string and returns a boolean. Taking a string and returning an integer is pretty much the same, so that could be used as a template for libc.wcwidth().

https://github.com/oilshell/oil/blob/master/native/libc.c#L396

jyn514 · 2019-06-22T20:49:54Z

The tricky thing here is wcwidth assumes you have a valid unicode character already. I'm not sure how to parse ' \xe2\x96\xb6\xef\xb8\x8f ' into the 4 unicode characters u' \u25b6\ufe0f '

andychu · 2019-06-22T20:51:44Z

If I understand correctly, that's what mbtowc does.

So the Python version of wcwidth is basically just a wrapper tha calls mbtowc and then wcwidth. It should do no work on its own. Basically just error checking and so forth.

jyn514 · 2019-06-22T21:57:28Z

I'm kind of stumped. I wrote a wrapper that seems right to me, but it keeps trying to decode the arguments as ASCII and I'm not sure why.

static PyObject *
func_wcswidth(PyObject *self, PyObject *args){
    const char *string;
    if (!PyArg_ParseTuple(args, "es", "utf-8", &string)) {
        return NULL;
    }
    printf("made it past argparsing, string is %s\n", string);
    int len = mbstowcs(NULL, string, 0);
    printf("len is %d", len);
    if (len == -1) {
        PyErr_SetString(PyExc_UnicodeError, "Invalid UTF-8 string");
        return NULL;
    }
    wchar_t unicode[len + 1];
    mbstowcs(unicode, string, len + 1);
    printf("unicode is %ls\n", unicode);
    int width = wcswidth(unicode, len + 1);
    PyMem_Free(string);
    return PyInt_FromLong(width);
}

When I call it with libc.wcswidth("▶️") I get the following traceback, which happens before it even finishes parsing the arguments:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

andychu · 2019-06-22T21:59:48Z

PyArg_ParseTuple shouldn't be doing any decoding (is that Python 3?) You should just get a raw char* and then pass it to mbstowc (if I understand correctly, I haven't actually done it before)

andychu · 2019-06-22T22:00:53Z

That is, you go from Python string -> char* -> wchar_t -> int essentially. Python should not do any decoding or encoding. Using libc is convenient because it's available everywhere "for free".

(I'll be offline without Github access for until later tonight.)

jyn514 · 2019-06-22T22:02:22Z

Oh I'm silly it's the same reason that unicode doesn't work, it's not compiled in

jyn514 · 2019-06-22T22:02:48Z

Ok yeah now I'm getting different errors :P

andychu · 2019-06-22T22:03:19Z

I think the invocation from fnmatch is pretty close to what we want:

https://github.com/oilshell/oil/blob/master/native/libc.c#L59

jyn514 · 2019-06-22T22:05:54Z

I'm not sure this is my end, mbstowcs is returning -1 for valid unicode strings :(

static PyObject *
func_wcswidth(PyObject *self, PyObject *args){
    char *string;
    if (!PyArg_ParseTuple(args, "s", &string)) {
        return NULL;
    }
    printf("made it past argparsing, string is %s\n", string);
    int len = mbstowcs(NULL, string, 0);
    printf("len is %d\n", len);
    if (len == -1) {
        PyErr_SetString(PyExc_UnicodeError, "Invalid UTF-8 string");
        return NULL;
    }
    wchar_t unicode[len + 1];
    mbstowcs(unicode, string, len + 1);
    printf("unicode is %ls\n", unicode);
    int width = wcswidth(unicode, len + 1);
    //PyMem_Free(string);
    return PyInt_FromLong(width);
}

>>> import libc; libc.wcswidth("▶️")
made it past argparsing, string is ▶️
len is -1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeError: Invalid UTF-8 string

jyn514 · 2019-06-22T22:06:45Z

@andychu I tried the call from fnmatch earlier and it gave me the traceback I just posted, I didn't realize it was coming from my code instead of the python library

jyn514 · 2019-06-22T22:13:55Z

stackoverflow says this is platform dependent https://stackoverflow.com/questions/21120965/converting-a-utf-8-text-to-wchar-t

andychu · 2019-06-23T02:23:19Z

I copied the example here and it worked:

http://man7.org/linux/man-pages/man3/mbstowcs.3.html

https://github.com/oilshell/blog-code/tree/master/libc-unicode

But you do have to do setlocale() utf-8, even though that's my system's locale.

I am sort of uncomfortable with the globals... but this might be the way to do it.

I guess we could write a test with utf-8 and ucs2/utf-16 or whatever as the default encoding, and see what happens.

I only care about utf-8 for Oil, but I don't want it to do something fantastically buggy if that's not your locale...

The other thing we can do is decode utf-8 ourselves, which is not hard. The wcwidth() thing is hard, so we want to reuse that, but we don't have to do reuse decoding.

There is actually code to do that in osh/string_ops.py. It's not quite decoding but it's close -- it counts utf-8 chars, which I guess is not enough here.

(It would also be nice to move that to C at some point, but that's a different story than getting it working.)

andychu · 2019-06-23T02:27:25Z

FWIW if you want to geek out there are a lot of tiny utf-8 decoders here:

https://news.ycombinator.com/item?id=15423674

e.g.

https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

(I don't vouch for that code, it's not the most straightforward code, but it shows how small the problem is)

Writing our own or copying one into native/libc.c as a shortcut is not a horrible idea, I think... it is a small enough calculation that it shouldn't have any bugs (at least that was my experience with the Python implementation).

jyn514 · 2019-06-23T13:38:41Z

Hmm, maybe the best way would be to save the original locale, setlocale to UTF8, convert the string, and then set it back?

jyn514 · 2019-06-23T14:04:02Z

Got this working!

andychu · 2019-07-27T03:41:08Z

Released with 0.7.pre1 I think

andychu added the interactive-shell label Mar 24, 2019

jyn514 added a commit to jyn514/oil that referenced this issue Jun 22, 2019

Take unicode into account when finding the length of the prompt

dcea3b9

Addresses oils-for-unix#269. Uses the [wcwidth](https://pypi.org/project/wcwidth/) python library to find the display width of a character.

jyn514 added a commit to jyn514/oil that referenced this issue Jun 22, 2019

Take unicode into account when finding the length of the prompt

08efae8

Addresses oils-for-unix#269. Uses the [wcwidth](https://pypi.org/project/wcwidth/) python library to find the display width of a character.

andychu mentioned this issue Jun 23, 2019

Consider using a utf-8 library in C #366

Closed

jyn514 mentioned this issue Jun 23, 2019

Take unicode into account when calculating display width #368

Merged

andychu closed this as completed Jul 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comp_ui.NiceDisplay should take into account unicode character width #269

comp_ui.NiceDisplay should take into account unicode character width #269

andychu commented Mar 24, 2019

andychu commented Mar 24, 2019

andychu commented Mar 24, 2019 •

edited

Loading

andychu commented Mar 24, 2019 •

edited

Loading

jyn514 commented Mar 24, 2019

andychu commented Mar 24, 2019 •

edited

Loading

jyn514 commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019 •

edited

Loading

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019 •

edited

Loading

jyn514 commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

jyn514 commented Jun 22, 2019

jyn514 commented Jun 22, 2019 •

edited

Loading

jyn514 commented Jun 22, 2019

andychu commented Jun 23, 2019 •

edited

Loading

andychu commented Jun 23, 2019

jyn514 commented Jun 23, 2019

jyn514 commented Jun 23, 2019

andychu commented Jul 27, 2019

comp_ui.NiceDisplay should take into account unicode character width #269

comp_ui.NiceDisplay should take into account unicode character width #269

Comments

andychu commented Mar 24, 2019

andychu commented Mar 24, 2019

andychu commented Mar 24, 2019 • edited Loading

andychu commented Mar 24, 2019 • edited Loading

jyn514 commented Mar 24, 2019

andychu commented Mar 24, 2019 • edited Loading

jyn514 commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019 • edited Loading

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

andychu commented Jun 22, 2019 • edited Loading

jyn514 commented Jun 22, 2019

jyn514 commented Jun 22, 2019

andychu commented Jun 22, 2019

jyn514 commented Jun 22, 2019

jyn514 commented Jun 22, 2019 • edited Loading

jyn514 commented Jun 22, 2019

andychu commented Jun 23, 2019 • edited Loading

andychu commented Jun 23, 2019

jyn514 commented Jun 23, 2019

jyn514 commented Jun 23, 2019

andychu commented Jul 27, 2019

andychu commented Mar 24, 2019 •

edited

Loading

andychu commented Mar 24, 2019 •

edited

Loading

andychu commented Mar 24, 2019 •

edited

Loading

andychu commented Jun 22, 2019 •

edited

Loading

andychu commented Jun 22, 2019 •

edited

Loading

jyn514 commented Jun 22, 2019 •

edited

Loading

andychu commented Jun 23, 2019 •

edited

Loading