-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comp_ui.NiceDisplay should take into account unicode character width #269
Comments
It looks like https://en.wikipedia.org/wiki/Wide_character#C/C++ Is there a version wcwidth() that works on utf-8 data? Note: this function might be useful for short-circuiting? |
Fairly related, since OSH essentially does this! https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf The OpenBSD base system supportsexactly two LC_CTYPE locales:
We don’t even support ISO-LATIN-1anylonger in the base system. Apparently there is mbtowc(3), but it's not multi-threaded! Gah... author recommends not using the reentrant versions? |
Here's an interesting alternative: https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters We could just print the prompt, and query cursor positions before and after! Then we know exactly how many displayable characters it is. Downsides? Querying the cursor position is probably also necessary for #257, for proper wrapping. |
Why do we need multithreading? Shouldn't we have at most one main thread at a time? |
Yeah we don't really need multi-threading in the near term, but functions that use globals are a "smell" to me ... it's not what I'm used to. Some people have asked for Oil to be a Lua-like library, in which case it should use no globals. This would be nice in theory, but it's way down on the priority list now. It's possible that the prompt could use globals but it's separate from the OSH interpreter as a library, though. Although querying the cursor position might seem like a hack, I think it might be the right thing for the case of the prompt, if not in general. |
Since this is the last blocker from me using osh as my fulltime shell, I think I'll take a look at it. Unfortunately I don't think this is going to play nicely with readline escape codes (https://superuser.com/questions/301353/escape-non-printing-characters-in-a-function-for-a-bash-prompt). The python
I think your idea of querying readline for the cursor position is the best way to do this without becoming platform dependent, but I'm not sure how to do that ... |
It looks like
|
@jyn514 I heard of that ability here: http://ballingt.com/rich-terminal-applications-2/
(BTW there are several blog posts on that site about terminals worth reading) I'm not sure if this is the best way to do it, since I didn't try it. Other shells seem to use wcwidth. Can we just write our own wcwidth wrapper? I think wcwidth.py is pure Python. But we could put a wrapper in Oh I remember the issue is that we don't use |
Oh sorry I didn't read back on this thread... that was already discussed. I'm fine with using Also the stackexchange link above talks about the cursor position. I think the "safer" thing to do is probably to try wcwidth... But either one seems doable with enough effort (?) |
Oh I guess a gotcha for The complication is that OSH is basically the following the OpenBSD utf-8 philosophy, whereas most shells use fixed width representations in memory. (However, they do it very badly. Take a look at how bash's length operator behaves... It basically gives you garbage when there are encoding errors. The number of characters it reports is not a monotonically increasing function of the number of bytes!!! https://github.com/oilshell/oil/blob/master/spec/var-op-len.test.sh#L18 So OSH is somewhat of an outlier, but it also has more of a chance of being correct. |
FWIW if you're not familiar with these Unicode issues (I wasn't until recently and still have more to learn), I just updated this section of the FAQ, and it has some relevant links: http://www.oilshell.org/blog/2018/03/04.html#faq http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/ We can chat about it on Zulip too |
Ok I have this working using the |
Addresses oils-for-unix#269. Uses the [wcwidth](https://pypi.org/project/wcwidth/) python library to find the display width of a character.
Addresses oils-for-unix#269. Uses the [wcwidth](https://pypi.org/project/wcwidth/) python library to find the display width of a character.
The problem is that we don't have Python's unicode types in the release build! It will work in the dev build because that's running with a plain Python interpreter, e.g. /usr/bin/python2. I shaved the Python interpreter down to this: http://www.oilshell.org/release/0.6.pre22/metrics.wwz/line-counts/nativedeps.txt e.g. there's a This blog post talks about that: http://www.oilshell.org/blog/2018/11/15.html and the conclusion is
And this is not just for vanity / minimalism. Speed is probably the top problem right now! Python is too slow, e.g. the benchmarks show that running a configure script is 6x slower :-( http://www.oilshell.org/release/0.6.pre22/benchmarks.wwz/osh-runtime/ |
But it's good that this works for your use case! That means we can just write The way I would start is to copy your prompt into native/libc_test.py, and then iterate on This command can quickly iterate on it. (Although you might want to re-enable stdout; it's piped to /dev/null)
I would love that as a contribution because I bet a lot of other people have unicode prompts too! |
FWIW using the Python/C API should be pretty straightforward for this case (although I find it hard in general) The https://github.com/oilshell/oil/blob/master/native/libc.c#L396 |
The tricky thing here is |
If I understand correctly, that's what So the Python version of wcwidth is basically just a wrapper tha calls mbtowc and then wcwidth. It should do no work on its own. Basically just error checking and so forth. |
I'm kind of stumped. I wrote a wrapper that seems right to me, but it keeps trying to decode the arguments as ASCII and I'm not sure why.
When I call it with
|
PyArg_ParseTuple shouldn't be doing any decoding (is that Python 3?) You should just get a raw |
That is, you go from Python string -> (I'll be offline without Github access for until later tonight.) |
Oh I'm silly it's the same reason that |
Ok yeah now I'm getting different errors :P |
I think the invocation from fnmatch is pretty close to what we want: https://github.com/oilshell/oil/blob/master/native/libc.c#L59 |
I'm not sure this is my end,
|
@andychu I tried the call from fnmatch earlier and it gave me the traceback I just posted, I didn't realize it was coming from my code instead of the python library |
stackoverflow says this is platform dependent https://stackoverflow.com/questions/21120965/converting-a-utf-8-text-to-wchar-t |
I copied the example here and it worked: http://man7.org/linux/man-pages/man3/mbstowcs.3.html https://github.com/oilshell/blog-code/tree/master/libc-unicode But you do have to do setlocale() utf-8, even though that's my system's locale. I am sort of uncomfortable with the globals... but this might be the way to do it. I guess we could write a test with utf-8 and ucs2/utf-16 or whatever as the default encoding, and see what happens. I only care about utf-8 for Oil, but I don't want it to do something fantastically buggy if that's not your locale... The other thing we can do is decode utf-8 ourselves, which is not hard. The wcwidth() thing is hard, so we want to reuse that, but we don't have to do reuse decoding. There is actually code to do that in osh/string_ops.py. It's not quite decoding but it's close -- it counts utf-8 chars, which I guess is not enough here. (It would also be nice to move that to C at some point, but that's a different story than getting it working.) |
FWIW if you want to geek out there are a lot of tiny utf-8 decoders here: https://news.ycombinator.com/item?id=15423674 e.g. https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ (I don't vouch for that code, it's not the most straightforward code, but it shows how small the problem is) Writing our own or copying one into |
Hmm, maybe the best way would be to save the original locale, |
Got this working! |
Released with 0.7.pre1 I think |
@jyn514 I'm pretty sure this is the issue you just mentioned in #257. There are two separate bugs:
I think we can just call the POSIX function wcwidth(), although I guess that involves some conversion to
wchar_t
, which is annoying:http://man7.org/linux/man-pages/man3/wcwidth.3.html
There is a pure Python version of it, but I'm not sure if we should use it.
https://pypi.org/project/wcwidth/#files
The text was updated successfully, but these errors were encountered: