Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide APIs to allow users to write their own line editor / interactive interface #663

Open
andychu opened this issue Mar 15, 2020 · 13 comments

Comments

@andychu
Copy link
Contributor

andychu commented Mar 15, 2020

Great section on how ble.sh works, that I recommend everyone read:

https://github.com/oilshell/oil/wiki/How-Interactive-Shells-Work

@andychu
Copy link
Contributor Author

andychu commented Mar 15, 2020

API Requirements: To summarize, ble.sh only requires primitive I/O operations receive byte (bind -x) and poll (read -t 0) for its essential part. In other words, Bash/Readline doesn't provide any satisfactory high-level APIs for user-input processing (Bash/Readline provides bind for key bindings but it has tight limitations). If a shell provides some high-level support, a customizable key-binding system and a coroutine system would help users to develop interactive interfaces.

  • I think Oil could do the utf-8 decoding rather than doing it in shell? I think that would help other authors of line editors and could speed things up?
    • on issue Consider using a utf-8 library in C #366 I mentioned possibly using https://github.com/JuliaStrings/utf8proc
    • @akinomyoga I wonder if we can use that? One issue with utf-8 decoding is that interactive decoding might need to be a state machine (push) rather than a normal parser (pull) ? I didn't yet look at decode.sh or the utf8proc but it's an issue I suspect.
    • That seems a bit related to the coroutine part, but I guess ble.sh uses that for something else, i.e. to make sure copy-paste doesn't get bottlenecked on byte-by-byte code
  • Note: zsh has a select() interface ... it wasn't mentioned but maybe it helps.

API Requirements: ble.sh requires a primitive I/O operation output string (printf). In addition, the means to get the current terminal size (LINES and COLUMNS) is needed. The same information can be obtained by external commands such as tput lines and tput cols (ncurses) or resize (xterm utility), yet it is useful to provide them as builtin features (as these commands might not be available in the system). If a shell provides high-level support for this, layout and rendering can be performed by the shell but not by the shell scripts so that the shell scripts only have to specify the characters and their graphic attributes. If the shell provides the prompt calculation, it should also provide the cursor position information after the prompt is printed. The means to suppress/control the I/O of the original shell is also needed.

  • Oil provides write -- in addition to printf
  • LINES and COLUMNs is Set $LINES and $COLUMNS #459
  • Yes to prompt width
  • Need to figure out what C APIs provide the others. This is useful

API Requirements: The ble.sh requires a means to execute commands in the top-level context (direct eval in bind -x). Also ble.sh uses the external command stty to adjust the pty handler state which might be better to be built in the shell.

  • OK this is interesting. Possibly Oil can provide something cleaner than bind -x -- eval but I'll have to think about it.

@andychu
Copy link
Contributor Author

andychu commented Mar 15, 2020

For context, this issue is related / complementary to #653 but I want to work on both. #653 gives some more immediate benefits while this is a large amount of work that's further in the future. But I think it's important as it will save us the effort of writing 30K lines of code for interactivity -- we just provide the APIs and users can customize the behavior. Much like they do with zsh (but not fish?)

@andychu
Copy link
Contributor Author

andychu commented Mar 15, 2020

Of course I also accept ideas / patches for all of this stuff :) e.g. I think it probably makes sense for shell to have a terminal parser somehow?

If we get closer to runnine ble.sh (#653, will probably take awhile), then we could provide some alternate / faster APIs to "fast path" some code in ble.sh in native code, e.g. the decoding seems like a good candidate, and there are others.

Maybe the "drawing buffer" needs a special data structure to be fast, etc.

And maybe native integers will avoid converting back and forth between strings and ints all the time? I think that the performance of ble.sh must suffer from bash doing that? I imagine you want to stay in the "integer domain" most of the time for performance

@andychu
Copy link
Contributor Author

andychu commented Mar 15, 2020

  • I should also say that one way to reduce the amount of work is to include some time-tested C libraries with Oil
    • for Unicode: utf-8, character width.
    • for terminal parsing? I guess the line editor mainly needs to emit terminal codes, not parse them.
      • there might be some use cases for parsing, e.g. people including terminal codes in the prompt
    • what else? certain data structures?
  • Possible "accelerators":
    • I already know I want to include something like ByteReplacer for shell quoting, glob quoting, ERE quoting, URL quoting, HTML quoting, etc.
    • there could be other primitives in the Oil language that make certain kinds of text- or byte-processing fast

@andychu andychu changed the title APIs to allow users to write their own line editor / interactive interface Provide APIs to allow users to write their own line editor / interactive interface Mar 15, 2020
@akinomyoga
Copy link
Collaborator

@akinomyoga I wonder if we can use that? One issue with utf-8 decoding is that interactive decoding might need to be a state machine (push) rather than a normal parser (pull) ? I didn't yet look at decode.sh or the utf8proc but it's an issue I suspect.

In principle, it can be implemented in both strategy, push and pull. But I think push is easier for this situation because pull approach blocks the control when the user input has not yet arrived. It needs additional workarounds such as threads or select(2), etc. that may require yet additional adjustments. Also the input stream (stdin) can be shared by several sub-programs (subshells and external programs), so it is complicated to keep consistency with pull approach.

The implementation of ble.sh is push approach (a state machine). The state is saved in the global variable _ble_decode_byte__utf_8__mode, and additional data is stored in _ble_decode_byte__utf_8__code:

It seems utf8proc only provides pull approach.


Need to figure out what C APIs provide the others. This is useful

For LINES and COLUMNS:

#include <sys/ioctl.h>

struct winsize ws;
ioctl(tty_fd, TIOCGWINSZ, (char*) &ws);
COLUMNS = ws.ws_col;
LINES = ws.ws_row;

The command stty ... is equivalent to termios

#include <termios.h>

struct termios termios;
tcgetattr(tty_fd, &termios);

/* modify termios */

tcsetattr(tty_fd, TCSAFLUSH, &termios);

there might be some use cases for parsing, e.g. people including terminal codes in the prompt

I don't think there is a significant demand for API for terminal parser available from scripts, but ble.sh does have a terminal code parser/processor. ble.sh parses the expanded results of PS1 to trace the cursor movements. But Bash/Readline doesn't do that which is the reason why users have to enclose control sequences with \[\]. I think Bash/Readline just sums up the result of wcwidth for characters after the last \n. The terminal code parser/processor of ble.sh is implemented as a shell function ble/canvas/trace in the following place:

ble.sh accepts user settings with ANSI control sequences in many places. All of these settings are processed by the above function ble/canvas/trace. But the interactive interface needs not to be so fancy. I'm not sure whether this kind of API has significant demands or not.

@andychu
Copy link
Contributor Author

andychu commented Mar 21, 2020

Yes I suspected push/pull would be an issue.

I don't like having that "mismatch" which is why I like the high-level declarative style -- the you can generate both styles from a specification.

re2c apparently supports this, though I haven't used it:

https://re2c.org/manual/manual.html#storable-state

And actually this ties in very closely to what I think you're doing in ble.sh with syntax. You at least have a shell lexer, but lexing shell requires parsing shell:

https://github.com/oilshell/oil/wiki/Parsing-Shell-Quotes-Requires-Parsing-the-Whole-Language

In theory you can imagine "inverting" the whole OSH parser from pull to push, although that sounds a little scary... right now for completion, Oil just parses the entire line again, which is fast.

However I can certainly imagine cases for advanced UI where the push/event style (which I presume ble uses) is better.


I still want to look at your coroutine abstraction more closely ... It would be nice if you had some short docs / links to source in doc/ because I bet a lot of other people would be interested.

This is far in the future, but I would like Oil to have some kind of high level language / DSL for coroutines/state machines, to get rid of the push/pull problem perhaps. I think the ragel state machine generator does the "push" style by default.

I noticed Go HTTP libraries are all done in the pull style because the runtime supports goroutines / lightweight threads. As opposed to node.js / nginx in C where they have to write out error prone state machines. So yeah I'm not sure what will happen here but I think it is interesting, and also relates to whether we have select() in shell like zsh does, or if we have libev / libevent, etc.

@andychu
Copy link
Contributor Author

andychu commented Mar 21, 2020

Another way of saying it is you can divide programmers in two:

  1. those who are comfortable writing explicit state machines (rather than implicit ones, e.g. [1])
  2. those who aren't

I feel like programmers in category 1 usually have some kind of EE background. I used to sit next to a guy doing FPGAs and he said it's all state machines.

Most shell programmers (including myself) are probably in category 2. (Although I think Oil is the only shell that encodes IFS as an explicit state machine, in frontend/consts.py) But I find it takes a long time to get it right, even though I like the property of considering every transition, which normal imperative code often doesn't do.

Then again I know that node.js and nginx often have (security) bugs lurking in their HTTP parsing state machine for years ... so it is not easy for anyone. Writing state machines in a higher level language is appealing for that reason.


This is all in the future, but I think ble.sh has some very interesting use cases that I will dig deeper into.

[1] https://eli.thegreenplace.net/2009/08/29/co-routines-as-an-alternative-to-state-machines/

@andychu
Copy link
Contributor Author

andychu commented Mar 21, 2020

Actually I tried the example here in ble.sh and it seems to understand everything? It knows that } matches ${, that " matches inner quote ", that ) matches $( ?

https://github.com/oilshell/oil/wiki/Parsing-Shell-Quotes-Requires-Parsing-the-Whole-Language

Do you think there are cases where it will get confused? How much of shell parsing does it implement?

@andychu
Copy link
Contributor Author

andychu commented Mar 21, 2020

Actually I think the example wasn't really showing what I thought it did... I made a note on the wiki page and I will try to come up with some harder examples :)

@akinomyoga
Copy link
Collaborator

akinomyoga commented Mar 22, 2020

push/pull interface

Yes I suspected push/pull would be an issue.

Actually there is a hybrid interface of push/pull or the generalized interface which can be used in both push and pull. I think the famous example is Zlib and other compression libraries (Bzip2, XZ Utils, etc.). For example, you can find this idea in the function declaration of C++ std::codecvt<I,E,S>::in (which is used for code conversions between character encodings). As you can see in the first parameter in the following declaration, it is like a state machine but switches the behavior depending on the capacity of the input buffer and output buffer. When input buffer is larger (smaller) than the output buffer, it behaves like a pull (push) approach. In the case of compression libraries, all of these parameters are stored in a structure like z_stream, but the essential structure is the same.

template<typename InternT, typename ExternT, typename State>
result std::codecvt<InternT, ExternT, State>::in(
        StateT& state,
        const ExternT* from,
        const ExternT* from_end,
        const ExternT*& from_next,
        InternT* to,
        InternT* to_end,
        InternT*& to_next ) const;

Coroutines/fibers in ble.sh

I still want to look at your coroutine abstraction more closely ... It would be nice if you had some short docs / links to source in doc/ because I bet a lot of other people would be interested.

Unfortunately, my implementation of coroutine in ble.sh is not abstract at all. The coroutine is manually written in Duff's device approach using case $state in ... esac. There are no fancy abstractions in ble.sh, i.e., language-level syntax sugars like yield/async/await... of modern languages. But, of course, I recommend to implement language-level support in Oil rather than forcing users to write explicit state machines. Note: ble/util/idle.* which I mentioned somewhere is not the utility to define a coroutine, but it's a kind of scheduler for coroutines [ In this sense, the usage is more like fibers rather than coroutines ].

Parser in ble.sh

And actually this ties in very closely to what I think you're doing in ble.sh with syntax. You at least have a shell lexer, but lexing shell requires parsing shell:

ble.sh parses the command line. It internally constructs AST. You can inspect the internal parsing state by setting ble_debug=1 in a ble.sh session [ Note: this inspection feature is only available for master branch. It is dropped off in the release versions ble-0.3, etc. ]. (edit 2024-11-14): In the latest version, this inspection feature is available with the setting bleopt syntax_debug=1.

Example: Internal state of ble.sh for echo "A${V:-"$(f(){ echo f;};f)"}Z"

$ ble_debug=1
$ echo "A${V:-"$(f(){ echo f;};f)"}Z"
_ble_syntax_attr/tree/nest/stat?
 2*aw   000 'e' |       stat=(CMDX w=- n=- t=-:-)
 |*aw   001 'c' |
 |*aw   002 'h' |
 |*aw   003 'o' +       word=CMDI:0-4/(wattr=72057594037930240)
 3*a    004 ' '
 9*a    005 '"' ||      nest=(ARGI w=ARGX:5- n=- t=-:$4) stat=(ARGX w=- n=- t=$4:-)
 5*a    006 'A' ||
14*a    007 '$' |||     nest=(QUOT w=- n='"${':5- t=-:-) stat=(QUOT w=- n=@5 t=-:-)
 |*a    008 '{' |||
26*a    009 'V' |||
14*a    010 ':' |||     stat=(PARAM w=- n=@7 t=-:-)
 |*a    011 '-' |||
 9*a    012 '"' ||||    nest=(PWORD w=- n='none':7- t=-:-) stat=(PWORD w=- n=@7 t=-:-)
14*a    013 '$' |||||   nest=(QUOT w=- n='$(':12- t=-:-) stat=(QUOT w=- n=@12 t=-:-)
 |*a    014 '(' |||||
 2*aw   015 'f' |||||+  word=_ble_attr_FUNCDEF:15-16/(wattr=72057594037951489) stat=(CMDX w=- n=@13 t=-:-)
12*a    016 '(' |||||
 |*a    017 ')' |||||
18*a    018 '{' |||||++ word=CMDI:@15>18-19>@18/(wattr=d) word="none":18-19 nest=(CMDI w=CMDXC:18- n='none':13- t=-:$16) stat=(CMDXC w=- n=@13 t=$16:-)
17*a    019 ' ' |||||   stat=(CMDX1 w=- n=@13 t=$19:-)
 2*aw   020 'e' ||||||  stat=(CMDX1 w=- n=@13 t=$19:-)
 |*aw   021 'c' ||||||
 |*aw   022 'h' ||||||
 |*aw   023 'o' |||||+  word=CMDI:@18>20-24/(wattr=72057594037930240)
 3*a    024 ' ' |||||
 4*a    025 'f' |||||+  word=ARGI:@23>25-26/(wattr=d) stat=(ARGX w=- n=@13 t=$24:-)
12*a    026 ';' |||||   stat=(ARGX w=- n=@13 t=$26:-)
19*a    027 '}' |||||+  word=CMDI:@25>27-28/(wattr=d) stat=(CMDX w=- n=@13 t=$26:-)
12*a    028 ';' |||||   stat=(CMDXE w=- n=@13 t=$28:-)
 2*aw   029 'f' |||||+  word=CMDI:@27>29-30/(wattr=72057594037929472) stat=(CMDX w=- n=@13 t=$28:-)
14*a    030 ')' ||||+   word="$(":13-31>@29 stat=(ARGX w=- n=@13 t=$30:-)
 9*a    031 '"' |||+    word="none":12-32>@30 stat=(QUOT w=- n=@12 t=$31:-)
14*a    032 '}' ||+     word=""${":7-33>@31 stat=(PWORD w=- n=@7 t=$32:-)
 5*a    033 'Z' ||      stat=(QUOT w=- n=@5 t=$33:-)
 9*a    034 '"' ++      word=ARGI:@3>5-35>@34/(wattr=d) word="none":5-35>@32 stat=(QUOT w=- n=@5 t=$33:-)
 |    s 035 ^@         stat=(ARGX w=- n=- t=$35:-)
\_ 'echo'
\_ '"A${V:-"$(f(){ echo f;};f)"}Z"'
    \_ '"A${V:-"$(f(){ echo f;};f)"}Z"'
        \_ '${V:-"$(f(){ echo f;};f)"}'
            \_ '"$(f(){ echo f;};f)"'
                \_ '$(f(){ echo f;};f)'
                    \_ 'f'
                    \_ '{'
                    |   \_ '{'
                    \_ 'echo'
                    \_ 'f'
                    \_ '}'
                    \_ 'f'

In theory you can imagine "inverting" the whole OSH parser from pull to push, although that sounds a little scary... right now for completion, Oil just parses the entire line again, which is fast.

The parser in ble.sh is more like a state machine. In addition, it saves the parser states in arbitrary positions of the command line so that it can partially update AST. It only parses the changed part as much as possible because the shell script is too slow to parse the entire line for every keystroke.

Actually I tried the example here in ble.sh and it seems to understand everything? It knows that } matches ${, that " matches inner quote ", that ) matches $( ?

Yes.

1. Known unsupported syntax in ble.sh

How much of shell parsing does it implement?

I try to implement all the Bash syntax that I recognize. But there are several things that I haven't yet implemented. One is the heredoc terminator which resembles like command substitutions. See the following example. Bash treats $(echo hello) literally, i.e., it doesn't perform command substitution. But ble.sh highlights it as if it is a real command substitution.

cat << $(echo hello)
hello
$(echo hello)

Another one I know is that ble.sh doesn't check the number of words in the pattern of case. In Bash, one cannot specify more than one word in the case pattern. But ble.sh doesn't detect this as a syntax error. Actually it is not so difficult to support it, but I'm just lazy...

hello='a b'
case hello in
(a b) # <-- this is syntax error
  echo hello ;;
esac

Also, I haven't implemented detailed syntax checks inside the conditional command [[ ... ]].

2. Parsing with alias/history expansions

Do you think there are cases where it will get confused?

Actually there are several known cases that I gave up. The most obvious ones are history expansion and alias expansions, which are like macros in C. It can change the syntactic structure by expansions. Maybe it is even worse than C macros because it can even change the lexical structure. For example, one can include an opening single quote in an alias, etc. (see the following example). ble.sh doesn't parse the inside of history expansions and alias expansions.

$ alias hello="echo 'hello"
$ hello world'
hello world

3. Inconsistent Multipass Parsing of Bash

Another type of problematic cases is related to the multipass parsing of Bash. The syntactic structure that Bash recognizes can be inconsistent between different passes for one substring, so there are conceptually multi-AST in Bash. ble.sh is a single-pass parser, and in addition, I don't know how to deal with such ambiguous multi-AST, so I don't support them [ The list of such ambiguous cases are described near the beginning of note.txt, but I'm sorry everything is written in Japanese...].

Here I list the cases with a simple illustration for each.

echo [@(echo|[...])]

# first pass (word)
        echo [...]
      @(....|.....)
echo [.............]

# second pass (pathname expansion)
      @(echo|[...
     [...........]
echo .............)]
echo {@(,)}

# first pass (word)
        ,
      @(.)
echo {....}

# second pass (brace expansion)
      @( )
echo {..,.}
echo ${var:-{a,b}{a,b}

# first pass (word, brace expansion)
             a b  a b
       var:-{.,.}{.,.}
echo ${...............

# second pass (parameter expansion)
echo ${var:-{a,b}{a,b}
       var:-{a,b
echo ${.........}{a,b}
echo a[]b]=~

# first pass (tilde expansion)
      []
echo a..b]=~

# second pass (pathname expansion)
       ]b
echo a[..]=~

4. Active/inactive expansions

The expansions can be active or inactive depending on the syntax context. ble.sh doesn't completely handle the context because it requires multipass parsing to completely support them. Actually I'm not sure if this should be handled within the syntax analysis or in the semantic analysis, but in the case of Bash, everything happens in multipass syntax analysis. For example, in the following example, @(a|b) and [abc] are enclosed in [...] after the brace expansions, so the special meaning for the pattern @(a|b) and [abc] should be lost. But it is difficult to detect such inactivation in an arbitrary case without performing the actual brace expansions.

echo [{@(a|b),[abc]}]

Another example is the following. It will become ~root ~root after the brace expansions, so the tilde expansions are activated. But ble.sh doesn't highlight them as tilde expansions because the tilde is apparently not the first character of the word. It needs to perform the actual brace expansion to test if it can be activated or not.

echo {~root,~root}

Another example related to tilde expansions is as follows. In Bash, tildes after : or = in the word of the form of variable assignments are subject to tilde expansions. But when there is a brace expansion in the word, these tilde expansions are inactivated. ble.sh highlights such inactive tildes before the brace expansions as if it is activated.

echo a=~:{a,b}:~:echo

@andychu
Copy link
Contributor Author

andychu commented Mar 30, 2020

Sorry I meant to test Oil's parser out on all these cases and see what it does, but I haven't had time yet. But I'm impressed!

I noted recently that Oil uses its parser for history and completion, but:

  • bash doesn't use its own parser for completion
  • It punts the problem to bash-completion, a shell project, which tries to parse bash in bash, and does a really bad job.

http://www.oilshell.org/blog/2020/01/history-and-completion.html

It looks like ble.sh is doing a really good job but I haven't tested it more.

I would like to provide Oil users with some methods of parsing shell but I haven't figured it out exactly how yet! And the push/pull issue is relevant there.

Oil's lexer can be "inverted" with re2c but there's no straightforward way to do that with the parser.


BTW Oil has a binding for WINSZ for its own use, in Python. But I think it does make sense to "hoist" it up to the shell level, so users can access it. If you have any ideas for builtins that should exist, let me know. (Although maybe it's better to run existing code before adding new features.)

@akinomyoga
Copy link
Collaborator

Thank you!

I noted recently that Oil uses its parser for history and completion, but:

ble.sh uses its parser for completion but doesn't use it for history expansions. Actually, ble.sh doesn't implement the history expansion but just uses history -p of Bash.


If you have any ideas for builtins that should exist,

I think some kind of extended trap is useful. Already several non-signal traps are supported by trap such as EXIT, ERR, DEBUG, RETURN, etc. Basically, additional non-signal traps can be added to the supported traps, but the problem of trap is that only one handler can be registered to a trap at the same time. There is exactly the same problem with PROMPT_COMMAND approach. We may handle this problem by writing something like PROMPT_COMMAND="new-handler;$PROMPT_COMMAND", but this approach is not robust. It is also non-trivial to safely remove a specific trap handler. Also, it will be useful if the user can fire the trap.

ble.sh actually implements its own event handling system "blehook" which allows registering/unregistering multiple handlers for each "hook" and also firing the "hook". ble.sh uses this to provide users customization points (uppercase hooks) and also uses for inter-modular communitations (lowercase hooks) to loosen the coupling between modules. If you are interested, you can dump the current list of hooks by the command blehook in the ble.sh session:

$ blehook
blehook ADDHISTORY=
blehook CHPWD=
blehook DA1R=
blehook DA2R+='ble/term/DA2R.hook'
blehook DA2R+='ble/color/initialize-term-colors'
(snip)
blehook keymap_load+='mshex/my-key-bindings'
blehook keymap_load+='mshex/bashrc/bind-keys'
blehook keymap_vi_load+='ble/util/invoke-hook _ble_keymap_vi_load_hook'
blehook keymap_vi_load+='blerc/vim-load-hook'
blehook keymap_vi_load+='blerc/debug/vim-hook'
blehook syntax_load=
blehook widget_bell+='ble/complete/menu/clear'

The detailed usage of the shell function blehook is described here. It may help to design the new API.

@andychu
Copy link
Contributor Author

andychu commented Mar 30, 2020

Ah I see, I've seen history -p but it's not yet implemented in Oil. That's very much within scope.


Thanks for the useful observations about the bash APIs. I filed #682 to keep track of that. I think there are some more immediate things on #653, but I would accept patches for builtins based on common usage in ble.sh. If there is anything you want to prototype, feel free :)

That is, I like to design the APIs based on a real usage, i.e. not just imagine what people want. So that is why I think ble.sh is compelling, because it already works and we can extract some APIs from it.


A big theme is that a lot of ble.sh is "inverted" vs. a "normal" shell scripts, i.e. in the push/event style. There is heavy use of traps/hooks, etc. That's something I did experience when implementing interactive features in Oil, but I didn't fully appreciate.

In other words there are two halves to the shell: the interpreter, and the line editor. And they use opposite paradigms: the intepreter is entirely the "pull" style, and the line editor is almost entirely the "push" style.

And in the case of bash, the bridge is very small.

  • All it really does is call readline(), and then the line editor takes over with its push style.
  • And then there is the TAB callback which in Oil returns a list of lines with completion candidates. e.g. if the prompt line is ls c, then Oil returns ['ls configure', 'ls cook'] to GNU readline, and that's about it.

So it's a very thin interface. I think Oil will be richer, but it will also support all the old hooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants