-
-
Notifications
You must be signed in to change notification settings - Fork 162
FAQ: Why Not Write Oil in X?
This is a common set of questions, enough so that I'm making a wiki page about it.
- Background: Oil Is Being Implemented "Middle Out" (with language-oriented programming, by translating Python to C++)
- Particularly the appendix has links
(2024 update: Renamed Oil -> Oils, though the titles of linked blog posts haven't been changed)
I don't claim that Oils' strategy is the only way! (However I will note that every POSIX-compatible shell is written in C, and there are deeper reasons for that than you might expect. Shell is a thin layer over the Unix kernel, which has an interface specified in C. Its major dependency is libc.)
How to Rewrite Oil in Nim, C++, D, or Rust (or C) (Summer 2020)
I think OCaml is a "top contender" -- it has algebraic data types, garbage collection, and a predictable native compiler.
However, a huge part of a shell codebase is lexing and parsing, and I believe those are more naturally done with imperative / stateful languages. (On the other hand, type checking is probably more natural in OCaml. I say "probably" because location info and errors tend to drown out attempts at clean, short code.)
For example, ocamlyacc is no different than a parser generator in C. There is no real advantage to OCaml there, and C/C++ are faster. re2c ended up being a perfect code generator for Oils, and I would have to write OCaml bindings for its generated code, which I don't know how to do.
- Related: The Morbig paper parsed POSIX shell with Menhir (similar to yacc), but Menhir required new features in order for this to work. Oils handles the much larger bash language with "lexer modes", recursive descent, and algebraic data types. See How To Parse Shell Like a Programming Language (most work was done in 2016)
-
Good Subthread With Examples of Imperative Programming in OCaml -- to my eyes, writing a simple
for
loop andcontinue
is awkward (short circuiting). Shell and parsing are full of such code.
I think D would be another top contender -- it's a "fast systems language" with garbage collection. It has builtin dictionaries! (which are used all over Oils)
Though I think we would still want something like Zephyr ASDL for algebraic data types.
You Can't Write a Portable POSIX Shell in Portable Go due to its threaded runtime, and the fact that it doesn't use libc.
It doesn't necessarily mean Go is a bad choice, but it will cause more work.
Related comment and discussion about threads and fork(), which don't mix: https://news.ycombinator.com/item?id=31741222
Also, as of March 2024, the native Oils binary is a little above 2 MB. As far as I remember, binaries in other compiled languages may be 10x as big, and I believe that matters for a shell.
I think Rust is promising in general, and it's obviously possible with enough effort.
But the shell AST is actually a big graph, and the parser is reused in an unusual way for interactive completion, giving it odd ownership semantics. I think garbage collection is natural for this problem.
Also, garbage collection must occur somewhere -- either at the OSH/YSH level or at the host language level (C++ / Rust). I think having it in the host language is very nice. You can write garbage collectors in (unsafe) Rust, but I don't know how to do it.
- Another answer here: https://old.reddit.com/r/oilshell/comments/ralaw3/backlog_rough_progress_assessments/hnlksxc/
- GCC and Clang still support many more architectures than the Rust compiler. I'd like Oils to be built on weird embedded systems with limited compiler support. That's not a dealbreaker, but it's a consideration.
- Note that Zephyr ASDL is more expressive than Rust's algebraic data types in at least one dimension. (comment on that)
I think you could probably "compress" bash's 142K lines of C into 100K lines of C++ or so. But then YSH would bring that to perhaps 200K lines.
I don't think I'm capable of writing 200K lines of C++ from scratch with a good architecture! (A few people I've worked with probably could, but I can't, and I think even most "good" C++ programmers can't.)
The 10-40K lines of Python let me aggressively refactor the code for years! And after living with this code for many years, I'm happy with how it turned out.
Also:
- Parsing Shell Was Like Black Box Reverse Engineering, and the low latency of an interpreter helps. C++ is slow to compile.
- Python has garbage collection, and it also turned out to be a rich source of metalanguages:
- Zephyr ASDL for algebraic data types
- pgen2 for LL parsing
- a regex parser, which, along with re2c, produces state machines in pure C
- Gradual typing with MyPy
C has the advantage that many kernel and systems programmers are familiar with C.
It has the same disadvantages as C++ -- you'd have to write a very large C program to implement OSH + YSH (i.e. most of bash, and a lot more), and it would be hard to re-design and refactor.
In addition, we use fine-grained static types to represent shell, which both C++ and MyPy support. (In the Zephyr ASDL code generator, sum types are implemented with inheritance, which turns out to work just great and be very useful.)
In contrast, shells generally use a very homogeneous / "untyped" WORD*
representation, which makes it hard to reason about parsing, evaluation, and quoting.
We also use:
- strongly typed containers like
List<T>
andDict<K, V>
- virtual functions
- exceptions and deterministic cleanup with C++ destructors
Related FAQ: Why Generate C++, and Not C?
Features like constexpr
also turn out to be pretty nice for the compile-time reflection needed for garbage collection.
Note: a manual translation to Nim is being attempted: https://forum.nim-lang.org/t/6756#42018
There seems to be a belief that automatically translating Python to Nim is easier than translating Python to C++, but I don't think that's true. The similar indentation-based syntax doesn't make translation easier; the semantics and libraries are what matter.
https://old.reddit.com/r/oilshell/comments/gqrixg/oil_08pre5_progress_in_c/frw0sl7/
From what I understand, I think Nim could be a good language for writing a shell from scratch. Although one thing I didn't like is that the generated C code is not readable.
It is more like a control flow graph serialized into C, from what I remember.
An explicit goal of Oils' C++ translation is to be able to read, debug, profile the generated code with standard C++ tools, which are powerful and numerous. Functions are functions; loops are loops; ifs are ifs; etc.
One answer here: https://lobste.rs/s/e6u4zi/garbage_collected_heap_c_shaped_like#c_kgepb7
EDIT: It's possible that this would work, we have done an experiment along these lines
It's less straightforward than what we're doing, since it has to infer types at build time
This has become a bit of a FAQ too! Related: Oil 0.12.7 - Garbage Collector Problems
- The technique is inherently unportable
- I would like Oil to be able to bootstrap OSes on weird CPU architectures, without writing assembly code. (recall that one of the first shell programs I ran was "Aboriginal Linux")
- Compared to the shell, it's big and complex
- It's at least 33K lines of C code, and some assembly.
- We have less than 7K lines of hand-written C++ in Oil -- that's the collector + data structures + OS bindings
- It's supposed to be "drop in", but in reality ...
- To make good use of it, you are supposed to give it hints about where pointers may or may not be
- There are many tuning parameters, and they can be tuned incorrectly. (author of Nix evaluator rewrite commented on this)
- the Nix evaluator appears to be carrying around Boehm GC patches for Darwin. I don't want to become a Boehm maintainer!
- Two anti-recommendations from 2012 here; Boehm was removed from their codebases: https://news.ycombinator.com/item?id=3576396
- It's also true that some people have good experiences
- The risk of imprecision is higher on 32-bit systems; a shell has good use cases on 32-bit systems.
- The safety is questionable -- it changes when compilers change, and they have changed a lot since Boehm GC was initially developed
- Good perspective from Henderson about this in Accurate Garbage Collection in Uncooperative Environments (2002)
- My summary of that work: https://old.reddit.com/r/ProgrammingLanguages/comments/y93yvv/oil_0127_garbage_collector_problems/it9uv1h/
- 2016 Emacs controversy: https://lists.gnu.org/archive/html/emacs-devel/2016-11/msg00551.html
- my point is not to say it's unsafe, but to say it's questionable, which is definitely true! Oils' philosophy is not to use questionable assumptions about compilers, even if they happen to be true for some compilers.
Some more comments on the blog: https://www.oilshell.org/blog/2023/01/garbage-collector.html
Oils is written with an "executable spec", so in theory any memory management strategy can be plugged in.
Some comments here: https://lobste.rs/s/de28g8/pictures_working_garbage_collector#c_vc5l9a
Some comments here: https://lobste.rs/s/s2remb/oil_is_being_implemented_middle_out#c_1itid1
-
Zulip: Implementation Language FAQ. (requires login). Go, Rust, D, Nim, etc.
- I may copy more links here as necessary