Skip to content

FAQ: Why Not Write Oil in X?

andychu edited this page Jan 17, 2023 · 57 revisions

This is a common set of questions, enough so that I'm making a wiki page about it.

First: I Encourage Parallel Experiments in Other Languages

I don't claim that Oil's strategy is the only way! (However I will note that every POSIX-compatible shell is written in C, and there are deeper reasons for that than you might expect. Shell is a thin layer over the Unix kernel, which has an interface specified in C. Its major dependency is libc.)

How to Rewrite Oil in Nim, C++, D, or Rust (or C) (Summer 2020)

Why Not OCaml?

I think OCaml is a "top contender" -- it has algebraic data types, garbage collection, and a predictable native compiler.

However, a huge part of a shell codebase is lexing and parsing, and I believe those are more naturally done with imperative / stateful languages. (On the other hand, type checking is probably more natural in OCaml. I say "probably" because location info and errors tend to drown out attempts at clean, short code.)

For example, ocamlyacc is no different than a parser generator in C. There is no real advantage to OCaml there, and C/C++ are faster. re2c ended up being a perfect code generator for Oil, and I would have to write OCaml bindings for its generated code, which I don't know how to do.

Why not D?

I think D would be another top contender -- it's a "fast systems language" with garbage collection. It has builtin dictionaries! (which are used all over Oil)

Though I think we would still want something like Zephyr ASDL for algebraic data types.

Why Not Go?

You Can't Write a Portable POSIX Shell in Portable Go due to its threaded runtime, and the fact that it doesn't use libc.

It doesn't necessarily mean Go is a bad choice, but it will cause more work.

Related comment and discussion about threads and fork(), which don't mix: https://news.ycombinator.com/item?id=31741222

Also, as of Jan 2023, the native Oil binary is a little above 1 MB. As far as I remember, binaries in other compiled languages may be 10x as big, and I believe that matters for a shell.

Why Not Rust?

I think Rust is promising in general, and it's obviously possible with enough effort.

But the shell AST is actually a big graph, and the parser is reused in an unusual way for interactive completion, giving it odd ownership semantics. I think garbage collection is natural for this problem.

Also, garbage collection must occur somewhere -- either in the Oil language or the host language (C++ / Rust). I think having it in the host language is very nice. You can write garbage collectors in (unsafe) Rust, but I don't know how to do it.

Why Not Write it By Hand in C++?

I think you could probably "compress" bash's 142K lines of C into 100K lines of C++ or so. But then the Oil language would bring that to perhaps 200K lines.

I don't think I'm capable of writing 200K lines of C++ from scratch with a good architecture! (A few people I've worked with probably could, but I can't, and I think even most "good" C++ programmers can't.)

The 10-40K lines of Python let me aggressively refactor the code for years! And after living with this code for many years, I'm happy with how it turned out.

Also:

  • Parsing Shell Was Like Black Box Reverse Engineering, and the low latency of an interpreter helps. C++ is slow to compile.
  • Python has garbage collection, and it also turned out to be a rich source of metalanguages:
    • Zephyr ASDL for algebraic data types
    • pgen2 for LL parsing
    • a regex parser, which, along with re2c, produces state machines in pure C
    • Gradual typing with MyPy

Why Not C ?

C has the advantage that many kernel and systems programmers are familiar with C.

It has the same disadvantages as C++ -- you'd have to write a very large C program to implement OSH + Oil (i.e. most of bash, and a lot more), and it would be hard to re-design and refactor.

In addition, we use fine-grained static types to represent shell, which both C++ and MyPy support. (In the Zephyr ASDL code generator, sum types are implemented with inheritance, which turns out to work just great and be very useful.)

In contrast, shells generally use a very homogeneous / "untyped" WORD* representation, which makes it hard to reason about parsing, evaluation, and quoting.

  • We also use strongly typed containers like List<T> and Dict<K, V>, as well as virtual functions and exceptions.
  • C++ features like constexpr also turn out to be pretty nice for the compile-time reflection needed for garbage collection.

Related FAQ: Why Generate C++, and Not C?

Nim?

Note: a manual translation to Nim is being attempted: https://forum.nim-lang.org/t/6756#42018

Semi-Automatic Translation

There seems to be a belief that automatically translating Python to Nim is easier than translating Python to C++, but I don't think that's true. The similar indentation-based syntax doesn't make translation easier; the semantics and libraries are what matter.

https://old.reddit.com/r/oilshell/comments/gqrixg/oil_08pre5_progress_in_c/frw0sl7/

Writing From Scratch

From what I understand, I think Nim could be a good language for writing a shell from scratch. Although one thing I didn't like is that the generated C code is not readable.

It is more like a control flow graph serialized into C, from what I remember.

An explicit goal of Oil's C++ translation is to be able to read, debug, profile the generated code with standard C++ tools, which are powerful and numerous. Functions are functions; loops are loops; ifs are ifs; etc.

Why Not Use [Dynamic Language With JIT] instead of C++?

Why Not Run the Oil Interpreter with PyPy?

One answer here: https://lobste.rs/s/e6u4zi/garbage_collected_heap_c_shaped_like#c_kgepb7

Why Not Rewrite the Oil Interpreter in RPython, and generate C without a JIT?

EDIT: It's possible that this would work, we have done an experiment along these lines

It's less straightforward than what we're doing, since it has to infer types at build time

Memory Management Strategies

Why not use Boehm GC?

This has become a bit of a FAQ too! Related: Oil 0.12.7 - Garbage Collector Problems

  • The technique is inherently unportable
    • I would like Oil to be able to bootstrap OSes on weird CPU architectures, without writing assembly code. (recall that one of the first shell programs I ran was "Aboriginal Linux")
  • Compared to the shell, it's big and complex
    • It's at least 33K lines of C code, and some assembly.
    • We have less than 7K lines of hand-written C++ in Oil -- that's the collector + data structures + OS bindings
  • It's supposed to be "drop in", but in reality ...
    • To make good use of it, you are supposed to give it hints about where pointers may or may not be
    • There are many tuning parameters, and they can be tuned incorrectly. (author of Nix evaluator rewrite commented on this)
    • the Nix evaluator appears to be carrying around Boehm GC patches for Darwin. I don't want to become a Boehm maintainer!
  • Two anti-recommendations from 2012 here; Boehm was removed from their codebases: https://news.ycombinator.com/item?id=3576396
    • It's also true that some people have good experiences
  • The risk of imprecision is higher on 32-bit systems; a shell has good use cases on 32-bit systems.
  • The safety is questionable -- it changes when compilers change, and they have changed a lot since Boehm GC was initially developed

Some more comments on the blog: https://www.oilshell.org/blog/2023/01/garbage-collector.html

What About Reference Counting?

Oil is written with an "executable spec", so in theory any memory management strategy can be plugged in.

Some comments here: https://lobste.rs/s/de28g8/pictures_working_garbage_collector#c_vc5l9a

What About shared_ptr ?

Some comments here: https://lobste.rs/s/s2remb/oil_is_being_implemented_middle_out#c_1itid1

Links

Clone this wiki locally