How to find a minimum of three double numbers? It may be surprising to you (it
+certainly was to me), but there is more than one way to do it, and with big
+difference in performance as well. It is possible to make this simple
+calculation significantly faster by utilizing
+CPU level parallelism.
+
The phenomenon described in this blog post was observed in this
+thread of the Rust forum. I am not the one who found out what is
+going on, I am just writing it down :)
+
We will be using Rust, but the language is not important, the original program
+was in Java. What will turn out to be important is CPU architecture. The laptop
+on which the measurements are done has i7-3612QM.
We will be measuring dynamic time warping algorithm. This algorithm
+calculates a distance between two real number sequences, xs and ys. It is
+very similar to edit distance or Needleman–Wunsch,
+because it uses the same dynamic programming structure.
+
The main equation is
+
+
+
That is, we calculate the distance between each pair of prefixes of xs and
+ys using the distances from three smaller pairs. This calculation can be
+represented as a table where each cell depends on three others:
+
+
+
It is possible to avoid storing the whole table explicitly. Each row depends
+only on the previous one, so we need to store only two rows at a time.
Is it fast? If we compile it in --release mode with
+
+
+
in ~/.cargo/config, it takes 435 milliseconds for two
+random sequences of length 10000.
+
What is the bottleneck? Let’s look at the instruction level profile of the main
+loop using perf annotate command:
+
+
+
perf annotate uses AT&T assembly syntax, this means that the destination
+register is on the right.
+
The xmm0 register holds the value of curr[iy], which was calculated on the
+previous iteration. Values of prev[iy - 1] and prev[iy] are fetched into
+xmm1 and xmm2. Note that although the original code contained three if
+expressions, the assembly does not have any jumps and instead uses two min and
+one blend instruction to select the minimum. Nevertheless, a significant
+amount of time, according to perf, is spent calculating the minimum.
This version completes in 430 milliseconds, which is a nice win of 5
+milliseconds over the first version, but is not that impressive. The assembly
+looks cleaner though:
+
+
+
Up to this point it was a rather boring blog post about Rust with some assembly
+thrown in. But let’s tweak the last variant just a little bit …
This version takes only 287 milliseconds to run, which is roughly 1.5 times
+faster than the previous one! However, the assembly looks almost the same …
+
+
+
The only difference is that two vminsd instructions are swapped.
+But it is definitely much faster.
A possible explanation is a synergy of CPU level parallelism and speculative
+execution. It was proposed by @krdln and @vitalyd. I don’t know how to
+falsify it, but it at least looks plausible to me!
+
Imagine for a second that instead of vminsd %xmm0,%xmm1,%xmm0 instruction
+in the preceding assembly there is just vmovsd %xmm1,%xmm0. That is, we don’t
+use xmm0 from the previous iteration at all! This corresponds to the following
+update rule:
+
+
+
The important property of this update rule is that CPU can calculate two cells
+simultaneously in parallel, because there is no data dependency between
+curr[i] and curr[i + 1].
+
We do have vminsd %xmm0,%xmm1,%xmm0, but it is equivalent to vmovsd
+%xmm1,%xmm0 if xmm1 is smaller than xmm0. And this is often the case:
+xmm1 holds the minimum of upper and diagonal cell, so it is likely to be less
+then a single cell to the left. Also, the diagonal path is taken slightly more
+often then the two alternatives, which adds to the bias.
+
So it looks like the CPU is able to speculatively execute vminsd and
+parallelise the following computation based on this speculation! Isn’t that
+awesome?
Despite the fact that Rust is a high level language, there is a strong
+correlation between the source code and the generated assembly. Small tweaks to
+the source result in the small changes to the assembly with potentially big
+implications for performance. Also, perf is great!
+
That’s all :)
+
+
+
+
+
+
+
+
diff --git a/2017/03/18/min-of-three-part-2.html b/2017/03/18/min-of-three-part-2.html
new file mode 100644
index 00000000..d32de34d
--- /dev/null
+++ b/2017/03/18/min-of-three-part-2.html
@@ -0,0 +1,542 @@
+
+
+
+
+
+
+ Min of Three Part 2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
It calculates dynamic time warping distance between two double
+vectors using an update rule which is structured like this:
+
+
+
This code takes 293 milliseconds to run on a particular input
+data. The speedup from 435 milliseconds stated in the previous post is
+due to Moore’s law: I’ve upgraded the CPU :)
+
We can bring run time down by tweaking how we calculate the minimum of
+three elements.
This version takes only 210 milliseconds, presumably because the
+minimum of two elements in the previous row can be calculated without
+waiting for the preceding element in the current row to be computed.
+
The assembly for the main loop looks like this (AT&T syntax,
+destination register on the right)
Can we loosen dependencies between cells even more to benefit from instruction
+level parallelism? What if instead of filling the table row by row, we do it
+diagonals?
+
+
+
We’d need to remember two previous diagonals instead of one previous
+row, but all the cells on the next diagonal would be independent! In
+theory, compiler should be able to use SIMD instructions to make the
+computation truly parallel.
It take 185 milliseconds to run. The assembly for the main loop is
+quite interesting:
+
+
+
First of all, we don’t see any vectorized instructions, the code does
+roughly the same operations as the in previous version. Also, there is
+a whole bunch of extra branching instructions on the top. These are
+bounds checks which were not eliminated this time. And this is great:
+if I add all off-by one errors I’ve made implementing diagonal
+indexing, I would get an integer overflow! Nevertheless, we’ve got
+some speedup.
+
Can we go further and add get SIMD instructions here? At the moment,
+Rust does not have a stable way to explicitly emit SIMD
+(it’s going to change some day) (UPDATE: we have SIMD on stable now!), so the only choice we
+have is to tweak the source code until LLVM sees an opportunity for
+vectorization.
How can we get the same results with safe Rust? One possible way is to
+use iterators, but in this case the resulting code would be rather
+ugly, because you’ll need a lot of nested .zip’s. So let’s try a
+simple trick of hoisting the bounds checks of the loop. The idea is to
+transform this:
+
+
+
into this:
+
+
+
In Rust, this is possible by explicitly slicing the buffer before the loop:
This is definitely an improvement over the best safe version, but is
+still twice as slow as the unsafe variant. Looks like some bounds
+checks are still there! It is possible to find them by selectively
+using unsafe to replace some indexing operations.
We’ve gone from almost 300 milliseconds to only 50 in safe Rust. That
+is quite impressive! However, the resulting code is rather brittle and
+even small changes can prevent vectorization from triggering.
+
It’s also important to understand that to allow for SIMD, we had to
+change the underlying algorithm. This is not something even a very
+smart compiler could do!
I’ve tried installed a stable 16.09 version first, but live CD didn’t manage to
+start the X server properly. This was easy to fix by switching to the then beta
+17.03.
It is my first system which uses UEFI instead of BIOS, and I was
+pleasantly surprised by how everything just worked. Documentation contains only
+a short paragraph about UEFI, but it’s everything you need. The only hiccup on
+my side happened when I enabled GRUB together with systemd-boot: you don’t
+need GRUB at all, system-boot is a bootloader which handles everything.
After I’ve installed everything, I was presented with a blank screen
+instead of my desktop environment (with the live CD everything
+worked). It took me ages to debug the issue, while the fix was super
+trivial: add videoDrivers = [ "intel" ]; to xserver config and
+"noveau" to blacklistedKernelModules.
While nix is the best way to manage Linux desktop I am aware of,
+rustup is the most convenient way of managing Rust toolchains.
+Unfortunately it’s not easy to make rustup play nicely with NixOS (UPDATE:
+rustup is now packaged in nixpkgs and just works). Rustup downloads binaries of
+the compiler and Cargo, but it is impossible to launch unmodified binaries on
+NixOS because it a lacks conventional loader.
+
The fix I came up with is a horrible hack which goes against
+everything in NixOS. Here it is:
+
+
+
It makes the loader and shared libraries (rustup needs zlib) visible
+to binaries compiled for x64 Linux.
Another software which I wish to update somewhat more frequently than
+other packages is IntelliJ IDEA (I write a fair amount of Kotlin and
+Rust). NixOS has a super convenient mechanism to do this:
+packageOverrides. Here is my ~/nixpkgs/config.nix:
+
+
+
It allows to use the most recent IDEA with the stable NixOS channel.
If you are wondering how debuggers work, I suggest reading Eli Bendersky’s
+eli-on-debuggers. However after having read these notes myself, I still
+had one question unanswered. Namely, how can debugger show fields of a class, if
+the type of the class is known only at runtime?
Consider this situation: you have a pointer of type A*, which at runtime holds
+a value of some subtype of A. Could the debugger display the fields of the
+actual type? Turns out, it can handle cases like the one below just fine!
Could it be possible that information about dynamic types is present in DWARF?
+If we look at the DWARF, we’ll see that there’s layout information for both
+Base and Derive types, as well as a entry for x parameter, which says that
+it has type Base. And this makes sense: we don’t know that x is Derived
+until runtime! So debugger must somehow figure the type of the variable
+dynamically.
As usual, there’s no magic. For example, LLDB has a hard-coded knowledge of C++
+programming language, which allows debugger to inspect types at runtime.
+Specifically, this is handled by LanguageRuntime LLDB plugin, which has a
+curious function GetDynamicTypeAndAddress, whose job is to poke the
+representation of value to get its real type and adjust pointer, if necessary
+(remember, with multiple inheritance, casts may change the value of the
+pointer).
+
The implementation of this function for C++ language lives in
+ItaniumABILanguageRuntime.cpp although, unlike C, C++ lacks a
+standardized ABI, almost all compilers on all non-windows platforms use a
+specific ABI, confusingly called Itanium (after a now effectively dead
+64-bit CPU architecture).
+
+
+
+
+
+
+
+
diff --git a/2018/01/03/make-your-own-make.html b/2018/01/03/make-your-own-make.html
new file mode 100644
index 00000000..3fe881e3
--- /dev/null
+++ b/2018/01/03/make-your-own-make.html
@@ -0,0 +1,258 @@
+
+
+
+
+
+
+ Make your own make
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
One of my favorite features of Cargo is that it is not a general
+purpose build tool. This allows Cargo to really excel at the task of building
+Rust code, without usual Turing tarpit of build configuration files. I have yet
+to see a complicated Cargo.toml file!
+
However, once a software project grows, it’s almost inevitable that it will
+require some tasks besides building Rust code. For example, you might need to
+integrate several languages together, or to setup some elaborate testing for
+non-code aspects of your project, like checking the licenses, or to establish an
+elaborate release procedure.
+
For such use-cases, a general purpose task automation solution is needed. In
+this blog post I want to describe one possible approach, which leans heavily on
+Cargo’s built-in functionality.
The simplest way to automate something is to write a shell script. However there
+are few experts in the arcane art of shell scripting, and shell scripts are
+inherently platform dependent.
+
The same goes for make, with its many annoyingly similar flavors.
+
Two tools which significantly improve on the ease of use and ergonomics are
+just and cargo make. Alas, they still mostly rely on the
+shell to actually execute the tasks.
An obvious idea is to use Rust for task automation. Originally, I have proposed
+creating a special Cargo subcommand to execute build tasks, implemented as Rust
+programs, in this
+thread.
+However, since then I realized that there are built-in tools in Cargo which
+allow one to get a pretty ergonomic solution. Namely, the combination of
+workspaces, aliases and ability to define binaries seems to do the trick.
If you just want a working example, see this
+commit.
+
A typical Rust project looks like this
+
+
+
Suppose that we want to add a couple of tasks, like generating some code from
+some specification in the RON format, or
+grepping the source code for TODO marks.
+
First, create a special tools package:
+
+
+
The tools/Cargo.toml might look like this:
+
+
+
Then, we add a
+[workspace]
+to the parent package:
+
+
+
We need this section because tools is not a dependency of frobnicator, so it
+won’t be picked up automatically.
+
Then, we write code to accomplish the tasks in tools/src/bin/gen.rs and
+tools/src/bin/todo.rs.
+
Finally, we add frobnicator/.cargo/config with the following contents:
+
+
+
Voilà! Now, running cargo gen or cargo todo will execute the tasks!
This is a small post about a specific pattern for cancellation in the Rust
+programming language. The pattern is simple and elegant, but it’s rather
+difficult to come up with it by yourself.
To be able to stop a worker, we need to have one in the first place! So, let’s
+implement a model program.
+
The task is to read the output line-by-line, sending these lines to another thread
+for processing (echoing the line back, with ❤️).
+My solution looks like this:
Now that we have a worker, let’s add a new requirement.
+
When the user types stop, the worker (but not the program itself) should be halted.
+
How can we do this? The most obvious way is to add a new variant, Stop, to the Msg
+enum, and break out of the worker’s loop:
+
+
+
This works, but only partially:
+
+
+
We can add more code to fix the panic, but let’s stop for a moment and try
+to invent a more elegant way to stop the worker. The answer will be below this
+beautiful Ukiyo-e print :-)
The answer is: the cleanest way to cancel something in Rust is to drop it.
+For our task, we can stop the worker by dropping the Sender:
+
+
+
Note the interesting parts of the solution:
+
+
+no need to invent an additional message type,
+
+
+the Sender is stored inside an Option, so that we can
+drop it with the .take method,
+
+
+the Option forces us to check if the worker is alive
+before sending a message.
+
+
+
More generally, previously the worker had two paths for termination: a normal
+termination via the Stop message and an abnormal termination after a panic
+in recv (which might happen if the parent thread panics and drops the Sender).
+Now there is a single code path for both cases. That means we can be surer that if
+something somewhere dies with a panic then the shutdown will proceed in an
+orderly fashion, it is not a special case anymore.
+
The only thing left to make this ultimately neat is to replace a hand-written while let
+with a for loop:
Recently I’ve been sending a lot of pull requests to various GitHub-hosted
+projects. It had been a lot of trial and error before I settled on the git
+workflow which doesn’t involve “Nah, I’ll just rm -rf this folder and do a
+fresh git clone” somewhere. This post documents the workflow. In a nutshell,
+it is
+
+
+do not use the master branch for pull requests
+
+
+use the master branch to track upstream repository
+
+
+automate
+
+
+
Note that hub utility exist to handle these issues
+automatically. I personally haven’t used for no real reason, you definitely
+should check it out!
The natural thing to do, when sending a pull request, is to fork the upstream
+repository, git clone your fork locally, make a fix, git commit -am and
+git push it to the master branch of your fork and then send a PR.
+
It even seems to work at first, but breaks down in these two cases:
+
+
+
You want to send a second PR, and now you don’t have a clean branch
+to base your work off.
+
+
+
The upstream was updated, your PR does not merge cleanly anymore,
+you need to do a rebase, but you don’t have a clean branch to rebase
+onto.
+
+
+
Tip 1: always start with creating a feature branch for PR:
+
+
+
However it is easy to forget this step, so it is important to be able
+to move to a separate branch after you erroneously committed code to
+master. It is also crucial to reset master to clean state, otherwise
+you’ll face some bewildering merge conflicts, when you try to update
+your fork several days later.
+
Tip 2: don’t forget to reset master after a mix-up:
+
+
+
Update: I’ve learned that magit has a dedicated utility for this “create a
+branch and reset master to a clean state” workflow —git spinoff.
+My implementation is here.
If you work regularly on a particular project, you’d want to keep your
+fork in sync with upstream repository. One way to do that would be to
+add upstream repository as a git remote, and set the local master
+branch to track the master from upstream:
+
Tip 3: tracking remote repository
+
+
+
With this setup, you can easily update your pull request if they don’t
+merge cleanly because of upstream changes:
+
Tip 4: updating a PR
+
+
+
Update: worth automating as well, here’s my git
+refresh
There are several steps to get the repo setup just right, and doing it
+manually every time would lead to errors and mysterious merge
+conflicts. It might be useful to define a shell function to do this
+for you! It could look like this
Bonus 1: another useful function to have is for reviewing PRs:
+
+
+
Bonus 2:
+There are a lot of learning materials about Git out there. However, a
+lot of these materials are either comprehensive references, or just present a
+handful of most useful git commands. I’ve once accidentally stumbled upon
+Git from the bottom up and I
+highly recommend reading it: it is a moderately long article, which explains the
+inner mechanics of Git.
Suppose you have some struct which holds some references inside. Now,
+you want to store a reference to this structure inside some larger
+struct. It could look like this:
+
+
+
The code, as written, does not compile:
+
+
+
To fix it, we need to get Foo an additional lifetime:
+
+
+
And this is the problem which is the subject of this post. Although
+Foo is supposed to be an implementation detail, its lifetime, 'a,
+bleeds to Context’s interface, so most of the clients of Context
+would need to name this lifetime together with 'a: 'f bound. Note
+that this effect is transitive: in general, rust struct has to name
+lifetimes of contained types, and their contained types, and their
+contained types, … But let’s concentrate on this two-level example!
+
The question is, can we somehow hide this 'a from users of Context? It’s
+interesting that I’ve first distilled this problem about half a year ago in this
+urlo
+post,
+and today, while refactoring some of Cargo internals in
+#5476 with
+@dwijnand, I’ve stumbled upon something, which
+could be called a solution, if you squint hard enough.
Surprisingly, it works! I’ll show a case where this approach breaks down
+in a moment, but let’s first understand why this works. The magic
+happens in the new method, which could be written more explicitly as
+
+
+
Here, we assign a &'f Foo<'a> to a variable of a different type &'f
+Foo<'f>. Why is this allowed? We use 'a lifetime in Foo only for
+a shared reference. That means that Foo is
+covariant over
+'a. And that means that the compiler can use Foo<'a> instead of
+Foo<'f> if 'a: 'f. In other words rustc is allowed to shorten the
+lifetime.
+
It’s interesting to note that the original new function didn’t say
+that 'a: 'f, although we had to add this bound to the impl block
+explicitly. For functions, the compiler infers such bounds from
+parameters.
+
Hopefully, I’ve mixed polarity an even number of times in this
+variance discussion :-)
What we want to say is that, inside the Context, there is some
+lifetime 'a which the consumers of Context need not care about,
+because it outlives 'f anyway. I think that the syntax for that
+would be something like
+
+
+
Alas, for is supported only for traits and function pointers, and
+there it has the opposite polarity of for all instead of exists,
+so using it for a struct gives
We’ve added a Push trait, which has the same interface as the Foo
+struct, but is not parametrized over the lifetime. This is
+possible because Foo’s interface doesn’t actually depend on the 'a
+lifetime. And this allows us to magically write foo: &'f mut (Push + 'f).
+This + 'f is what hides 'a as “some unknown lifetime, which outlives 'f”.
There are many problems with the previous solution: it is ugly,
+complicated and introduces dynamic dispatch. I don’t know how to solve
+those problems, so let’s talk about something I know how to deal with
+:-)
+
The Push trait duplicated the interface of the Foo struct. It
+wasn’t that bad, because Foo had only one method. But what if
+Bar has a dozen of methods? Could we write a more general trait,
+which gives us access to Foo directly? Looks like it is possible, at
+least to some extent:
How does this work? Generally, we want to say that “there exists some
+lifetime 'a, which we know nothing about except that 'a: 'f”. Rust
+supports similar constructions only for functions, where for<'a> fn
+foo(&'a i32) means that a function works for all lifetimes 'a. The
+trick is to turn one into another! The desugared type of callback f,
+is &mut for<'x> FnMut(&'f mut Foo<'x>). That is, it is a function
+which accepts Foo with any lifetime. Given that callback, we are
+able to feed our Foo with a particular lifetime to it.
While the code examples in the post juggled Foos and Bars, the
+core problem is real and greatly affects the design of Rust code. When
+you add a lifetime to a struct, you “poison” it, and all structs which
+contain it as a member need to declare this lifetime as well. I would
+love to know a proper solution for this problem: the described trait
+object workaround is closer to code golf than to the practical
+approach.
In this post, I’ll talk about a pattern for extracting values from a
+weakly typed map. This pattern applies to all statically typed
+languages, and even to dynamically typed ones, but the post is rather
+Rust-specific.
You have an untyped Map<String, Object> and you need to get a typed
+Foo out of it by the "foo" key. The untyped map is often some kind
+of configuration, like a JSON file, but it can be a real map with
+type-erased Any objects as well.
+
In the common case of statically known configuration, the awesome
+solution that Rust offers is serde. You stick derive(Deserialize)
+in front of the Config struct and read it from JSON, YML, TOML or
+even just environment variables!
+
+
+
However, occasionally you can’t use serde. Some of the cases where
+this might happen are:
+
+
+
merging configuration from several sources, which requires writing a
+non-trivial serde deserializer,
+
+
+
lazy deserialization, when you don’t want to care about invalid values
+until you actually use them,
+
+
+
extensible plugin architecture, where various independent modules
+contribute options to a shared global config, and so the shape of
+the config is not known upfront.
+
+
+
you are working with Any objects or otherwise don’t do
+serialization per se.
The simplest approach here is to just grab an untyped object using a
+string literal and specify its type on the call site:
+
+
+
I actually think that this is a fine approach as long as such snippets
+are confined within a single module.
+
One possible way to make it better is to extract "foo" constant to a
+variable:
+
+
+
This does bring certain benefits:
+
+
+
fewer places to make a typo in,
+
+
+
behavior is moved from the code (.get("foo")) into data (const FOO), which
+makes it easier to reason about the code (at a glance, you can see all available
+config option and get an idea why they might be useful),
+
+
+
there’s now an obvious place to document keys: write a doc-comment for a
+constant.
+
+
+
While great in theory, I personally feel that this usually brings little
+tangible benefit in most cases, especially if some constants are used only once.
+This is the case where the implementation, a literal "foo", is more clear than
+the abstraction, a constant FOO.
However, the last pattern can become much more powerful and
+interesting if we associate types with string constants. The idea is
+to encode that the "foo" key can be used to extract an object of
+type Foo, and make it impossible to use it for, say,
+Vec<String>. To do this, we’ll need a pinch of
+PhantomData:
+
+
+
Now, we can add type knowledge to the "foo" literal:
+
+
+
And we can take advantage of this in the get method:
+
+
+
Note how we were able to get rid of the turbofish at the call-site!
+Moreover, the understandably aspect of the previous pattern is also
+enhanced: if you know both the type and the name of the config option,
+you can pretty reliably predict how it is going to be used.
I’ve first encountered this pattern in IntelliJ code. It uses
+UserDataHolder, which is basically Map<String, Object>, everywhere.
+It helps plugin authors to extend built-in objects in crazy ways but is rather
+hard to reason about, and type-safety improves the situation a lot. I’ve also
+changed Exonum’s config to employ this pattern in this PR. It also was a
+case of plugin extensible, where an upfront definition of all configuration
+option is impossible.
+
Finally, I’ve written a small crate for this typed_key :)
Similarly to the previous post, we will once again add types to the Rust
+code which works perfectly fine without them. This time, we’ll try to improve
+the pervasive pattern of using indexes to manage cyclic data structures.
Often one wants to work with a data structure which contains a cycle
+of some form: object foo references bar, which references baz
+which references foo again. The textbook example here is a graph of
+vertices and edges. In practice, however, true graphs are a rare
+encounter. Instead, you are more likely to see a tree with parent
+pointers, which contains a lot of trivial cycles. And sometimes cyclic
+graphs are implicit: an Employee can be the head of a Departement,
+and Departement has a Vec<Employee> personal. This is sort-of a
+graph in disguise: in usual graphs, all vertices are of the same type,
+and here Employee and Departement are different types.
+
Working with such data structures is hard in any language. To arrive
+at a situation when A points to B which points back to A, some
+form of mutability is required. Indeed, either A or B must be
+created first, and so it can not point to the other immediately after
+construction. You can paper over this mutability with let rec, as in
+OCaml, or with laziness, as in Haskell, but it is still there.
+
Rust tends to surface subtle problems in the form of compile-time
+errors, so implementing such graphs in Rust is challenging. The three
+usual approaches are:
+arena and real cyclic references, explanation by
+simonsapin (this one is really neat!),
+
+
+arena and integer indices, explanation by nikomatsakis.
+
+
+
(apparently, rewriting a Haskell monad tutorial in Rust results in a
+graphs blog post).
+
I personally like the indexing approach the most. However it presents
+an interesting readability challenge. With references, you have a
+foo of type &Foo, and it is immediately clear what that foo is,
+and what you can do with it. With indexes, however, you have a foo:
+usize, and it is not obvious that you somehow can get a Foo. Even
+worse, if indexes are used for two types of objects, like Foo and
+Bar, you may end up with thing: usize. While writing the code with
+usize actually works pretty well (I don’t think I’ve ever used the
+wrong index type), reading it later is more complicated, because
+usize is much less suggestive of what you could do.
One way to ameliorate this problem is to introduce a newtype wrapper
+around usize:
+
+
+
Here, “one should use FooIdx to index into Vec<Foo>” is still just
+a convention. A cool thing about Rust is that we can turn this
+convention into a property verified during type checking. By adding an
+appropriate impl, we should be able to index into Vec<Foo> with
+FooIdx directly:
It’s insightful to study why this impl is allowed. In Rust, types,
+traits and impls are separate. This creates a room for a problem: what
+if there are two impl blocks for a given (trait, type) pair? The
+obvious choice is to forbid to have two impls in the first place, and
+this is what Rust does.
+
Actually enforcing this restriction is tricky! The simplest rule of
+“error if a set of crates currently compiled contains duplicate impls”
+has severe drawbacks. First of all, this is a global check, which
+requires the knowledge of all compiled crates. This postpones the
+check until the later stages of compilation. It also plays awfully
+with dependencies, because two completely unrelated crates might fail
+the compilation if present simultaneously. What’s more, it doesn’t
+actually solve the problem, because the compiler does not necessary
+know the set of all crates beforehand. For example, you may load
+additional code at runtime via dynamic libraries, and silent bad
+things might happen if you program and dynamic library have duplicate
+impls.
+
To be able to combine crates freely, we want a much stronger property:
+not only the set of crates currently compiled, but all existing and
+even future crates must not violate the one impl restriction. How on
+earth is it possible to check this? Should cargo publish look for
+conflicting impls across all of the crates.io?
+
Luckily, and this is stunningly beautiful, it is possible to loosen
+this world-global property to a local one. In the simplest form, we
+can place a restriction that impl Foo for Bar can appear either in
+the crate that defines Foo, or in the one that defines
+Bar. Crucially, whichever one defines the impl has to use the other,
+which makes it possible to detect the conflict.
+
This is all really nifty, but we’ve just defined an Index impl for
+Vec, and both Index and Vec are from the standard library! How
+is it possible? The trick is that Index has a type parameter: trait
+Index<Idx: ?Sized>. It is a template for a trait of sorts, and we get
+a “real” trait when we substitute type parameter with a type. Because
+FooIdx is a local type, the resulting Index<FromIdx> trait is also
+considered local. The precise rules here are quite tricky, this
+RFC explains them pretty well.
Because Index<FooIdx> and Index<BarIdx> are different traits, one
+type can implement both of them. This is convenient for containers
+which hold distinct types:
+
+
+
It’s also helpful to define arithmetic operations and conversions for
+the newtyped indexes. I’ve put together a
+typed_index_derive crate to automate this boilerplate via a
+proc macro, the end result looks like this:
Hi! During the last couple of years, I’ve spent a lot of time writing
+parsers and parser generators, and I want to write down my thoughts
+about this topic. Specifically, I want to describe some properties of
+a parser generator that I would enjoy using. Note that this is not an
+“introduction to parsing” blog post, some prior knowledge is assumed.
+
Why do I care about this at all? The broad reason is that today a lot
+of tools and even most editors use regular expressions to
+approximately parse programming languages, and I find this outright
+b҉a͡rb̢ari͞c͘. I understand
+that in practice parsing is not as easy as it is in theory:
+
+
+
However, I do believe we could do better if we use better tools!
+
The specific reason is that I care way too much about the Rust
+programming language and
+
+
+
I think today it is the best language for writing compiler-like
+stuff (yes, better than OCaml!),
+
+
+
I’d love to see an awesome parser generator written in and
+targeting Rust,
I’ve used various parser generators, implemented one,
+fall, and still haven’t met a parser generator
+that I love.
+
The post is split into three major chapters:
+
+
+
UX— how to make using a parser generator easy, enjoyable and
+fun?
+
+
+
API— what API the generated parser should have.
+
+
+
Parsing Techniques— how exactly do we get from text to the
+parsed tree?
+
+
+
I’ll be using a rather direct and assertive language in the following,
+but the fact is I am totally not sure about anything written here, and
+would love to know more about alternatives!
Although this text is written in Emacs, I strongly believe that a
+semantic-based, reliable, and fast support from tooling is a great
+boon to learnability and productivity. A great IDE support is a must
+for a modern parser generator, and this chapter talks mostly about
+IDE-related features.
+
The most important productivity boost of a parser generator is the
+ability to fiddle with grammar interactively. The UI for this might
+look as a three-pane view, where the grammar is on the first pane,
+example code to parse is in the second pane and the resulting parse
+tree is in the third one. Editing first two panes should reactively
+update the last one. This is difficult to implement with most
+yacc-like parser generators, I’ll talk more about it in the next
+section.
+
The second most important feature is inline tests: for complex
+grammars it could be really hard to map from a particular rule
+specification to actual code that is parsed by the rule. Having a test
+written alongside the rule is invaluable! The test should be just a
+snippet of code in the target language. The “gold” value of the parse
+tree for the snippet should be saved in the file alongside the grammar
+and should be updated automatically when the grammar changes. Having
+inline tests allows to fit the “three pane UI” from the previous into
+two panes because you can just use the test as your second pane.
Note that even if you write your parser by hand, you still should use such
+“inline tests”. To do so, write them as comments with special markers, and write
+a small script which extracts such comments and turns them into tests proper.
+Here’s an
+example
+from one experimental hand-written parser of mine. Having such examples of “what
+does this if parses?” greatly simplifies reading of parser’s code!
+
Here’s the list of important misc IDE features, from super important to very
+important. They are not specific to parser generators, so, if you are using a
+parser generator to implement IDE support for your language, look into these
+first!
+
+
+
Extend selection to the enclosing syntactic structure (and not just
+to a braced block). A super simple feature, but this combined with
+multiple cursors is arguably more powerful than vim’s text objects,
+and most definitely easier to use.
+
+
+
Fuzzy search of symbols in the current file/in the project: super
+handy for navigation, both more important and easier to implement
+than goto definition.
+
+
+
Precise syntax highlighting. Highlighting is not a super-important
+feature and actually works ok even with regex approximations, but
+if you already have the syntax tree, then why not use it?
+
+
+
Go to definition/find references.
+
+
+
Errors and warnings inline, with fixes if available.
+
+
+
Extract rule refactoring, pairs well with extend selection.
+
+
+
Code formatting.
+
+
+
Smart typing: indenting code on Enter, adding/removing trailing
+commas when joining/splitting lines, and in general auto magically
+fixing punctuation.
+
+
+
Code completion: although for parser generators dumb word-based
+completion tends to work OK.
I want to emphasize that most of these features are ridiculously easy to
+implement, if you have a parse tree for your language. Take, for example, “fuzzy
+search of symbols in the project”. This is a super awesome feature for
+navigation. Basically, it is CTAGS done right: first, you parse each file (in
+parallel) and build a list of symbols for it. Then, as user types, you
+incrementally update the changed files. Using fall, I’ve implemented this
+feature for Rust, and it took me three small files:
+
+
+
find_symbols.rs
+to extract symbols from a single file, 21(!) lines.
+
+
+
indxr.rs,
+a generic infra to watch files for changes and recompute the index incrementally, 155 lines.
+
+
+
symbol_index.rs
+glues the previous two together, and adds
+fst by ever-awesome BurntSushi
+on top for fuzzy search, 122 lines.
+
+
+
This is actually practical: initial indexing of rust-lang/rust repo
+takes about 30 seconds using a single core and fall’s ridiculously
+slow parser, and after that everything just works:
A small note on how to pack all this IDE functionality: make a library. That
+way, anyone could use it anywhere. For example, as a web-assembly module in the
+online version. On top of the library you could implement whatever protocol you
+like, Microsoft’s LSP, or some custom one. If you go the protocol-first way,
+using your code outside of certain editors could be harder.
Traditionally, parser generators work by allowing the user to specify
+custom code for each rule, which is then copy-pasted into the
+generated parser. This is typically used to construct an abstract
+syntax tree, but could be used, for example, to evaluate arithmetic
+expressions during parsing.
+
I don’t think this is the right API for the parser generator for three
+reasons though.
+
It feels like a layering violation because it allows to intermix parsing with
+basically everything else. You can literally do code-generation during parsing.
+It makes things like
+the lexer hack possible.
+
It would be very hard to implement reactive rendering of the parse
+tree if the result of parsing is some user-defined type.
+
Most importantly, I don’t think that producing abstract syntax
+tree as a result of parsing is the right choice. The problem with AST
+is that it, by definition, loses information. The most commonly lost
+things are whitespace and comments. While they are not important for a
+command-line batch compiler, they are crucial for IDEs, which work
+very close to the original source code. Another important IDE-specific
+aspect is support for incomplete code. If a function is missing a body
+and a closing parenthesis on the parameter list, it’s still better be
+recognized as a function. It’s difficult to support such missing
+pieces in traditional AST.
+
I am pretty confident that a better API for the generated parser is to
+produce a parse tree which losslessly represents both the input text
+and associated tree structure. Losslessness is a very important
+property: it guarantees that we could implement anything in principle.
+
I’ve outlined one possible design of such lossless representation in the
+libsyntax2 RFC, the simplified
+version looks like this:
+
+
+
That is, the result of parsing is a homogeneous tree, with nodes
+having two bits of information besides the children:
+
+
+
Type of a node: is it a function definition, a parameter, a
+comment?
+
+
+
Region of the source text covered by the node.
+
+
+
A cool thing about such representation is that every language uses
+the same type of the syntax tree. In fall features like extend
+selection are implemented once and work for all languages.
+
If you need it, you can do the conversion to AST in a separate
+pass. Alternatively, it’s possible to layer AST on top of the
+homogeneous tree, using newtype wrappers like
+
+
+
Parser generator should automatically generate such AST wrappers. However, it
+shouldn’t directly infer them from the grammar: not every node kind needs an AST
+wrapper, and method names are important. Better to let the user specify AST
+structure separately, and check that AST and parse tree agree. As an example
+from fall, here is the
+grammar rule for Rust paths, the corresponding
+ast definition, and the
+generated code.
Another important feature for modern parser generator is support for
+incremental reparsing, which is obviously useful for IDEs.
+
One thing that greatly helps here is the split between parser and
+lexer phases.
+
It is much simpler (and more efficient) to make lexing
+incremental. When lexing, almost any change affects at most a couple
+of tokens, so in theory incremental lexing could be pretty
+efficient. Beware though that worst-case relexing still has to be
+linear, because insertion of unclosed quote changes all the following
+tokens.
+
In contrast, it is much easier to change tree structure significantly
+with a small edit, which places upper-bound on incremental reparsing
+effectiveness. Besides, making parsing incremental is more complicated
+because you have to deal with trees instead of a linear structure.
+
An interesting middle ground here is an incremental lexer combined
+with a fast non-incremental parser.
Traditional lex-style lexers struggle with special cases like ml-style
+properly nested comments or Rust raw literals which are even not
+context-free.
+The problem is typically solved by injecting custom code into lexer,
+which maintains some sort of state, like a nesting level of
+comments. In my experience, making this work properly is very
+frustrating.
+
These two tricks may make writing lexer simpler.
+
Instead of supporting lexer states and injecting custom code, allow to pair
+regex, which defines a token, with a function which takes a string slice and
+outputs usize. If lexer matches such external token, it then calls supplied
+function to determine the other end of the token. Here’s an example from fall:
+external
+token,
+custom
+functions.
+
Often it is better to use layered languages instead of lexer
+states. Parsing string literals is a great example of this. String
+literals usually have some notion of a well-formed escape
+sequence. The traditional approach to parsing string literals is to
+switch to a separate lexer state after ", which handles
+escapes. This is bad for error recovery: if there’s a typo in an
+escape sequence, it should still be possible to recognize literal
+correctly. So alternative approach is to parse a string literal as,
+basically, “anything between two quotes”, and then use a separate
+lexer for escapes specifically later in the compiler pipeline.
+
Another interesting lexing problem which arises in practice is
+context-sensitivity: things like contextual keywords or >> can
+represent different token types, depending on the surrounding code. To
+deal with this case nicely, the parser should support token
+remapping. While most of the tokens appear in the final parse tree as
+is, the parser should be able to, for example, substitute two >>
+tokens with a single >>, so that later stages of compilation need
+not to handle this special case.
A nice trick to make parser more general and fast is not to construct
+parse tree directly, but emit a stream of events like “start internal
+node”, “eat token”, “finish internal node”. That way, parsing does not
+itself allocate and, for example, you can use the stream of events to
+patch an existing tree, doing minimal allocations. This also divorces
+the parser from a particular tree structure, so it is easier to
+plug-in different tree backends.
+
Events also help with reshuffling the tree structure. For example,
+during event processing we can turn left-leaning trees to
+right-leaning ones or flatten them into lists. Another interesting
+form of tree reshuffling is attachment of comments. If a comment
+immediately precedes some definition, it should be a part of this
+definition. This is not specified by the language, but it is the
+result that human would expect. With events, we can handle only
+significant tokens to the parser and deal with attaching comments and
+whitespace when reconstructing tree from a flat list of events.
To properly implement incremental reparsing, we should start with a
+data structure for text which is more efficient to update than
+String. While we do have quite a few extremely high-quality
+implementations of ropes, the ecosystem is critically missing a way to
+talks about them generically. That is, there’s no something like
+Java’s CharSequence in Rust (which needs a much more involved design
+in Rust to avoid unnecessary overhead).
+
Luckily, the parse tree needs to remember only the offsets, so we can
+avoid hard-coding a particular text representation, and we don’t even
+need a generic parameter for that.
+
Homogeneous trees make reactive testing of the grammar possible in
+theory because you can always produce a text representation of a tree
+from them. But in practice reactivity requires that “read grammar,
+compile parser, run it on input” loop is fast. Literally generating
+source code of the parser and then compiling it would be too slow, so
+some kind of interpreted mode is required. However, this conflicts
+with the need to be able to extend lexer with custom code. I don’t
+know of a great solution here, but something like this would work:
+
+
+
require that all lexer extensions are specified in the verbatim
+block of the grammar file and don’t have external dependencies,
+
+
+
for IDE support, compile the lexer, and only the lexer, in a temp
+dir and communicate with it via IPC.
+
+
+
A possible alternative is to use a different, approximate lexer for
+interactive testing of the grammar. In my experience this makes such
+testing almost useless because you get different results in
+interesting cases and interesting cases are what is important for this
+feature.
+
In IDEs, a surprisingly complicated problem is managing a list of open
+and modified files, synchronizing them with the file system, providing
+consistent file-system snapshots and making sure that things like
+in-memory buffers are also possible. For parser generators, all this
+complexity might be dodged by requiring that all of the grammar needs
+to be specified in a single file.
So we want to write a parser generator that produces lossless parse
+trees and which has an awesome IDE support. How do we actually parse
+a text into a tree? Unfortunately, while there are many ways to parse
+text, there’s no accepted best one. I’ll try to do a broad survey of
+various options.
+
I’d love to discuss the challenges of the textbook approach of just
+using a context-free grammar/BNF notation. However, let’s start with a
+simpler, “solved” case: regular expressions.
+
Languages which could be described by regular expressions are called
+regular. They are exactly the same languages which could be recognized
+by finite state machines. These two definition mechanisms have nice
+properties which explain the usefulness of regular languages in real
+life:
+
+
+
Regular expressions map closely to our thinking and are easy for
+humans to understand. Note that there are equivalent in power, but
+much less “natural” meta-languages for describing regular
+languages: raw finite state machines or regular grammars.
+
+
+
Finite state machines are easy for computers to execute. FSM is
+just a program which is guaranteed to use constant amount of
+memory.
+
+
+
Regular languages are rather inexpressive, but they work great for
+lexers. On the opposite side of expressivity spectrum are Turing
+machines. For them, we also have a number of meta-languages (like
+Rust), which work great for humans. It’s interesting that a Turing
+machine is equivalent to a finite state machine with a pair of stacks:
+to get two stacks from a tape, cut the tape in half where the head
+is. Moving the head then corresponds to popping from one stack and
+pushing to another.
+
And the context-free languages, which are described by CFGs, are
+exactly in between languages recognized by finite state machines and
+languages recognized by Turing machines. You need a push-down
+automaton, or a state machine with one stack, to recognize a
+context-free language.
+
CFGs are powerful enough to describe arbitrary nesting structures and
+seem to be a good fit for describing programming languages. However,
+there are a couple of problems with CFGs. Let’s write a grammar for
+arithmetic expressions with additions, multiplications, parenthesis
+and numbers. The obvious answer,
+
+
+
has a problem. It is under specified and does not tell if 1 + 2 * 3
+is (1 + 2) * 3 or 1 + (2 * 3). We need to tweak the grammar to get
+rid of this ambiguity:
+
+
+
I think the necessity of such transformations is a problem! Humans don’t think
+like this: it took me three or four courses in formal grammars to really
+internalize this transformation. And if we look at language references, we’ll
+typically see a
+precedence
+table instead of BNF.
+
Another problem here is that we even can’t workaround ambiguity by
+plainly forbidding it: checking if CFG is unambiguous is undecidable.
+
So CFGs turn out to be much less practical and simple than regular
+expressions. What options do we have then?
The first choice is to parse something, not necessary a context-free
+language. A good way to do it is to write a parser by hand. A
+hand-written parser is usually called a recursive descent parser, but
+in reality it includes two crucial techniques in addition to just
+recursive descent. The pure recursive descent works by translating
+grammar rules like T -> A B into a set of recursive functions:
+
+
+
The theoretical problem here is that it can’t deal with
+left-recursion. That is, rules like Statements -> Statements ';'
+OneStatement make recursive descent parser to loop infinitely. In
+theory, this problem is solved by rewriting the grammar and
+eliminating the left recursion. If you had a formal grammars class,
+you probably have done this! In practice, this is a completely
+non-existent problem, because we have loops:
+
+
+
The next problem with recursive descent is that parsing expressions with
+precedence requires that weird grammar rewriting. Luckily, there’s a simpler
+technique to deal with expressions. Suppose you want to parse 1 + 2 * 3. One
+way to do that would be to parse it with a loop as a list of atoms separated
+by operators and then reconstruct a tree separately. If you fuse these two
+stages together, you get a loop, which could recursively call itself and nest,
+a
+Pratt parser. Understanding it for the first time is hard, but you only need to
+do it once :)
+
The most important feature of hand-written parsers is a great support
+for error recovery and partial parses. It boils down to two simple
+tricks.
+
If you are parsing a homogeneous sequence of things (i.e, you are inside the
+loop), and the current token does not look like it can begin a new element, you
+just skip over it and start the next iteration of the loop. Here’s an
+example
+from Kotlin. At
+this
+line, we’ll get null if current token could not begin a class member
+declaration.
+Here
+we just skip over it.
+
If you are parsing a particular thing T, and you expect token foo,
+but see bar, then, roughly:
+
+
+if bar is not in the FOLLOW(T), you skip over it and emit error,
+
+
+if bar is in FOLLOW(T), you emit error, but don’t skip the
+token.
+
+
+
That way, parsing something like
+
+
+
would correctly recognize incomplete function foo (again, its easier to
+represent such incomplete function with homogeneous parse trees than with AST),
+and a complete struct S. Here’s another
+example
+from Kotlin.
+
Although hand-written parsers are good at producing high-quality error
+messages as well, I don’t think that this is important. In the IDE
+context, for syntax errors it is much more important and beneficial to
+get a red squiggly under the error immediately after you’ve typed
+invalid code. Instantaneous feedback and precise location are, in my
+personal experience, enough to fix syntax errors. The error message
+can be just “Syntax error”, and more elaborate messages are often make
+things worse because mapping from an error message to what is
+actually wrong is harder than just typing and deleting stuff and
+checking if it works.
+
It is possible to simplify authoring of this style of parsers by
+generating all recursive functions, loop and Pratt parsers from
+declarative BNF/PEG style description. This is what Grammar Kit and
+fall do.
Another choice is to stay within CFG class but avoid dealing with
+ambiguity by producing all possible parse trees for a given
+input. This is typically achieved using non-determinism and
+memorization, using GLR and GLL style techniques.
+
Here I’d like to call out
+tree-sitter project, which actually
+ticks quite a few boxes outlined in this blog post. In particular, it uses
+homogeneous trees, is fully incremental and has surprisingly good support for
+error recovery (though not quite as good as hand-written style parsers, at least
+when I’ve last checked it).
Yet another choice is to give up full generality and restrict the
+parser generator to a subset of unambiguous grammars, for which we
+actually could verify the absence of ambiguity. This is how traditional
+parser generators like yacc, happy, menhir or LALRPOP work.
+
The very important advantage of these parsers is that you get a strong
+guarantee that the grammar works and does not have nasty
+surprises. The price you have to pay, though, is that sometimes it is
+necessary to tweak an already unambiguous grammar to make the stupid
+tool understand that there’s no ambiguity.
+
I also haven’t seen deterministic LR parsers with great support for
+error recovery, but looks like it should be possible in theory?
+Recursive descent parsers, which are more or less LL(1), recover from
+errors splendidly, and LR(1) has strictly more information than an
+LL(1) one.
+
So, what is the best choice for writing a parser/parser generator?
+
It seems to me that the two extremes are the most promising: hand
+written parser gives you utmost control over everything, which is
+important when you need to parse some language, not designed by you,
+which is hostile to the usual parsing techniques. On the other hand,
+classical LR-style parsers give you a proof that the grammar is
+unambiguous, which is very useful if you are creating your own
+language. Ultimately, I think that being able to produce lossless
+parse trees supporting partial parses is more important than any
+particular parsing technique, so perhaps supporting both approaches
+with a single API is the right choice?
This is a post about an interesting testing technique which feels like it should
+be well known. However, I haven’t seen it mentioned anywhere. I don’t even have
+a good name for it, I’ve semi-discovered it in the wild. If you know how this
+thing is called, please leave a comment!
I was reading Dart analysis server source code, and came across this
+line.
+Immediately I was struck as if by lighting. Well, not exactly in the same way,
+but you get the idea.
+
What does this line do? I actually don’t know, but I have a guess. My
+explanation is further down (to give you a chance to discover the
+trick as well!), but the general idea is that this line helps
+tremendously with making tests more maintainable.
Two tasks which programmers typically enjoy less than furiously
+cranking out new features are maintaining existing code and writing
+tests. And, as an old Russian joke says, maintaining tests is the
+worst. Here are some pain points specific to the post:
+
Negative tests. You want to check that something does not
+happen. Writing a test in this situation is tricky because the test
+might actually pass for a trivial reason instead of the intended
+one. The rule of thumb is to verify that the test actually fails if
+the specific condition which it covers is commented out. The problem
+with this rule of thumb is that it works in a single point in time. As
+the code evolves, the test might begin to pass for a trivial reason.
+
Duplicated tests. Test suites are usually append-only and grow
+indefinitely. Almost inevitably this leads to a situation where
+different tests are testing essentially the same features, or where
+one test is a superset of another.
+
Bifurcated suites. Somewhat similar to the previous point, you may
+end up in a situation where a single component has two separate
+test-suites in different parts of the code base. I’d want to say that
+this happens when two developers write tests independently, but
+practice says that me and me one month later are enough to create such
+a mess :)
+
Tests discoverability. This is a problem a new contributor usually
+faces. Finding a piece of code where the bug fix should be applied is
+usually comparatively easier than locating the corresponding tests.
+
The underlying issue is that it is non-trivial to answer these two
+questions:
+
+
+
Given a line of code, where is the test for this specific line?
+
+
+
Given a test, where is the code that is being tested?
The beautiful solution to this problem (which I hypothesise the
+_coverageMarker() line in Dart does) is to track code coverage on the
+test-by-test basis. That is, when running a test, verify that
+specific lines of code were covered by this test.
+
I’ve put together a small Rust library to do this, called
+uncover. It provides two macros:
+covered_by and covers.
+
The first macro is used in the code under test, like
+this:
If the block where covers is used does not cause the execution of
+the corresponding covered_by line then the error will be raised at
+the end of the block.
+
Under the hood, this is implemented as a global HashMap<String, u64> which
+counts how many times each line was executed. So covered_by!
+increments
+the corresponding count, and covers! returns a guard object that
+checks
+in Drop that the count was incremented. It is possible to disable these checks
+at compile time. And yes, the library actually
+exposes
+a macro which defines macros :)
+
I haven’t had a chance to apply this technique in large projects (and
+it is less useful for smaller ones), but it looks very promising.
+
It’s now easy to navigate between code and tests: just ripgrep the
+string literal (or write a plugin for this for your IDE). You will be
+able to find the test for the specific if-branch! This should be
+especially handy for new contributors.
+
If this technique is used pervasively, you also get an idea about the
+overall test coverage.
+
During refactorings, you became aware of tests which might be
+affected. Moreover, because coverage is actually checked by the tests
+themselves, you’ll notice if some test stop to exercise the code it
+was intended to check.
+
Once again, if you know how this thing is called, please do enlighten
+me in comments! Discussion on /r/rust.
This is partially a mild instance of xkcd://386 with
+respect to the great don’t
+panic post by
+@vorner (yes, it’s 2 am here) and partially a
+discussion of error-handling in the framework of structured concurrency, which
+was recently popularized by @njsmith.
In the blog post, @vorner argues that unwinding sometimes may do more
+harm than good, if it manages to break some unsafe invariants,
+cross FFI boundary or put the application into an impossible state. I
+fully agree that these all are indeed significant dangers of panics.
+
However, I don’t think that just disabling unwinding and using panic
+= "abort" is the proper fix to the problem for the majority of use
+cases. A lot of programs work in a series of requests and responses
+(often implicit), and I argue that for this pattern it is desirable to
+be able to handle bugs in requests gracefully.
+
I’ve spent quite some time working on an
+IDE, and, although it might not
+be apparent on the first sight, IDEs are also based on requests/responses:
+
+
+users types a character, IDE updates its internal data structures
+
+
+users requests completion, IDE runs some calculations on the data
+and gives results
+
+
+
As IDEs are large and have a huge number of features, it is inevitable
+that some not very important linting inspection will fail due to index
+out of bounds access on this particular macro invocation in this
+particular project. Killing the whole IDE process would definitely be
+a bad user experience. On the other hand, just showing a non-modal
+popup “Something went wrong, would you like to submit a bug report” is
+usually only a minor irritation: errors are more common in the
+numerous “additional” features, while the smaller core tends to be
+more correct.
+
I do think that this pattern of “show error message and chug along” is
+applicable to a significant number of applications. Of course, even in
+this setting a bug in the code can in theory have dire consequences,
+but in practice this is mitigated by the following:
+
+
+
Majority of requests are readonly and can’t corrupt data.
+
+
+
The low-level implementation of write requests usually has a
+relatively bug-free transnational semantics, so bugs in write
+requests which lead to transaction aborts don’t corrupt data as
+well.
+
+
+
Most applications have some kind of backup/undo functionality, and
+even if a bug leads to a commit of invalid data, user often can
+restore good state (of course this works only for relatively
+unimportant data).
+
+
+
However, @vorner identifies a very interesting specific problem with
+unwinding which I feel we should really try to solve better: if you
+have a bunch of threads running, and one of them catches fire, what
+happens? It turns out that often nothing particular happens: some more
+threads might die from the poisoned mutexes and closed channels, but
+other treads might continue, and, as a result the application will
+exist in a half-dead state for indefinite period of time.
You are right! Erlang and especially OTP behaviors are great for managing errors
+at scale. However a full actor system might be an overkill if all you want is
+just an OS thread.
+
If you haven’t done this already, pack some snacks, prepare lots of coffee/tea
+and do read the structured
+concurrency
+blog post. The crux of the pattern is to avoid fire and forget concurrency:
+
+
+
Instead, each thread should be confined to some lexical scope and
+never escape it:
+
+
+
The benefit of this organization is that all threads form a tree,
+which gives you greater control, because you know for sure which parts
+are sequential and which are concurrent. Concurrency is explicitly
+scoped.
And we have a really, really interesting API design problem if we
+combine structured concurrency and unwinding. What should be the
+behavior of the following program?
+
+
+
Now, for crossbeam specifically there’s little choice here due to
+the boring requirement for memory safety. But let’s pretend for now
+that this is a garbage collected language.
+
So, we have two concurrent threads in a single scope, one of which is
+currently running and another one is, unfortunately, dead.
+
The most obvious choice is to wait for the running thread to finish
+(we don’t want to let it escape the scope) and then to reraise the
+panic at scope exit. The problem with this approach is that there’s a
+potentially unbounded window between the instant the panic is created,
+and its propagation.
+
This is not a theoretical concern: some time ago a friend of mine had
+a fascinating debugging session with a Python machine learning
+application. The program was processing a huge amount of data, so, to
+speed things up, it partitioned the data and spawned a thread per
+partition (actual processing was in native code, so GIL was avoided):
+
+
+
The observed behavior was that a singe thread died, but no exception
+or stack trace were printed anywhere. This was because the executor
+was waiting for all other threads before propagating the
+exception. Although technically the exception was not lost, in
+practice you’d have to wait for several hours to actually see it!
+
The Trio library uses an
+interesting
+refinement of this strategy: when one of the tasks in scope fails, all others
+are immediately cancelled, and then awaited for. I think this should work well
+for Trio, because it has first-class support for cancellation; any async
+operation is a cancellation point. So all children tasks will be cancelled in a
+timely manner, although I wouldn’t be surprised if there are some pathological
+cases where exception propagation is delayed.
+
Unfortunately, this solution does’t work for native threads, because
+there are just no good cancellation points. And I don’t know of any
+approach that would work :(
+
One vague idea I have is inspired by handling of orphaned processes in
+Unix: if a thread in a scope dies, the scope is teared down
+immediately, and all the running processes are attached to the value
+that is thrown. If anyone wants to handle the failure, they must
+wait for all attached threads to finish first. This way, the initial
+panic and all in-progress threads could be propagated to the top-level
+init scope, which then can attempt either a clean exit by waiting
+for all children, or do a process::abort.
+
However this attachment to the parent violates the property that a
+thread never leaves its original scope. Because crossbeam relies on
+this property for memory safety, this approach is just not applicable
+for threads which share stack data.
+
It’s already 4 am here, so I really should be wrapping the post up :)
+So, a challenge: design a Rust library for scoped concurrency based on
+native OS threads that:
+
+
+never looses a thread or a panic,
+
+
+immediately propagates panics,
+
+
+allows to (optionally?) share stack data between the threads.
+
I’ve spend years looking for a good tool to make slides.
+I’ve tried LaTeX Beamer, Google Docs, Slides.com and several reveal.js offsprings, but neither was satisfactory for me.
+Last year, I stumbled upon Asciidoctor.js PDF (which had like three GitHub starts at that moment), and it is perfect.
+
At least, it is perfect for my use case, your requirements might be different.
+I make presentations for teaching programming at Computer Science Center, so my slides are full of code, bullet lists, and sometimes have moderately complex layout.
+To make reviewing course material easier, slides need to have high information density
+
If you want to cut down straight to the code, see the repository with slides for my Rust course:
+A source markup language: I like to keep my slides on GitHub
+
+
+Ease of styling and layout.
+A good test here is two-column layout with code snippet on the left and a bullet list on the right
+
+
+The final output should be a PDF.
+I don’t use animations, but I need exactly the same look of slides on different computers
+
+
+
All the tools I’ve tried don’t quite fit the bill.
+
While TeX is good for formatting formulas, LaTeX is a relatively poor language for describing the structure of the document.
+Awesome Emacs mode fixes the issue partially, but still, \begin{itemize} is way to complex for a bullet list.
+Additionally, quality of implementation is not perfect: unicode support needs opt-in, and the build process is fiddly.
+
Google Docs and Slides.com are pretty solid choices if you want WYSWIG.
+In fact, I primarily used these two tools before AsciiDoctor.
+However WYSWIG and limited flexibility which come with it are significant drawbacks
+
I think I’ve never made a serious presentation in any of the JavaScript presentation frameworks.
+I’ve definitely tried reveal.js, remark and shower, but turned back to Google Docs in the end.
+The two main reasons for this were:
+
+
+Less than ideal source language:
+
+
+if it is Markdown, I struggled with creating complex layouts like the two column one;
+
+
+if it is HTML, simple things like bullet lists or emphasis are hard.
+
+
+
+
+Cross browser CSS.
+These frameworks pack a lot of JS and CSS, which I don’t really need, but which makes tweaking stuff difficult for me, as I am not a professional web developer.
+
The killer feature behind Asciidoctor.js PDF is the AsciiDoc markup language.
+Like Markdown, it’s a lightweight markup language.
+When I was translating this blog from .md to .adoc the only significant change in the syntax was for links, from
+
+
+
to
+
+
+
However, unlike Markdown and LaTeX, AsciiDoc has native support for rich hierarchical document model.
+AsciiDoc source is parsed into a tree of nested elements with attributes (historically, AsciiDoc was created as an easier way to author DocBook XML).
+This allows to express complex document structure without ad-hoc syntax extensions.
+Additionally, the concrete syntax feels very orthogonal and well rounded up.
+We’ve seen the syntax for links before, and this is how one includes an image:
+
+
+
Or a snippet from another file:
+
+
+
A couple of more examples, just to whet your appetite (Asciidoctor has extensive documentation)
+
+
+
+
This is a paragraph
+
This is a paragraph with an attribute (which translates to CSS class)
+
+
+
+
+
+
+
This is a bullet list
+
+
+
Bullet with table (+ joins blocks)
+
+
+
Are tables in lists stupid?
+
Probably!
+
+
+
+
+
+
+
+
+
+
That is, in addition to the usual syntax highlighting, the &xs[0] bit is wrapped into a <span class="hl-error">.
+This can be used to call out specific bits of code, or, like in this case, to show compiler errors:
+
Here’s an example of a complex slide:
+
+
+
+
+.two-col sets the css class for two-column flex layout.
+
+
+[.language-rust] sets css class for inline <code> element, so mut gets highlighted.
+
+
+This bullet-point contains a longer snippet of code.
+
+
+Have you noticed these circled numbered callouts? They are another useful feature of AsciiDoc!
+
AsciiDoc markup language is a powerful primitive, but how do we turn it into pixels on the screen?
+The hard part of making slides is laying out the contents: breaking paragraphs in lines, aligning images, arranging columns.
+As was pointed out by Asciidoctor maintainer, browsers are extremely powerful layout engines, and HTML + CSS is a decent way to describe the layout.
+
And here’s where Asciidoctor.js PDF comes in: it allows one to transform AsciiDoc DOM into HTML, by supplying a functional-style visitor.
+This HTML is then rendered to PDF by chromium (but you can totally use HTML slides directly if you like it more).
+
Here’s the visitor which produces the slides for my Rust course:
In contrast to reveal.js, I have full control over the resulting HTML and CSS.
+As I don’t need cross browser support or complex animations, I can write a relatively simple modern CSS, which I myself can understand.
Note that Asciidoctor.js PDF is a relatively new piece of technology (although the underlying Asciidoctor project is very mature).
+For this reason for my slides I just vendor a specific version of the tool.
+
Because the intermediate result is HTML, the development workflow is very smooth.
+It’s easy to make a live preview with a couple of editor plugins, and you can use browser’s dev-tools to debug CSS.
+I’ve also written a tiny bit of JavaScript to enable keyboard navigation for slides during preview.
+Syntax highlighting is also a bespoke pile of regexes :-)
+
One thing I am worried about is the depth of the stack of technologies of Asciidoctor.js PDF.
+
+
+Original AsciiDoc tool was written in Python.
+
+
+Asciidoctor is a modern enhanced re-implementation in Ruby.
+
+
+Asciidoctor.js PDF runs on NodeJS via Opal Ruby -> JavaScript compiler
+
+
+It is used to produce HTML which is then fed into chromium to produce PDF!
+
+
+
Oh, and syntax highlighting on this blog is powered by pygments, so Ruby calls into Python!
+
This is quite a Zoo, but it works reliably for me!
Course slides are available under CC-BY at https://github.com/matklad/rust-course.
+See the sibling post if you want to learn more about how the slides were made
+(TL;DR: Asciidoctor is better than beamer, Google Docs, slides.com, reveal.js, remark).
+
High-quality recordings of lectures are available on YouTube:
Teaching is hard, but very rewarding.
+Teaching Rust feels especially good because the language is very well designed and the quality of the implementation is great.
+Overall, I don’t feel like this was a particularly hard course for the students.
+In the end most of the folks successfully completed all assignments, which were fairly representative of the typical Rust code.
There were one extremely hard topic and one poorly explained topic.
+
The hard one was the module system.
+Many students were completely stumped by it.
+It’s difficult to point out the specific hard aspect of the current (Rust 2018) module system: each student struggled in their own way.
+
Here’s a selection of points of confusion:
+
+
+you don’t need to wrap contents of foo.rs in mod foo { ... }
+
+
+you don’t need to add mod lib; to main.rs
+
+
+child module lives in the parent/child.rs file, unless the parent is lib.rs or main.rs
+
+
+
I feel like my explanation of modules was an OK one, it contained all the relevant details and talked about how things work under the hood.
+However, it seems like just explaining the modules is not enough: one really needs to arrange a series of exercises about modules, and make sure that all students successfully pass them.
+
I don’t think that modules are the hardest feature of the language: advanced lifetimes and unsafe subtleties are more difficult.
+However, you don’t really write mem::transmute or HRTB every day, while you face modules pretty early.
+
The poorly explained topic was Send/Sync.
+I was like “compiler infers Send/Sync automatically, and after that your code just fails to compile if it would had a data race, isn’t Rust wonderful?”.
+But this misses the crucial point: in generic code (both for impl T and dyn T), you’ll need to write : Sync bounds yourself.
+Of course the homework was about generic code, and there were a number of solutions with (unsound) unsafe impl<T> Sync for MyThing<T> :-)
It’s very hard to google Rust documentation at the moment, because google links
+you to redirect stubs of the old book, which creates that weird feeling that you
+are inside of a science-fiction novel.
+I know that the problem is already fixed, and we just need to wait until the new version of the old book is deployed, but I wish we could have fixed it earlier.
+
Editions are a minor annoyance as well. I’ve completely avoided talking about Rust 2015, hoping that I’ll just teach the shiny new thing.
+But of course students google for help and get outdated info.
+
+
+many used extern crate syntax
+
+
+dyn in dyn T was sometimes omitted
+
+
+there was a couple of mod.rs
+
+
+
Additionally, several students somehow ended up without edition = "2015" in Cargo.toml.
Over time I have accumulated a number of tricks and hacks that make “linux desktop” more natural for me.
+Today I’ve discovered another one: a way to minimize Firefox on close.
+This seems like a good occasion to write about things I’ve been doing!
I’ve never understood the appeal of multiple desktops, tiling window managers,
+or Mac style “full screen window is outside of your desktop”.
+They for sure let you to neatly organize several applications at once, but I never need an overview of all applications.
+What I need most of the time is switching to a specific application, like a browser.
+
Windows has a feature for this, that fits this workflow perfectly.
+If you pin an application to start menu, then win + number will launch or focus that app.
+That is, if the app is already running, its window will be raised and focused.
+
For some reason, this is not available out of the box in any of the Linux window
+managers I’ve tried. What is easy is binding launching an application to a
+shortcut, but I rarely use more than once instance of Firefox!
+
Luckily, jumpapp is exactly what is needed
+to implement this properly.
+
I use Xbindkeys for global
+shortcuts, with the following config:
+
+
+
Note that I bind F? keys without any modifiers: these keys are rarely used
+by applications and are very convenient for personal use.
I’ve always liked Quake-style terminals, which you can bring to front with a
+single keypress.
+For this reason, I was stuck with
+yakuake for a really long time.
+
jumpapp allows me to use any terminal in
+this fashion, so now I use full screen kitty.
Because switching windows/applications is easy for me, I typically look at a single maximized window.
+However, sometimes I like to have two windows side-by-side, for example an editor and a browser with preview.
+A full blown tiling window manager will be an overkill for this use-case, but another Windows feature comes in handy.
+In Windows, Win + ← and Win + → tiles active window to the left and right side of the screen.
+Luckily, this is a built in feature in most window managers, including KWin and Openbox (the two I use the most).
This one is tricky!
+On one hand, because I use one maximized window at a time, I feel comfortable with smaller displays.
+I was even disappointed with a purchase of external display for my laptop: turns out, bigger screen doesn’t really help me!
+On the other hand, I really like when all pixels I have are utilized fully.
+
I’ve tried to work in full screen windows, but that wasn’t very convenient for two reasons:
+
+
+Tray area is useful for current time, other status information, and notifications.
+
+
+Full screen doesn’t play well with jumpapp window switching.
+
+
+
After some experiments, I’ve settled with the following setup:
+
+
+
Use of maximized, but not full screen windows.
+
+
+
When window is maximized, its borders and title bar are hidden. To do this in kwin add the following to ~/.config/kwinrc:
+
+
+
+
+
To still have an ability to close/minimize the window with the mouse, I use Active Window Menu Plasmoid.
+What it does is that it packs window title and close/maximize/minimize buttons into the desktop panel, without spending extra pixels:
+
+
+
+
+
Another thing I’ve noticed is that I look to the bottom side of the screen much more often.
+For this reason, I move desktop panel to the top.
+You can imagine how inconvenient Mac’s dock is for me: it wastes so many pixels in the most important area of the display :-)
After several years of using Emacs and a number of short detours into Vim-land, I grew a profound dislike for the arrow keys.
+It’s not that they make me slower: they distract me because I need to think about moving my hands.
+
For the long time I’ve tried to banish arrow keys from my life by making every
+application understand ctrl+b, ctrl+f and the like.
+But that was always a whack-a-mole game without a chance to win.
+
A much better approach is Home Row Computing.
+I rebind, on the low level, CapsLock + i/j/k/l to arrow keys.
+This works in every app.
+It also works with alt and shift modifiers.
+
I use xkbcomp with this config to set this up.
+I have no idea how this actually works :-)
I used to pile up everything on the desktop.
+But now my desktop is completely empty, and I enjoy uncluttered view of
+The Hunters in the Snow
+every time I boot my laptop.
+
The trick is to realize that accreting “junk” files is totally normal, and
+“just don’t put garbage on desktop” is not a solution.
+Instead, one can create a dedicated place for hoarding.
+
I have two of those:
+
+
+~/downloads which I remove automatically on every reboot
+
+
+~/tmp which I rm -fr ~/tmp manually once in a while
+
I used to use Zsh with a bunch of plugins, hoping that I’ll learn bash this way.
+I still google “How to if in bash?” every single time though.
+
For this reason, I’ve switched to fish with mostly default config.
+The killer feature for me is autosuggestions: completion of the commands based on the history.
+Zsh has something similar, via a plugin, but this crucial feature works in fish out of the box.
+
One slightly non-standard thing I do is a two-line prompt that looks like this:
+
+
+
Two line prompts are great! You can always see a full working directory, and commands are always visually in the same place.
+Having current time in the prompt is also useful in case you run a long command and forget to time it.
I don’t use a lot of desktop apps, but I keep a browser with at least five tabs for different messaging apps.
+By the way, Tree Style Tab is the best tool for taming modern “apps”!
+
The problem with this is that I automatically Alt+F4 Firefox once I am done with it, but launching it every time is slow.
+Ideally, I want to minimize it on close, just how I do with qBittorrent and Telegram.
+Unfortunately, there’s no built-in feature for this in Firefox.
+
I once tried to build it with Xbindkeys and Xdotool.
+The idea was to intercept Alt+F4 and minize active window if it is Firefox.
+That didn’t work too well: to close all other applications, I tried to forward Alt+F4, but that recursed badly :-)
+
Luckily, today I’ve realized that I can write a KWin script for this!
+This turned out to be much harder than anticipated, because the docs are thin and setup is fiddly.
+
This
+post was instrumental for me to figure this stuff out. Thanks Chris!
+
I’ve created two files:
+
+
+
+
+
After than, I’ve ticked a box in front of Smart Close Window in System Settings › Window Management › KWin Scripts and
+added a shortcut in System Settings › Shortcuts › Global Shortcuts › System Settings.
+The last step took a while fo figure out: although it looks like we set shortcut in the script itself, this doesn’t actually work for some reason.
Finally, my life has become significancy easier since I’ve settled on NixOS.
+I had mainly used Arch and a bit of Ubuntu before, but NixOS is so much easier to control.
+I highly recommend to check it out!
One of my favorite blog posts about Rust is Things Rust Shipped Without by Graydon Hoare.
+To me, footguns that don’t exist in a language are usually more important than expressiveness.
+In this slightly philosophical essay, I want to tell about a missing Rust feature I especially like: constructors.
Constructors are typically found in Object Oriented languages.
+The job of a constructor is to fully initialize an object before the rest of the world sees it.
+At the first blush, this seems like a really good idea:
+
+
+You establish invariants in the constructor.
+
+
+Each method takes care to maintain invariants.
+
+
+Together, these two properties mean that it is possible to reason about the object in terms of coarse-grained invariants, instead of fine-grained internal state.
+
+
+
The constructor plays a role of induction base here, as it is the only way to create a new object.
+
Unfortunately, there’s a hole in this reasoning: constructor itself observes an object in an inconsistent state, and that creates a number of problems.
When the constructor initializes the object, it starts with some dummy state.
+But how do you define a dummy state for an arbitrary object?
+
The easiest answer is to set all fields to default values: booleans to false, numbers to 0, and reference types to null.
+But this requires that every type has a default value, and forces the infamous null into the language.
+This is exactly the path that Java took: at the start of construction, all fields are zero or null.
+
It’s really hard to paper over this if you want to get rid of null afterwards.
+A good case study here is Kotlin.
+Kotlin uses non-nullable types by default, but has to work with pre-exiting JVM semantics.
+The language-design heroics to hide this fact are really impressive and work well in practice, but are unsound.
+That is, with constructors it is possible to circumvent Kotlin null-checking.
+
Kotlin’s main trick is to encourage usage of so-called “primary constructors”, which simultaneously declare a field and set it before any user code runs:
+
+
+
Alternatively, if the field is not declared in the constructor, the programmer is encouraged to immediately initialize it:
+
+
+
Trying to use a field before initialization is forbidden statically on the best effort basis:
+
+
+
But, with some creativity, one can get around these checks.
+For example, a method call would do:
+
+
+
As well as capturing this by a lambda (spelled { args -> body } in Kotlin):
+
+
+
Examples like these seem contorted (and they are), but I did hit similar issues
+in real code
+(Kolmogorov’s zero–one law of software engineering: in a sufficiently large code base, every code pattern exists almost surely, unless it is statically rejected by the compiler, in which case it almost surely doesn’t exist).
+
The reason why Kotlin can get away with this unsoundness is the same as with Java’s covariant arrays: runtime does null checks anyway.
+All in all, I wouldn’t want to complicate Kotlin’s type system to make the above cases rejected at compile time:
+given existing constraints (JVM semantics), cost/benefit ratio of a runtime check is much better than that of a static check.
+
What if the language doesn’t have a reasonable default for every type?
+For example, in C++, where user defined types are not necessary references, one can not just assign nulls to every field and call it a day!
+Instead, C++ invents special kind of syntactic machinery for specifying initial values of the fields: initializer lists:
+
+
+
Being a special syntax, the rest of the language doesn’t work completely flawlessly with it.
+For example, it’s hard to fit arbitrary statements in initializer lists, because C++ is not expression-oriented language (which by itself is OK!).
+Working with exceptions from initializer lists needs yet another obscure language feature.
As Kotlin examples alluded, all hell breaks loose if one calls a method from a constructor.
+Generally, methods expect that this object is fully constructed and valid (adheres to invariants).
+But, in Java or Kotlin, nothing prevents you from calling a method in constructor, and that way a semi-alive object can “escape”.
+Constructor promises to establish invariants, but is actually the easiest place to break them!
+
A particularly bizarre thing happens when the base class calls a method overridden in the subclass:
+
+
+
Just think about it: code for Derived runs before the its constructor!
+Doing a similar thing in C++ leads to even curiouser results.
+Instead of calling the function from Derived, a function from Base will be called.
+This makes some sense, because Derived is not at all initialized (remember, we can’t just say that all fields are null).
+However, if the function in Base happens to be pure virtual, undefined behavior occurs.
Breaking invariants isn’t the only problem with constructors.
+They also have signature with fixed name (empty) and return type (the class itself).
+That makes constructor overloads confusing for humans.
+
+
The problem with return type usually comes up if construction can fail.
+You can’t return Result<MyClass, io::Error> or null from a constructor!
+
This is often used as an argument that C++ with exceptions disabled is not viable, and that using constructors force one to use exceptions as well.
+I don’t think that’s a valid argument though: factory functions solve both problems, because they can have arbitrary names and can return arbitrary types.
+I actually this to be an occasionally useful pattern in OO-languages:
+
+
+
Make a single private constructor that accepts all the fields as arguments and just sets them.
+That is, this constructor acts almost like a record literal in Rust.
+It can also validate any invariants, but it shouldn’t do anything else with arguments or fields.
+
+
+
For public API, provide the necessary public factory functions, with
+appropriate naming and adjusted return types.
+
+
+
A similar problem with constructors is that, because they are a special kind of thing, it’s hard to be generic over them.
+In C++, “default constructable” or “copy constructable” can’t be expressed more directly than “certain syntax works”.
+Contrast this with Rust, where these concepts have appropriate signatures:
In Rust, there’s only one way to create a struct: providing values for all the fields.
+Factory functions, like the conventional new, play the role of constructors, but, crucially, don’t allow calling any methods until you have at least a basically valid struct instance on hand.
+
A perceived downside of this approach is that any code can create a struct, so there’s no the single place, like the constructor, to enforce invariants.
+In practice, this is easily solved by privacy: if struct’s fields are private it can only be created inside its declaring module.
+Within a single module, it’s not at all hard to maintain a convention like “all construction must go via the new method”.
+One can even imagine a language extension that allows one to mark certain functions with a #[constructor] attribute, with the effect that the record literal syntax is available only in the marked functions.
+But, again, additional language machinery seems unnecessary: maintaining local conventions needs little effort.
+
I personally think that this tradeoff looks the same for first-class contract programming in general.
+Contracts like “not null” or “positive” are best encoded in types.
+For complex invariants, just writing assert!(self.validate()) in each method manually is not that hard.
+Between these two patterns there’s little room for language-level or macro-based #[pre] and #[post] conditions.
An interesting language to look at the constructor machinery is Swift.
+Like Kotlin, Swift is a null-safe language.
+Unlike Kotlin, Swift’s null-checking needs to be sound, so it employs interesting tricks to mitigate constructor-induced damage.
+
First, Swift embraces named arguments, and that helps quite a bit with “all constructors have the same name”.
+In particular, having two constructors with the same types of parameters is not a problem:
+
+
+
Second, to solve “constructor calls virtual function from an object’s class that didn’t came into existence yet” problem, Swift uses elaborate two-phase initialization protocol.
+Although there’s no special syntax for initializer lists, compiler statically checks that constructor’s body has just the right, safe and sound, form.
+For example, calling methods is only allowed after all fields of the class and its ancestors are set.
+
Third, there’s special language-level support for failable constructors.
+A constructor can be declared nullable, which makes the result of a call to a constructor an option.
+A constructor can also have throws modifier, which works somewhat nicer with Swifts’s semantic two-phase initialization than with C++ syntactic initializer lists.
+
Swift manages to plug all of the holes in constructors I am ranting about.
+This comes at a price, however: the initialization chapter is one of the longest in Swift book!
However, I can think of at least two reasons why constructors can’t be easily substituted with Rust-style record literals.
+
First, inheritance more or less forces the language to have constructors.
+One can imagine extending the record syntax with support for base classes:
+
+
+
But this won’t work in a typical single-inheritance OO language object layout!
+Usually, an object starts with a header and continues with fields of classes, from the base one to the most derived one.
+This way, a prefix of an object of a derived class forms a valid object of a base class.
+For this layout to work though, constructor needs to allocate memory for the whole object at once.
+It can’t allocate just enough space for base, and than append derived fields afterwards.
+But such piece-wise allocation is required if we want a record syntax were we can just specify a value for a base class.
+
Second, unlike records, constructors have a placement-friendly ABI.
+Constructor acts on the this pointer, which points to a chunk of memory which a newborn object should occupy.
+Crucially, a constructor can easily pass pointer to subobject’s constructors, allowing to create a complex tree of values in-place.
+In contrast, in Rust constructing records semantically involves quite a few copies of memory, and we are at the mercy of the optimizer here.
+It’s not a coincidence that there’s still no accepted RFC for placement in Rust!
This is a short note about yet another way to look at Rust’s unsafe.
+
Today, an interesting bug was found in rustc, which made me aware just how useful unsafe is for making code maintainable.
+The story begins a couple of months ago, when I was casually browsing through recent pull requests for rust-lang/rust.
+I was probably waiting for my code to compile at that moment :]
+Anyway, a pull request caught my attention, and, while I was reading the diff, I noticed a usage of unsafe.
+It looked roughly like this:
+
+
+
This function applies a T -> T function to a &mut T value, a-la take_mut crate.
+
There is a safe way to do this in Rust, by temporary replacing the value with something useless (Jones’s trick):
+
+
+
In map_in_place we don’t have a T: Default bound, so the trick is not applicable.
+Instead, the function uses (unsafe) ptr::read to get an owned value out of a unique reference, and then uses ptr::write to store the new value back, without calling the destructor.
+
However, the code has a particular unsafe code smell: it calls user-supplied code (f) from within an unsafe block.
+This is usually undesirable, because it makes reasoning about invariants harder: arbitrary code can do arbitrary unexpected things.
+
+
And, indeed, this function is unsound: if f panics and unwinds, the t value would be dropped twice!
+The solution here (which I know from the take_mut crate) is to just abort the process if the closure panics.
+Stern, but effective!
+
I felt really torn about bringing this issue up: clearly, inside the compiler we know what we are doing, and the error case seems extremely marginal.
+Nevertheless, I did leave the comment, and the abort trick was implemented.
+
And guess what?
+Today a bug report came in (#62894), demonstrating that closure does panic in some cases, and rustc aborts.
+To be clear, the abort in this case is a good thing!
+If rustc didn’t abort, it would be a use-after-free.
+
Note how cool is this: a casual code-reviewer was able to prevent a memory-safety issue by looking at just a single one-line function.
+This was possible for two reasons:
+
+
+The code was marked unsafe which made it stand out.
+
+
+The safety reasoning was purely local: I didn’t need to understand the PR (or surrounding code) as a whole to reason about the unsafe block.
+
+
+
The last bullet point is especially interesting, because it is what makes type systems [1] in general effective in large-scale software development:
+
+
+Checking types is a local (per-expression, per-function, per-module, depending on the language) procedure.
+Every step is almost trivial: verify that sub-expressions have the right type and work out the result type.
+
+
+Together, these local static checks guarantee a highly non-trivial global property:
+during runtime, actual types of all the values match inferred static types of variables.
+
+
+
Rust’s unsafe is similar: if we verify every usage of unsafe (local property!) to be correct, then we guarantee that the program as a whole does not contain undefined behavior.
+
The devil is in the details, however, so the reality is slightly more nuanced.
+
First, unsafe should be checked by humans, thus a human-assisted type system.
+The problem with humans, however, is that they make mistakes all the time.
+
Second, checking unsafe can involve a rather large chunk of code.
+For example, if you implement Vec, you can (safely) write to its length field from anywhere in the defining module.
+That means that correctness of Deref impl for Vec depends on the whole module.
+Common wisdom says that the boundary for unsafe code is a module, but I would love to see a more precise characteristic.
+For example, in map_in_place case it’s pretty clear that only a single function should be examined.
+On the other hand, if Vec’s field are pub(super), parent module should be scrutinized as well.
+
Third, it’s trivial to make all unsafe blocks technically correct by just making every function unsafe.
+That wouldn’t be a useful thing to do though!
+Similarly, if unsafe is used willy-nilly across the ecosystem, its value is decreased, because there would be many incorrect unsafe blocks, and reviewing each additional block would be harder.
+
Fourth, and probably most disturbing, correctness of two unsafe blocks in isolation does not guarantee that they together are correct!
+We shouldn’t panic though: in practice, realistic usages of unsafe do compose.
This is a note on how to make multithreaded programs more robust.
+It’s not really specific to Rust, but I get to advertise my new jod-thread micro-crate :)
+
Let’s say you’ve created a fresh new thread with std::thread::spawn, but haven’t call JoinHandle::join anywhere in your program.
+What can go wrong in this situation?
+As a reminder, join blocks until the thread represented by handle completes successfully or with a panic.
+
First, if the main function finishes earlier, some destructors on that other thread’s stack might not run.
+It’s not a big deal if all that destructors do is just freeing memory: the OS cleanups after the process exit anyway.
+However, Drop could have been used for something like flushing IO buffers, and that is more problematic.
+
Second, not joining threads can lead to surprising interference between unrelated parts of the program and in general to more chaotic behavior.
+Imagine, for example, running a test suite with many tests.
+In this situation typical “singleton” threads may accumulate during a test run.
+Another scenario is spawning helper threads when processing tasks.
+If you don’t join these threads, you might end up using more resources than there are concurrent tasks, making it harder to measure the load.
+To be clear, if you don’t call join, the thread will complete at some point anyway, it won’t leak or anything.
+But this some point is non-deterministic.
+
Third, If a thread panics in a forest, and no one is around to hear it, does it make a sound?
+The join method returns a Result, which is be an Err if the thread has panicked.
+If you don’t join the thread, you won’t get a chance to react to this event.
+So, unless you are looking at the stderr at this moment, you might not realize that something is wrong!
+
+
It seems like joining the threads by default is a good idea.
+However, just calling JoinHandle::join is not enough:
+
+
+
The problem is, code in … might use ? (or some other form of early return), or it can panic, and in both cases the thread won’t be joined.
+As usual, the solution is to put the “cleanup” operation into a Drop impl.
+That’s exactly what my crate, jod_thread, does!
+Note that this is really a micro crate, so consider just rolling your own join on drop.
+The value is not in the code, it’s in the pattern of never leaving a loose thread behind!
As usual, it is instructive to contrast and compare Rust and C++.
+
In C++, std::thread has this interesting peculiarity that it terminates the process in destructor unless you call .join (which works just like in Rust) or .detach (which says “I won’t be joining this thread at all”).
+In other words, C++ mandates that you explicitly choose between joining and detaching.
+Why is that?
+
It’s easy to argue that detach by default is a wrong choice for C++: it can easily lead to undefined behavior if the lambda passed to the thread uses values from parent’s stack frame.
+
Or, as Scott Meyer poetically puts it in the Item 37 of Effective Modern C++ (which is probably the best book to read if you are into both Rust and C++):
+
+
This also happens to be one of my favorite arguments for “why Rust?” :)
+
The reasoning behind not making join the default is less clear cut.
+The book says that join by default is be counterintuitive, but that is somewhat circular: it is surprising precisely because it is not the default.
+
In Rust, unlike C++, implicit detach can’t cause undefined behavior (compiler will just refuse the code if the lambda borrows from the stack).
+I suspect this “we can, so why not?” is the reason why Rust detaches by default.
+
However, there’s a twist!
+C++ core guidelines now recommend to always use gsl::joining_thread (which does implicit join) over std::thread in CP.25.
+The following CP.26 reinforces the point by advising against .detach() method.
+The reasoning is roughly similar to my post: detached threads make the program more chaotic, as they add superfluous degrees of freedom to the runtime behavior.
+
It’s interesting that I’ve learned about these two particular guidelines only today, when refreshing my C++ for this section of the post!
+
So, it seems like both C++ and Rust picked the wrong default for the thread API in this case. But at least C++ has official guidelines recommending the better approach.
+And Rust, … well, Rust has my blog post now :-)
Of course there isn’t one!
+Joining on drop seems to be a better default, but it brings its own problems.
+The nastiest one is deadlocks: if you are joining a thread which waits for something else, you might wait forever.
+I don’t think there’s an easy solution here: not joining the thread lets you forget about the deadlock, and may even make it go away (if a child thread is blocked on the parent thread), but you’ll get a detached thread on your hands!
+The fix is to just arrange the threads in such a way that shutdown is always orderly and clean.
+Ideally, shutdown should work the same for both the happy and panicking path.
+
I want to discuss a specific instructive issue that I’ve solved in rust-analyzer.
+It was about the usual setup with a worker thread that consumes items from a channel, roughly like this:
+
+
+
Here, the worker thread has a simple termination condition: it stops when the channel is closed.
+However, here lies the problem: we create the channel before the thread, so the sender is dropped after the worker.
+This is a deadlock: frobnicate waits for worker to exit, and worker waits for frobnicate to drop the sender!
+
There’s a straightforward fix: drop the sender first!
+
+
+
This solution, while obvious, has a pretty serious problem!
+The prepare some work ... bit of code can contain early returns due to error handling or it may panic.
+In both case the result is a deadlock.
+What is the worst, now deadlock happens only on the unhappy path!
+
There is an elegant, but tricky fix for this. Take a minute to think about it!
+How to change the above snippet such that the worker thread is guranted to be joined, without deadlocks, regardless of the exit condition (normal termination,?, panic) of frobnicate?
+
The answer will be below these beautiful Ukiyo-e prints :-)
+
+
+
+
+
First of all, the problem we are seeing here is an instance of a very general setup.
+We have a bug which only manifests itself if a rare error condition arises.
+In some sense, we have a bug in the (implicit) error handling (just like 92% of critical bugs).
+The solutions here are a classic:
+
+
+Artificially trigger unhappy path often (“restoring from backup every night”).
+
+
+Make sure that there aren’t different happy and unhappy paths (“crash only software”).
+
+
+
We are going to do the second one.
+Specifically, we’ll arrange the code in such way that compiler automatically drops worker first, without the need for explicit drop.
+
Something like this:
+
+
+
The problem here is that we need receiver inside the worker, but moving let (sender, receiver) up brings us back to the square one.
+Instead, we do this:
+
+
+
Beautiful, isn’t it?
+And super cryptic: the real code has a sizable comment chunk!
+
The second big issue with join by default is that, if you have many threads in the same scope, and one of them errors, you really want to not only wait until others are finished, but to actually cancel them.
+Unfortunately, cancelling a thread is a notoriously thorny problem, which I’ve explained a bit in another post.
So, yeah, join your threads, but be on guard about deadlocks!
+Note that most of the time one shouldn’t actually spawn threads manually: instead, tasks should be spawned to a common threadpool.
+This way, physical parallelism is nicely separated from logical concurrency.
+However, tasks should generally be joined for the same reason threads should be joined.
+A nice additional properly of tasks is that joining the threadpool itself in the end ensures that no tasks are leaked in the single place.
+
A part of the inspiration for this post was the fact that I once forgot to join a thread :(
+This rather embarrassingly happened in my other post.
+Luckily, my current colleague Stjepan Glavinanoticed this.
+Thank you, Stjepan!
If you are finding rust-analyzer useful in your work, consider talking to management about sponsoring rust-analyzer.
+We are specifically seeking sponsorship from companies that use Rust!
There are exciting projects to improve data-processing capabilities of shells, like nushell.
+However, I personally don’t use this capability of shell a lot: 90% of commands I enter are simpler than some cmd | rg pattern.
+
I primarily use shell as a way to use my system, and it is these interactive capabilities that I find lacking.
+So I want something closer in spirit to notty.
The most commands I type are cd, exa, rm, git ..., cargo ....
+I also type mg, which launches a GUI version of Emacs with Magit:
+
+
+
These tools make me productive.
+Keyboard-only input is fast and “composable” (I can press up to see previous commands, I can copy-paste paths, etc).
+Colored character-box based presentation is very clear and predictable, I can scan it very quickly.
+
+
However, there are serious gaps in the UX:
+
+
+
ctrl+c doesn’t work as it works in every other application.
+
+
+
I launch GUI version of Emacs: the terminal one changes some keybindings, which is confusing to me.
+For example, I have splits inside emacs, and inside my terminal as well, and I just get confused as to which shortcut I should use.
+
+
+
The output of programs is colored with escaped codes, which are horrible, and not flexible enough.
+When my Rust program panics and prints that it failed in my_crate::foo::bar function, I want this to be a hyperlink to the source code of the function.
+I want to cat images and PDFs in my terminal (and html, obviously).
+
+
+
My workflow after I’ve done a bunch of changes is:
+
+
+type cargo test to launch tests
+
+
+type ctrl+shift+Enter to split the terminal
+
+
+type git status or mg in the split to start making a commit in parallel to testing
+
+
+
+
+
The last step is crazy!
+
Like, cargo test is being run by my shell (fish), the split is handled by the terminal emulator (kitty), which launches a fresh instance of fish and arranges the working directory to be saved.
+
As a user, I don’t care about this terminal/terminal emulator/shell split.
+I want to launch a program, and just type commands.
+Why cargo test blocks my input?
+Why can’t I type cargo test, Enter, exa -l, Enter and have this program to automatically create the split?
+
+
+
Additionally, while magit awesome, I want an option to use such interface for all my utilities.
+Like, for tar?
+And, when I type cargo test --package, I really want completion for the set of packages which are available in the current directory.
Isn’t it Emacs that I am trying to describe?
+Well, sort-of.
+Emacs is definitely in the same class of “application containers”, but it has some severe problems, in my opinion:
+
+
+Emacs Lisp is far from the best possible language for writing extensions.
+
+
+Plugin ecosystem is not really dependable.
+
+
+It doesn’t define out-of-process plugin API (things like hyperlinking output).
+
+
+Async support is somewhere between non-existent and awkward.
+
+
+Its main focus is text editing.
+
+
+Its defaults are not really great (fish shell is a great project to learn from here).
+
+
+ctrl+c, ctrl+v do not work by default, M-x is not really remappable.
+
A “terminals are a mess” story from today.
+I wanted “kill other split” shortcut shortcut for my terminal, bound to ctrl+k, 1.
+Implementing it was easy, as kitty has a nice plugin API.
+After that I’ve realized that I need to remap kill_line from ctrl+k to ctrl+shift+k, so that it doesn’t conflict with the ctrl+k, 1 chord.
+It took me a while to realize that searching for kill_line in kitty is futile — editing is handled by the shell.
+Ok, so it looks like I can just remap the key in fish, by bind \cK kill_line, except that, no, ctrl shortcuts do not work with Shift because of some obscure terminal limitation.
+So, let’s go back to kitty and add a ctrl+shift+k shortcut that sends ^k to the fish!
+An hour wasted.
In this post, I will be expressing strong opinions about a topic I have relatively little practical experience with, so feel free to roast and educate me in comments (link at the end of the post) :-)
+
Specifically, I’ll talk about:
+
+
+spinlocks,
+
+
+spinlocks in Rust with #[no_std],
+
+
+priority inversion,
+
+
+CPU interrupts,
+
+
+and a couple of neat/horrible systemsy Rust hacks.
+
I maintain once_cell crate, which is a synchronization primitive.
+It uses std blocking facilities under the hood (specifically, std::thread::park), and as such is not compatible with #[no_std].
+A popular request is to add a spin-lock based implementation for use in #[no_std] environments: #61.
+
More generally, this seems to be a common pattern in Rust ecosystem:
+
+
+A crate uses Mutex or other synchronization mechanism from std
+
+
+Someone asks for #[no_std] support
+
+
+Mutex is swapped for some variation of spinlock.
+
A Spinlock is the simplest possible implementation of a mutex, its general form looks like this:
+
+
+
+
+To grab a lock, we repeatedly execute compareandswap until it succeeds. The CPU “spins” in this very short loop.
+
+
+Only one thread at a time can be here.
+
+
+To release the lock, we do a single atomic store.
+
+
+Spinning is wasteful, so we use an intrinsic to instruct the CPU to enter a low-power mode.
+
+
+
Why we need Ordering::Acquire and Ordering::Release is very interesting, but beyond the scope of this article.
+
The key take-away here is that a spinlock is implemented entirely in user space: from OS point of view, a “spinning” thread looks exactly like a thread that does a heavy computation.
+
An OS-based mutex, like std::sync::Mutex or parking_lot::Mutex, uses a system call to tell the operating system that a thread needs to be blocked. In pseudo code, an implementation might look like this:
+
+
+
The main difference is park_this_thread— a blocking system call.
+It instructs the OS to take current thread off the CPU until it is woken up by an unpark_some_thread call.
+The kernel maintains a queue of threads waiting for a mutex.
+The park call enqueues current thread onto this queue, while unpark dequeues some thread. The park system call returns when the thread is dequeued.
+In the meantime, the thread waits off the CPU.
+
If there are several different mutexes, the kernel needs to maintain several queues.
+An address of a lock can be used as a token to identify a specific queue (this is a futex API).
+
System calls are expensive, so production implementations of Mutex usually spin for several iterations before calling into OS, optimistically hoping that the Mutex will be released soon.
+However, the waiting always bottoms out in a syscall.
Because spin locks are so simple and fast, it seems to be a good idea to use them for short-lived critical sections.
+For example, if you only need to increment a couple of integers, should you really bother with complicated syscalls? In the worst case, the other thread will spin just for a couple of iterations…
+
Unfortunately, this logic is flawed!
+A thread can be preempted at any time, including during a short critical section.
+If it is preempted, that means that all other threads will need to spin until the original thread gets its share of CPU again.
+And, because a spinning thread looks like a good, busy thread to the OS, the other threads will spin until they exhaust their quants, preventing the unlucky thread from getting back on the processor!
+
If this sounds like a series of unfortunate events, don’t worry, it gets even worse. Enter Priority Inversion. Suppose our threads have priorities, and OS tries to schedule high-priority threads over low-priority ones.
+
Now, what happens if the thread that enters a critical section is a low-priority one, but competing threads have high priority?
+It will likely get preempted: there are higher priority threads after all.
+And, if the number of cores is smaller than the number of high priority threads that try to lock a mutex, it likely won’t be able to complete a critical section at all: OS will be repeatedly scheduling all the other threads!
But wait! — you would say — we only use spin locks in #[no_std] crates, so there’s no OS to preempt our threads.
+
First, it’s not really true: it’s perfectly fine, and often even desirable, to use #[no_std] crates for usual user-space applications.
+For example, if you write a Rust replacement for a low-level C library, like zlib or openssl, you will probably make the crate #[no_std], so that non-Rust applications can link to it without pulling the whole of the Rust runtime.
+
Second, if there’s really no OS to speak about, and you are on the bare metal (or in the kernel), it gets even worse than priority inversion.
+
On bare metal, we generally don’t worry about thread preemption, but we need to worry about processor interrupts. That is, while processor is executing some code, it might receive an interrupt from some periphery device, and temporary switch to the interrupt handler’s code.
+
And here comes the disaster: if the main code is in the middle of the critical section when the interrupt arrives, and if the interrupt handler tries to enter the critical section as well, we get a guaranteed deadlock!
+There’s no OS to switch threads after a quant expires.
+Here are Linux kernel docs discussing this issue.
Let’s trigger priority inversion!
+Our victim is the getrandom crate.
+I don’t pick on getrandom specifically here: the pattern is pervasive across the ecosystem.
+
The crate uses spinning in the LazyUsize utility type:
+
+
+
There’s a static instance of LazyUsize which caches file descriptor for /dev/random:
This descriptor is used when calling getrandom— the only function that is exported by the crate.
+
To trigger priority inversion, we will create 1 + N threads, each of which will call getrandom::getrandom.
+We arrange it so that the first thread has a low priority, and the rest are high priority.
+We stagger threads a little bit so that the first one does the initialization.
+We also make creating the file descriptor slow, so that the first thread gets preempted while in the critical section.
It uses a couple of systems programming hacks to make this disaster scenario easy to reproduce.
+To simulate slow /dev/random, we want to intercept the poll syscall getrandom is using to ensure that there’s enough entropy.
+We can use strace to log system calls issued by a program.
+I don’t know if strace can be used to make a syscall run slow (now, once I’ve looked at the website, I see that it can in fact be used to tamper with syscalls, sigh), but we actually don’t need to!
+getrandom does not use the syscall directly, it uses the poll function from libc.
+We can substitute this function by using LD_PRELOAD, but there’s an even simpler way!
+We can trick the static linker into using a function which we define ourselves:
+
+
+
The name of the function accidentally ( :) ) clashes with a well-known POSIX function.
+
However, this alone is not enough.
+getrandomtries to usegetrandom syscall first, and that code path does not use a spin lock.
+We need to fool getrandom into believing that the syscall is not available.
+Our extern "C" trick wouldn’t have worked if getrandom literally used the syscall instruction.
+However, as inline assembly (which you need to issue a syscall manually) is not available on stable Rust, getrandom goes via syscallfunction from libc.
+That we can override with the same trick.
+
However, there’s a wrinkle!
+Traditionally, libc API used errno for error reporting.
+That is, on a failure the function would return an single specific invalid value, and set the errno thread local variable to the specific error code. syscall follows this pattern.
+
The errno interface is cumbersome to use.
+The worst part of errno is that the specification requires it to be a macro, and so you can only really use it from Csource code.
+Internally, on Linux the macro calls __get_errno_location function to get the thread local, but this is an implementation detail (which we will gladly take advantage of, in this land of reckless systems hacking!). The irony is that the ABI of Linux syscall just returns error codes, so libc has to do some legwork to adapt to the awkward errno interface.
+
So, here’s a strong contender for the most cursed function I’ve written so far:
+
+
+
It makes getrandom believe that there’s no getrandom syscall, which causes it to fallback to /dev/random implementation.
+
To set thread priorities, we use thread_priority crate, which is a thin wrapper around pthread APIs.
+We will be using real time priorities, which require sudo.
+
And here are the results:
+
+
+
Note that I had to kill the program after two minutes.
+Also note the impressive system time, as well as load average
+
+
+
If we patchgetrandom to use std::sync::Once instead we get a much better result:
+
+
+
+
+Note how real is half a second, but user and sys are small.
+That’s because we are waiting for 500 milliseconds in our poll
+
+
+
This is because Once uses OS facilities for blocking, and so OS notices that high priority threads are actually blocked and gives the low priority thread a chance to finish its work.
First, if you only use a spin lock because “it’s faster for small critical sections”, just replace it with a mutex from std or parking_lot.
+They already do a small amount of spinning iterations before calling into the kernel, so they are as fast as a spinlock in the best case, and infinitely faster in the worst case.
+
Second, it seems like most problematic uses of spinlocks come from one time initialization (which is exactly what my once_cell crate helps with). I think it usually is possible to get away without using spinlocks. For example, instead of storing the state itself, the library may just delegate state storing to the user. For getrandom, it can expose two functions:
+
+
+
It then becomes the user’s problem to cache RandomState appropriately.
+For example, std may continue using a thread local (src) while rand, with std feature enabled, could use a global variable, protected by Once.
+
Another option, if the state fits into usize and the initializing function is idempotent and relatively quick, is to do a racy initialization:
+
+
+
Take a second to appreciate the absence of unsafe blocks and cross-core communication in the above example!
+At worst, init will be called number of cores times (EDIT: this is wrong, thanks to /u/pcpthm for pointing this out!).
+
There’s also a nuclear option: parametrize the library by blocking behavior, and allow the user to supply their own synchronization primitive.
+
Third, sometimes you just know that there’s only a single thread in the program, and you might want to use a spinlock just to silence those annoying compiler errors about static mut.
+The primary use case here I think is WASM. A solution for this case is to assume that blocking just doesn’t happen, and panic otherwise. This is what std does for Mutex on WASM, and what is implemented for once_cell in this PR: #82.
(at least on commodity desktop Linux with stock settings)
+
This is a followup to the previous post about spinlocks.
+The gist of the previous post was that spinlocks have some pretty bad worst-case behaviors, and, for that reason, one shouldn’t blindly use a spinlock if using a sleeping mutex or avoiding blocking altogether is cumbersome.
+
In the comments, I was pointed to this interesting article, which made me realize that there’s another misconception:
+
+
Until today, I haven’t benchmarked any mutexes, so I don’t know for sure.
+However, what I know in theory about mutexes and spinlocks makes me doubt this claim, so let’s find out.
I do understand why people might think that way though.
+A simplest mutex just makes lock / unlock syscalls when entering and exiting a critical section, offloading all synchronization to the kernel.
+However, syscalls are slow and so, if the length of critical section is smaller than the length of two syscalls, spinning would be faster.
+
It’s easy to eliminate the syscall on entry in an uncontended state.
+We can try to optimistically CAS lock to the locked state, and call into kernel only if we failed and need to sleep.
+Eliminating syscall on exit is tricky, and so I think historically many implementations did at least one syscall in practice.
+Thus, mutexes were, in fact, slower than spinlocks in some benchmarks.
+
However, modern mutex implementations avoid all syscalls if there’s no contention.
+The trick is to make the state of the mutex an enum: unlocked, locked with some waiting threads, locked without waiting threads.
+This way, we only need to call into the kernel if there are in fact waiters.
+
Another historical benefit of spinlocks is that they are smaller in size.
+A state of a spinlock is just a single boolean variable, while for a mutex you also need a queue of waiting threads. But there’s a trick to combat this inefficiency as well.
+We can use the address of the boolean flag as token to identify the mutex, and store non-empty queues in a side table.
+Note how this also reduces the (worst case) total number of queues from number of mutexes to number of threads!
+
So a modern mutex, like the one in WTF::ParkingLot, is a single boolean, which behaves more or less like a spinlock in an uncontended case but doesn’t have pathological behaviors of the spinlock.
Our hypothesis is that mutexes are faster, so we need to pick a workload which favors spinlocks.
+That is, we need to pick a very short critical section, and so we will just be incrementing a counter (1).
+
This is better than doing a dummy lock/unlock.
+At the end of the benchmark, we will assert that the counter is indeed incremented the correct number of times (2).
+This has a number of benefits:
+
+
+This is a nice smoke test which at least makes sure that we haven’t done an off by one error anywhere.
+
+
+As we will be benchmarking different implementations, it’s important to verify that they indeed give the same answer! More than once I’ve made some piece of code ten times faster by accidentally eliminating some essential logic :D
+
+
+We can be reasonably sure that compiler won’t outsmart us and won’t remove empty critical sections.
+
+
+
Now, we can just make all the threads hammer a single global counter, but that would only test a situation of extreme contention.
+We need to structure a benchmark in a way that allow us to vary contention level.
+
So instead of a single global counter, we will use an array of counters (3).
+Each thread will be incrementing random elements of this array.
+By varying the size of the array, we will be able to control the level of contention.
+To avoid false sharing between neighboring elements of the array we will use crossbeam’s CachePadded.
+To make the benchmark more reproducible, we will vendor a simple PRNG (4), which we seed manually.
We are testing std::sync::Mutex, parking_lot::Mutex, spin::Mutex and a bespoke implementation of spinlock from probablydance article.
+We use 32 threads (on 4 core/8 hyperthreads CPU), and each thread increments some counter 10 000 times.
+We run each benchmark 100 times and compute average, min and max times (we are primarily measuring throughput, so average makes more sense than median this time).
+Finally, we run the whole suite twice, to sanity check that the results are reproducible.
First, we reproduce the result that the variance of spinlocks on Linux with default scheduling settings can be huge:
+
+
+
Note that these are extreme results for 100 runs, where each run does 32 * 10_000 lock operations.
+That is, individual lock/unlock operations probably have an even higher spread.
+
Second, the uncontended case looks like I have expected: mutexes and spinlocks are not that different, because they essentially use the same code
+
+
+
Third, under heavy contention mutexes annihilate spinlocks:
+
+
+
Now, this is the opposite of what I would naively expect.
+Even in heavy contended state, the critical section is still extremely short, so for each thread, the most efficient strategy seems to spin for a couple of iterations.
+
But I think I can explain why mutexes are so much better in this case.
+One reason is that with spinlocks a thread can get unlucky and be preempted in the critical section.
+The other more important reason is that, at any given moment in time, there are many threads trying to enter the same critical section.
+With spinlocks, all cores can be occupied by threads who compete for the same lock.
+With mutexes, there is a queue of sleeping threads for each lock, and the kernel generally tries to make sure that only one thread from the group is awake.
+
This is a funny example of mechanical race to the bottom. Due to the short length of critical section, each individual thread would spend less CPU cycles in total if it were spinning, but it increases the overall cost.
+
EDIT: simpler and more plausible explanation from the author of Rust’s parking lot is that it does exponential backoff when spinning, unlike the two spinlock implementations.
+
Fourth, even under heavy contention spin locks can luck out and finish almost as fast as mutexes:
+
+
+
This again shows that a good mutex is roughly equivalent to a spinlock in the best case.
+
Fifth, the amount of contention required to disrupt spinlocks seems to be small. Even if 32 threads compete for 1 000 locks, spinlocks still are considerably slower:
+
+
+
EDIT: someone on Reddit noticed that the number of threads is significantly higher than the number of cores, which is an unfortunate situation for spinlocks.
+And, although the number of threads in the benchmark is configurable, it never occurred to me to actually vary it 😅!
+Lowering the number of threads to four gives a picture similar to the “no contention” situation above: spinlocks a slightly, but not massively, faster.
+Which makes total sense! as there are more cores than CPUs, there’s no harm in spinning.
+And, if you can carefully architecture you application such that it runs a small fixed number of threads, ideally pinned to specific CPUs (like in the seastar architecture), using spinlocks might make sense!
As usual, each benchmark exercises only a narrow slice from the space of possible configurations, so it would be wrong to draw a sweeping conclusion that mutexes are always faster.
+For example, if you are in a situation where preemption is impossible (interrupts are disabled, cooperative multitasking, realtime scheduling, etc), spinlocks might be better (or even the only!) choice.
+And there’s also a chance the benchmark doesn’t measure what I think it measures :-)
+
But I find this particular benchmark convincing enough to disprove that “spinlocks are faster then mutexes for short critical sections”.
+In particular I find the qualitative observation that, under contention mutexes allow for better scheduling even if critical sections are short and not preempted in the middle, enlightening.
+Effcient Userspace Optimistic Spinning Locks— a presentation about making fast-path spinlocking in futex-based locks even more efficient.
+The main problem with optimistic spinning is how much of it do you want (that is, tweaking the number of iterations parameter).
+The proposal solves this in an ingenious self-tweeking way (with the help of the kernel): we spin until the holder of the lock itself goes to sleep.
+
Rust is my favorite programming language (other languages I enjoy are Kotlin and Python).
+In this post I want to explain why I, somewhat irrationally, find this language so compelling.
+The post does not try to explain why Rust is the most loved language according to
+StackOverflow survey :-)
+
Additionally, this post does not cover the actual good reasons why one might want to use Rust.
+Briefly:
+
+
+If you use C++ or C, Rust allows you to get roughly the same binary, but with compile-time guaranteed absence of undefined behavior.
+This is a big deal and the reason why Rust exists.
+
+
+If you use a statically typed managed language (Java, C#, Go, etc), the benefit of Rust is a massive simplification of multithreaded programming: data races are eliminated at compile time.
+Additionally, you get the benefits of a lower level language (less RAM, less CPU, direct access to platform libraries) without paying as much cost as you would with C++.
+This is not free: you’ll pay with compile times and cognitive complexity, but it would be “why my code does not compile” complexity, rather than “why my heap is corrupted” complexity.
+
+
+
If you’d like to hear more about the above, this post will disappoint you :-)
The reason why I irrationally like Rust is that it, subjectively, gets a lot of small details just right (or at least better than other languages I know).
+The rest of the post would be a laundry list of those things, but first I’d love to mention why I think Rust is the way it is.
+
First, it is a relatively young language, so it can have many “obviously good” things.
+For example, I feel like there’s a general consensus now that, by default, local variables should not be reassignable.
+This probably was much less obvious in the 90s, when today’s mainstream languages were designed.
+
Second, it does not try to maintain source/semantic compatibility with any existing language.
+Even if we think that const by default is a good idea, we can’t employ it in TypeScript, because it needs to stay compatible with JavaScript.
+
Third, (and this is a pure speculation on my part) I feel that the initial bunch of people who designed the language and its design principles just had an excellent taste!
To set the right mood for the rest of the discussion, let me start with claiming that snake_case is more readable than camelCase :-)
+Similarly, XmlRpcRequest is better than XMLRPCRequest.
+
I believe that readability is partially a matter of habit.
+But it also seems logical that _ is better at separating words than case change or nothing at all.
+And, subjectively, after writing a bunch of camelCase and snake_case, I much prefer _.
How would you Ctrl+F the definition of foo function in a Java file on GitHub?
+Probably just foo(, which would give you both the definition and all the calls.
+In Rust, you’d search for fn foo.
+In general, every construct is introduced by a leading keyword, which makes it much easier to read the code for a human.
+When I read C++, I always have a hard time distinguishing field declarations from method declarations: they start the same.
+Leading keywords also make it easier to do stupid text searches for things.
+If you don’t find this argument compelling because “one should just use an IDE to look for methods”, well, it actually makes implementing an IDE slightly easier as well:
+
+
+Parsing has a nice LL(1) vibe to it, you just dispatch on the current token.
+
+
+Parser resilience is easy, you can synchronize on leading keywords like fn, struct etc.
+
+
+It’s easier for the IDE to guess the intention of a user.
+If you type fn, IDE recognizes that you want to add a new function and can, for example, complete function overrides for you.
+
C-family languages usually use Type name order.
+Languages with type inference, including Rust, usually go for name: Type.
+Technically, this is more convenient because in a recursive descent parser it’s easier to make the second part optional.
+It’s also more readable, because you put the most important part, the name, first.
+Because names are usually more uniform in length than types, groups of fields/local variables align better.
Many languages use if (condition) { then_branch } syntax, where parenthesis around condition are mandatory, and braces around then_branch are optional.
+Rust does the opposite, which has the following benefits:
+
+
+There’s no need for a special rule to associate else with just the right if. Instead, else if is an indivisible unambiguous bit of syntax.
+
+
+goto fail; bug is impossible; more generally, you don’t have to make the decision if it is ok to omit the braces.
+
I think “everything is an expression” is generally a good idea, because it makes things composable.
+Just the other day I tried to handle null in TypeScript in a Kotlin way, with foo() ?? return false, and failed because return is not an expression.
+
The problem with traditional functional (Haskell/OCaml) approach is that it uses let name = expr in expression for introducing new variables, which just feels bulky.
+Specifically, the closing in keyword feels verbose, and also emphasizes the nesting of expression.
+The nesting is undoubtedly there, but usually it is very boring, and calling it out is not very helpful.
+
Rust doesn’t have a let expression per se, instead it has flat-feeling blocks which can contain many let statements:
+
+
+
This gives, subjectively, a lighter-weight syntax for introducing bindings and side-effecting statements, as well as an ability to nicely scope local variables to sub-blocks!
In Rust, reassignable variables are declared with let mut and non-reassignable with let.
+Note how the rarer option is more verbose, and how it is expressed as a modifier, and not a separate keyword, like let and const.
In Rust, enums (sum types, algebraic data types) are namespaced.
+
You declare enums like this:
+
+
+
And use them like Expr::Int, without worrying that it might collide with
+
+
+
No more repetitive data Expr = ExprInt Int | ExprBool Bool | ExprSum Expr Expr!
+
Swift does even a nicer trick here, by using an .VariantName syntax to refer to a namespaced enum (docs).
+This makes matching less verbose and completely dodges the sad Rust ambiguity between constants and bindings:
Fields and methods are declared in separate blocks (like in Go):
+
+
+
This is a huge improvement to readability: there are usually far fewer fields than methods, but by looking at the fields you can usually understand which set of methods can exist.
u32 and i64 are shorter and clearer than unsigned int or long.
+usize and isize cover the most important use case for arch-dependent integer type, and also make it clearer at the type level which things are addresses/indices, and which are quantities.
+There’s also no question of how integer literals of various types look, it’s just 1i8 or 92u64
+
The overflow during arithmetic operations is considered a bug, traps in debug builds and wraps in release builds.
+However, there’s a plethora of methods like wrapping_add, saturating_sub, etc, so you can exactly specify behavior on overflow in specific cases where it is not a bug.
+In general, methods on primitives allow to expose a ton of compiler intrinsics in a systematic way, like u64::count_ones.
Rust uses control flow analysis to check that every local variable is assigned before the first use.
+This is a much better default than making this UB, or initializing all locals to some default value.
+Additionally, Rust has a first-class support for diverging control flow (! type and loop {} construct), which protects it from at-a-distance changes like
+this example
+from Java.
+
Definitive initialization analysis is an interesting example of a language feature which requires relatively high-brow implementation techniques, but whose effects seem very intuitive, almost trivial, to the users of the language.
Rust libraries (“crates”) don’t have names.
+More generally, Rust doesn’t have any kind of global shared namespace.
+
This is in contrast to languages which have a concept of library path (PYTHONPATH, classpath, -I).
+If you have a library path, you are exposed to name/symbol clashes between libraries.
+While a name clash between two libraries seems pretty unlikely, there’s a special case where collision happens regularly.
+One of your dependencies can depend on libfoo v1, and another one on libfoo v2.
+Usually this means that you either can’t use the two libraries together, or need to implement some pretty horrific workarounds.
+
In Rust the name you use for a library is a property of the dependency edge between upstream and downstream crate.
+That is, the single crate can be known under different names in different dependant crates or, vice versa, two different crates might be known under equal names in different parts of the crate graph!
+This (and semver discipline, which is a social thing) is the reason why Cargo doesn’t suffer from dependency hell as much as some other ecosystems.
Related to the previous point, crates are also an important visibility boundary, which allows you clearly delineate public API of a library from implementation details.
+This is a major improvement over class-level visibility controls.
+
It’s interesting though that it took Rust two tries to get first-class “exported from the library” (pub) and “internal to the library” (pub(crate)) visibilities.
+That is also the reason why more restrictive pub(crate) is unfortunately longer to write, I wish we used pub and pub*.
+
Before 2018 edition, Rust had a simpler and more orthogonal system, where you can only say “visible in the parent”, which happens to be “exported” if the parent is root or is itself exported.
+But the old system is less convenient in practice, because you can’t look at the declaration and immediately say if it is a part of crate’s public API or not.
+
The next language should use these library-level visibilities from the start.
The canonical comparison function returns an enum Ordering { Less, Equal, Greater }, you don’t need to override all six comparison operators.
+Rust also manages this without introducing a separate <=> spaceship operator just for this purpose.
+And you still can implement fast path for == / != checks.
Rust defines two ways to turn something into a string: Display, which is intended for user-visible strings, and Debug, which is generally intended for printf debugging.
+This is similar to Python’s __str__ and __repr__.
+
Unlike Python, the compiler derives Debug for you.
+Being able to inspect all data structures is a huge productivity boost.
+I hope some day we’ll be able to call custom user-provided Debug from a debugger.
+
A nice bonus is that you can debug-print things in two modes:
+
+
+compactly on a single-line
+
+
+verbosely, on multiple lines as an indented tree
+
Strings are represented as utf-8 byte buffers.
+The encoding is fixed, can’t be changed, and its validity is enforced.
+There’s no random access to “characters”, but you can slice string with a byte index, provided that it doesn’t fall in the middle of a multi-byte character.
This post describes a simple technique for writing interners in Rust which I haven’t seen documented before.
+
String interning is a classical optimization when you have to deal with many equal strings.
+The canonical example would be a compiler: most identifiers in a program are repeated several times.
+
Interning works by ensuring that there’s only one canonical copy of each distinct string in memory.
+It can give the following benefits:
+
+
+Less memory allocated to hold strings.
+
+
+If all strings are canonicalized, comparison can be done in O(1) (instead of O(n)) by using pointer equality.
+
+
+Interned strings themselves can be represented with an index (typically u32) instead of a (ptr, len) pair.
+This makes data structures which embed strings more compact.
+
+
+
The simplest possible interner in Rust could look like this:
+
+
+
To remove duplicates, we store strings in a HashMap.
+To map from an index back to the string, we also store strings in a Vec.
+
I didn’t quite like this solution yesterday, for two reasons:
+
+
+It allocates a lot — each interned string is two separate allocations.
+
+
+Using a HashMap feels like cheating, surely there should be a better, more classical data structure!
+
+
+
So I’ve spent a part of the evening cobbling together a non-allocating trie-based interner.
+The result: trie does indeed asymptotically reduce the number of allocations from O(n) to O(log(n)).
+Unfortunately, it is slower, larger and way more complex than the above snippet.
+Minimizing allocations is important, but allocators are pretty fast, and that shouldn’t be done at the expense of everything else.
+Also, Rust HashMap (implemented by @Amanieu based on Swiss Table) is fast.
+
+
+For the curious, the Trie design I've used
+
The trie is build on per-byte basis (each node has at most 256 children).
+Each internal node is marked with a single byte.
+Leaf nodes are marked with substrings, so that only the common prefix requires node per byte.
+
To avoid allocating individual interned strings, we store them in a single long String.
+An interned string is represented by a Span (pair of indexes) inside the big buffer.
+
Trie itself is a tree structure, and we can use a standard trick of packing its nodes into array and using indexes to avoid allocating every node separately.
+However, nodes themselves can be of varying size, as each node can have different number of children.
+We can still array-allocate them, by rolling our own mini-allocator (using a segregated free list)!
+
Node’s children are represented as a sorted array of links.
+We use binary search for indexing and simple linear shift insertion.
+With at most 256 children per node, it shouldn’t be that bad.
+Additionally, we pre-allocate 256 nodes and use array indexing for the first transition.
+
Links are organized in layers.
+The layer n stores a number of [Link] chunks of length 2n (in a single contiguous array).
+Each chunk represents the links for a single node (with possibly some extra capacity).
+Node can find its chunk because it knows the number of links (which gives the number of layers) and the first link in the layer.
+A new link for the node is added to the current chunk if there’s space.
+If the chunk is full, it is copied to a chunk twice as big first.
+The old chunk is then added to the list of free chunks for reuse.
+
Here’s the whole definition of the data structure:
+
+
+
Isn’t it incredibly cool that you can look only at the fields and understand how the thing works,
+without even seeing the rest 150 lines of relatively tricky implementation?
+
+
+
However, implementing a trie made me realize that there’s a simple optimization we can apply to our naive interner to get rid of extra allocations.
+In the trie, I concatenate all interned strings into one giant String and use (u32, u32) index pairs as an internal representation of string slice.
+
If we translate this idea to our naive interner, we get:
+
+
+
The problem here is that we can’t actually write implementations of Eq and Hash for Span to make this work.
+In theory, this is possible: to compare two Spans, you resolve them to &str via buf, and then compare the strings.
+However, Rust API does not allow to express this idea.
+Moreover, even if HashMap allowed supplying a key closure at construction time, it wouldn’t help!
+
+
+
Such API would run afoul of the borrow checker.
+The key_fn would have to borrow from the same struct.
+What would work is supplying a key_fn at call-site for every HashMap operation, but that would hurt ergonomics and ease of use a lot.
+This exact problem requires
+slightly unusual
+design of lazy values in Rust.
+
+
However, with a bit of unsafe, we can make something similar work.
+The trick is to add strings to buf in such a way that they are never moved, even if more strings are added on top.
+That way, we can just store &str in the HashMap.
+To achieve address stability, we use another trick from the typed_arena crate.
+If the buf is full (so that adding a new string would invalidate old pointers), we allocate a new buffer, twice as large,
+without coping the contents of the old one.
+
Here’s the full implementation:
+
+
+
The precise rule for increasing capacity is slightly more complicated:
+
+
+
Just doubling won’t be enough, we also need to make sure that the new string actually fits.
+
We could have used a single bufs: Vec<String> in place of both buf and full.
+The benefit of splitting the last buffer into a dedicated field is that we statically guarantee that there’s at least one buffer.
+That way, we void a bounds check and/or .unwrap when accessing the active buffer.
+
We also use &'static str to fake interior references.
+Miri (rust in-progress UB checker) is not entirely happy about this.
+I haven’t dug into this yet, it might be another instance of
+rust-lang/rust#61114.
+To be on the safe side, we can use *const str instead, with a bit of boilerplate to delegate PartialEq and Hash.
+Some kind of (hypothetical) 'unsafe lifetime could also be useful here!
+The critical detail that makes our use of fake 'static sound here is that the alloc function is private.
+The public lookup function shortens the lifetime to that of &self (via lifetime elision).
+
For the real implementation, I would change two things:
+
+
+
Use rustc_hash::FxHashMap.
+It’s a standard Rust HashMap with a faster (but not DOS-resistant) hash function —FxHash.
+Fx stands for Firefox, this is a modification of FNV hash originally used in the browser.
+
+
+
Add a newtype wrapper for string indexes:
+
+
+
+
+
That’s all I have to say about fast and simple string interning in Rust!
+Discussion on /r/rust.
Welcome to my article about Pratt parsing — the monad tutorial of syntactic analysis.
+The number of Pratt parsing articles is so large that there exists a survey post :)
+
The goals of this particular article are:
+
+
+Raising an issue that the so-called left-recursion problem is overstated.
+
+
+Complaining about inadequacy of BNF for representing infix expressions.
+
+
+Providing a description and implementation of Pratt parsing algorithm which sticks to the core and doesn’t introduce a DSL-y abstraction.
+
+
+Understanding the algorithm myself for hopefully the last time. I’ve
+implemented
+a production-grade Pratt parser once, but I no longer immediately understand that code :-)
+
+
+
This post assumes a fair bit of familiarity with parsing techniques, and, for example, does not explain what a context free grammar is.
The pinnacle of syntactic analysis theory is discovering the context free grammar
+notation (often using BNF concrete syntax) for decoding linear structures into trees:
+
+
+
I remember being fascinated by this idea, especially by parallels with natural language sentence structure.
+However, my optimism quickly waned once we got to describing expressions.
+The natural expression grammar indeed allows one to see what is an expression.
+
+
+
Although this grammar looks great, it is in fact ambiguous and imprecise, and needs to be rewritten to be amendable to automated parser generation.
+Specifically, we need to specify precedence and associativity of operators.
+The fixed grammar looks like this:
+
+
+
To me, the “shape” of expressions feels completely lost in this new formulation.
+Moreover, it took me three or four courses in formal languages before I was able to reliably create this grammar myself.
+
And that’s why I love Pratt parsing — it is an enhancement of recursive descent parsing algorithm, which uses the natural terminology of precedence and associativity for parsing expressions, instead of grammar obfuscation techniques.
The simplest technique for hand-writing a parser is recursive descent, which
+models the grammar as a set of mutually recursive functions. For example, the
+above item grammar fragment can look like this:
+
+
+
Traditionally, text-books point out left-recursive grammars as the Achilles heel
+of this approach, and use this drawback to motivate more advanced LR parsing
+techniques. An example of problematic grammar can look like this:
+
+
+
Indeed, if we naively code the sum function, it wouldn’t be too useful:
+
+
+
+
+At this point we immediately loop and overflow the stack
+
+
+
A theoretical fix to the problem involves rewriting the grammar to eliminate the left recursion.
+However in practice, for a hand-written parser, a solution is much simpler — breaking away with a pure recursive paradigm and using a loop:
I have a confession to make: I am always confused by “high precedence” and “low precedence”. In a + b * c, addition has a lower precedence, but it is at the top of the parse tree…
+
So instead, I find thinking in terms of binding power more intuitive.
+
+
+
The * is stronger, it has more power to hold together B and C, and so the expression is parsed as
+A + (B * C).
+
What about associativity though? In A + B + C all operators seem to have the same power, and it is unclear which + to fold first.
+But this can also be modelled with power, if we make it slightly asymmetric:
+
+
+
Here, we pumped the right power of + just a little bit, so that it holds the right operand tighter.
+We also added zeros at both ends, as there are no operators to bind from the sides.
+Here, the first (and only the first) + holds both of its arguments tighter than the neighbors, so we can reduce it:
+
+
+
Now we can fold the second plus and get (A + B) + C.
+Or, in terms of the syntax tree, the second + really likes its right operand more than the left one, so it rushes to get hold of C.
+While he does that, the first + captures both A and B, as they are uncontested.
+
What Pratt parsing does is that it finds these badass, stronger than neighbors operators, by processing the string left to right.
+We are almost at a point where we finally start writing some code, but let’s first look at the other running example.
+We will use function composition operator, . (dot) as a right associative operator with a high binding power.
+That is, f . g . h is parsed as f . (g . h), or, in terms of power
We will be parsing expressions where basic atoms are single character numbers and variables, and which uses punctuation for operators.
+Let’s define a simple tokenizer:
+
+
+
To make sure that we got the precedence binding power correctly, we will be transforming infix expressions into a gold-standard (not so popular in Poland, for whatever reason) unambiguous notation — S-expressions:
+1 + 2 * 3 == (+ 1 (* 2 3)).
+
+
+
And let’s start with just this: expressions with atoms and two infix binary operators, + and *:
+
+
+
So, the general approach is roughly the one we used to deal with left recursion — start with parsing a first number, and then loop, consuming operators and doing … something?
+
+
+
+
+Note that we already can parse this simple test!
+
+
+
We want to use this power idea, so let’s compute both left and right powers of the operator.
+We’ll use u8 to represent power, so, for associativity, we’ll add 1.
+And we’ll reserve the 0 power for the end of input, so the lowest power operator can have is 1.
+
+
+
And now comes the tricky bit, where we introduce recursion into the picture.
+Let’s think about this example (with powers below):
+
+
+
The cursor is at the first +, we know that the left bp is 1 and the right one is 2.
+The lhs stores a.
+The next operator after + is *, so we shouldn’t add b to a.
+The problem is that we haven’t yet seen the next operator, we are just past +.
+Can we add a lookahead?
+Looks like no — we’d have to look past all of b, c and d to find the next operator with lower binding power, which sounds pretty unbounded.
+But we are onto something!
+Our current right priority is 2, and, to be able to fold the expression, we need to find the next operator with lower priority.
+So let’s recursively call expr_bp starting at b, but also tell it to stop as soon as bp drops below 2.
+This necessitates the addition of min_bp argument to the main function.
+
And lo, we have a fully functioning minimal Prat parser:
+
+
+
+
+min_bp argument is the crucial addition. expr_bp now parses expressions with relatively high binding power. As soon as it sees something weaker than min_bp, it stops.
+
+
+This is the “it stops” point.
+
+
+And here we bump past the operator itself and make the recursive call.
+Note how we use l_bp to check against min_bp, and r_bp as the new min_bp of the recursive call.
+So, you can think about min_bp as the binding power of the operator to the left of the current expressions.
+
+
+Finally, after parsing the correct right hand side, we assemble the new current expression.
+
+
+To start the recursion, we use binding power of zero.
+Remember, at the beginning the binding power of the operator to the left is the lowest possible, zero, as there’s no actual operator there.
+
+
+
So, yup, these 40 lines are the Pratt parsing algorithm.
+They are tricky, but, if you understand them, everything else is straightforward additions.
Now let’s add all kinds of weird expressions to show the power and flexibility of the algorithm.
+First, let’s add a high-priority, right associative function composition operator: .:
+
+
+
Yup, it’s a single line!
+Note how the left side of the operator binds tighter, which gives us desired right associativity:
+
+
+
Now, let’s add unary -, which binds tighter than binary arithmetic operators, but less tight than composition.
+This requires changes to how we start our loop, as we no longer can assume that the first token is an atom, and need to handle minus as well.
+But let the types drive us.
+First, we start with binding powers.
+As this is an unary operator, it really only have right binding power, so, ahem, let’s just code this:
+
+
+
+
+Here, we return a dummy () to make it clear that this is a prefix, and not a postfix operator, and thus can only bind things to the right.
+
+
+Note, as we want to add unary - between . and *, we need to shift priorities of . by two.
+The general rule is that we use an odd priority as base, and bump it by one for associativity, if the operator is binary. For unary minus it doesn’t matter and we could have used either 5 or 6, but sticking to odd is more consistent.
+
+
+
Plugging this into expr_bp, we get:
+
+
+
Now, we only have r_bp and not l_bp, so let’s just copy-paste half of the code from the main loop?
+Remember, we use r_bp for recursive calls.
+
+
+
Amusingly, this purely mechanical, type-driven transformation works.
+You can also reason why it works, of course.
+The same argument applies; after we’ve consumed a prefix operator, the operand consists of operators that bind tighter, and we just so conveniently happen to have a function which can parse expressions tighter than the specified power.
+
Ok, this is getting stupid.
+If using ((), u8)“just worked” for prefix operators, can (u8, ()) deal with postfix ones?
+Well, let’s add ! for factorials. It should bind tighter than -, because -(92!) is obviously more useful than (-92)!.
+So, the familiar drill — new priority function, shifting priority of . (this bit is annoying in Pratt parsers), copy-pasting the code…
+
+
+
Wait, something’s wrong here.
+After we’ve parsed the prefix expression, we can see either a postfix or an infix operator.
+But we bail on unrecognized operators, which is not going to work…
+So, let’s make postfix_binding_power to return an option, for the case where the operator is not postfix:
+
+
+
Amusingly, both the old and the new tests pass.
+
Now, we are ready to add a new kind of expression: parenthesised expression.
+It is actually not that hard, and we could have done it from the start, but it makes sense to handle this here, you’ll see in a moment why.
+Parens are just a primary expressions, and are handled similar to atoms:
+
+
+
Unfortunately, the following test fails:
+
+
+
The panic comes from the loop below — the only termination condition we have is reaching eof, and ) is definitely not eof.
+The easiest way to fix that is to change infix_binding_power to return None on unrecognized operands.
+That way, it’ll become similar to postfix_binding_power again!
+
+
+
And now let’s add array indexing operator: a[i].
+What kind of -fix is it?
+Around-fix?
+If it were just a[], it would clearly be postfix.
+if it were just [i], it would work exactly like parens.
+And it is the key: the i part doesn’t really participate in the whole power game, as it is unambiguously delimited. So, let’s do this:
+
+
+
+
+Note that we use the same priority for ! as for [.
+In general, for the correctness of our algorithm it’s pretty important that, when we make decisions, priorities are never equal.
+Otherwise, we might end up in a situation like the one before tiny adjustment for associativity, where there were two equally-good candidates for reduction.
+However, we only compare right bp with left bp!
+So for two postfix operators it’s OK to have priorities the same, as they are both right.
+
+
+
Finally, the ultimate boss of all operators, the dreaded ternary:
+
+
+
Is this … all-other-the-place-fix operator?
+Well, let’s change the syntax of ternary slightly:
+
+
+
And let’s recall that a[i] turned out to be a postfix operator + parenthesis…
+So, yeah, ? and : are actually a weird pair of parens!
+And let’s handle it as such!
+Now, what about priority and associativity?
+What associativity even is in this case?
+
+
+
To figure it out, we just squash the parens part:
+
+
+
This can be parsed as
+
+
+
or as
+
+
+
What is more useful?
+For ?-chains like this:
+
+
+
the right-associative reading is more useful.
+Priority-wise, the ternary is low priority.
+In C, only = and , have lower priority.
+While we are at it, let’s add C-style right associative = as well.
+
Here’s our the most complete and perfect version of a simple Pratt parser:
This is a sequel to the previous post about Pratt parsing.
+Here, we’ll study the relationship between top-down operator precedence (Pratt parsing) and the more famous shunting yard algorithm.
+Spoiler: they are the same algorithm, the difference is implementation style with recursion (Pratt) or a manual stack (Dijkstra).
+
Unlike the previous educational post, this one is going to be an excruciatingly boring pile of technicalities — we’ll just slowly and mechanically refactor our way to victory.
+Specifically,
+
+
+We start with refactoring Pratt parser to minimize control flow variations.
+
+
+Then, having arrived at the code with only one return and only one recursive call, we replace recursion with an explicit stack.
+
+
+Finally, we streamline control in the iterative version.
+
+
+At this point, we have a bona fide shunting yard algorithm.
+
+
+
To further reveal the connection, we further verify that the original recursive and the iterative formulation produce syntax nodes in the same order.
+
Really, the most exciting bit about this post is the conclusion, and you already know it :)
Last time, we’ve ended up with the following code:
+
+
+
First, to not completely drown in minutia, we’ll simplify it by removing support for indexing operator [] and ternary operator ?:.
+We will keep parenthesis, left and right associative operators, and the unary minus (which is somewhat tricky to handle in shunting yard).
+So this is our starting point:
+
+
+
What I like about this code is how up-front it is about all special cases and control flow.
+This is a “shameless green” code!
+However, it is clear that we have a bunch of duplication between prefix, infix and postfix operators.
+Our first step would be to simplify the control flow to its core.
First, let’s merge postfix and infix cases, as they are almost the same.
+The idea is to change priorities for ! from (11, ()) to (11, 100), where 100 is a special, very strong priority, which means that the right hand side of a “binary” operator is empty.
+We’ll handle this in a pretty crude way right now, but all the hacks would go away once we refactor the rest.
+
+
+
Yup, we just check for hard-coded 100 constant and use a bunch of unwraps all over the place.
+But the code is already smaller.
+
Let’s apply the same treatment for prefix operators.
+We’ll need to move their handing into the loop, and we also need to make lhs optional, which is now not a big deal, as the function as a whole returns an Option.
+On a happier note, this will allow us to remove the if 100 wart.
+What’s more problematic is handing priorities: minus has different binding powers depending on whether it is in an infix or a prefix position.
+We solve this problem by just adding an prefix: bool argument to the binding_power function.
+
+
+
Keen readers might have noticed that we use 99 and not 100 here for “no operand” case.
+This is not important yet, but will be during the next step.
+
We’ve unified prefix, infix and postfix operators.
+The next logical step is to treat atoms as nullary operators!
+That is, we’ll parse 92 into (92) S-expression, with None for both lhs and rhs.
+We get this by using (99, 100) binding power.
+At this stage, we can get rid of distinction between atom tokens and operator tokens, and make the lexer return underlying char’s directly.
+We’ll also get rid of S::Atom, which gives us this somewhat large change:
+
+
+
This is the stage where it becomes important that “fake” binding power of unary - is 99.
+After parsing first constant in 1 - 2 the r_bp is 100, and we need to avoid eating the following minus.
+
The only thing left outside the main loop are parenthesis.
+We can deal with them using (99, 0) priority — after ( we enter a new context where all operators are allowed.
+
+
+
Or, after some control flow cleanup:
+
+
+
This is still recognizably a Pratt parse, with its characteristic shape
+
+
+
What we’ll do next is mechanical replacement of recursion with a manual stack.
This is a general transformation and (I think) it can be done mechanically.
+The interesting bits during transformation are recursive calls themselves and returns.
+The underlying goal of the preceding refactorings was to reduce the number of recursive invocations to one.
+We still have two return statements there, so let’s condense that to just one as well:
+
+
+
Next, we should reify locals which are live across the recursive call into a data structure.
+If there were more than one recursive call, we’d have to reify control-flow as enum as well, but we’ve prudently removed all but one recursive invocation.
+
So let’s start with introducing a Frame struct, without actually adding a stack just yet.
+
+
+
And now, let’s add a stack: Vec<Frame>.
+This is the point where the magic happens.
+We’ll still keep the top local variable: representing a stack as (T, Vec<T>) and not as just Vec<T> gives us compile-time guarantee of non-emptiness.
+We replace the expr_bp(lexer, r_bp) recursive call with pushing to the stack.
+All operations after the call are moved after return.
+return itself is replaced with popping off the stack.
+
+
+
Tada! No recursion anymore, and still passes the tests!
+Let’s cleanup this further though.
+First, let’s treat ) more like a usual operator.
+The correct binding powers here are the opposite of (: (0, 100):
+
+
+
Finally, let’s note that continue inside the match is somewhat wasteful — when we hit it, we’ll re-peek the same token again.
+So let’s repeat just the match until we know we can make progress.
+This also allows replacing peek() / next() pair with just next().
+
+
+
And guess what? This is the shunting yard algorithm, with its characteristic shape of
+
+
+
To drive the point home, let’s print the tokens we pop off the stack, to verify that we get reverse Polish notation without any kind of additional tree rearrangement, just like in the original algorithm description:
+
+
+
+
+
We actually could have done it with the original recursive formulation as well.
+Placing print statements at all points where we construct an S node prints expression in a reverse polish notation,
+proving that the recursive algorithm does the same steps and in the same order as the shunting yard.
This is a short ad of a Rust programming language targeting experienced C++ developers.
+Being an ad, it will only whet your appetite, consult other resources for fine print.
This program creates a vector of 32-bit integers (std::vector<int32_t>), takes a reference to the first element, x, pushes one more number onto the vector and then uses x.
+The program is wrong: extending the vector may invalidate references to element, and *x might dereference a dangling pointer.
+
The beauty of this program is that it doesn’t compile:
+
+
+
Rust compiler tracks the aliasing status of every piece of data and forbids mutations of potentially aliased data.
+In this example, x and xs alias the first integer in the vector’s storage in the heap.
This program creates an integer counter protected by a mutex, spawns 10 threads, increments the counter 10 times from each thread, and prints the total.
+
The counter variable lives on the stack, and a pointer to these stack data is shared with other threads.
+The threads have to lock the mutex to do the increments.
+When printing the total, the counter is read bypassing the mutex, without any synchronization.
+
The beauty of this program is that it relies on several bits of subtle reasoning for correctness, each of which is checked by compiler:
+
+
+Child threads don’t escape the main function and so can read counter from its stack.
+
+
+Child threads only access counter through the mutex.
+
+
+Child threads will have terminated by the time we read total out of counter without mutex.
+
+
+
If any of these constraints are broken, the compiler rejects the code.
+There’s no need for std::shared_ptr just to defensively make sure that the memory isn’t freed under your feet.
+
Rust allows doing dangerous, clever, and fast things without fear of introducing undefined behavior.
+
If you like what you see, here are two books I recommend for diving deeper into Rust:
Hey, unlike all other articles on this blog, this one isn’t about programming, it’s about my personal life.
+It’s nothing important, just some thoughts that have been on my mind recently.
+So, if you come here for technical content, feel free to skip this one!
+
I do, however, intentionally post this together with other articles, for two main reasons:
+
+
+
There are some things here which I wish I had understood earlier.
+So, I would have liked it if I had accidentally read about them in some technical blog.
+
+
+
I am always casually interested in people behind technical blogs I read, so, again, I would have liked it to read a similar article.
I think giving some background info about me would be useful.
+I come from a middle class Russian family.
+I was born in 1992, so my earliest years fell onto a rather fun historical period, of which I don’t really remember anything.
+I grew up in Stavropol — a city circa 400_000 in the southern part of Russia.
+After finishing school, (I was sixteen), I moved to St. Petersburg to study in the state University there.
+I had spent 10 years in that city before moving to Berlin, the place I currently live, last year.
+
In terms of understanding “how the life works”, I became somewhat actively self-conscious at about 14.
+The set of important beliefs I’ve learned/discovered then hasn’t changed until about 2017 or so.
+This latter change (which I feel is still very much ongoing) gives the title to the present article.
I guess the biggest deal for me is discovering that polyamory 1) exists 2) is something I’ve been missing a lot in my interpersonal relations.
+It’s the big one because it most directly affected me, and because other stuff I’ve learned, I’ve learned from my poly partners.
+
In a nutshell, polyamory is the idea that is OK to love several people at the same time time.
+That if you love A, and also love B, it doesn’t mean that your love for A is somehow fake or untrue.
+I find the analogy with kids illuminating — if it’s OK to love both your kids, than it should be OK to love both your partners, right?
+I highly recommend everyone to read More than Two, on the basis that it’s a rare book that directly affected my life, and that it would probably would have affected it even if polyamory weren’t my thing (which is, of course, totally valid as well!).
+
A more general point is that until 2017, I didn’t have a real working model of romantic relationships.
+I am reasonably sure that a lot of people are in a similar situation: it’s hard to encounter a reasonable relationship model in society to learn from!
+(This might be biased by my culture, but I suspect that it might not).
+
We aren’t taught how to be with another person (if we are lucky enough, we are taught how to practice safe sex at least), so we have to learn on our own by observing.
+One model is the relationships of our parents, which are quite often at least somewhat broken (like in my case).
+The other model is the art, and the portrayal of romance in art is (and this is an uncomfortably strong opinion for me) actively harmful garbage.
+
What I now hold as the most important thing in romantic relations is a very clear, direct and honest communication.
+Honest with yourself and honest with your partner.
+Honesty includes the ability to feel your genuine needs and desires (as opposed to following the model of what you think you should feel).
+
An example that is near and dear to my heart is when you are in relationship with A, but there’s also this other person B whom your you find attractive.
+Honesty is accepting that “attractive” means “my body (and quite probably my consciousness) wants to have sex with this person” and acting on that observation, rather than pretending that it doesn’t exist or shaming yourself into thinking it shouldn’t exist.
+
Or a more concrete example: one of my favorite dishes (code named “the dish I find the most yummy”) is bananas mixed with sour cream and quark.
+Me and my partner O enjoyed eating this dish in the morning, and I was usually tasked with preparing it.
+There are two variates of quark — a hard grainy one and a soft one.
+O had a preference for the soft one, so, naturally, I made morning meals using the soft one, because I don’t really care, and eating the same thing is oh sooo romantic.
+This continued until one day O said “Kladov, stop bullshitting yourself and admit that you love the grainy one. Let’s buy both variates and make two portions”.
+O was totally right.
+And the thing is, I haven’t even noticed my (useless, stupid, and most egregiously, not called for) sacrifice for the sake of the relationship until it was called out by my partner.
+(In the end, O came to the conclusion that the grainy quark is actually yummier, but that’s besides the point).
+
And the depiction of love in art is the opposite of this.
+Which is understandable — the reason why romance (and death) is featured so prominently in art is that a major component of art’s success is its capacity for evoking emotions, and there’s little so heart wrecking as romantic drama (and death).
+And the model of “speak with words through the mouth” relationships is very good at minimizing drama.
+(Reminder: this is non-technical post, so if I say here that something is or isn’t doesn’t mean I’ve performed due diligence to confirm that it is true).
+My relations with poly partners were more boring than my relations with monogamous partners.
+This is great for participating people, but bad for art (unless it is some kind of slow-cinema piece).
+
Recently, I re-read Anna Karenina by Leo Tolstoy.
+I highly recommend this novel if you can read Russian.
+(I am not sure if it is translatable to English, a big part of its merit is the exquisite language).
+There are two romantic lines there: a passionate, forbidden and fatal love between Anna (who is married) and Vronski (who is not the guy Anna is married to), and a homely love/family of Levin and Kity.
+The second one is portraited in a favorable light, as a representative of the isomorphism class of happy families.
+The scene of engagement between Levin and Kity made my blood boil.
+They are sitting at the table, with a piece of chalk.
+Leving feels that it’s kind of an appropriate model to ask Kity to mary him.
+So he take chalk and writes:
+
+
+
Which are the initial letters of a phrase
+
+
+
Which asks about Kity’s original rejection of Levin several years ago.
+Kity decodes this messages, and answers in a likewise manner.
+This “dialog” continues for some time, at the end of which they are happily engaged, and I am enraged.
+Such implicit, subtle and ellipsis based communication is exactly how you wreck any relation.
+
Which is the saddest part here is that I wasn’t enraged when I’ve read the book for the first time when I was 15 or so.
+Granted, I had a full understanding that the book is about late XIX century, and that the models of relations are questionable.
+But still, I think I subconsciously binned Levin and Kity’s relationship to the good ones, and this why I find the art harmful in this respect.
+
My smaller quibble is that sex is both a fetishized and a taboo topic.
+It’s hinted at, today not so subtly, but is rarely shown or studied as a subject of art.
+Von Trier and Gaspar Noe being two great exceptions among the artists I like.
So, how did I go from a default void model of romance, to my current place, where I know what I want and can actively build my relationships as I like, and not “as they are supposed to be”?
+This is the most fascinating thing about this, and one of the primary reasons for me to write this down for other people to read.
+
I think I am a pretty introspective person — I like to think about things, form models and opinions, adjust and tweak them.
+And I did think about relationships a lot.
+And, for example, one conclusion was that I don’t really understand jealousy, and I don’t want to “own” or otherwise restrict my partner.
+I was always ok with the fact that a person I love has relationship with someone else, both in theory and a couple of times in practice.
+
But I didn’t make a jump to “it’s OK for me to love more than a single person”, and I don’t really understand that.
+It feels like a very simple theorem, which you should just prove yourself.
+Instead, it took me several chance encounters to get to this truth.
+(To clarify again, I don’t claim that polyamory is a universal truth, this is just something that works for me, you are likely different).
+Once I got it, it turned out obvious and self evident.
+But to get it, I needed:
+
+
+A relation with a poly person S, who was literally reading More Than Two when we were together.
+
+
+A relation with an extremely monogamous (as in, expressing a lot of distress due to jealousy) S.
+
+
+A relation with another poly person A, at which point it finally clicked that if I like 1 & 3, and don’t like 2, than maybe it makes sense for me to read that book as well.
+
+
+
So, surprise, it’s possible to have some hugely important, but not so subtly broken things in life which were carried over from childhood/early adolescence without reconsidering.
+If they are pointed out, it’s clear how to fix them, but noticing them is the tricky bit…
Speaking of things which are hard to notice…
+Surprisingly, mental health exists!
+Up until very recently, my association for mental health was The Cabinet of Dr. Caligari: something which just doesn’t happen “in real life”.
+Very, very far from the truth.
+A lot of people seriously struggle with their minds.
+Major depression or borderline personality disorder (examples I am familiar with) affect the very way you think, and are not that uncommon.
+And many people struggle with smaller problems, like anxiety, self-loathing, low self-esteem, etc.
+
My own emotional responses are pretty muted.
+I’d pass a Voight Kampff test I guess. Maybe.
+My own self-esteem is adequate, and I love myself.
+
So, it was eye-opening to realize that this might not be the case for other people.
+Empathy is also not my strongest point, hehe.
+
Well, it gets even better than this.
+I suspect I might be autistic :-)
+Thanks M for pointing that out to me:
+
M: I am autistic. +
+A: Wait wat? On the contrary, you are the first person I’ve met who doesn’t seem insane. Wait a second…
+
(
+Actually, S had made a bet that I am an aspie couple of years before that…
+Apparently, just telling me something important about myself never works?
+)
+
To clarify, I’ve never been to a counselor, so I don’t know what labels are applicable to me, if any, but I do think that I can be described as a person demonstrating certain unusual/autistic traits.
+They don’t bother me (on the contrary, having learned a bit about minds of other people, I feel super lucky about the way my brain works), so I don’t think I’d get counseling any time soon.
+However, if something in your life bothers you (or even if it doesn’t), counseling is probably a good idea to try!
+Several people I trust highly recommend it.
+Keep in mind that a lot that is called psychology is oscillating between science and, well, bullshit, so be careful with your choice.
+Check that it is indeed a science based thing (Cognitive Behavioral Therapy being one of the most properly researched approaches).
+
Anyway, I guess it makes sense to share a bit of my experiences, in case someone reads this and thinks “oh shit, that’s me” :-)
+Hypothetical me from ten years ago would have appreciated this.
+
I think the single most telling thing is that I am Meursault, from Camus’s The Stranger.
+I read a lot, but characters rarely make sense to me, even less so than people.
+Except for Meursault, I can associate myself with him.
+Not as “he is in a similar situation to mine” but “I understand the motives of his actions in any given situation”.
Apparently, Meursault had a real life prototype, Camus’s best fried, and looks like that friend had Asperger’s since before it was was named!
+Hey, the hypothesis that I am autistic has predictive power!
+
Another thing where I find myself different from other people is that I am introverted.
+Well, a lot of folks I know claim “I am introverted”, but the amount of social life they have gives me chills :-)
+Kladov’s radius — the minimal degree of introvercy such that you are the most introverted person you know, because for any person more introverted than yourself, you two have zero chance to meet.
+
I don’t really have a need for social interactions I think — I like being by myself.
+Not uttering a single word in a day (or a weekend) is something which happens to me pretty regularly, and I enjoy that.
+By the way, did you know that Gandhi had one day in the week when he spoke to no one?
+
What do I do instead of people?
+(formerly) Mathematics, programming, watching good movies, reading good books.
+Programming is a big one for the past six years or so — I rather easily loose myself in the state of flow (although my overall productivity is super unstable, and sometimes I can’t have anything done for the whole day just because).
+I also occasionally get mildly annoying by the work-life balance articles on reddit (I am thinking about a specific one which contrasted having life with building a carrier).
+Of course everyone should do what works best for them.
+But if someone codes at work, and then codes at home, it doesn’t necessary mean they are optimizing their salary or are trying to get better at coding or something.
+They might just really like writing code, and sometimes practice it during working hours as well because what else would you do between the meetings?
+
Otherwise, I am pretty uninterested in stuff.
+I don’t like traveling or trying out new things.
+
I don’t have any super specific physical or psychological sensitivities.
+I don’t go outside of my apartment without headphones; music helps me to create a sort of bubble of my space around myself.
+I am pretty easily overwhelmed in groups of people (which is different from not enjoying people generally — I might get overwhelmed even among people I like to be around.).
+
My interpersonal relations are funny — I always perceive myself much colder than the other person (and I project much fewer emotions in stressful situations).
+Note that “colder” here is a positive thing — I wish other people were more like me, not the other way around.
+
I am awkward and avoidant of “casual” social contact.
+As in, I don’t eat alone in cafes and such, as that means interacting with the waiter.
+I do that in company though, where I can just observe and repeat what others are doing.
+
In general, I am pretty happy to be at the place where I am.
+Well, I guess it would have helped a tiny bit if I could go to the supermarket in the next building, and not to the one three blocks away where I had already been before and where I know how to behave.
+But, really, I perceive these as small things which are not worth fixing.
The next discovery (or rather, subtle shift in the world view) is from a slightly earlier era (2014 maybe?).
+I don’t believe that people are X.
+Or rather, I believe that it’s generally unimportant that “this person is X” when explaining their actions.
+I weigh circumstances as relatively more important that personalities when explaining events.
+In other words, there are no “good” or “bad” people, the same person can display a wide range of behaviors, depending on the current (not necessary historical) environment.
+This is what I’ve learned from The Lucifer Effect.
+
More generally, I feel that systems, mechanisms and institutions in place define the broad outlook of the world, and, if something is wrong, we should not make it right, but understand what force makes it wrong, and try to build a counter-mechanism.
+
A specific example here is that, if I see a less than polite/constructive/respectful comment on reddit making a point I disagree with, I answer with two comments.
+One is factual comment about the point being discussed, another one is a templated response along the lines of “I read your comment as _, I find this unacceptable, please avoid antagonistic rhetorical constructs like _”.
+
That is:
+
+
+Clarify my subjective interpretation of the comment.
+
+
+State that I don’t find it appropriate.
+
+
+Point out specific ways to improve the comment.
+
+
+
The goal here is not to disagree with a single specific comment or a to change behavior of a single specific commenter to write better comments in the future.
+The goal is to create a culture which I think promotes healthy discussion, so that, when other people read the exchange, they get a strong signal what is ok and what is not.
A more recent development of this idea is that mechanisms rule me as well (thanks to O again for this one!).
+
Specifically, I now separate my mind from myself.
+What my mind feels/wants is not necessary what I want.
+I am not my brain.
+
If I feel a craving for a bit of chocolate, that doesn’t mean that I actually want sweets!
+It only means that some chemistry in my brain decided that I need to experience the feeling of wanting something sweet right now.
+
An interesting aspect of this is that the “desires” part of our brain is older and more primitive than “proving theorems” part of our brain.
+As it is simpler, it is more reliable and powerful.
+So, it takes a disproportionally large amount of willpower to override your primitive wanting brain.
+
This flipped me from “If I want to stop doing X, I’d easily do that” to “ok, I should not start wanting X, otherwise getting rid of that would be a pain”.
+Somehow, I’ve never tried alcohol, tobacco or drugs before (yes, I voluntarily moved to Berlin).
+There wasn’t strong reason for that, I am totally OK with all those things, it’s just that (I guess) I am too introverted to land into a company to start.
+However, now I think I would deliberately avoid addictive substances, because I value my thinking about complicated stuff.
+And when I am dealing with a hard math-y problem, I don’t want to think “and don’t drink that extra bottle of beer” on top, as that’s too hard.
+
I am less successful with the torrent of low-quality superficial info from the internet.
+Luckily, I’ve never had any social network profiles (I guess for the same reason as with alcohol), but I started reading reddit at some point, and that eats into my attention.
+/etc/hosts and RSS help a lot here.
This discussion about mind, cognitive biases, mechanisms etc sounds a lot like something from rationalists community.
+I am somewhat superficially familiar with it, and it does sound like a good thing.
+If I were to optimize my life to better achieve my goals, I would probably dedicate some time studying https://www.lesswrong.com/.
+Perhaps even me not having any particular goals (besides locally optimizing for what I find the most desirable at any given moment) is some form of a bias?
To conclude, a small, but crisp observation.
+I often find myself in emotionally non-neutral debates about whether doing “X” is good.
+If there’s an actual disagreement, I tend to find myself a relatively more cold/cynic side, and my interlocutor a more empathetic one.
+Surprisingly to me, many of such disagreements are traced to a single fundamental difference in decision-making process.
+
When I make a decision (especially an ethical one), I tend to go for what I feel is “right” in some abstract sense.
+I can’t explain this any better, this is really a gut feeling (and is not categorical imperative, at least not consciously).
+
Apparently, another mode for making ethical decisions is common — weighing the consequences of a specific action in a specific context, and making decision based on that, without taking “poperness” of the action itself into consideration.
+
With this two different underlying algorithms, it’s pretty easy to heatedly disagree about some specific conclusion!
+(Tip: to unearth such deep disagreements more efficiently, use the following rule: as soon anyone notices that a debate is happening, the debate is paused, and each side explains the position the other side is arguing for).
I guess that’s it for now and the nearest future!
+If you have comments, suggestions or just want to say hello, feel free to drop me a email (in GitHub profile) or contact me on Telegram (@matklad).
This is a short note on the builder pattern, or, rather, on the builder method pattern.
+
TL;DR: if you have Foo and FooBuilder, consider adding a builder method to Foo:
+
+
+
A more minimal solution is to rely just on FooBuilder::default or FooBuilder::new.
+There are two problems with that:
+
First, it is hard to discover.
+Nothing in the docs/signature of Foo mentions FooBuilder, you need to look elsewhere to learn how to create a Foo.
+I remember being puzzled at how to create a GlobSet for exactly this reason.
+In contrast, the builder method is right there on Foo, probably the first one.
+
Second, it is more annoying to use, as you need to import bothFoo and FooBuilder.
+With Foo::builder method often only one import suffices, as you don’t need to name the builder type.
This is a hand-wavy philosophical article about programming, without quantifiable justification, but with some actionable advice and a case study.
+
Suppose that there are two types in the program, Blorb and Gonk.
+Suppose also that they both can blag.
+
Does it make sense to add the following trait?
+
+
+
I claim that it makes sense only if you have a function like
+
+
+
That is, if some part of you program is generic over T: Blag.
+
If in every x.blag() the x is either Blorg, or Gonk, but never a T (each usage is concrete), you don’t need this abstraction.
+“Need” is used in a literal sense here: replace a trait with two inherent methods named blag, and the code will be essentially the same.
+Using a trait here doesn’t achieve any semantic compression.
+
Given that abstractions have costs “don’t need” can be strengthen to “probably shouldn’t”.
+
+
+
Not going for an abstraction often allows a for more specific interface.
+A monad in Haskell is a thing with >>=.
+Which isn’t telling much.
+Languages like Rust and OCaml can’t express a general monad, but they still have concrete monads.
+The >>= is called and_then for futures and flat_map for lists.
+These names are more specific than >>= and are easier to understand.
+The >>= is only required if you want to write code generic over type of monad itself, which happens rarely.
+
Another example of abstraction which is used mostly concretely are collection hierarchies.
+In Java or Scala, there’s a whole type hierarchy for things which can hold other things.
+Rust’s type system can’t express Collection trait, so we have to get by with using Vec, HashSet and BTreeSet directly.
+And it isn’t actually a problem in practice.
+Turns out, writing code which is generic over collections (and not just over iterators) is not that useful.
+The “but I can change the collection type later” argument also seems overrated — often, there’s only single collection type that makes sense.
+Moreover, swapping HashSet for BTreeSet is mostly just a change at the definition site, as the two happen to have almost identical interface anyway.
+The only case where I miss Java collections is when I return Vec<T>, but mean a generic unordered collection.
+In Java, the difference is captured by List<T> vs Collection<T>.
+In Rust, there’s nothing built-in for this.
+It is possible to define a VecSet<T>(Vec<T>), but doesn’t seem worth the effort.
+
Collections also suffer from >>= problem — collapsing similar synonyms under a single name.
+Java’s
+Queue
+has add, offer, remove, and poll methods, because it needs to be a collection, but also is a special kind of collection.
+In C++, you have to spell push_back for vector’s push operation, so that it duck-types with deque’s front and back.
+
+
Finally, the promised case study!
+rust-analyzer needs to convert a bunch of internal type to types suitable for converting them into JSON message of the Language Server Protocol.
+ra::Completion is converted into lsp::Completion; ra::Completion contains ra::TextRange which is converted to lsp::Range, etc.
+
The first implementation started with an abstraction for conversion:
+
+
+
This abstraction doesn’t work for all cases — sometimes the conversion requires additional context.
+For example, to convert a rust-analyzer’s offset (a position of byte in the file) to an LSP position ((line, column) pair), a table with positions of newlines is needed.
+This is easy to handle:
+
+
+
Naturally, there was an intricate web of delegating impls.
+The typical one looked like this:
+
+
+
There were a couple of genuinely generic impls for converting iterators of convertible things.
+
The code was hard to understand.
+It also was hard to use: if calling .conv didn’t work immediately, it took a lot of time to find which specific impl didn’t apply.
+Finally, there were many accidental (as in “accidental complexity”) changes to the shape of code: CTX being passed by value or by reference, switching between generic parameters and associated types, etc.
+
I was really annoyed by how this conceptually simple pure boilerplate operation got expressed as clever and fancy abstraction.
+Crucially, almost all of the usages of the abstraction (besides those couple of iterator impls) were concrete.
+So I replaced the whole edifice with much simpler code, a bunch of functions:
+
+
+
Simplicity and ease of use went up tremendously.
+Now instead of typing x.conv() and trying to figure out why an impl I think should apply doesn’t apply, I just auto-complete to_proto::range and let the compiler tell me exactly which types don’t line up.
+
I’ve lost fancy iterator impls, but the
+total diff
+for the commit was +999,-1123.
+There was some genuine code re-use in those impls, but it was not justified by the overall compression, even disregarding additional complexity tax.
+
To sum up, “is this abstraction used exclusively concretely?” is a meaningful question about the overall shape of code.
+If the answer is “Yes!”, then the abstraction can be replaced by a number of equivalent non-abstract implementations.
+As the latter tend to be simpler, shorter, and more direct, “Concrete Abstraction” can be considered a code smell.
+As usual though, any abstract programming advice can be applied only in a concrete context — don’t blindly replace abstractions with concretions, check if provided justifications work for your particular case!
This is my response for this year’s call for blog posts.
+I am writing this as a language implementor, not as a language user.
+I also don’t try to prioritize the problems.
+The two things I’ll mention are the things that worry me most without reflecting on the overall state of the project.
+They are not necessary the most important things.
For the past several years, I’ve been a maintainer of “Sponsored Open Source” projects (rust-analyzer & IntelliJ Rust).
+These projects:
+
+
+have a small number of core developers who work full-time at company X, and whose job is to maintain the project,
+
+
+are explicitly engineered for active open source:
+
+
+significant fraction of maintainer’s time goes to contribution documentation, issue mentoring, etc,
+
+
+non-trivial amount of features end up being implemented by the community.
+
+
+
+
+
This experience taught me that there’s a great deal of a difference between the work done by the community, and the work done during payed hours.
+To put it bluntly, a small team of 2-3 people working full-time on a specific project with a long time horizon can do a lot.
+Not because payed hours == higher quality work, but because of the cumulative effect of:
+
+
+being able to focus on a single thing,
+
+
+keeping the project in a mental cache and accumulating knowledge,
+
+
+being able to “invest” into the code and do long-term planing effectively.
+
+
+
In other words, community gives breadth of contributions, while payed hours give depth.
+Both are important, but I feel that Rust could use a lot of the latter at the moment, in two senses.
+
First, marginal utility of adding a full-time developer to the Rust project will be high for quite a few full-time developers.
+
Second, perhaps more worrying, I have a nagging feeling that the imbalance between community and payed hours can affect the quality of the technical artifact, and not just the speed of development.
+The two styles of work lend themselves to different kinds of work actually getting done.
+Most of pull requests I merge are about new features, and some are about bug-fixes.
+Most of pull requests I submit are about refactoring existing code.
+Community naturally picks the work of incrementally adding new code, maintainers can refactor and rewrite existing code.
+It’s easy to see that, in the limit, this could end with an effectively immutable/append only code base.
+I think we are pretty far from the limit today, but I don’t exactly like the current dynamics.
+I keep coming back to this Rust 2019 post when I think about this issue.
+
The conclusion from this section is that we should find ways to fund teams of people to focus on improving the Rust programming language.
+Through luck, hard work of my colleagues at JetBrains and Ferrous Systems, and my own efforts it became possible to move in this direction for both IntelliJ Rust and rust-analyzer.
+This was pretty stressful, and, well, I feel that the marginal utility of one more compiler engineer is still huge in the IDE domain at least.
And now to something completely different!
+I want this:
+
+
+
That is, I want to simplify working on the compiler itself to it being just a crate.
+This section of the article expands on the comment I’ve made on the
+irlo
+a while ago.
+
Since a couple of months ago, I am slowly pivoting from doing mostly green field dev in the rust-analyzer’s code base to refactoring rustc internals towards merging the two.
+The process has been underwhelming, and slow and complicated build process plays a significant part in this: I feel like my own productivity is at least five times greater when I work on rust-analyzer in comparison to rustc.
+
Before I go into details about my vision here, I want to give shout-outs to
+@Mark-Simulacrum, @mark-i-m, and @jyn514
+who already did a lot of work on simplifying the build process in the recent several months.
+
Note that I am going to make a slightly deeper than “Rust in 20XX” dive into the topic, feel free to skip the rest of the post if technical details about bootstrapping process are not your cup of tea.
+
Finally, I also should warn that I have an intern advantage here — I have absolutely no idea about how Rust’s current build process works, so I tell how it should work from the position of ignorance. Without further ado,
rustc is a bootstrapping compiler.
+This means that, to compile rustc itself, one needs to have a previous version of rustc available.
+This could make compiler’s build process peculiar.
+My thesis is that this doesn’t need to be the case, and that the compiler could be just a crate.
+
Bootstrapping does make this harder to see though, so, as a thought experiment, let’s imagine what would rustc’s build process look like were it not written in Rust.
+Let’s imagine the world where rustc is implemented in Go.
+How would one build and test this rust compiler?
+
First, we clone the rust-lang/rust repository.
+Then we download the latest version of the Go compiler — as we are shipping rustc binaries to the end user, it’s OK to require a cutting-edge compiler.
+But there’s probably some script or gvm config file to make getting the latest Go compiler easier.
+After that, go test builds the compiler and runs the unit tests.
+Unit tests take a snippet of Rust code as an input and check that the compiler correctly analyses the snippet: that the parse tree is correct, that diagnostics are emitted, that borrow checker correctly accepts or rejects certain problems.
+
What we can not check in this way is that the compiler is capable of producing a real binary which we can run (that is, run-pass tests).
+The reason for that is slightly subtle — to produce a binary, compiler needs to link the tested code with the standard library.
+But we’ve only compiled the compiler, we don’t have a standard library yet!
+
So, in addition to unit-tests, we also need somewhat ad-hoc integration tests, which assume that the compiler has been build already, use it to compile the standard library, and then compile, link, and run the corpus of the test programs.
+Running std’s own #[test] tests is also a part of this integration testing.
+
Now, let’s see if the above setup has any bottlenecks:
+
+
+
Getting the Go compiler is fast and straightforward.
+In fact, it’s reasonable to assume that the user already have a recent Go compiler installed, and that they are familiar with standard Go workflows.
+
+
+
Compiling rustc would take a little while.
+On the one hand, Rust is a big language, and you need to spend quite a few lines of code to implement it.
+On the other hand, compilers are very straightforward programs, which don’t do a lot of IO, don’t have to deal with changing business requirements and don’t have a lot of dependencies.
+Besides, Go is a language known for fast compile times.
+So, spending something like five minutes on a quad-core machine for compiling the compiler seems reasonable.
+
+
+
After that, running unit-tests is a breeze: unit-tests do not depend on any state external to the test itself; we are testing pure functions.
+
+
+
The first integration tests is compiling and #[test]ing std.
+As std is relatively small, compiling it with our compiler should be relatively fast.
+
+
+
Running tens of thousands of full integration tests will be slow.
+Each such test would need to do IO to read the source code, write the executable, and run the process.
+It is reasonable to assume that most of potential failures are covered with compiler’s and std’s unit tests.
+But it would be foolish to rely solely on those tests — fully integrated test suite is important to make sure that compiler indeed does what it is supposed to, and it is vital to compare several independent implementations — who knows, maybe one day we’ll rewrite rustc from Go to Rust, and re-using compiler’s unit-tests would be much harder in that context.
+
+
+
So, it seems like except for the final integration test suite, there’s no complexity/performance bottlenecks in our setup for a from-scratch build.
+The problem with integrated suite can be handled by running a subset of smoke tests by default, and only running the full set of integrated tests on CI.
+Testing is embarrassingly parallel, so a beefy CI fleet should handle that just fine.
+
What about incremental builds?
+Let’s say we want to contribute a change to std.
+First time around, this requires building the compiler, which is unfortunate.
+This is a one-time cost though, and it shouldn’t be prohibitive (or we will have troubles with changes to the compiler itself anyway).
+We can also cheat here, and just download some version of rustc from the internet to check std.
+This will mostly work, except for the bits where std and rustc need to know about each other (lang items and intrinsics).
+For those, we can use #[cfg(not(bootstrap))] in the std to compile different code for older versions of the compiler.
+This makes std implementation mind-bending though, so a better alternative might be to just make CI publish the artifacts for the compiler built off the master branch.
+That is, if you only contribute to std, you download the latest compiler instead of building it yourself.
+We have a trade off between implementation complexity and compile times.
+
If we want to contribute a change to the compiler, then we are golden as long as it can be checked by the unit-tests (which, again, in theory is everything except for run-pass tests).
+If we need to run integrated tests with std, then we need to recompile std with the new compiler, after every change to the compiler.
+This is pretty unfortunate, but:
+
+
+if you fundamentally need to recompile std (for example, you change lang-items), there’s no way around this,
+
+
+if you don’t need to recompile std, than you probably can write an std-less unit-test,
+
+
+as an escape hatch, there might be some kind of KEEP_STDLIB env var, which causes integrated tests to re-use existing std, even if the compiler is newer.
+
+
+
To sum up, compiler is just a program which does some text processing.
+In the modern world full of distributed highly-available long-running systems, compiler is actually a pretty simple program.
+It also is fairly easy to test.
+The hard bit is not the compiler itself, but the standard library: to even start building the standard library, we need to compile the compiler.
+However, most of the compiler can be tested without std, and std itself can be tested using compiler binary built from the master branch by CI.
In theory, it should be possible to replace Go from the last section with Rust, and get a similarly simple bootstrapping compiler.
+That is, we would use latest stable/beta Rust to compile rustc, then we’ll use this rustc to compile std, and we are done.
+We might add a sanity check — using the freshly built compiler & std, recompile the compiler again and check that everything works.
+This is optional, and in a sense just a subset of a crater run, where we check one specific crate — compiler itself.
+
However, today’s build is more complicated than that.
+
First, instead of using a “standard distribution” of the compiler for bootstrapping, x.py downloads custom beta toolchain.
+This could and should be replaced with using rustup by default.
+
Second, master rustc requires master std to build.
+This is the bit which makes rustc not a simple crate.
+Remember how before the build started with just compiling the compiler as a usual program?
+Today, rustc build starts with compiling master std using the beta compiler, than with compiling master rustc using master std and beta compiler.
+So, there’s a requirement that std builds with both master and beta compilers, and we also has this weird state where versions of compiler and std we are using to compile the code do not match. In other words, while #[cfg(not(bootstrap))] was an optimization in the previous section (which could be replaced with downloading binary rustc from CI), today it is required.
+
Third, there’s not much in a way of the unit tests in the compiler.
+Almost all tests require std, which means that, to test anything, one needs to rebuild everything.
+
Fourth, LLVM & linkers.
+A big part of “compilers are easy to test” is the fact that they are, in theory, closed systems interacting with the outside world in a limited well-defined way.
+In the real world, however, rustc relies on a bunch of external components to work, the biggest one of which is LLVM.
+Luckily, these external components are required only for making the final binary.
+The bulk of the compiler, analysis phases which reject invalid programs and lower valid ones, does not need them.
With all this in mind, here are specific steps which I believe would make the build process easier:
+
+
+Gear the overall build process and defaults to the “hacking on the compiler” use case.
+
+
+By default, rely on rust-toolchain file and rustup to get the beta compiler.
+
+
+Switch from x.py to something like cargo-xtask, to remove dependency on Python.
+
+
+Downgrade rustc’s libstd requirements to beta.
+Note that this refers solely to the std used to build rustc itself.
+rustc will use master std for building user’s code.
+
+
+Split compiler and std into separate Cargo workspaces.
+
+
+Make sure that, by default, rustc is using system llvm, or llvm downloaded from a CI server.
+Building llvm from source should require explicit op-in.
+
+
+Make sure that cd compiler && cargo test just works.
+
+
+Add ability to to make a build of the compiler which can run check, but doesn’t do llvm-dependent codegen.
+
+
+Split the test suite into cross-platform codegen-less check part, and the fully-integrated part.
+
+
+Split the compiler itself into frontend and codegen parts, such that changes in frontend can be tested without linking backend, and changes in backend can be tested without recompiling the frontend.
+
+
+Stop building std with beta compiler and remove all #[cfg(bootstrap)].
+
+
+Somehow make cargo test just work in std.
+This will require some hackery to plug the logic for “build compiler from source or download from CI” somewhere.
+
+
+
At this stage, we have a compiler which is 100% bog standard crate, and std, which is almost a typical crate (it only requires a very recent compiler to build).
+
After this, we can start the standard procedure to optimize compile and test times, just how you would do for any other Rust project (I am planning to write a couple of posts on these topics).
+I have a suspicion that there’s a lot of low-hanging fruit there — one of the reasons why I writing this post is that I’ve noticed that doctests in std are insanely slow, and that nobody complains about that just because everything else is even slower!
+
This post ended up being too technical for the genre, but, to recap, there seems to be two force multipliers we could leverage to develop Rust itself:
+
+
+Creating a space for small teams of people to work full-time on Rust.
+
+
+Simplifying hacking on the compiler to just cargo test.
+
This post describes my own pet theory of programming languages popularity.
+My understanding is that no one knows why some languages are popular and others aren’t, so there’s no harm done if I add my own thoughts to the overall confusion.
+Obviously, this is all wild speculation and a just-so story without any kind of data backed research.
+
The central thesis is that the actual programming language (syntax, semantics, paradigm) doesn’t really matter.
+What matters is characteristics of the runtime — roughly, what does memory of the running process look like?
+
To start, an observation.
+A lot of software is written in vimscript and emacs lisp (magit being one example I can’t live without).
+And these languages are objectively bad.
+This happens even with less esoteric technologies, notable examples being PHP and JavaScript.
+While JavaScript is great in some aspects (it’s the first mainstream language with lambdas!), it surely isn’t hard to imagine a trivially better version of it (for example, without two different nulls).
+
This is a general rule — as soon as you have a language which is Turing-complete, and has some capabilities for building abstractions, people will just get the things done with it.
+Surely, some languages are more productive, some are less productive, but, overall, FP vs OOP vs static types vs dynamic types doesn’t seem super relevant.
+It’s always possible to overcome the language by spending some more time writing a program.
+
In contrast, overcoming language runtime is not really possible.
+If you want to extend vim, you kinda have to use vimscript.
+If you want your code to run in the browser, JavaScript is still the best bet.
+Need to embed your code anywhere? GC is probably not an option for you.
+
This two observations lead to the following hypothesis:
+
+
Let’s see some examples which can be “explained” by this theory.
+
+
C
+
+
C has a pretty spartan runtime, which is notable for two reasons.
+First, it was the first fast enough runtime for a high-level language.
+It was possible to write the OS kernel in C, which had been typically done in assembly before that for performance.
+Second, C is the language of Unix.
+(And yes, I would put C into the “easily improved upon” category of languages. Null-terminated strings are just a bad design).
+
+
JavaScript
+
+
This language has been exclusive in the browsers for quite some time.
+
+
Java
+
+
This case I think is the most interesting for the theory.
+A common explanation for Java’s popularity is “marketing by Sun”, and subsequent introduction of Java into University’s curricula.
+This doesn’t seem convincing to me.
+Let’s look at the 90’s popular languages (I am not sure about percentage and relative ranking here, but the composition seems broadly correct to me):
+
+
+
On this list, Java is the only non-dynamic cross-platform memory safe language.
+That is, Java is both memory safe (no manual error-prone memory management) and can be implemented reasonably efficiently (field access is a load and not a dictionary lookup).
+This seems like a pretty compelling reason to choose Java, irrespective of what the language itself actually looks like.
+
+
Go
+
+
One can argue whether focus on simplicity at the expense of everything else is good or bad, but statically linked zero dependency binaries definitely were a reason for Go popularity in the devops sphere.
+In a sense, Go is an upgrade over “memory safe & reasonably fast” Java runtime, when you no longer need to install JVM separately.
+
+
+
Naturally, there are also some things which are not explained by my hypothesis.
+One is scripting languages.
+A highly dynamic runtime with eval and ability to easily link C extensions indeed would be a differentiator, so we would expect a popular scripting language.
+However, it’s unclear why they are Python and PHP, and not Ruby and Perl.
+
Another one is language evolutions: C++ and TypeScript don’t innovate runtime-wise, yet they are still major languages.
+
Finally, let’s make some bold predictions using the theory.
+
First, I expect Rust to become a major language, naturally :)
+This needs some explanation — on the first blush, Rust is runtime-equivalent to C and C++, so the theory should predict just the opposite.
+But I would argue that memory safety is a runtime property, despite the fact that it is, uniquely to Rust, achieved exclusively via language machinery.
+
Second, I predict Julia to become more popular.
+It’s pretty unique, runtime-wise, with its stark rejection of Ousterhout’s Dichotomy and insisting that, yeah, we’ll just JIT highly dynamic language to suuuper fast numeric code at runtime.
+
Third, I wouldn’t be surprised if Dart grows.
+On the one hand, it’s roughly in the same boat as Go and Java, with memory safe runtime with fixed layout of objects and pervasive dynamic dispatch.
+But the quality of implementation of the runtimes is staggering: it has first-class JIT, AOT and JS compilers.
+Moreover, it has top-notch hot-reload support.
+Nothing here is a breakthrough, but the combination is impressive.
+
Fourth, I predict that Nim, Crystal and Zig (which is very interesting, language design wise) would not become popular.
+
Fifth, I predict that Swift will be pretty popular on Apple hardware due to platform exclusivity, but won’t grow much outside of it, despite being very innovative in language design (generics in Swift are the opposite of the generics in Go).
I’ve recently read an article criticizing Rust, and, while it made a bunch of good points, I didn’t enjoy it — it was an easy to argue with piece.
+In general, I feel that I can’t recommend an article criticizing Rust.
+This is a shame — confronting drawbacks is important, and debunking low effort/miss informed attempts at critique sadly inoculates against actually good arguments.
+
So, here’s my attempt to argue against Rust:
+
+
Not All Programming is Systems Programming
+
+
Rust is a systems programming language.
+It offers precise control over data layout and runtime behavior of the code, granting you maximal performance and flexibility.
+Unlike other systems programming languages, it also provides memory safety — buggy programs terminate in a well-defined manner, instead of unleashing (potentially security-sensitive) undefined behavior.
+
However, in many (most) cases, one doesn’t need ultimate performance or control over hardware resources.
+For these situations, modern managed languages like Kotlin or Go offer decent speed, enviable
+time to performance, and are memory safe by virtue of using a garbage collector for dynamic memory management.
+
+
Complexity
+
+
Programmer’s time is valuable, and, if you pick Rust, expect to spend some of it on learning the ropes.
+Rust community poured a lot of time into creating high-quality teaching materials, but the Rust language is big.
+Even if a Rust implementation would provide value for you, you might not have resources to invest into growing the language expertise.
+
Rust’s price for improved control is the curse of choice:
+
+
+
In Kotlin, you write class Foo(val bar: Bar), and proceed with solving your business problem.
+In Rust, there are choices to be made, some important enough to have dedicated syntax.
+
All this complexity is there for a reason — we don’t know how to create a simpler memory safe low-level language.
+But not every task requires a low-level language to solve it.
Compile times are a multiplier for everything.
+A program written in a slower to run but faster to compile programming language can be faster to run because the programmer will have more time to optimize!
+
Rust intentionally picked slow compilers in the generics dilemma.
+This is not necessarily the end of the world (the resulting runtime performance improvements are real), but it does mean that you’ll have to fight tooth and nail for reasonable build times in larger projects.
+
rustc implements what is probably the most advanced incremental compilation algorithm in production compilers, but this feels a bit like fighting with language compilation model.
+
Unlike C++, Rust build is not embarrassingly parallel; the amount of parallelism is limited by length of the critical path in the dependency graph.
+If you have 40+ cores to compile, this shows.
+
Rust also lacks an analog for the pimpl idiom, which means that changing a crate requires recompiling (and not just relinking) all of its reverse dependencies.
+
+
Maturity
+
+
Five years old, Rust is definitely a young language.
+Even though its future looks bright, I will bet more money on “C will be around in ten years” than on “Rust will be around in ten years”
+(See Lindy Effect).
+If you are writing software to last decades, you should seriously consider risks associated with picking new technologies.
+(But keep in mind that picking Java over Cobol for banking software in 90s retrospectively turned out to be the right choice).
+
There’s only one complete implementation of Rust — the rustc compiler.
+The most advanced alternative implementation, mrustc, purposefully omits many static safety checks.
+rustc at the moment supports only a single production-ready backend — LLVM.
+Hence, its support for CPU architectures is narrower than that of C, which has GCC implementation as well as a number of vendor specific proprietary compilers.
+
Finally, Rust lacks an official specification.
+The reference is a work in progress, and does not yet document all the fine implementation details.
+
+
Alternatives
+
+
There are other languages besides Rust in systems programming space, notably, C, C++, and Ada.
+
Modern C++ provides tools and guidelines for improving safety.
+There’s even a proposal for a Rust-like lifetimes mechanism!
+Unlike Rust, using these tools does not guarantee the absence of memory safety issues.
+Modern C++ is safer, Rust is safe.
+However, if you already maintain a large body of C++ code, it makes sense to check if following best practices and using sanitizers helps with security issues.
+This is hard, but clearly is easier than rewriting in another language!
+
If you use C, you can use formal methods to prove the absence of undefined behaviors, or just exhaustively test everything.
+
Ada is memory safe if you don’t use dynamic memory (never call free).
+
Rust is an interesting point on the cost/safety curve, but is far from the only one!
+
+
Tooling
+
+
Rust tooling is a bit of a hit and miss.
+The baseline tooling, the compiler and the build system
+(cargo), are often cited as best in class.
+
But, for example, some runtime-related tools (most notably, heap profiling) are just absent — it’s hard to reflect on the runtime of the program if there’s no runtime!
+Additionally, while IDE support is decent, it is nowhere near the Java-level of reliability.
+Automated complex refactors of multi-million line programs are not possible in Rust today.
+
+
Integration
+
+
Whatever the Rust promise is, it’s a fact of life that today’s systems programming world speaks C, and is inhabited by C and C++.
+Rust intentionally doesn’t try to mimic these languages — it doesn’t use C++-style classes or C ABI.
+
That means that integration between the worlds needs explicit bridges.
+These are not seamless.
+They are unsafe, not always completely zero-cost and need to be synchronized between the languages.
+While the general promise of piece-wise integration holds up and the tooling catches up, there is accidental complexity along the way.
+
One specific gotcha is that Cargo’s opinionated world view (which is a blessing for pure Rust projects) might make it harder to integrate with a bigger build system.
+
+
Performance
+
+
“Using LLVM” is not a universal solution to all performance problems.
+While I am not aware of benchmarks comparing performance of C++ and Rust at scale, it’s not to hard to come up with a list of cases where Rust leaves some performance on the table relative to C++.
+
The biggest one is probably the fact that Rust’s move semantics is based on values (memcpy at the machine code level).
+In contrast, C++ semantics uses special references you can steal data from (pointers at the machine code level).
+In theory, compiler should be able to see through chain of copies; in practice it often doesn’t: #57077.
+A related problem is the absence of placement new — Rust sometimes need to copy bytes to/from the stack, while C++ can construct the thing in place.
+
Somewhat amusingly, Rust’s default ABI (which is not stable, to make it as efficient as possible) is sometimes worse than that of C: #26494.
+
Finally, while in theory Rust code should be more efficient due to the significantly richer aliasing information, enabling aliasing-related optimizations triggers LLVM bugs and miscompilations: #54878.
+
But, to reiterate, these are cherry-picked examples, sometimes the field is tilted the other way.
+For example, std::unique_ptrhas a performance problem which Rust’s Box lacks.
+
A potentially bigger issue is that Rust, with its definition time checked generics, is less expressive than C++.
+So, some C++ template tricks for high performance are not expressible in Rust using a nice syntax.
+
+
Meaning of Unsafe
+
+
An idea which is even more core to Rust than ownership & borrowing is perhaps that of unsafe boundary.
+That, by delineating all dangerous operations behind unsafe blocks and functions and insisting on providing a safe higher-level interface to them, it is possible to create a system which is both
+
+
+sound (non-unsafe code can’t cause undefined behavior),
+
+
+and modular (different unsafe blocks can be checked separately).
+
+
+
It’s pretty clear that the promise works out in practice: fuzzing Rust code unearths panics, not buffer overruns.
+
But the theoretical outlook is not as rosy.
+
First, there’s no definition of Rust memory model, so it is impossible to formally check if a given unsafe block is valid or not.
+There’s informal definition of “things rustc does or might rely on” and in in-progress runtime verifier, but the actual model is in flux.
+So there might be some unsafe code somewhere which works OK in practice today, might be declared invalid tomorrow, and broken by a new compiler optimization next year.
+
Second, there’s also an observation that unsafe blocks are not, in fact, modular.
+Sufficiently powerful unsafe blocks can, in effect, extend the language.
+Two such extensions might be fine in isolation, but lead to undefined behavior if used simultaneously:
+Observational equivalence and unsafe code.
Here are some thing I have deliberately omitted from the list:
+
+
+Economics (“it’s harder to hire Rust programmers”) — I feel that the “maturity” section captures the essence of it which is not reducible to chicken and egg problem.
+
+
+Dependencies (“stdlib is too small / everything has too many deps”) — given how good Cargo and the relevant parts of the language are, I personally don’t see this as a problem.
+
+
+Dynamic linking (“Rust should have stable ABI”) — I don’t think this is a strong argument. Monomorphization is pretty fundamentally incompatible with dynamic linking and there’s C ABI if you really need to. I do think that the situation here can be improved, but I don’t think that improvement needs to be Rust-specific.
+
Rust thread-locals are slower than they could be.
+This is because they violate zero-cost abstraction principle, specifically the “you don’t pay for what you don’t use bit”.
+
Rust’s thread-local implementation(
+1,
+2
+) comes with built-in support for laziness — thread locals are initialized on the first access.
+Sometimes this overhead is a big deal, as thread locals are a common tool for writing high-performance code.
+For example, allocator fast path often involves looking into thread-local heap.
+
There’s an unstable #[thread_local] attribute for a zero-cost implementation
+(see the tracking issue).
+
Let’s see how much “is thread local initialized?” check costs by comparing these two programs:
+
+
+
+
+
In this test, we declare an integer thread-local variable, and use it as an accumulator for the summation.
+
We use non-trivial summation term: (step * step) ^ step— this is to prevent LLVM from evaluating the sum at compile time.
+If a term of a summation is a polynomial (like 1, step or step * step), then the sum itself is a one degree higher polynomial, and LLVM can figure this out!
+We rely on wrapping overflow of unsigned integers in C, and use wrapping_mul and wrapping_add in Rust.
+To make sure that both programs are equivalent, we also print the result.
+
One optimization we specifically don’t protect from is caching thread-local access.
+That is, instead of doing a billion of thread-local loads and stores, the compiler could generate code to compute the sum into the local variable, and do a single store at the end.
+This is because “can the compiler optimize thread-local access?” is exactly the property we want to measure.
+
There’s no standard way to get monotonic wall-clock time in C, so the C version is not cross-platform.
+
This code gives the following results on my machine:
+
+
+
This benchmark doesn’t allow to measure the cost of thread-local access per se, but the overall time is about 2x longer for Rust.
+
Can we make Rust faster?
+I don’t know how to do that, but I know how to cheat.
+We can apply a general Rust extension trick — write some C code and link it with Rust!
+
Let’s implement a simple C library which declares a thread-local and provides access to it:
+
+
+
Link it with Rust:
+
+
+
And use it:
+
+
+
The result are underwhelming:
+
+
+
This is expected — we replaced access to a thread local with a function call.
+As we are crossing the language boundary, the compiler can’t inline it, which destroys performance.
+However, there’s a way around that: Rust allows cross-language Link Time Optimization (docs).
+That is, Rust and C compilers can cooperate, to allow the linker to do inlining across the languages.
+
This requires to manually align a bunch of stars:
+
+
+
The C compiler, the Rust compiler and the linker must use the same version of LLVM.
+As you might have noticed, this excludes gcc.
+I had luck with rustc 1.46.0, clang 10.0.0, and LLD 10.0.0.
+
+
+
-flto=thin in the C compiler flags.
+
+
+
RUSTFLAGS:
+
+
+
+
+
Now, just recompiling the old code gives the same performance for C and Rust:
+
+
+
Interestingly, this is the same performance we get without any thread-locals at all:
+
+
+
So, either the compiler/linker was able to lift thread-local access out of the loop, or its cost is masked by arithmetics.
+
Full code for the benchmarks is available at https://github.com/matklad/ftl.
+Note that this research only scratches the surface of the topic: thread locals are implemented differently on different OSes.
+Even on a single OS, there are be differences depending on compilation flags (dynamic libraries differ from static libraries, for example).
+Looking at the generated assembly could also be illuminating (code on Compiler Explorer).
In this article we’ll dissect the implementation of std::io::Error type from the Rust’s standard library.
+The code in question is here:
+library/std/src/io/error.rs.
+
You can read this post as either of:
+
+
+A study of a specific bit of standard library.
+
+
+An advanced error management guide.
+
+
+A case of a beautiful API design.
+
+
+
The article requires basic familiarity with Rust error handing.
+
+
When designing an Error type for use with Result<T, E>, the main question to ask is “how the error will be used?”.
+Usually, one of the following is true.
+
+
+
The error is handled programmatically.
+The consumer inspects the error, so its internal structure needs to be exposed to a reasonable degree.
+
+
+
The error is propagated and displayed to the user.
+The consumer doesn’t inspect the error beyond the fmt::Display; so its internal structure can be encapsulated.
+
+
+
Note that there’s a tension between exposing implementation details and encapsulating them. A common anti-pattern for implementing the first case is to define a kitchen-sink enum:
+
+
+
There is a number of problems with this approach.
+
First, exposing errors from underlying libraries makes them a part of your public API.
+Major semver bump in your dependency would require you to make a new major version as well.
+
Second, it sets all the implementation details in stone.
+For example, if you notice that the size of ConnectionDiscovery is huge, boxing this variant would be a breaking change.
+
Third, it is usually indicative of a larger design issue.
+Kitchen sink errors pack dissimilar failure modes into one type.
+But, if failure modes vary widely, it probably isn’t reasonable to handle them!
+This is an indication that the situation looks more like the case two.
+
+
+
However bad the enum approach might be, it does achieve maximum inspectability of the first case.
+
The propagation-centered second case of error management is typically handled by using a boxed trait object.
+A type like Box<dyn std::error::Error> can be constructed from any specific concrete error, can be printed via Display, and can still optionally expose the underlying error via dynamic downcasting.
+The anyhow crate is a great example of this style.
+
The case of std::io::Error is interesting because it wants to be both of the above and more.
+
+
+This is std, so encapsulation and future-proofing are paramount.
+
+
+IO errors coming from the operating system often can be handled (for example, EWOULDBLOCK).
+
+
+For a systems programming language, it’s important to expose the underlying OS error exactly.
+
+
+The set of potential future OS error is unbounded.
+
+
+io::Error is also a vocabulary type, and should be able to represent some not-quite-os errors.
+For example, Rust Paths can contain internal 0 bytes and opening such path should return an io::Errorbefore making a syscall.
+
+
+
Here’s what std::io::Error looks like:
+
+
+
First thing to notice is that it’s an enum internally, but this is a well-hidden implementation detail.
+To allow inspecting and handing of various error conditions there’s a separate public fieldless kind enum:
+
+
+
Although both ErrorKind and Repr are enums, publicly exposing ErrorKind is much less scary.
+A #[non_exhaustive]Copy fieldless enum’s design space is a point — there are no plausible alternatives or compatibility hazards.
+
Someio::Errors are just raw OS error codes:
+
+
+
Platform-specific sys::decode_error_kind function takes care of mapping error codes to ErrorKind enum.
+All this together means that code can handle error categories in a cross-platform way by inspecting the .kind().
+However, if the need arises to handle a very specific error code in an OS-dependent way, that is also possible.
+The API carefully provides a convenient abstraction without abstracting away important low-level details.
+
An std::io::Error can also be constructed from an ErrorKind:
+
+
+
This provides cross-platform access to error-code style error handling.
+This is handy if you need the fastest possible errors.
+
Finally, there’s a third, fully custom variant of the representation:
+
+
+
Things to note:
+
+
+
Generic new function delegates to monomorphic _new function.
+This improves compile time, as less code needs to be duplicated during monomorphization.
+I think it also improves the runtime a bit: the _new function is not marked as inline, so a function call would be generated at the call-site.
+This is good, because error construction is the cold-path and saving instruction cache is welcome.
+
+
+
The Custom variant is boxed — this is to keep overall size_of smaller.
+On-the-stack size of errors is important: you pay for it even if there are no errors!
+
+
+
Both these types refer to a 'static error:
+
+
+
In a dyn Trait + '_, the '_ is elided to 'static, unless the trait object is behind a reference, in which case it is elided as &'a dyn Trait + 'a.
+
+
+
get_ref, get_mut and into_inner provide full access to the underlying error.
+Similarly to os_error case, abstraction blurs details, but also provides hooks to get the underlying data as-is.
+
+
+
Similarly, Display implementation reveals the most important details about internal representation.
+
+
+
To sum up, std::io::Error:
+
+
+encapsulates its internal representation and optimizes it by boxing large enum variant,
+
+
+provides a convenient way to handle error based on category via ErrorKind pattern,
+
+
+fully exposes underlying OS error, if any.
+
+
+can transparently wrap any other error type.
+
+
+
The last point means that io::Error can be used for ad-hoc errors, as &str and String are convertible to Box<dyn std::error::Error>:
+
+
+
It also can be used as a simple replacement for anyhow.
+I think some libraries might simplify their error handing with this:
+
+
+
For example, serde_json provides the following method:
+
+
+
Read can fail with io::Error, so serde_json::Error needs to be able to represent io::Error internally.
+I think this is backwards (but I don’t know the whole context, I’d be delighted to be proven wrong!), and the signature should have been this instead:
+
+
+
Then, serde_json::Error wouldn’t have Io variant and would be stashed into io::Error with InvalidData kind.
+
+
+
I think std::io::Error is a truly marvelous type, which manages to serve many different use-cases without much compromise.
+But can we perhaps do better?
+
The number one problem with std::io::Error is that, when a file-system operation fails, you don’t know which path it has failed for!
+This is understandable — Rust is a systems language, so it shouldn’t add much fat over what OS provides natively.
+OS returns an integer return code, and coupling that with a heap-allocated PathBuf could be an unacceptable overhead!
+
+
+
I don’t know an obviously good solution here.
+One option would be to add compile time (once we get std-aware cargo) or runtime (a-la RUST_BACKTRACE) switch to heap-allocate all path-related IO errors.
+A similarly-shaped problem is that io::Error doesn’t carry a backtrace.
+
The other problem is that std::io::Error is not as efficient as it could be:
+
+
+
Its size is pretty big:
+
+
+
+
+
For custom case, it incurs double indirection and allocation:
+
+
+
+
+
I think we can fix this now!
+
First, we can get rid of double indirection by using a thin trait object, a-la
+failure or
+anyhow.
+Now that GlobalAlloc exist, it’s a relatively straight-forward implementation.
+
Second, we can make use of the fact that pointers are aligned, and stash both Os and Simple variants into usize with the least significant bit set.
+I think we can even get creative and use the second least significant bit, leaving the first one as a niche.
+That way, even something like io::Result<i32> can be pointer-sized!
+
And this concludes the post.
+Next time you’ll be designing an error type for your library, take a moment to peer through
+sources
+of std::io::Error, you might find something to steal!
These are my notes after learning the Paxos algorithm.
+The primary goal here is to sharpen my own understanding of the algorithm, but maybe someone will find this explanation of Paxos useful!
+This post assumes fluency with mathematical notation.
Paxos is an algorithm for implementing distributed consensus.
+Suppose you have N machines which communicate over a faulty network.
+The network may delay, reorder, and lose messages (it can not corrupt them though).
+Some machines might die, and might return later.
+Due to network delays, “machine is dead” and “machine is temporary unreachable” are indistinguishable.
+What we want to do is to make machines agree on some value.
+“Agree” here means that if some machine says “value is X”, and another machine says “value is Y”, then X necessary is equal to Y.
+It is OK for machine to answer “I don’t know yet”.
+
The problem with this formulation is that Paxos is an elementary, but subtle algorithm.
+To understand it (at least for me), a precise, mathematical formulation is needed.
+So, let’s try again.
+
What is Paxos?
+Paxos is a theorem about sets!
+This is definitely mathematical, and is true (as long as you base math on set theory), but is not that helpful.
+So, let’s try again.
+
What is Paxos?
+Paxos is a theorem about nondeterministic state machines!
+
A system is characterized by a state.
+The system evolves in discrete steps: each step takes system from state to state'.
+Transitions are non-deterministic: from a single current s1, you may get to different next states s2 and s3.
+(non-determinism models a flaky network).
+An infinite sequence of system’s states is called a behavior:
+
+
+
Due to non-determinism, there’s a potentially infinite number of possible behaviors.
+Nonetheless, depending on the transition function, we might be able to prove that some condition is true for any state in any behavior.
+
Let’s start with a simple example, and also introduce some notation.
+I won’t use TLA+, as I don’t enjoy its concrete syntax.
+Instead, math will be set in monospaced unicode.
+
The example models an integer counter.
+Each step the counter decrements or increments (non-deterministically), but never gets too big or too small
+
+
+
The sate of the system is a single variable —counter.
+It holds a natural number.
+In general, we will represent a state of any system by a fixed set of variables.
+Even if the system logically consists of several components, we model it using a single unified state.
+
The Init formula specifies the initial state, the counter is zero.
+Note that = is a mathematical equality, and not an assignment.
+Init is a predicate on states.
+
Init is true for {counter: 0}.
+Init is false for {counter: 92}.
+
Next defines a non-deterministic transition function.
+It is a predicate on pairs of states, s1 and s2.
+counter is a variable in the s1 state, counter' is the corresponding variable in the s2 state.
+In plain English, transition from s1 to s2 is valid if one of these is true:
+
+
+Value of counter in s1 is less than 9 and value of counter in s2 is greater by 1.
+
+
+Value of counter in s1 is greater than 0, and value of counter in s2 is smaller by 1.
+
+
+
Next is true for ({counter: 5}, {counter: 6}).
+Next is false for ({counter: 5}, {counter: 5}).
+
Here are some behaviors of this system:
+
+
+0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9
+
+
+0 → 1 → 0 → 1 → 0 → 1
+
+
+0 → 1 → 2 → 3 → 2 → 1 → 0
+
+
+
Here are some non behaviors of this system:
+
+
+1 → 2 → 3 → 4 → 5: Init does not hold for initial state
+
+
+0 → 2: Next does not hold for (0, 2) pair
+
+
+0 → 1 → 0 → -1: Next does not hold for (0, -1) pair
+
+
+
“behavior” means that the initial state satisfies Init, and each transition satisfies Next.
+
We can state and prove a theorem about this system: for every state in every behavior, the value of counter is between 0 and 9.
+Proof is by induction:
+
+
+The condition is true in the initial state.
+
+
+If the condition is true for state s1, and Next holds for (s1, s2), then the condition is true for s2.
+
+
+QED.
+
+
+
As usual with induction, sometimes we would want to prove a stronger property, because it gives us more powerful base for an induction step.
+
To sum up, we define a non-deterministic state machine using two predicates Init and Next.
+Init is a predicate on states which restricts possible initial states.
+Next is a predicate on pairs of states, which defines a non-deterministic transition function.
+Vars section describes the state as a fixed set of typed variables.
+Sets defines auxiliary fixed sets, elements of which are values of variables.
+Theorem section specifies a predicate on behaviors: sequences of steps evolving according to Init and Next.
+
The theorem does not automatically follow from Init and Next, it needs to be proven.
+Alternatively, we can simulate a range of possible behaviors on a computer and check the theorem for the specific cases.
+If the set of reachable states is small enough (finite would be a good start), we can enumerate all behaviors and produce a brute force proof.
+If there are too many reachable states, we can’t prove the theorem this way, but we often can prove it to be wrong, by finding a counter example.
+This is the idea behind model checking in general and TLA+ specifically.
Having mastered the basic vocabulary, let’s start slowly building towards Paxos.
+We begin with defining what consensus is.
+As this is math, we’ll do it using sets.
+
+
+
The state of the system is a set of chosen values.
+For this set to constitute consensus (over time) we need two conditions to hold:
+
+
+at most one value is chosen
+
+
+if we choose a value at one point in time, we stick to it (math friendly: any two chosen values are equal to each other)
+
+
+
Here’s the simplest possible implementation of consensus:
+
+
+
In the initial state, the set of chosen values is empty.
+We can make a step if the current set of chosen values is empty, in which case we select an arbitrary value.
+
This technically breaks our behavior theory: we require behaviors to be infinite, but, for this spec, we can only make a single step.
+The fix is to allow empty steps: a step which does not change the state at all is always valid.
+We call such steps “stuttering steps”.
+
The proof of the first condition of the consensus theorem is a trivial induction.
+The proof of the second part is actually non-trivial, here’s a sketch.
+Assume that i and j are indices, which violate the condition.
+They might be far from each other in state-space, so we can’t immediately apply Next.
+So let’s choose the smallestj1 ∈ [i+1;j] such that the condition is violated.
+Let i1 = j1 - 1.
+The condition is still violated for (i1, j1) pair, but this time they are subsequent steps, and we can show that Next does not hold for them, concluding the proof.
+
Yay! We have a distributed consensus algorithm which works for 1 (one) machine:
Let’s try to extend this to a truly distributed case, where we have N machines (“acceptors”).
+We start with formalizing the naive consensus algorithm: let acceptors vote for values, and select the value which gets a majority of votes.
+
+
+
The state of the system is the set of all votes cast by all acceptors.
+We represent a vote as a pair of an acceptor and the value it voted for.
+Initially, the set of votes is empty.
+On each step, some acceptor casts a vote for some value (adds (a, v) pair to the set of votes), but only if it hasn’t voted yet.
+Remember that Next is a predicate on pairs of states, so we check votes for existing vote, but add a new one to votes'.
+The value is chosen if the set of acceptors which voted for the value ({a ∈ 𝔸: (a, v) ∈ votes}) is at least half as large as the set of all acceptors.
+In other words, if a majority of acceptors has voted for the value.
+
+
+
Let’s prove consensus theorem for Majority Vote protocol.
+TYPE ERROR, DOES NOT COMPUTE.
+The consensus theorem is a predicate on behaviors of states consisting of chosen variable.
+Here, chosen isn’t a variable, votes is!
+chosen is a function which maps current state to some boolean.
+
While it is intuitively clear what “consensus theorem” would look like for this case, let’s make this precise.
+Let’s map states with votes variable to states with chosen variable using the majority rule, f.
+This mapping naturally extends to a mapping between corresponding behaviors (sequences of steps):
+
+
+
Now we can precisely state that for every behavior B of majority voting spec, the theorem holds for f(B).
+This yields a better way to prove this!
+Instead of proving the theorem directly (which would again require i1, j1 trick), we prove that our mapping f is a homomorphism.
+That is, we prove that if votes_0 → votes_1 → ... is a behavior of the majority voting spec, then f(votes_0) → f(votes_1) → ... is a behavior of the consensus spec.
+This lets us to re-use existing proof.
+
The poof for initial step is trivial, but let’s spell it out just to appreciate the amount of details a human mind can glance through
+
+
+
Let’s show that if Majority Vote’s Next_m holds for (votes, votes'), then Consensus’s Next_c holds for (f(votes), f(votes')).
+There’s one obstacle on our way: this claim is false!
+Consider a case with three acceptors and two values: 𝔸 = {a1, a2, a3}, 𝕍 = {v1, v2}.
+Consider these values of votes and votes':
+
+
+
If you just mechanically check Next, you see that it works!
+a3 hasn’t cast its vote, so it can do this now.
+The problem is that chosen(votes) = {v1} and chosen(votes') = {v1, v2}.
+
We are trying to prove too much!
+f works correctly only for states reachable from Init, and the bad value of votes where a1 votes twice is not reachable.
+
So, we first should prove a lemma: each acceptor votes at most once.
+After that, we can prove Next_m(votes, votes') = Next_c(f(votes), f(votes')) under the assumption of at most once voting.
+Specifically, if |f(votes')| turns out to be larger than 1, then we can pick two majorities which voted for different values, which allows to pin down a single acceptor which voted twice, which is a contradiction.
+The rest is left as an exercise for the reader :)
+
So, majority vote indeed implements consensus.
+Let’s look closer at the “majority” condition.
+It is clearly important.
+If we define chosen as
+
+
+
then its easy to construct a behavior with several chosen values.
+The property of majority we use is that any two majorities have at least one acceptor in common.
+But any other condition with this property would work as well as majority.
+For example, we can assign an integer weight to each acceptor, and require the sum of weights to be more than half.
+As a more specific example, consider a set of for acceptors {a, b, c, d}.
+
Its majorities are:
+
+
+
But the following set of sets would also satisfy non-empty intersection condition:
+
+
+
Operationally, it is strictly better, as fewer are acceptors needed to reach a decision.
+
So let’s refine the protocol to a more general form.
+
+
+
We require to specify a set of quorums — set a of subsets of acceptors such that every two quorums have at least one acceptor in common.
+The value is chosen if there exists a quorum such that its every member voted for the value.
+
There’s one curious thing worth noting here.
+Consensus is a property of the whole system, there’s no single “place” where we can point to and say “hey, this is it, this is consensus”.
+Imagine 3 acceptors, sitting on Earth, Venus, and Mars, and choosing between values v1 and v2.
+They can execute Quorum Vote algorithm without communicating with each other at all.
+They will necessary reach consensus without knowing which specific value they agreed on!
+An external observer can then travel to the three planets, collect the votes and discover the chosen value, but this feature isn’t built into the algorithm itself.
+
OK, so we’ve just described an algorithm for finding consensus among N machines, proved the consensus theorem for it, and noted that it has staggering communication efficiency: zero messages.
+Should we collect our Turing Award?
+
Well, no, there’s a big problem with Quorum Vote — it can get stuck.
+Specifically, if there are three values, and the votes are evenly split between them, then no value is chosen, and only stuttering steps are possible.
+If you can vote for different values, it might happen that neither value receives a majority of votes.
+Voting satisfies the safety property, but not the liveness property — the algorithm can get stuck even if all machines are on-line and communication is perfect.
+
There is a simple fix to the problem, with a rich historical tradition among many “democratic” governments.
+Let’s have a vote, and let’s pick the value chosen by the majority, but let’s allow to vote only for a single candidate value:
+
+
+
The new condition says that an acceptor is only allowed to cast a vote if all other votes are for the same value.
+As a special case, if the set of votes is empty, the acceptor can vote for any value (but all other acceptors would have to vote for this value afterwards).
+
From a mathematical point of view, this algorithm is perfect.
+From a practical stand point, not so much: an acceptor to cast the first vote somehow needs to make sure that it is indeed the first one.
+The obvious fix to this problem is to assign a unique integer number to each acceptor, call the highest-numbered acceptor “leader”, and allow only the leader to cast the first decisive vote.
+
So acceptors first communicate with each other to figure out who the leader is, then the leader casts the vote, and the followers follow.
+But this also violates liveness: if the leader dies, then the followers would wait indefinitely.
+A fix for this problem is to let the second highest acceptor to take over the leadership if the leader perishes.
+But under our assumptions, it’s impossible to distinguish between a situation when the leader is dead from a situation when it just has a really bad internet connection.
+So naively picking successor would lead to a split vote and a standstill again (power transitions are known to be problematic for authoritarian regimes in real life too!).
+If only there were some kind of … distributed consensus algorithm for picking the leader!
This is the place were we start discussing real Paxos :-)
+It starts with a “ballot voting” algorithm.
+This algorithm, just like the ones we’ve already seen, does not define any messages.
+Rather, message passing is an implementation detail, so we’ll get to it later.
+
Recall that rigged voting requires all acceptors to vote for a single values.
+It is immune to split voting, but is susceptible to getting stuck when the leader goes offline.
+The idea behind ballot voting is to have many voting rounds, ballots.
+In each ballot, acceptors can vote only for a single value, so each ballot individually can get stuck.
+However, as we are running many ballots, some ballots will make progress.
+The value is chosen in a ballot if it is chosen by some quorum of acceptors.
+The value is chosen in an overall algorithm if it is chosen in some ballot.
+
The Turing award question is: how do we make sure that no two ballots choose different values?
+Note that it is OK if two ballots choose the same value.
+
Let’s just brute force this question, really.
+First, assume that the ballots are ordered (for example, by numbering them with natural numbers).
+And let’s say we want to pick some value v to vote for in ballot b.
+When v is safe?
+Well, when no other value v1 can be chosen by any other ballot.
+Let’s tighten this up a bit.
+
Value v is safe at ballot b if any smaller ballot b1 (b1 < b) did not choose and will not choose any value other than v.
+
So yeah, easy-peasy, we just need to predict which values will be chosen in the future, and we are done!
+We’ll deal with it in a moment, but let’s first convince ourselves that, if we only select safe values for voting, we won’t violate consensus spec.
+
So, when we select a safe value v to vote for in a particular ballot, it might get chosen in this ballot.
+We need to check that it won’t conflict with any other value.
+For smaller ballots that’s easy — it’s the definition of safety condition.
+What if we conflict with some value v1 chosen in a future ballot?
+Well, that value is also safe, so whoever chose v1, was sure that it won’t conflict with v.
+
How do we tackle the precognition problem?
+We’ll ask acceptors to commit to not voting in certain ballots.
+For example, if you are looking for a safe value for ballot b and know that there’s a quorum q such that each quorum member never voted in smaller ballots, and promised to never vote in smaller ballots, you can be sure that any value is safe.
+Indeed, any quorum in smaller ballots will have at least one member which would refuse to vote for any value.
+
Ok, but what if there’s some quorum member which has already voted for some v1 in some ballot b1 < b?
+(Take a deep breath, the next sentence is the kernel of the core idea of Paxos).
+Well, that means that v1 was safe at b1, so, if there will be no votes between b1 and b, v1 is also safe at b!
+(Exhale).
+In other words, to pick a safe value at b we:
+
+
+Take some quorum q.
+
+
+Make everyone in q promise to never vote in ballots earlier than b.
+
+
+Among all of the votes already cast by the quorum members we pick the one with the highest ballot number.
+
+
+If such vote exists, its value is a safe value.
+
+
+Otherwise, any value is safe.
+
+
+
To implement the “never vote” promise, each acceptor will maintain maxBal value.
+It will never vote in ballots smaller or equal to maxBal.
+
Let’s stop hand-waving and put this algorithm in math.
+Again, we are not thinking about messages yet, and just assume that each acceptor can observe the state of the whole system.
+
+
+
Let’s unwrap this top-down.
+First, the chosen condition says that it is enough for some quorum to cast votes in some ballot for a value to be accepted.
+It’s trivial to see that, if we fix the ballot, then any two quorums would vote for the same value — quorums intersect.
+Showing that quorums vote for the same value in different ballots is the tricky bit.
+
The Init condition is simple — no votes, any acceptor can vote in any ballot (= any ballot with number larger than -1).
+
The Next consists of two cases.
+On each step of the protocol, some acceptor either votes for some value in some ballot ∃ v ∈ 𝕍: Vote(a, b, v), or declares that it won’t cast additional vote in small ballots AdvanceMaxBal(a, b).
+Advancing ballot just sets maxBal for this acceptor (but takes care not to rewind older decisions).
+Casting a vote is more complicated and is predicated on three conditions:
+
+
+We haven’t forfeited our right to vote in this ballot.
+
+
+If there’s some vote in this ballot already, we are voting for the same value.
+
+
+If there are no votes, then the value should be safe.
+
+
+
Note that the last two checks overlap a bit: if the set of votes cast in a ballot is not empty, we immediately know that the value is safe: somebody has proven this before.
+But it doesn’t harm to check for safety again: a safe value can not become unsafe.
+
Finally, the safety check.
+It is done in relation to some quorum — if q proves that v is safe, than members of this quorum would prevent any other value to be accepted in early ballots.
+To be able to do this, we first need to make sure that q indeed finalized their votes for ballots less than b (maxBall is at least b - 1).
+Then, we need to find the latest vote of q.
+There are two cases
+
+
+No one in q ever voted (b1 = -1).
+In this case, there are no additional conditions on v, any value would work.
+
+
+Someone in q voted, and b1 is the last ballot when someone voted.
+Then v must be the value voted for in b1.
+This implies Safe(v, b1).
+
+
+
If all of these conditions are fulfilled, we cast our vote and advance maxBall.
+
This is the hardest part of the article.
+Take time to fully understand Ballot Vote.
+
+
+
Rigorously proving that Ballot Voting satisfies Consensus would be tedious — the specification is large, and the proof would necessary use every single piece of the spec!
+But let’s add some hand-waving.
+Again, we want to provide homomorphism from Ballot Voting to Consensus.
+Cases where the image of a step is a stuttering step (the set of chosen values is the same) are obvious.
+It’s also obvious that the set of chosen values never decreases (we never remove votes, so a value can not become unchosen).
+It also increases by at most one value with each step.
+
The complex case is to prove that, if currently only v1 is chosen, no other v2 can be chosen as a result of the current step.
+Suppose the contrary, let v2 be the newly chosen value, and v1 be a different value chosen some time ago.
+v1 and v2 can’t belong to the same ballot, because every ballot contains votes only for a single value (this needs proof!).
+Lets say they belong to b1 and b2, and that b1 < b2.
+Note that v2 might belong to b1— nothing prevents smaller ballot from finishing later.
+When we chose v2 for b2, it was safe.
+This means that some quorum either promised not to vote in b1 (but then v1 couldn’t have been chosen in b1), or someone from the quorum voted for v2 in b1 (but then v1 = v2 (proving this might require repeated application of safety condition)).
+
Ok, but is this better than Majority Voting?
+Can Ballot Voting get stuck?
+No — if at least one quorum of machines is online, they can bump their maxBall to a ballot bigger than any existing one.
+After they do this, there necessary will be a safe value relative to this quorum, which they can then vote on.
+
However, Ballot Voting is prone to a live lock — if acceptors continue to bump maxBal instead of voting, they’ll never select any value.
+In fact, in the current formulation one needs to be pretty lucky to not get stuck.
+To finish voting, there needs to be a quorum which can vote in ballot b, but not in any smaller ballot, and in the above spec this can only happen by luck.
+
It is impossible to completely eliminate live locks without assumptions about real time. However, when we implement Ballot Voting with real message passing, we try to reduce the probability of a live lock.
One final push left!
+Given the specification of Ballot Voting, how do we implement it using message passing?
+Specifically, how do we implement the logic for selecting the first (safe) value for the ballot?
+
The idea is to have a designated leader for each ballot.
+As there are many ballots, we don’t need a leader selection algorithm, and can just statically assign ballot leaders.
+For example, if there are N acceptors, acceptor 0 can lead ballots 0, N, 2N, …, acceptor 1 can lead 1, N + 1, 2N + 1, … etc.
+
To select a value for ballot b, the ballot’s leader broadcasts a message to initiate the ballot.
+Upon receiving this message, each acceptor advances its maxBall to b - 1, and sends the leader its latest vote, unless the acceptor has already made a promise to not vote in b.
+If the leader receives replies from some quorum, it can be sure that this quorum won’t vote in smaller ballots.
+Besides, the leader knows quorum’s votes, so it can pick a safe value.
+
In other words, the practical trick for picking a safe value is to ask some quorum to abstain from voting in small ballots and to pick a value consistent with votes already cast.
+This is the first phase of Paxos, consisting of two message types, 1a and 1b.
+
The second phase is to ask the quorum to cast the votes.
+The leader picks a safe value and broadcasts it for the quorum.
+Quorum members vote for the value, unless in the meantime they happened to promise to a leader of the bigger ballot to not vote.
+After a member voted, it broadcasts its vote.
+When a quorum of votes is observed, the value is chosen and the consensus is reached.
+This is the second phase of Paxos with messages 2a and 2b.
+
Let’s write this in math!
+To model message passing, we will use msgs variable: a set of messages which have ever been send.
+Sending a message is adding it to this set.
+Receiving a message is asserting that it is contained in the set.
+By not removing messages, we model reorderings and duplications.
+
The messages themselves will be represented by records. For example, phase 1a message which initiates voting in ballot b will look like this:
+
+
+
Another bit of state we’ll need is lastVote— for each acceptor, what was the last ballot the acceptor voted in, together with the corresponding vote.
+It will be null if the acceptor hasn’t voted.
+
Without further ado,
+
+
+
Let’s go through each of the phases.
+
Phase1a initiates ballot b.
+It is executed by the ballot’s leader, but there’s no need to model who exactly the leader is, as long as it is unique.
+This stage simply broadcasts 1a message.
+
Phase1b is executed by an acceptor a.
+If a receives 1a message for ballot b and it can vote in b, then it replies with its lastVote.
+If it can’t vote (it has already started some larger ballot), it simply doesn’t respond.
+If enough acceptors don’t respond, the ballot will get stuck, but some other ballot might succeed.
+
Phase2a is the tricky bit, it checks if the value v is save for ballot b.
+
First, we need to make sure that we haven’t already initiated Phase2a for this ballot.
+Otherwise, we might initiate Phase2a for different values.
+Here is the bit where it is important that the ballot’s leader is stable.
+The leader needs to remember if it has already picked a safe value.
+
Then, we collect 1b messages from some quorum (we need to make sure that every quorum member has send 1b message for this ballot).
+Value v is safe if the whole quorum didn’t vote (vote is null), or if it is the value of the latest vote of some quorum member.
+We know that quorum members won’t vote in earlier ballots, because they had increased maxBal before sending 1b messages.
+
If the value indeed turns out to be safe, we broadcast 2a message for this ballot and value.
+
Finally, in Phase2b an acceptor a votes for this value, if its maxBall is still good.
+The bookkeeping is updating maxBal, lastVote, and sending the 2b message.
+
The set of 2b messages corresponds to the votes variable of the Ballot Voting specification.
There’s a famous result called FLP impossibility: Impossibility of Distributed Consensus with One Faulty Process.
+But we’ve just presented Paxos algorithm, which works as long as more than half of the processes are alive.
+What gives?
+FLP theorem states that there’s no consensus algorithm with finite behaviors.
+Stated in a positive way, any asynchronous distributed consensus algorithm is prone to live-lock.
+This is indeed the case for Paxos.
+
Liveness can be improved under partial synchronity assumptions.
+Ie, if we give each process a good enough clock, such that we can say things like “if no process fails, Paxos completes in t seconds”.
+If this is the case, we can fix live locking (ballots conflicting each other) by using naive leader selection algorithm to select the single acceptor which can initiate ballots.
+If we don’t reach consensus after t seconds, we can infer that someone has failed and re-run naive leader selection.
+If we are unlucky, naive leader selection will produce two leaders, but this won’t be a problem for safety.
+
Paxos requires atomicity and durability to function correctly.
+For example, once the has leader picked safe value and has broadcasted a 2a message, it should persist the selected value.
+Otherwise, if it goes down and then resurrects, it might choose a different value.
+How to make a choice of value atomic and durable?
+Write it to a local database!
+How to make local transaction atomic and durable?
+Write it first into the write ahead log?
+How to write something to WAL?
+Using the write syscall/DMA.
+What happens if the power goes down exactly in the middle of the write operation?
+Well, we can write a chunk of bytes with a checksum!
+Even if the write itself is not atomic, a checksummed write is!
+If we read the record from disk and checksum matches, then the record is valid.
+
I use slightly different definition of maxBal (less by one) than the one in the linked lecture, don’t get confused about this!
Some time ago I wrote a reddit comment explaining the benefits of IDEs.
+Folks refer to it from time to time, so I decided to edit it into an article form.
+Enjoy!
+
I think I have a rather balanced perspective on IDEs.
+I used to be a heavy Emacs user (old config, current config).
+I worked at JetBrains on IntelliJ Rust for several years.
+I used evil mode and vim for a bit, and tried tmux and kakoune.
+Nowadays, I primarily use VS Code to develop rust-analyzer: LSP-based editor-independent IDE backend for Rust.
+
I will be focusing on IntelliJ family of IDEs, as I believe these are the most advanced IDEs today.
+
The main distinguishing feature of IntelliJ is semantic understanding of code.
+The core of IntelliJ is a compiler which parses, type checks and otherwise understands your code.
+PostIntelliJ is the canonical post about this.
+That article also refutes the claim that “Smalltalk IDE is the best we’ve ever had”.
+
Note that “semantic understanding” is mostly unrelated to the traditional interpretation of “IDE” as Integrated Development Environment.
+I personally don’t feel that the “Integrated” bit is all that important.
+I commit&push from the command line using Julia scripts, rebase in magit, and do code reviews in a browser.
+If anything, there’s an ample room for improvement for the integration bits.
+For me, I in “IDE” stands for “intelligent”, smart.
+
Keep in mind this terminology difference.
+I feel it is a common source of misunderstanding.
+“Unix and command line can do anything an IDE can do” is correct about integrated bits, but is wrong about semantical bits.
+
Traditional editors like Vim or Emacs understand programming languages very approximately, mostly via regular expressions.
+For me, this feels very wrong.
+It’s common knowledge that HTML shall not be parsed with regex.
+Yet this is exactly what happens every time one does vim index.html with syntax highlighting on.
+I sincerely think that almost every syntax highlighter out there is wrong and we, as an industry, should do better.
+I also understand that this is a tall order, but I do my best to change the status quo here :-)
+
These are mostly theoretical concerns though.
+The question is, does semantic understanding help in practice?
+I am pretty sure that it is non-essential, especially for smaller code bases.
+My first non-trivial Rust program was written in Emacs, and it was fine.
+Most of rust-analyzer was written using pretty spartan IDE support.
+There are a lot of insanely-productive folks who are like “sometimes I type vim, sometimes I type vi, they are sufficiently similar”.
+Regex-based syntax highlighting and regex based fuzzy symbol search (ctags) get you a really long way.
+
However, I do believe that features unlocked by deep understanding of the language help.
+The funniest example here is extend/shrink selection.
+This features allows you to extend current selection to the next encompassing syntactic construct.
+It’s the simplest feature a PostIntelliJ IDE can have, it only needs the parser.
+But it is sooo helpful when writing code, it just completely blows vim’s text objects out of the water, especially when combined with multiple cursors.
+In a sense, this is structural editing which works for text.
+
+
+
If you add further knowledge of the language into a mix, you’ll get the “assists” system: micro-refactoring which available in a particular context.
+For example, if the cursor is on a comma in a list of function arguments, you can alt+enter > “swap arguments”, and the order of arguments will be changed in the declaration and on various call-sites as well.
+(See this post to learn how assists are implemented).
+
These small dwim things add up to a really nice editing experience, where you mostly express the intention, and the IDE deals with boring syntactical aspects of code editing:
+
+
+
For larger projects, complex refactors are a huge time-saver.
+Doing project-wide renames and signature changes automatically and without thinking reduces the cost of keeping the code clean.
+
Another transformative experience is navigation.
+In IntelliJ, you generally don’t “open a file”.
+Instead you think directly in terms of functions, types and modules, and navigate to those using file structure, goto symbol, to do definition/implementation/type, etc:
When I used Emacs, I really admired its buffer management facilities, because they made opening a file I want a breeze.
+When I later switched to IntelliJ, I stopped thinking in terms of a set of opened files altogether.
+I disabled editor tabs and started using editor splits less often — you don’t need bookmarks if you can just find things.
+
For me, there’s one aspect of traditional editors which is typically not matched in IDEs out of the box — basic cursor motion.
+Using arrow keys for that is slow and flow-breaking, because one needs to move the hand from the home row.
+Even Emacs’ horrific C-p, C-n are a big improvement, and vim’s hjkl go even further.
+One fix here is to configure each tool to use your favorite shortcuts, but this is a whack-a-mole game.
+What I do is remapping CapsLock to act as an extra modifier, such that ijklare arrow keys.
+(There are also keyboards with hardwaresupport for this).
+This works in all applications the same way.
+Easy motion / ace jump functionality for jumping to any visible character is also handy, and usually is available viaa plugin.
+
Recent advancements with LSP protocol promise to give one the best of both worlds, where semantic-aware backend and light-weight editor frontend are different processes, which can be mixed and matched.
+This is nice in theory, but not as nice in practice as IntelliJ yet, mostly because IntelliJ is way more polished.
+
To give a simple example, in IntelliJ for “go to symbol by fuzzy name” functionality, I can filter the search scope by:
+
+
+is this my code/code from a dependency?
+
+
+is this test/production code?
+
+
+is a symbol a type-like thing, or a method-like thing?
+
+
+path to the module where the symbol is defined.
+
+
+
VS Code and LSP simply do not have capabilities for such filters yet, they have to be bolted on using hacks.
+Support for LSP in other editors is even more hit-and-miss.
+
LSP did achieve a significant breakthrough — it made people care about implementing IDE backends.
+Experience shows that re-engineering an existing compiler to power an IDE is often impossible, or isomorphic to a rewrite.
+How a compiler talks to an editor is the smaller problem.
+The hard one is building a compiler that can do IDE stuff in the first place.
+Check out this post for some of the technical details.
+Starting with this use-case in mind saves a lot of effort down the road.
+
This I think is a big deal.
+I hypothesize that the reason why IDEs do not completely dominate tooling landscape is the lack of good IDE backends.
+
If we look at the set of languages fairly popular recently, a significant fraction of them is dynamically typed: PHP, JavaScript, Python, Ruby.
+The helpfulness of an IDE for dynamically typed languages is severely limited: while approximations and heuristics can get you a long way, you still need humans in the loop to verify IDE’s guesses.
+
There’s C++, but its templates are effectively dynamically typed, with exactly the same issues (and a very complex base language to boot).
+Curiously, C looks like a language for which implementing a near-perfect IDE is pretty feasible.
+I don’t know why it didn’t happen before CLion.
+
This leaves C# and Java.
+Indeed, these languages are dominated by IDEs.
+There’s a saying that you can’t write Java without an IDE.
+I think it gets the causation direction backwards: Java is one of the few languages for which it is possible to implement a great IDE without great pain.
+Supporting evidence here is Go.
+According to survey results, text editors are stably declining in popularity in favor of IDEs.
+
I think this is because Go actually has good IDEs.
+This is possible because the language is sufficiently statically typed for an IDE to be a marked improvement.
+Additionally, the language is very simple, so the amount of work you need to put in to make a decent IDE is much lower than for other languages.
+If you have something like JavaScript…
+Well, you first need to build an alternative language for which you can actually implement an IDE (TypeScript) and only then you can build the IDE itself (VS Code).
Midori error model makes sharp distinction between two kinds of errors:
+
+
+bugs in the program, like indexing an array with -92
+
+
+error conditions in programs’ environment (reading a file which doesn’t exist)
+
+
+
In Rust, those correspond to panics and Results.
+It’s important to not mix the two.
+
std I think sadly does mix them in sync API.
+The following APIs convert panics to recoverable results:
+
+
+Mutex::lock
+
+
+thread::JoinHandle::join
+
+
+mpsc::Sender::send
+
+
+
All those APIs return a Result when the other thread panicked.
+These leads to people using ? with these methods, using recoverable error handling for bugs in the program.
+
In my mind, a better design would be to make those API panic by default.
+Sometimes synchronization point also happen to be failure isolation boundaries.
+More verbose result-returning catching_lock, catching_join, catching_send would work for those special cases.
+
If std::Mutex did implement lock poisoning, but the lock method returned a LockGuard<T>, rather than Result<LockGuard<T>, PoisonError>, then we wouldn’t be discussing poisoning in the rust book, in every mutex example, and wouldn’t consider changing the status quo.
+At the same time, we’d preserve “safer” semantics of lock poisoning.
+
There’s an additional consideration here.
+In a single-threaded program, panic propagation is linear.
+One panic is unwound past a sequence of frames.
+If we get the second panic in some Drop, the result is process aborting.
+
In a multi-threaded program, the stack is tree-shaped.
+What should happen if one of the three parallel threads panics?
+I believe the right semantics here is that siblings are cancelled, and then the panic is propagated to the parent.
+How to implement cancellation is an open question.
+If two children panic, we should propagate a pair of panics.
A topic closely related to lock poisoning is unwinding safety —UnwindSafe and RefUnwindSafe traits.
+I want to share an amusing story how this machinery almost, but not quite, saved my bacon.
+
rust-analyzer implements cancellation via unwinding.
+After a user types something and we have new code to process, we set a global flag.
+Long-running background tasks like syntax highlighting read this flag and, if it is set, panic with a struct Cancelled payload.
+We use resume_unwind and not panic to avoid printing backtrace.
+After the stack is unwound, we can start processing new code.
+
This means that rust-analyzer’s data, stored in the Db type, needs to be unwind safe.
+
One day while I was idly hacking on rust-analyzer during Rust all-hands I’ve noticed a weird compilation error, telling me that Db doesn’t implement the corresponding trait.
+What’s worse, removing the target directory fixed the bug.
+This was an instance of incorrect incremental compilation.
+
The problem stemmed from two issues:
+
+
+UnwindSafe and RefUnwindSafe are auto traits, and inference rules for those are complicated
+
+
+Db type has a curiously recurring template structure
+
+
+
With incremental compilation in the mix, something somewhere went wrong.
+
The compiler bug was fixed after several months, but, to work around it in the meantime, we’ve added a manual impl UnwindSafe for Db which masked the bug.
+
Couple of months more has passed, and we started integrating chalk into rust-analyzer.
+At that time, chalk had it’s own layer of caching, in addition to the incremental compilation of rust-analyzer itself.
+So we had something like this:
+
+
+
(We used parking_lot for perf, and to share mutex impl between salsa and rust-analyzer).
+
Now, one of the differences between std::Mutex and parking_lot::Mutex is lock poisoning.
+And that means that std::Mutex is unwind safe (as it just becomes poisoned), while parking_lot::Mutex is not.
+Chalk used some RefCell’s internally, so it wasn’t unwind safe.
+So the whole Db stopped being UnwindSafe after addition of chalk.
+But because we had that manual impl UnwindSafe for Db, we haven’t noticed this.
+
And that lead to a heisenbug.
+If cancellation happened during trait solving, we unwound past ChalkSolver.
+And, as didn’t have strict exception safety guarantees, that messed up its internal questions.
+So the next trait solving query would observe really weird errors like index out of bounds inside chalk.
+
The solution was to:
+
+
+remove the manual impl (by that time the underlying compiler bug was fixed).
+
+
+get the Db: !UnwindSafe expected error.
+
+
+replace parking_lot::Mutex with std::Mutex to get unwind-safety.
+
+
+change calls to .lock to propagate cancellation.
+
+
+
The last point is interesting, it means that we need support for recoverable poisoning in this case.
+We need to understand that the other thread was cancelled mid-operation (so that chalk’s state might be inconsistent).
+And we also need to re-raise the panic with a specific payload — the Cancelled struct.
+This is because the situation is not a bug.
This post documents call site dependency injection pattern.
+It is a rather low level specimen and has little to do with enterprise DI.
+The pattern is somewhat Rust-specific.
+
Usually, when you implement a type which needs some user-provided functionality, the first thought is to supply it in constructor:
+
+
+
In this example, we implement Engine and the caller supplies Config.
+
An alternative is to pass the dependency to every method call:
+
+
+
In Rust, the latter (call-site injection) sometimes works with lifetimes better.
+Let’s see the examples!
In the first example, we want to lazily compute a field’s value based on other fields.
+Something like this:
+
+
+
The problem with this design is that it doesn’t work in Rust.
+The closure in Lazy needs access to self, and that would create a self-referential data structure!
+
The solution is to supply the closure at the point where the Lazy is used:
The next example is about plugging a custom hash function into a hash table.
+In Rust’s standard library, this is only possible on the type level, by implementing the Hash trait for a type.
+A more general design would be to parameterize the table with a hash function at run-time.
+This is what C++ does.
+However in Rust this won’t be general enough.
+
Consider a string interner, which stores strings in a vector and additionally maintains a hash-based index:
+
+
+
The set field stores the strings in a hash table, but it represents them using indices into neighboring vec.
+
Constructing the set with a closure wont work for the same reason Lazy didn’t work — this creates a self-referential structure.
+In C++ there exists a work-around — it is possible to box the vec and share a stable pointer between Interner and the closure.
+In Rust, that would create aliasing, preventing the use of &mut Vec.
+
Curiously, using a sorted vec instead of a hash works with std APIs:
+
+
+
This is because the closure is supplied at the call site rather than at the construction site.
+
The hashbrown crate provides this style of API for hashes via RawEntry.
The third example is from the Zig programming language.
+Unlike Rust, Zig doesn’t have a blessed global allocator.
+Instead, containers in Zig come in two flavors.
+The “Managed” flavor accepts an allocator as a constructor parameter and stores it as a field
+(Source).
+The “Unmanaged” flavor adds an allocator parameter to every method
+(Source).
+
The second approach is more frugal — it is possible to use a single allocator reference with many containers.
The final example comes from the Rust language itself.
+To implement dynamic dispatch, Rust uses fat pointers, which are two words wide.
+The first word points to the object, the second one to the vtable.
+These pointers are manufactured at the point where a concrete type is used generically.
+
This is different from C++, where vtable pointer is embedded into the object itself during construction.
+
+
Having seen all these examples, I am warming up to Scala-style implicit parameters.
+Consider this hypothetical bit of Rust code with Zig-style vectors:
+
+
+
The problem here is Drop— freeing the vectors requires access to the allocator, and it’s unclear how to provide one.
+Zig dodges the problem by using defer statement rather than destructors.
+In Rust with implicit parameters, I imagine the following would work:
+
+
+
+
To conclude, I want to share one last example where CSDI thinking helped me to discover a better application-level architecture.
+
A lot of rust-analyzer’s behavior is configurable.
+There are toggles for inlay hints, completion can be tweaked, and some features work differently depending on the editor.
+The first implementation was to store a global Config struct together with the rest of analysis state.
+Various subsystems then read bits of this Config.
+To avoid coupling distinct features together via this shared struct, config keys were dynamic:
+
+
+
This system worked, but felt rather awkward.
+
The current implementation is much simpler.
+Rather than storing a single Config as a part of the state, each method now accepts a specific config parameter:
+
+
+
Not only the code is simpler, it is more flexible.
+Because configuration is no longer a part of the state, it is possible to use different configs for the same functionality depending on the context.
+For example, explicitly invoked completion might be different from the asynchronous one.
I’ve read a book about management and it helped me to solve a long-standing personal conundrum about the code review process.
+The book is “High Output Management”.
+Naturally, I recommend it (and read this “review” as well: https://apenwarr.ca/log/20190926).
+
One of the smaller ideas of the book is that of the managerial meddling.
+If my manager micro-manages me and always tells me what to do, I’ll grow accustomed to that and won’t be able to contribute without close supervision.
+This is a facet of a more general Task-Relevant Maturity framework.
+Irrespective of the overall level of seniority, a person has some expertise level for each specific task.
+The optimal quantity and quality of supervisor’s involvement depends on this level (TRM).
+When TRM grows, the management style should go from structured control to supervision to nudges and consultations.
+I don’t need a ton of support when writing Rust, I can benefit a lot from a thorough review when coding in Julia and I certainly require hand-holding when attempting to write Spanish!
+But the overarching goal is to improve my TRM, as that directly improves my productivity and frees up my supervisor’s time.
+The problem with meddling is not excessive control (it might be appropriate in low-TRM situations), it is that meddling removes the motivation to learn to take the wheel yourself.
+
Now, how on earth all this managerial gibberish relates to the pull request review?
+I now believe that there are two largely orthogonal (and even conflicting) goals to a review process.
+
One goal of a review process is good code.
+The review ensures that each change improves the overall quality of a code base.
+Without continuous betterment any code under change reverts to the default architecture: a ball of goo.
+
Another goal of a review is good coders.
+The review is a perfect mentorship opportunity, it is a way to increase contributor’s TRM.
+This is vital for community-driven open-source projects.
+
I personally always felt that the review process I use falls quite short of the proper level of quality.
+Which didn’t really square with me bootstrapping a couple of successful open source projects.
+Now I think that I just happen to optimize for the people’s aspect of the review process, while most guides
+(with a notable exception of Optimistic Merging) focus on code aspects.
+
Now, (let me stress this point), I do not claim that the second goal is inherently better (though it sounds nicer).
+It’s just that in the context of both IntelliJ Rust and rust-analyzer (green-field projects with massive scope, big uncertainties and limited payed-for hours) growing the community of contributors and maintainers was more important than maintaining perfectly clean code.
+
Reviews for quality are hard and time consuming.
+I personally can’t really review the code looking at the diff, I can give only superficial comments.
+To understand the code, most of the time I need to fetch it locally and to try to implement the change myself in a different way.
+To make a meaningful suggestion, I need to implement and run it on my machine (and the first two attempts won’t fly).
+Hence, a proper review for me takes roughly the same time as the implementation itself.
+Taking into account the fact that there are many more contributors than maintainers, this is an instant game over for reviews for quality.
+
Luckily, folks submitting PRs generally have medium/high TRM.
+They were able to introduce themselves to the codebase, find an issue to work on and come up with a working code without me!
+So, instead of scrutinizing away every last bit of diff’s imperfection, my goal is to promote the contributor to an autonomous maintainer status.
+This is mostly just a matter of trust.
+I don’t read every line of code, as I trust the author of the PR to handle ifs and whiles well enough (this is the major time saver).
+I trust that people address my comments and let them merge their own PRs (bors d+).
+I trust that people can review other’s code, and share commit access (r+) liberally.
+
+
+
What new contributors don’t have and what I do talk about in reviews is the understanding of project-specific architecture and values.
+These are best demonstrated on specific issues with the diff.
+But the focus isn’t the improvement of a specific change, the focus is teaching the author of (hopefully) subsequent changes.
+I liberally digress into discussing general code philosophy issues.
+As disseminating this knowledge 1-1 is not very efficient, I also try to document it.
+Rather than writing a PR comment, I put the text into
+architecture.md or
+style.md
+and link that instead.
+I also try to do only a small fixed number of review rounds.
+Roughly, the PR is merged after two round-trips, not when there’s nothing left to improve.
+
All this definitely produces warm fuzzy feelings, but what about code quality?
+Gating PRs on quality is one, but not the only one, way to maintain clean code.
+The approach I use instead is continuous reafactoring / asynchronous reviews.
+One of the (documented) values in rust-analyzer is that anyone is allowed and encouraged to refactor all the code, old and new.
+
Instead of blocking the PR, I merge it and then refactor the code in a follow-up (ccing the original author), when I touch this area next time.
+This gives me a much better context than a diff view, as I can edit the code in-place and run the tests.
+I also don’t waste time transforming the change I have in mind to a PR comment (the motivation bits go directly into comment/commit message).
+It’s also easy to do unrelated drive-by fixes!
+
I wish this asynchronous review workflow was better supported by tools.
+By default, changes are merged by the author, but the PR also goes to a review queue.
+Later, the reviewer looks at the merged code in the main branch.
+Any suggestions are submitted as a new PR, with the original author set as a reviewer.
+(The in-editor reviewing reminds me iron workflow.)
+
+
For conclusion, let me reference another book.
+I like item 32 from “C++ Coding Standards”: be clear what kind of class you’re writing.
+A value type is not an interface is not a base class.
+All three are classes, but each needs a unique set of rules.
+
When doing/receiving a code review, understand the context and purpose.
+If this is a homework assignment, you want to share knowledge.
+In a critical crypto library, you need perfect code.
+And for a young open source project, you aim to get a co-maintainer!
If you maintain an open-source project in the range of 10k-200k lines of code, I strongly encourage you to add an ARCHITECTURE document next to README and CONTRIBUTING.
+Before going into the details of why and how, I want to emphasize that this is not another “docs are good, write more docs” advice.
+I am pretty sloppy about documentation, and, e.g., I often use just “simplify” as a commit message.
+Nonetheless, I feel strongly about the issue, even to the point of pestering you :-)
+
I have experience with both contributing to and maintaining open-source projects.
+One of the lessons I’ve learned is that the biggest difference between an occasional contributor and a core developer lies in the knowledge about the physical architecture of the project.
+Roughly, it takes 2x more time to write a patch if you are unfamiliar with the project, but it takes 10x more time to figure out where you should change the code.
+This difference might be hard to perceive if you’ve been working with the project for a while.
+If I am new to a code base, I read each file as a sequence of logical chunks specified in some pseudo-random order.
+If I’ve made significant contributions before, the perception is quite different.
+I have a mental map of the code in my head, so I no longer read sequentially.
+Instead, I just jump to where the thing should be, and, if it is not there, I move it.
+One’s mental map is the source of truth.
+
I find the ARCHITECTURE file to be a low-effort high-leverage way to bridge this gap.
+As the name suggests, this file should describe the high-level architecture of the project.
+Keep it short: every recurring contributor will have to read it.
+Additionally, the shorter it is, the less likely it will be invalidated by some future change.
+This is the main rule of thumb for ARCHITECTURE— only specify things that are unlikely to frequently change.
+Don’t try to keep it synchronized with code.
+Instead, revisit it a couple of times a year.
+
Start with a bird’s eye overview of the problem being solved.
+Then, specify a more-or-less detailed codemap.
+Describe coarse-grained modules and how they relate to each other.
+The codemap should answer “where’s the thing that does X?”.
+It should also answer “what does the thing that I am looking at do?”.
+Avoid going into details of how each module works, pull this into separate documents or (better) inline documentation.
+A codemap is a map of a country, not an atlas of maps of its states.
+Use this as a chance to reflect on the project structure.
+Are the things you want to put near each other in the codemap adjacent when you run tree .?
+
Do name important files, modules, and types.
+Do not directly link them (links go stale).
+Instead, encourage the reader to use symbol search to find the mentioned entities by name.
+This doesn’t require maintenance and will help to discover related, similarly named things.
+
Explicitly call-out architectural invariants.
+Often, important invariants are expressed as an absence of something, and it’s pretty hard to divine that from reading the code.
+Think about a common example from web development: nothing in the model layer specifically doesn’t depend on the views.
+
Point out boundaries between layers and systems as well.
+A boundary implicitly contains information about the implementation of the system behind it.
+It even constrains all possible implementations.
+But finding a boundary by just randomly looking at the code is hard — good boundaries have measure zero.
+
After finishing the codemap, add a separate section on cross-cutting concerns.
+
A good example of ARCHITECTURE document is this one from rust-analyzer:
+architecture.md.
I want a better profiler for Rust.
+Here’s what a rust-analyzer benchmark looks like:
+
+
+
Here’s how I want to profile it:
+
+
+
First, the profiler prints to stderr:
+
+
+
Otherwise, if everything is setup correctly, the output is
+
+
+
The profile-results folder contains the following:
+
+
+report.txt with
+
+
+user, cpu, sys time
+
+
+cpu instructions
+
+
+stats for caches & branches a-la pref-stat
+
+
+top ten functions by cumulative time
+
+
+top ten functions by self-time
+
+
+top ten hot-spot
+
+
+
+
+flamegraph.svg
+
+
+data.smth, which can be fed into some existing profiler UI (kcachegrind, firefox profiler, etc).
+
+
+report.html which contains a basic interactive UI.
+
+
+
To tweak settings, the following API is available:
+
+
+
Naturally, the following also works and produces an aggregate profile:
+
+
+
I don’t know how this should work.
+I think I would be happy with a perf-based Linux-only implementation.
+The perf-event crate by Jim Blandy (co-author of “Programming Rust”) is good.
+
Have I missed something?
+Does this tool already exist?
+Or is it impossible for some reason?
I’ve been re-reading Ted Kaminski blog about software design.
+I highly recommend all the posts, especially the earlier ones
+(here’s the first).
+He manages to offer design advice which is both non-trivial and sound (a subjective judgment of course), a rare specimen!
+
Anyway, one of the insights of the series is that, when designing an abstraction, we always face the inherent tradeoff between power and properties.
+The more we can express using a particular abstraction, the less we can say about the code using it.
+Our human bias for more expressive power is not inherent however.
+This is evident in programming language communities, where users unceasingly ask for new features and language designers say no.
+
Macros are a language feature which is very far in the “more power” side of the chart.
+Macros give you an ability to abstract over the source code.
+In exchange, you give up the ability to (automatically) reason about the surface syntax.
+As a specific example, rename refactoring doesn’t work 100% reliably in languages with powerful macro systems.
+
I do think that, in the ideal world, this is a wrong trade for a language which wants to scale to gigantic projects.
+The ability to automatically reason about and transform source code gains in importance when you add more programmers, more years, and more millions of lines of code.
+But take this with a huuuge grain of salt — I am obviously biased, having spent several years developing Rust IDEs.
+
That said, macros have a tremendous appeal — they are a language designer’s duct tape.
+Macros are rarely the best tool for the job, but they can do almost any job.
+The language design is incremental.
+A macro system relieves the design pressure by providing a ready poor man’s substitute for many features.
+
In this post, I want to explore what macros are used for in Rust.
+The intention is to find solutions which do not give up the “reasoning about source code” property.
By far, the most common use-case is the format! family of macros.
+The macro-less solution here is straightforward — a string interpolation syntax:
+
+
+
In Rust, interpolation probably shouldn’t construct a string directly.
+Instead, it can produce a value implementing Display (just like format_args!), which can avoid allocations.
+An interesting extension would be to allow iterating over format string pieces.
+That way, the interpolation syntax could be used for things like SQL statements or command line arguments, without the fear of introducing injection vulnerabilities:
+
+
+
This post about Julia programming language explains the issue.
+xshell crate implements this idea for Rust.
I think the second most common, and probably the most important use of macros in Rust are derives.
+Rust is one of the few languages which gets equality right (and forbids comparing apples and oranges), but this crucially depends on the ability to derive(Eq).
+Common solutions in this space are special casing in the compiler (Haskell’s deriving) or runtime reflection.
+
But the solution I am most excited about are C# source generators.
+Which are nothing new — this is just the old (source) code generation, just with a nice quality of implementation.
+You can supply custom code which gets run during the build and which can read existing sources and generate additional files, which are then added back to the compilation.
+
The beauty of this solution is that it moves all the complexity out of the language and into the build system.
+This means that you get baseline tooling support for free.
+Goto definition for generated code? Just works.
+Want to step into some serialization code while debugging? There’s actual source code on disk, so feel free to!
+You are more of a printf person? Well, you’d need to convince the build system to not stomp over your changes, but, otherwise, why not?
+
Additionally, source generators turn out to be significantly more expressive.
+They can call into the Roslyn compiler to analyzer the source code, so they are capable of type-directed code generation.
+
To be useful, source generators require some language level support for splitting a single entity across several files.
+In C#, partial classes play this role.
The raison d’être of macros is implementation of embedded DSLs.
+We want to introduce custom syntax within the language for succinctly modeling the program’s domain.
+For example, a macro can be used to embed HTML fragments in Rust code.
+
To me personally, eDSL is not problem to be solved, but just a problem.
+Introducing a new sublanguage (even if small) spends a lot of cognitive complexity budget.
+If you need it once in a while, better stick to just chaining together somewhat verbose function calls.
+If you need it a lot, it makes sense to introduce external DSL, with a compiler, a language server, and all the tooling that makes programming productive.
+To me, macro-based DSLs just don’t fell like an interesting point on the cost-benefit curve.
+
That being said, the Kotlin programming language solves the problem of strongly-typed, tooling-friendly DSL nicely (example).
+Infuriatingly, it’s hard to point what specifically is the solution.
+It’s … just the concrete syntax mostly.
+Here are some ingredients:
+
+
+The syntax for closures is { arg -> body }, or just { body }, so closures syntactically resemble blocks.
+
+
+Extension methods (which are just sugar for static methods).
+
+
+Java style implicit this, which introduces names into scope without an explicit declaration.
+
+
+TCP-preserving inline closures (this the single non-syntactical feature)
+
+
+
Nonetheless, this was not enough to implement Jetpack Compose UI DSL, it also needs a compiler plugin.
An interesting case of a DSL I want to call out is sqlx::query.
+It allows one to write code like this:
+
+
+
This I think is one of the few cases where eDSL does really pull its weight.
+I don’t know how to do this without macros.
+Using string interpolation (the advanced version to protect from injection), it is possible to specify the query.
+Using a source generator, it is possible to check the syntax of the query and verity the types, to, eg, raise a type error in this case:
+
+
+
But this won’t be enough to generate an anonymous struct, or to get rid of dynamic casts.
Rust also uses macros for conditional compilation.
+This use case convincingly demonstrates “lack of properties” aspect of power.
+Dealing with feature combinations is a perpetual headache for Cargo.
+Users have to repeatedly recompile large chunks of the crate graph when feature flags change.
+Catching a type error on CI with cargo test --no-default-features is pretty annoying, especially if you did run cargo test before submitting a PR.
+“Additive Features” is an uncheckable wishful thinking.
+
In this case, I don’t know a good macro-less alternative.
+But, in principle, this seems doable, if conditional compilation is pushed further down the compiler pipeline, to the code generation and linking stage.
+Rather than discarding some code early during parsing, the compiler can select the platform-specific version just before producing machine code for a function.
+Before that, it checks that all conditionally-compiled versions of the function have the same interface.
+That way, platform-specific type errors are impossible.
The final use-case I want to cover is that of a placeholder syntax.
+Rust’s macro_call!(...) syntax carves a well-isolated region where anything goes, syntax wise, as long as the parenthesis are balanced.
+In theory, this allow language designers to experiment with provisional syntax before setting something in stone.
+In practice, it looks like this is not at all that beneficial?
+There was some opposition to stabilizing postfix .await without going via intermediate period with await! macro.
+And, after stabilization, all syntax discussions were immediately forgotten?
+On the other hand, we did have try! -> ? transition, and I don’t think it helped to uncover any design pitfalls?
+At least, we managed to stabilize the unnecessary restrictive desugaring on that one.
+
+
For conclusion, I want to circle back to source generators.
+What exactly makes them easier for tooling than macros?
+I think the following three properties do.
+First, both input and output is, fundamentally, text.
+There’s no intermediate representation (like token trees), which is used by this meta-programming facility.
+This means that it doesn’t need to be integrated deeply with the compiler.
+Of course, internally the tool is free to parse, typecheck and transform the code however it likes.
+Second, there is a phase distinction.
+Source generators are executed once, in unordered fashion.
+There’s no back and forth between meta programming and name resolution, which, again, allows to keep “meta” part outside.
+Third, source generators can only add code, they can not change the meaning of the existing code.
+This means that semantically sound source code transformations remains so in the presence of a code generator.
Hey, I have a short announcement to make: I am joining NEAR (sharded proof of stake public blockchain)!
+TL;DR: I’ll be spending 60% of my time on WASM runtime for smart contracts and 40% on rust-analyzer.
+
Why NEAR?
+One of the problems I have with current popular blockchain technologies is that they are not scalable.
+Every node needs to process every transaction in the network.
+For a network of with N nodes that is roughly O(N^2) total work.
+NEAR aims to solve exactly this problem using the classic big data trick — sharding the data across several partitions.
+
Another aspect of NEAR I am particularly excited about is the strategic focus on the smart contract’s developer experience.
+That’s why NEAR is particularly interested in supporting rust-analyzer.
+Rust, with its top-notch WASM ecosystem and focus on correctness is a natural choice for writing contracts.
+At the same time, it is not the most approachable language there is.
+Good tooling can help a lot with surmounting the language’s inherent complexity, making writing smart contracts in Rust easy.
+
What does it mean for rust-analyzer?
+We’ll see: I am still be putting significant hours into it, although a bit less than previously.
+I’ll also help to manage rust-analyzer Open Collective.
+And, naturally, my know-how about building IDEs isn’t going anywhere :)
+At the same time, I am excited about lowering the bus factor and distributing rust-analyzer maintainership.
+I do want to take credit for initiating the effort, but it’s high time for some structured leadership rotation.
+It’s exciting to see @jonas-schievink from Ferrous System taking on more team leadership tasks.
+(I am hyped about support for inner items, kudos Jonas!)
+I am also delighted with the open source community that formed around rust-analyzer.
+@edwin0cheng,
+@flodiebold,
+@kjeremy,
+@lnicola,
+@SomeoneToIgnore,
+@Veetaha,
+@Veykril
+you are awesome, and rust-analyzer wouldn’t be possible without you ❤️
+
Finally, I can’t help but notice that IntelliJ Rust which I left completely a while ago is doing better than ever.
+Overall, I must say I am quite happy with today’s state of Rust IDE tooling.
+The basics are firmly in place.
+Let’s just finish the remaining 90%!
Any language has parametric polymorphism, eventually
+
If you start with just dynamic dispatch, you’ll end up adding generics down the road.
+This happened with C++ and Java, and is now happening with Go.
+The last one is interesting — even if you don’t carry accidental OOP baggage (inheritance), interfaces alone are not enough.
+
Why does it happen?
+Well, because generics are useful for simple things.
+Even if the language special-cases several parametric data structures, like go does with slices, maps and channels, it is impossible to abstract over them.
+In particular, it’s impossible to write list_reverse or list_sort functions without some awkward workarounds.
+
Ok, but where’s the dilemma?
+The dilemma is that adding parametric polymorphism to the language opens floodgates of complexity.
+At least in my experience, Rust traits, Haskell type classes, and Java generics are the main reason why some libraries in those languages are hard to use.
+
It’s not that generics are inherently hard, fn reverse<T>(xs: [T]) -> [T] is simple.
+It’s that they allow creating complicated solutions, and this doesn’t play well with our human bias for complexity.
+
One thing I am wondering is whether a polymorphic language without bounded quantification would be practical?
+Again, in my anecdotal experience, cognitive complexity soars when there are bounds on type parameters: T: This<S> + That.
+But parametric polymorphism can be useful without them:
+
+
+
is equivalent to
+
+
+
Can we build an entire language out of this pattern?
Click bait title!
+We’ll actually look into how integration and unit tests are implemented in Cargo.
+A few guidelines for organizing test suites in large Cargo projects naturally arise out of these implementation differences.
+And, yes, one of those guidelines will turn out to be: “delete all integration tests but one”.
+
Keep in mind that this post is explicitly only about Cargo concepts.
+It doesn’t discuss relative merits of integration or unit styles of testing.
+I’d love to, but that’s going to be a loooong article some other day!
When you use Cargo, you can put #[test] functions directly next to code, in files inside src/ directory.
+Alternatively, you can put them into dedicated files inside tests/:
+
+
+
I stress that unit/integration terminology is based purely on the location of the #[test] functions, and not on what those functions actually do.
+
To build unit tests, Cargo runs
+
+
+
Rustc then compiles the library with --cfg test.
+It also injects a generated fn main(), which invokes all functions annotated with #[test].
+The result is an executable file which, when run subsequently by Cargo, executes the tests.
+
Integration tests are build differently.
+First, Cargo uses rustc to compile the library as usual, without--cfg test:
+
+
+
This produces an .rlib file — a compiled library.
+
Then, for each file in the tests directory, Cargo runs the equivalent of
+
+
+
That is, each integration test is compiled into a separate binary.
+Running those binaries executes the test functions.
Note that rustc needs to repeatedly re-link the library crate with each of the integration tests.
+This can add up to a significant compilation time blow up for tests.
+That is why I recommend that large projects should have only one integration test crate with several modules.
+That is, don’t do this:
+
+
+
Do this instead:
+
+
+
When a refactoring along these lines was applied to Cargo itself, the effects were substantial (numbers).
+The time to compile the test suite decreased 3x.
+The size of on-disk artifacts decreased 5x.
+
It can’t get better than this, right?
+Wrong!
+Rust tests by default are run in parallel.
+The main that is generated by rustc spawns several threads to saturate all of the CPU cores.
+However, Cargo itself runs test binaries sequentially.
+This makes sense — otherwise, concurrently executing test binaries oversubscribe the CPU.
+But this means that multiple integration tests leave performance on the table.
+The critical path is the sum of longest tests in each binary.
+The more binaries, the longer the path.
+For one of my projects, consolidating several integration tests into one reduced the time to run the test suite from 20 seconds to just 13.
+
A nice side-effect of a single modularized integration test is that sharing the code between separate tests becomes trivial, you just pull it into a submodule.
+There’s no need to awkwardly repeat mod common; for each integration test.
If the project I am working with is small, I don’t worry about test organization.
+There’s no need to make tests twice as fast if they are already nearly instant.
+
Conversely, if the project is large (a workspace with many crates) I worry about test organization a lot.
+Slow tests are a boiling frog kind of problem.
+If you do not proactively fix it, everything is fine up until the moment you realize you need to sink a week to untangle the mess.
+
For a library with a public API which is published to crates.io, I avoid unit tests.
+Instead, I use a single integration tests, called it (integration test):
+
+
+
Integration tests use the library as an external crate.
+This forces the usage of the same public API that consumers use, resulting in a better design feedback.
+
For an internal library, I avoid integration tests all together.
+Instead, I use Cargo unit tests for “integration” bits:
+
+
+
That way, I avoid linking the separate integration tests binary altogether.
+I also have access to non-pub API of the crate, which is often useful.
First, documentation tests are extremely slow.
+Each doc test is linked as a separate binary.
+For this reason, avoid doc tests in internal libraries for big projects and add this to Cargo.toml:
+
+
+
Second, prefer
+
+
+
to
+
+
+
This way, when you modify just the tests, the cargo is smart to not recompile the library crate.
+It knows that the contents of tests.rs only affects compilation when --test is passed to rustc.
+Learned this one from @petrochenkov, thanks!
+
Third, even if you stick to unit tests, the library is recompiled twice: once with, and once without --test.
+For this reason, folks from pernosco go even further.
+They add
+
+
+
to Cargo.toml, make all APIs they want to unit test public and have a single test crate for the whole workspace.
+This crates links everything and contains all the unit tests.
The most commonly cited drawback of OS-level threads is that they use a lot of RAM.
+This is not true on Linux.
+
Let’s compare memory footprint of 10_000 Linux threads with 10_000 goroutines.
+We spawn 10k workers, which sleep for about 10 seconds, waking up every 10 milliseconds.
+Each worker is staggered by a pseudorandom delay up to 200 milliseconds to avoid the thundering herd problem.
+
+
+
+
+
We use time utility to measure memory usage:
+
+
+
The results:
+
+
+
A thread is only 3 times as large as a goroutine.
+Absolute numbers are also significant: 10k threads require only 100 megabytes of overhead.
+If the application does 10k concurrent things, 100mb might be negligible.
+
+
+
+
Note that it is wrong to use this benchmark to compare performance of threads and goroutines.
+The workload is representative for measuring absolute memory overhead, but is not representative for time overhead.
+
That being said, it is possible to explain why threads need 21 seconds of CPU time while goroutines need only 14.
+Go runtime spawns a thread per CPU-core, and tries hard to keep each goroutine tied to specific thread (and, by extension, CPU).
+Threads by default migrate between CPUs, which incurs synchronization overhead.
+Pinning threads to cores in a round-robin fashion removes this overhead:
+
+
+
The total CPU time now is approximately the same, but the distribution is different.
+On this workload, goroutine scheduler spends roughly the same amount of cycles in the userspace that the thread scheduler spends in the kernel.
I don’t understand performance characteristics of “async” programming when applied to typical HTTP based web applications.
+Let’s say we have a CRUD app with a relational database, where a typical request results in N queries to the database and transfers M bytes over the network.
+How much (orders of magnitude?) faster/slower would an “async” solution be in comparison to a “threaded” solution?
+
In this live post, I am collecting the benchmarks that help to shed the light on this and related questions.
+Note that I am definitely not the right person to do this work, so, if there is a better resource, I’ll gladly just use that instead.
+Feel free to send pull requests with benchmarks!
+Every benchmark will be added, but some might go to the rejected section.
+
I am interested in understanding differences between several execution models, regardless of programming language:
+
+
Threads:
+
+
Good old POSIX threads, as implemented on modern Linux.
+
+
Stackful Coroutines
+
+
M:N threading, which expose the same programming model as threads, but are implemented by multiplexing several user-space coroutines over a single OS-level thread.
+The most prominent example here is Go
+
+
Stackless Coroutines
+
+
In this model, each concurrent computation is represented by a fixed-size state machine which reacts to events.
+This model often uses async / await syntax for describing and composing state machines using standard control flow constructs.
+
+
Threads With Cooperative Scheduling
+
+
This is a mostly hypothetical model of OS threads with an additional primitive for directly switching between two threads of the same process.
+It is not implemented on Linux (see this presentation for some old work towards that).
+It is implemented on Windows under the “fiber” branding.
+
+
+
I am also interested in Rust’s specific implementation of stackless coroutines
This is a micro benchmark comparing the cost of primitive operations of threads and stackless as implemented in Rust coroutines.
+Findings:
+
+
+Thread creation is order of magnitude slower
+
+
+Threads use order of magnitude more RAM.
+
+
+IO-related context switches take the same time
+
+
+Thread-to-thread context switches (channel sends) take the same time, if threads are pinned to one core.
+This is surprising to me.
+I’d expect channel send to be significantly more efficient for either stackful or stackless coroutines.
+
+
+Thread-to-thread context switches are order of magnitude slower if there’s no pinning
+
+
+Threads hit non-memory resource limitations quickly (it’s hard to spawn > 50k threads).
+
Micro benchmark which compares Rust’s implementation of stackless coroutines with a manually coded state machine.
+Rust’s async/await turns out to not be zero-cost, pure overhead is about 4x.
+The absolute numbers are still low though, and adding even a single syscall of work reduces the difference to only 10%
This is a micro benchmark comparing just the memory overhead of threads and stackful coroutines as implemented in Go.
+Threads are “times”, but not “orders of magnitude” larger.
Macro benchmark which compares many different Python web frameworks.
+The conclusion is that async is worse for both latency and throughput.
+Note two important things.
+First, the servers are run behind a reverse proxy (nginx), which drastically changes IO patterns that are observed by the server.
+Second, Python is not the fastest language, so throughput is roughly correlated with the amount of C code in the stack.
This is a macro benchmark comparing performance of sync and async Rust web servers.
+This is the kind of benchmark I want to see, and the analysis is exceptionally good.
+Sadly, a big part of the analysis is fighting with unreleased version of software and working around bugs, so I don’t trust that the results are representative.
This is a micro benchmark that pretends to be a macro benchmark.
+The code is overly optimized to fit a very specific task.
+I don’t think the results are easily transferable to real-world applications.
+At the same time, lack of the analysis and the “macro” scale of the task itself doesn’t help with building a mental model for explaining the observed performance.
The opposite of a benchmark actually.
+This post gives a good theoretical overview of why async might lead to performance improvements.
+Sadly, it drops the ball when it comes to practice:
I am struggling with designing concurrent code.
+In this post, I want to share a model problem which exemplifies some of the issues.
+It is reminiscent of the famous expression problem in that there’s a two dimensional design grid, and a win along one dimension translates to a loss along the other.
+If you want a refresher on the expression problem (not required to understand this article), take a look at this post.
+It’s not canonical, but I like it.
+
Without further ado, concurrent expression problem:
+
+
+
I am not sure that’s exactly the right formulation, I feel like I am straining it a bit to fit the expression problem shape.
+The explanation that follows matters more.
+
I think there are two ways to code the system described.
+The first approach is to us a separate thread / goroutine / async task for each concurrent activity, with some synchronization around the access to the shared state.
+The alternative approach is to write an explicit state machine / actor loop to receive the next event and process it.
+
In the first scheme, adding new activities is easy, as you just write straight-line code with maybe some .awaits here and there.
+In the second scheme, it’s easy to check and act on invariants, as there is only a single place where the state is modified.
+
Let’s take a look at a concrete example.
+We’ll be using a pseudo code for a language with cooperative concurrency and explicit yield points (think Python with async/await).
+
The state consists of two counters.
+One activity decrements the first counter every second.
+The other activity does the same to the other counter.
+When both counters reach zero, we want to print something.
+
The first approach would look roughly like this:
+
+
+
And the second one like this:
+
+
+
It’s much easier to see what the concurrent activities are in the first case.
+It’s more clear how the overall state evolves in the second case.
+
The second approach also gives you more control — if several events are ready, you can process them in the order of priority (usually it makes sense to prioritize writes over reads).
+You can trivially add some logging at the start and end of the loop to collect data about slow events and overall latency.
+But the hit to the programming model is big.
+If you are new to the code and don’t know which conceptual activities are there, it’s hard to figure out that just from the code.
+The core issue is that causal links between asynchronous events are not reified in the code:
These are the notes on a design pattern I noticed in several contexts.
+
Suppose, metaphorically, you have a neatly organized bookcase which categorizes the books by their topics.
+And now, suppose you’ve got a new book, which doesn’t fit clearly into any existing category.
+What would you do?
+
Here are some common solutions I’ve seen:
+
+
+Put the book somewhere in the bookcase.
+
+
+Start rearranging the shelves until you have a proper topic for this new book.
+Maybe introduce a dedicated topic just for this single book.
+
+
+Don’t store the book in the bookcase, keep it on the bedside table.
+
+
+
Here’s the “kitchen sink pattern” solution for this problem: have the “Uncategorized” shelf for books which don’t clearly fit into the existing hierarchy.
+
The idea here is that the overall organization becomes better, if you explicitly designate some place as “stuff that doesn’t fit goes here by default”.
+Let’s see the examples.
+
First, the Django web framework has a shortcuts module with contains conveniences functions, not fitting model/view separation.
+The get_object_or_404 function lookups an object in the database and returns HTTP404 if it is not found.
+Models (SQL) and views (HTTP) don’t know about each other, so the function doesn’t belong to either of these modules.
+Placing it in shortcuts allows this separation to be more crisp.
+
Second, I have two tricks to keep my home folder organized.
+I have a script that clears ~/downloads on every reboot, and I have a ~/tmp as my dumping ground.
+Before ~/tmp, various semi-transient things polluted my otherwise perfectly organized workspace.
+
Third, I asked my colleague recently about some testing infrastructure.
+They replied that they have an extensive document for it in their fork, because it’s unclear what’s the proper place for it in the main repo.
+In this case the absence of a “dumping ground” prevented useful work for no good reason.
+
Fourth, in rust-analyzer we have a ast::make module which is intended to contain the minimal orthogonal set of constructors for AST nodes.
+Historically, people kept adding non-minimal, non-orthogonal constructors there as well.
+Useful work was done, but it muddied the design.
+This was fixed by adding a dedicated ast::make::ext submodule for convenient shortcuts.
+
Fifth, for big projects I like having stdext modules, which fill-in missing batteries for the standard library.
+Without it, various modules tend to accumulate unrelated, and often slightly duplicated, functionality.
+
Sixth, to avoid overthinking and setup costs to start a new hobby project (of which I have a tonne), I have a single monorepo for all incomplete things.
+Adding a folder there is much easier than creating a GitHub repo.
+
To sum up, many classifications work best if there is an explicit “can’t classify this” category.
+If there’s no obvious place to put things which don’t fit, a solid design might erode with time.
+Note that for this pattern to be useful, an existence of a good solid design is prerequisite, lest all the code ends up in an utils module.
Alternative titles:
+ Unit Tests are a Scam
+ Test Features, Not Code
+ Data Driven Integrated Tests
+
+
This post describes my current approach to testing.
+When I started programming professionally, I knew how to write good code, but good tests remained a mystery for a long time.
+This is not due to the lack of advice — on the contrary, there’s abundance of information & terminology about testing.
+This celestial emporium of benevolent knowledge includes TDD, BDD, unit tests, integrated tests, integration tests, end-to-end tests, functional tests, non-functional tests, blackbox tests, glassbox tests, …
+
Knowing all this didn’t help me to create better software.
+What did help was trying out different testing approaches myself, and looking at how other people write tests.
+Keep in mind that my background is mostly in writing compilerfront-ends for IDEs.
+This is a rather niche area, which is especially amendable to testing.
+Compilers are pure self-contained functions.
+I don’t know how to best test modern HTTP applications built around inter-process communication.
+
Without further ado, let’s see what I have learned.
This is something I inflicted upon myself early in my career, and something I routinely observe.
+You want to refactor some code, say add a new function parameter.
+Turns out, there are a dozen of tests calling this function, so now a simple refactor also involves fixing all the tests.
+
There is a simple, mechanical fix to this problem: introduce the check function which encapsulates API under test.
+It’s easier to explain using a toy example.
+Let’s look at testing something simple, like a binary search, just to illustrate the technique.
+
We start with direct testing:
+
+
+
Some time passes, and we realize that -> bool is not the best signature for binary search.
+It’s better if it returned an insertion point (an index where element should be inserted to maintain sortedness).
+That is, we want to change the signature to
+
+
+
Now we have to change every test, because the tests are tightly coupled to the specific API.
+
My solution to this problem is making the tests data driven.
+Instead of every test interacting with the API directly, I like to define a single check function which calls the API.
+This function takes a pair of input and expected result.
+For binary search example, it will look like this:
+
+
+
Now, when the API of the binary_search function changes, we only need to adjust the single place —check function:
+
+
+
To be clear, after you’ve done the refactor, you’ll need to adjust the tests to check the index as well, but this can be done separately.
+Existing test suite does not impede changes.
+
+
Keep in mind that the binary search example is artificially simple.
+The main danger here is that this is a boiling frog type of situation.
+While the project is small and the tests are few, you don’t notice that refactors are ever so slightly longer than necessary.
+Then, several tens of thousands lines of code later, you realize that to make a simple change you need to fix a hundred tests.
Almost no one likes to write tests.
+I’ve noticed many times how, upon fixing a trivial bug, I am prone to skipping the testing work.
+Specifically, if writing a test is more effort than the fix itself, testing tends to go out of the window.
+Hence,
+
+
Coming back to the binary search example, note how check function reduces the amount of typing to add a new test.
+For tests, this is a significant saving, not because typing is hard, but because it lowers the cognitive barrier to actually do the work.
The over-simplified binary search example can be stretched further.
+What if you replace the sorted array with a hash map inside your application?
+Or what if the calling code no longer needs to search at all, and wants to process all of the elements instead?
+
Good code is easy to delete.
+Tests represent an investment into existing code, and make it costlier to delete (or change).
+
The solution is to write tests for features in such a way that they are independent of the code.
+I like to use the neural network test for this:
+
+
Neural Network Test
+
+
Can you re-use the test suite if your entire software is replaced with an opaque neural network?
+
+
+
To give a real-life example this time, suppose that you are writing that part of code-completion engine which sorts potential completions according to relevance.
+(something I should probably be doing right now, instead of writing this article :-) )
+
Internally, you have a bunch of functions that compute relevance facts, like:
+
+
+Is there direct type match (.foo has the desired type)?
+
+
+Is there there indirect type match (.foo.bar has the right type)?
+
+
+How frequently is this completion used in the current module?
+
+
+
Then, there’s the final ranking function that takes these facts and comes up with an overall rank.
+
The classical unit-test approach here would be to write a bunch of isolated tests for each of the relevance functions,
+and a separate bunch of tests which feeds the ranking function a list of relevance facts and checks the final score.
+
This approach obviously fails the neural network test.
+
An alternative approach is to write a test to check that at a given position a specific ordered list of entries is returned.
+That suite could work as a cross-validation for an ML-based implementation.
+
In practice, it’s unlikely (but not impossible), that we use actual ML here.
+But it’s highly probably that the naive independent weights model isn’t the end of the story.
+At some point there will be special cases which would necessitate change of the interface.
+
+
Note that this advice goes directly against one common understanding of unit-testing.
+I am fairly confident that it results in better software over the long run.
There’s one talk about software engineering, which stands out for me, and which is my favorite.
+It is Boundaries by Gary Bernhardt.
+There’s a point there though, which I strongly disagree with:
+
+
Integration Tests are Superlinear?
+
+
When you use integration tests, any new feature is accompanied by a bit of new code and a new test.
+However, new code slows down all other tests, so the the overall test suite becomes slow, as the total time grows super-linearly.
+
+
+
I don’t think more code under test translates to slower test suite.
+Merge sort spends more lines of code than bubble sort, but it is way faster.
+
In the abstract, yes, more code generally means more execution time, but I doubt this is the defining factor in tests execution time.
+What actually happens is usually:
+
+
+Input/Output — reading just a bit from a disk, network or another process slows down the tests significantly.
+
+
+Outliers — very often, testing time is dominated by only a couple of slow tests.
+
+
+Overly large input — throwing enough data at any software makes it slow.
+
+
+
The problem with integrated tests is not code volume per se, but the fact that they typically mean doing a lot of IO.
+But this doesn’t need to be the case
+
+
Nonetheless, some tests are going to be slow.
+It pays off to introduce the concept of slow tests early on, arrange the skipping of such tests by default and only exercise them on CI.
+You don’t need to be fancy, just checking an environment variable at the start of the test is perfectly fine:
+
+
+
Definitely do not use conditional compilation to hide slow tests — it’s an obvious solution which makes your life harder
+(similar observation from the Go ecosystem).
+
To deal with outliers, print each test’s execution time by default.
+Having the numbers fly by gives you immediate feedback and incentive to improve.
All these together lead to a particular style of architecture and tests, which I call data driven testing.
+The bulk of the software is a pure function, where the state is passed in explicitly.
+Removing IO from the picture necessitates that the interface of software is specified in terms of data.
+Value in, value out.
+
One property of data is that it can be serialized and deserialized.
+That means that the check style tests can easily accept arbitrary complex input, which is specified in a structured format (JSON), ad-hoc plain text format, or via embedded DSL (builder-style interface for data objects).
+
Similarly, The “expected” argument of check is data.
+It is a result which is more-or-less directly displayed to the user.
+
A convincing example of a data driven test would be a “Goto Definition” tests from rust-analyzer (source):
+
+
+
In this case, the check function has only a single argument — a string which specifies both the input and the expected result.
+The input is a rust project with three files (//- /file.rs syntax shows the boundary between the files).
+The current cursor position is also a part of the input and is specified with the $0 syntax.
+The result is the //^^^ comment which marks the target of the “Goto Definition” call.
+The check function creates an in-memory Rust project, invokes “Goto Definition” at the position signified by $0, and checks that the result is the position marked with ^^^.
+
Note that this is decidedly not a unit test.
+Nothing is stubbed or mocked.
+This test invokes the whole compilation pipeline: virtual file system, parser, macro expander, name resolution.
+It runs on top of our incremental computation engine.
+It touches a significant fraction of the IDE APIs.
+Yet, it takes 4ms in debug mode (and 500µs in release mode).
+And note that it absolutely does not depend on any internal API — if we replace our dumb compiler with sufficiently smart neural net, nothing needs to be adjusted in the tests.
+
There’s one question though: why on earth am I using a png image to display a bit of code?
+Only to show that the raw string literal (r#""#) which contains Rust code is highlighted as such.
+This is possible because we re-use the same input format (with //-, $0 and couple of other markup elements) for almost every test in rust-analyzer.
+As such, we can invest effort into building cool things on top of this format, which subsequently benefit all our tests.
Previous example had a complex data input, but a relatively simple data output — a position in the file.
+Often, the output is messy and has a complicated structure as well (a symptom of rho problem).
+Worse, sometimes the output is a part that is changed frequently.
+This often necessitates updating a lot of tests.
+Going back to the binary search example, the change from -> bool to -> Result<usize, usize> was an example of this effect.
+
There is a technique to make such simultaneous changes to all gold outputs easy — testing with expectations.
+You specify the expected result as a bit of data inline with the test.
+There’s a special mode of running the test suite for updating this data.
+Instead of failing the test, a mismatch between expected and actual causes the gold value to be updated in-place.
+That is, the test framework edits the code of the test itself.
+
Here’s an example of this workflow in rust-analyzer, used for testing code completion:
+
+
+
Often, just Debug representation of the type works well for expect tests, but you can do something more fun.
+See this post from Jane Street for a great example:
+Using ASCII waveforms to test hardware designs.
An extremely popular genre for a testing library is a collection of fluent assertions:
+
+
+
The benefit of this style are better error messages.
+Instead of just “false is not true”, the testing framework can print values for x and y.
+
I don’t find this useful.
+Using the check style testing, there are very few assertions actually written in code.
+Usually, I start with plain asserts without messages.
+The first time I debug an actual test failure for a particular function, I spend some time to write a detailed assertion message.
+To me, fluent assertions are not an attractive point on the curve that includes plain asserts and hand-written, context aware explanations of failures.
+A notable exception here is pytest approach — this testing framework overrides the standard assert to provide a rich diff without ceremony.
One apparent limitation of the style of integrated testing I am describing is checking for properties which are not part of the output.
+For example, if some kind of caching is involved, you might want to check that the cache is actually being hit, and is not just sitting there.
+But, by definition, cache is not something that an outside client can observe.
+
The solution to this problem is to make this extra data a part of the system’s output by adding extra observability points.
+A good example here is Cargo’s test suite.
+It is set-up in an integrated, data-driven fashion.
+Each tests starts with a succinct DSL for setting up a tree of files on disk.
+Then, a full cargo command is invoked.
+Finally, the test looks at the command’s output and the resulting state of the file system, and asserts the relevant facts.
+
Tests for caching additionally enable verbose internal logging.
+In this mode, Cargo prints information about cache hits and misses.
+These messages are then used in assertions.
+
A close idea is coverage marks.
+Some times, you want to check that something does not happen.
+Tests for this tend to be fragile — often the thing does not happen, but for the wrong reason.
+You can add a side channel which explains the reasoning behind particular behavior, and additionally assert this as well.
In the ultimate stage of data driven tests the definitions of test cases are moved out of test functions and into external files.
+That is, you don’t do this:
+
+
+
Rather, there is a single test that looks like this:
+
+
+
I have a love-hate relationship with this approach.
+It has at least two attractive properties.
+First, it forces data driven approach without any cheating.
+Second, it makes the test suite more re-usable.
+An alternative implementation in a different programming language can use the same tests.
+
But there’s a drawback as well — without literal #[test] attributes, integration with tooling suffers.
+For example, you don’t automatically get “X out of Y tests passed” at the end of test run.
+You can’t conveniently debug just a single test, there isn’t a helpful “Run” icon/shortcut you can use in an IDE.
+
When I do externalized test cases, I like to leave a trivial smoke test behind:
+
+
+
If I need to debug a failing external test, I first paste the input into this smoke test, and then get my IDE tooling back.
Reading from a file is not the most fun way to come up with a data input for a check function.
+
Here are a few other popular ones:
+
+
Property Based Testing
+
+
Generate the input at random and verify that the output makes sense.
+For a binary search, check that the needle indeed lies between the two elements at the insertion point.
+
+
Full Coverage
+
+
Better still, instead of generating some random inputs, just check that the answer is correct for all inputs.
+This is how you should be testing binary search — generate every sorted list of length at most 7 with numbers in the 0..=6 range.
+Then, for each list and for each number, check that the binary search gives the same result as a naive linear search.
+
+
Coverage Guided Fuzzing
+
+
Just throw random bytes at the check function.
+Random bytes probably don’t make much sense, but it’s good to verify that the program returns an error instead of summoning nasal demons.
+Instead of piling bytes completely at random, observe which branches are taken, and try to invent byte sequences which cover more branches.
+Note that this test is polymorphic in the system under test.
Use random bytes as a seed to generate “syntactically valid” inputs, then see you software crash and burn when the most hideous edge cases are uncovered.
+If you use Rust, check out wasm-smith and arbitrary crates.
What if isolating IO is not possible, and the application is fundamentally build around interacting with external systems?
+In this case, my advice is to just accept that the tests are going to be slow, and might need extra effort to avoid flakiness.
+
Cargo is the perfect case study here.
+Its raison d’être is orchestrating a herd of external processes.
+Let’s look at the basic test:
+
+
+
The project() part is a builder, which describes the state of the a system.
+First,.build() writes the specified files to a disk in a temporary directory.
+Then,p.cargo("build").run() executes the real cargo build command.
+Finally, a bunch of assertions are made about the end state of the file system.
+
Neural network test: this is completely independent of internal Cargo APIs, by virtue of interacting with a cargo process via IPC.
+
To give an order-of-magnitude feeling for the cost of IO, Cargo’s test suite takes around seven minutes (-j 1), while rust-analyzer finishes in less than half a minute.
+
An interesting case is the middle ground, when the IO-ing part is just big and important enough to be annoying.
+That is the case for rust-analyzer — although almost all code is pure, there’s a part which interacts with a specific editor.
+What makes this especially finicky is that, in the case of Cargo, it’s Cargo who calls external processes.
+With rust-analyzer, it’s something which we don’t control, the editor, which schedules the IO.
+This often results in hard-to-imagine bugs which are caused by particularly weird environments.
+
I don’t have good answers here, but here are the tricks I use:
+
+
+Accept that something will break during integration.
+Even if you always create perfect code and never make bugs, your upstream integration point will be buggy sometimes.
+
+
+Make integration bugs less costly:
+
+
+use release trains,
+
+
+make patch release process non-exceptional and easy,
+
+
+have a checklist for manual QA before the release.
+
+
+
+
+Separate the tricky to test bits into a separate project.
+This allows you to write slow and not 100% reliable tests for integration parts, while keeping the core test suite fast and dependable.
+
This API is fundamentally untestable.
+Can you see why?
+It spawns a concurrent computation, but it doesn’t allow waiting for this computation to be finished.
+So, any test that calls do_stuff_in_background can’t check that the “Stuff” is done.
+Worse, even tests which do not call this function might start to fail — they now can get interference from other tests.
+The concurrent computation can outlive the test that originally spawned it.
+
This problem plagues almost every concurrent application I see.
+A common symptom is adding timeouts and sleeps to test, to increase the probability of stuff getting done.
+Such timeouts are another common cause of slow test suites.
+
What makes this problem truly insidious is that there’s no work-around.
+Broken once, causality link is not reforgable by a layer above.
Another common problem I see in complex projects is a beautifully layered architecture, which is “inverted” in tests.
+
Let’s say you have something fabulous, like L1 <- L2 <- L3 <- L4.
+To test L1, the path of least resistance is often to write tests which exercise L4.
+You might even think that this is the setup I am advocating for.
+Not exactly.
+
The problem with L1 <- L2 <- L3 <- L4 <- Tests is that working on L1 becomes slower, especially in compiled languages.
+If you make a change to L1, then, before you get to the tests, you need to recompile the whole chain of reverse dependencies.
+My “favorite” example here is rustc— when I worked on the lexer (T1), I spent a lot of time waiting for the rest of the compiler to be rebuild to check my small change.
+
The right setup here is to write integrated tests for each layer:
+
+
+
Note that testing L4 involves testing L1, L2 an L3.
+This is not a problem.
+Due to layering, only L4 needs to be recompiled.
+Other layers don’t affect run time meaningfully.
+Remember — it’s IO (and sleep-based synchronization) that kills performance, not just code volume.
In a nutshell, a #[test] is just a bit of code which is plugged into the build system to be executed automatically.
+Use this to your advantage, simplify the automation by moving as much as possible into tests.
+
Here’s some things in rust-analyzer which are just tests:
+
+
+Code formatting (most common one — you don’t need an extra pile of YAML in CI, you can shell out to the formatter from the test).
+
+
+Checking that the history does not contain merge commits and teaching new contributors git survival skills.
+
+
+Collecting the manual from specially-formatted doc comments across the code base.
+
+
+Checking that the code base is, in fact, reasonably well-documented.
+
+
+Ensuring that the licenses of dependencies are compatible.
+
+
+Ensuring that high-level operations are linear in the size of the input.
+Syntax-highlight a synthetic file of 1, 2, 4, 8, 16 kilobytes, run linear regression, check that result looks like a line rather than a parabola.
+
This essay already mentioned a couple of cognitive tricks for better testing: reducing the fixed costs for adding new tests, and plotting/printing test times.
+The best trick in a similar vein is the “not rocket science” rule of software engineering.
+
The idea is to have a robot which checks that the merge commit passes all the tests, before advancing the state of the main branch.
+
Besides the evergreen master, such bot adds pressure to keep the test suite fast and non-flaky.
+This is another boiling frog, something you need to constantly keep an eye on.
+If you have any a single flaky test, it’s very easy to miss when the second one is added.
This was a long essay.
+Let’s look back at some of the key points:
+
+
+There is a lot of information about testing, but it is not always helpful.
+At least, it was not helpful for me.
+
+
+The core characteristic of the test suite is how easy it makes changing the software under test.
+
+
+To that end, a good strategy is to focus on testing the features of the application, rather than on testing the code used to implement those features.
+
+
+A good test suite passes the neural network test — it is still useful if the entire application is replaced by an ML model which just comes up with the right answer.
+
+
+Corollary: good tests are not helpful for design in the small — a good test won’t tell you the best signatures for functions.
+
+
+Testing time is something worth optimizing for.
+Tests are sensitive to IO and IPC.
+Tests are relatively insensitive to the amount of code under tests.
+
+
+There are useful techniques which are underused — expectation tests, coverage marks, externalized tests.
+
+
+There are not so useful techniques which are over-represented in the discourse: fluent assertions, mocks, BDD.
+
+
+The key for unlocking many of the above techniques is thinking in terms of data, rather than interfaces or objects.
+
+
+Corollary: good tests are helpful for design in the large.
+They help to crystalize the data model your application is built around.
+
There’s a lot of tribal knowledge surrounding #[inline] attribute in Rust.
+I often find myself teaching how it works, so I finally decided to write this down.
+
Caveat Emptor: this is what I know, not necessarily what is true.
+Additionally, exact semantics of #[inline] is not set in stone and may change in future Rust versions.
In other words, for an ahead of time compiled language inlining is the mother of all other optimizations.
+It gives the compiler the necessary context to apply further transformations.
Inlining is at odds with another important idea in compilers — that of separate compilation.
+When compiling big programs, it is desirable to separate them into modules which can be compiled independently to:
+
+
+Process everything in parallel.
+
+
+Scope incremental recompilations to individual changed modules.
+
+
+
To achieve separate compilation, compilers expose signatures of functions, but keep function bodies invisible to other modules, preventing inlining.
+This fundamental tension is what makes #[inline] in Rust trickier than just a hint for the compiler to inline the function.
In Rust, a unit of (separate) compilation is a crate.
+If a function f is defined in a crate A, then all calls to f from within A can be inlined, as the compiler has full access to f.
+If, however, f is called from some downstream crate B, such calls can’t be inlined.
+B has access only to the signature of f, not its body.
+
That’s where the main usage of #[inline] comes from — it enables cross-crate inlining.
+Without #[inline], even the most trivial of functions can’t be inlined across the crate boundary.
+The benefit is not without a cost — the compiler implements this by compiling a separate copy of the #[inline] function with every crate it is used in, significantly increasing compile times.
+
Besides #[inline], there are two more exceptions to this.
+Generic functions are implicitly inlinable.
+Indeed, the compiler can only compile a generic function when it knows the specific type arguments it is instantiated with.
+As that is known only in the calling crate, bodies of generic functions have to be always available.
+
The other exception is link-time optimization.
+LTO opts out of separate compilation — it makes bodies of all functions available, at the cost of making compilation much slower.
Now that the underlying semantics is explained, it’s possible to infer some rule-of-thumbs for using #[inline].
+
First, it’s not a good idea to apply #[inline] indiscriminately, as that makes compile time worse.
+If you don’t care about compile times, a much better solution is to set lto = true in Cargo profile (docs).
+
Second, it usually isn’t necessary to apply #[inline] to private functions — within a crate, the compiler generally makes good inline decisions.
+There’s a joke that LLVM’s heuristic for when the function should be inlined is “yes”.
+
Third, when building an application, apply #[inline] reactively when profiling shows that a particular small function is a bottleneck.
+Consider using lto for releases.
+It might make sense to proactively #[inline] trivial public functions.
+
Fourth, when building libraries, proactively add #[inline] to small non-generic functions.
+Pay special attention to impls: Deref, AsRef and the like often benefit from inlining.
+A library can’t anticipate all usages upfront, it makes sense to not prematurely pessimize future users.
+Note that #[inline] is not transitive: if a trivial public function calls a trivial private function, you need to #[inline] both.
+See this benchmark for details.
+
Fifth, mind generic functions.
+It’s not too wrong to say that generic functions are implicitly inline.
+As a result, they often are a cause for code bloat.
+Generic functions, especially in libraries, should be written to minimize unwanted inlining.
+To give an example from wat:
+@alexcrichton explains inline.
+Note that, in reality, the compile time costs are worse than what I described — inline functions are compiled per codegen-unit, not per crate.
+
This is a follow up to the previous post about #[inline] in Rust specifically.
+This post is a bit more general, and a bit more ranty.
+Reader, beware!
+
When inlining optimization is discussed, the following is almost always mentioned: “inlining can also make code slower, because inlining increases the code size, blowing the instruction cache size and causing cache misses”.
+
I myself have seen this repeated on various forms many times.
+I have also seen a lot of benchmarks where judicious removal of inlining annotations did increase performance.
+However, not once have I seen the performance improvement being traced to ICache specifically.
+To me at least, this explanation doesn’t seem to be grounded — people know that ICache is to blame because other people say this, not because there’s a benchmark everyone points to.
+It doesn’t mean that the ICache explanation is wrong — just that I personally don’t have evidence to believe it is better than any other explanation.
+
Anyway, I’ve decided to look at a specific case where I know #[inline] to cause an observable slow down, and understand why it happens.
+Note that the goal here is not to explain real-world impact of #[inline], the benchmark is artificial.
+The goal is, first and foremost, to learn more about the tools to use for explaining results.
+The secondary goal is to either observe ICache effects in practice, or else to provide an alternative hypothesis for why removing inlining can speed the things up.
+
The benchmark is based on my once_cell Rust library.
+The library provides a primitive, similar to double-checked locking.
+There’s a function that looks like this:
+
+
+
I know that performance improves significantly when the initialize function is not inlined.
+It’s somewhat obvious that this is the case (that’s why the benchmark is synthetic — real world examples are about cases where we don’t know if inline is needed).
+But it is unclear why, exactly, inlining initialize leads to slower code.
+
For the experiment, I wrote a simple high-level benchmark calling get_or_try_init in a loop:
+
+
+
I also added compile-time toggle to force or forbid inlining:
Running both versions shows that #[inline(never)] is indeed measurably faster:
+
+
+
+
How do we explain the difference?
+The first step is to remove cargo from the equation and make two binaries for comparison:
+
+
+
On Linux, the best tool to quickly access the performance of any program is perf stat.
+It runs the program and shows a bunch of CPU-level performance counters, which might explain what’s going on.
+As we suspect that ICache might be to blame, let’s include the counters for caches:
+
+
+
There is some difference in L1-icache-load-misses, but there’s also a surprising difference in instructions.
+What’s more, the L1-icache-load-misses difference is hard to estimate, because it’s unclear what L1-icache-loads are.
+As a sanity check, statistics for dcache are the same, just as we expect.
+
While perf takes the real data from the CPU, an alternative approach is to run the program in a simulated environment.
+That’s what cachegrind tool does.
+Fun fact: the primary author of cachegrind is @nnethercote, whose Rust Performance Book we saw in the last post.
+Let’s see what cachegrind thinks about the benchmark.
+
+
+
Note that, because cachegrind simulates the program, it runs much slower.
+Here, we don’t see a big difference in ICache misses (I1 — first level instruction cache, LLi — last level instruction cache).
+We do see a difference in ICache references.
+Note that the number of times CPU refers to ICache should correspond to the number of instructions it executes.
+Cross-checking the number with perf, we see that both perf and cachegrind agree on the number of instructions executed.
+They also agree that inline_always version executes more instructions.
+It’s still hard to say what perf’s sL1-icache-loads means.
+Judging by the name, it should correspond to cachegrind’s I refs, but it doesn’t.
+
Anyway, it seems there’s one thing that bears further investigation — why inlining changes the number of instructions executed?
+Inlining doesn’t actually change the code the CPU runs, so the number of instructions should stay the same.
+Let’s look at the asm then!
+The right tool here is cargo-asm.
+
Again, here’s the function we will be looking at:
+
+
+
The call to get_or_init will be inlined, and the nested call to initialize will be inlined depending on the flag.
+
Let’s first look at the inline_never version:
+
+
+
And then at the inline_always version:
+
+
+
I’ve slightly edited the code and also highlighted the hot loop which constitutes the bulk of the benchmark.
+
Looking at the assembly, we can see the following:
+
+
+code is much larger — inlining happened!
+
+
+function prologue is bigger, compiler pushes more callee-saved registers to the stack
+
+
+function epilogue is bigger, compiler needs to restore more registers
+
+
+stack frame is larger
+
+
+compiler hoisted some of the initialize code to before the loop
+
+
+the core loop is very tight in both cases, just a handful of instructions
+
+
+the core loop counts upwards rather than downwards, adding an extra cmp instruction
+
+
+
Note that it’s highly unlikely that ICache affects the running code, as it’s a small bunch of instructions next to each other in memory.
+On the other hand, an extra cmp with a large immediate precisely accounts for the amount of extra instructions we observe (the loop is run 800_000_000 times).
It’s hard enough to come up with a benchmark which demonstrate the difference between two alternatives.
+It’s even harder to explain the difference — there might be many readily available explanations, but they are not necessary true.
+Nonetheless, today we have a wealth of helpful tooling.
+Two notable examples are perf and valgrind.
+Tools are not always correct — it’s a good idea to sanity check different tools against each other and against common-sense understanding of the problem.
+
For inlining in particular, we found the following reasons why inlining S into C might cause a slow down:
+
+
+Inlining might cause C to use more registers.
+This means that prologue and epilogue grow additional push/pop instructions, which also use stack memory.
+Without inlining, these instructions are hidden in S and are only paid for when C actually calls into S, as opposed to every time C itself is called.
+
+
+Generalizing from the first point, if S is called in a loop or in an if, the compiler might hoist some instructions of S to before the branch, moving them from the cold path to the hot path.
+
+
+With more local variables and control flow in the stack frame to juggle, compiler might accidentally pessimize the hot loop.
+
+
+
If you are curious under which conditions ICache does become an issue, there’s this excellent article about one such case.
This is an introductory article about shell injection, a security vulnerability allowing an attacker to execute arbitrary code on the user’s machine.
+This is a well-studied problem, and there are simple and efficient solutions to it.
+It’s relatively easy to design library API in such a way as to shield the application developer from the risk of shell injections.
+
There are two reasons why I am writing this post.
+First, this year I’ve pointed out this issue in threedifferentlibraries.
+It seems that, although the problem is well-studied, its not well known, so just repeating some things might help.
+Second, I’ve recently reported a related problem about one of the VS Code APIs, and I want to use this piece as an extended GitHub comment :-)
Shell injection can happen when a program needs to execute another program, and one of the arguments is controlled by the user/attacker.
+As a model example, let’s write a quick script to read a list of URLs from stdin, and run curl for each one of those.
+
That’s not realistic, but small and illustrative.
+This is what the script could look like in NodeJS:
+
+
+
I would have written this in Rust, but, alas, it’s not vulnerable to this particular attack :)
+
The interesting line is this one:
+
+
+
Here, we use are using exec API from node to spawn a child curl process, passing a line of input as an argument.
+
Seems to work for simple cases?
+
+
+
But what if we use a slightly more imaginative input?
+
+
+
That feels bad — seems that the script somehow reads the contents of my /etc/passwd.
+How did this happen, we’ve only invoked curl?
To understand what have just happened, we need to learn a bit about how spawning a process works in general.
+This section is somewhat UNIX-specific — things are implemented a bit differently on Windows.
+Nonetheless, the big picture conclusions hold there as well.
+
The main API to run a program with command line arguments is the exec family of functions.
+For example, here’s execve:
+
+
+
It takes the name of the program (pathname), a list of command line arguments (argv), and a list of environment variable for the new process (envp), and uses those to run the specified binary.
+How exactly this happens is a fascinating story with many forks in the plot, but it is beyond the scope of the article.
+
What is curious though, is that while the underlying system API wants an array of arguments, the child_process.exec function from node takes only a single string: exec("curl http://example.com").
+
Let’s find out!
+To do that, we’ll use the strace tool.
+This tool inspects (traces) all the system calls invoked by the program.
+We’ll ask strace to look for execve in particular, to understand how node’s exec maps to the underlying system’s API.
+We’ll need the --follow argument to trace all processes, and not just the top-level one.
+To reduce the amount of output and only print execve, we’ll use the --trace flag:
+
+
+
The first execve we see here is our original invocation of the node binary itself.
+The last one is what we want to do — spawn curl with a single argument, an url.
+And the middle one is what node’s exec actually does.
+
Let’s take a closer look:
+
+
+
Here, node invokes the sh binary (system’s shell) with two arguments: -c and the string we originally passed to child_process.exec.
+-c stands for command, and instructs the shell to interpret the value as a shell command, parse, it and then run it.
+
In other words, rather then running the command directly, node asks the shell to do the heavy lifting.
+But the shell is an interpreter of the shell language, and, by carefully crafting the input to exec, we can ask it to run arbitrary code.
+In particular, that’s what we used as a payload in the bad example above:
+
+
+
After the string interpolation, the resulting command was
+
+
+
That is, first run curl, then echo, then read the /etc/passwd.
There’s an equivalent safe API in node: spawn.
+unlike exec, it uses an array of arguments rather then a single string.
+
+
+
Internally, the API bypasses the shell and uses execve directly.
+Thus, this API is not vulnerable to shell injection — attacker can run curl with bad arguments, but it can’t run something else than curl.
+
Note that it’s easy to implement exec in terms of spawn:
+
+
+
It’s a common pattern among many languages:
+
+
+there’s an exec-style function that takes a string and spawns /bin/sh -c under the hood,
+
+
+the docs for this function include a giant disclaimer, saying that using it with user input is a bad idea,
+
+
+there’s a safe alternative which takes arguments as an array and spawns the process directly.
+
+
+
Why provide an exploitable API, while a safe version is possible and is more direct?
+I don’t know, but my guess is that it’s mostly just history.
+C has system, Perl’s backticks correspond directly to that, Ruby got backticks from Perl, Python just has system, node was probably influenced by all these scripting languages.
+
Note that security isn’t the only issue with /bin/sh -c based API.
+Read this other post to learn about the rest of the problems.
If you are an application developer, be aware that this issue exists.
+Read the language documentation carefully — most likely, there are two flavors of process spawning functions.
+Note how shell injection is similar to SQL injection and XSS.
+
If you develop a library for conveniently working with external processes, use and expose only the shell-less API from the underlying platform.
+
If you build a new platform, don’t provide bin/sh -c API in the first place.
+Be like deno (and also Go, Rust, Julia), don’t be like node (and also Python, Ruby, Perl, C).
+If you have to maintain such API for legacy reasons, clearly document the issue about shell injection.
+Documenting how to do /bin/sh -c by hand might also be a good idea.
+
If you are designing a programming language, be careful with string interpolation syntax.
+It’s important that string interpolation can be used to spawn a command in a safe way.
+That mostly means that library authors should be able to deconstruct a "cmd -j $arg1 -f $arg2" literal into two (compile-time) arrays: ["cmd -j ", " -f "] and [arg1, arg2].
+If you don’t provide this feature in the language, library authors will split the interpolated string, which would be unsafe (not only for shelling out — for SQLing or HTMLing as well).
+Good examples to learn from are JavaScript’s
+tagged templates
+and Julia’s
+backticks.
I was happily hacking on some Rust library.
+At some point I pressed the “run tests” button in rust-analyzer.
+And, surprised, accidentally pwned myself!
+
+
+
That was disappointing.
+C’mon, how come there’s a shell injection in the code I help to maintain?
+While this is not a big problem for rust-analyzer (our security model assumes trusted code, as each of rustup, cargo, and rustc can execute arbitrary code by design), it definitely was big blow to my aesthetics sensibilities!
+
Looking at the git history, it was me who had missed “concatenate arguments into a single string” during review.
+So I was definitely a part of the problem here.
+But the other part is that the API that takes a single string exists at all.
+
Let’s look at the API:
+
+
+
So, this is exactly what I am describing — a process-spawning API that takes a single string.
+I guess, in this case this might even be justified — the API opens a literal shell in the GUI, and the user can interact with it after the command finishes.
+
Anyway, after looking around I quickly found another API, which seemed (ominous music in the background) like what I was looking for:
+
+
+
The API takes a array of strings.
+It also tries to say something about quoting, which is a good sign!
+The wording is perplexing, but seems that it struggles to explain to me that passing ["ls", ">", "out.txt"] won’t actually redirect, because > will get quoted.
+This is exactly what I want!
+The absence of any kind of a security note on both APIs is concerning, but oh well.
+
So, I refactored the code to use this second constructor, and, 🥁 🥁 🥁, it still had the exact same behavior!
+Turns out that this API takes an array of arguments, and just concatenates them, unless I explicitly say that each argument needs to be escaped.
+
And this is what I am complaining about — that the API looks like it is safe for an untrusted user input, while it is not.
+This is misuse resistance resistance.
In this article, I’ll share my experience with organizing large Rust projects.
+This is in no way authoritative — just some tips I’ve discovered through trial and error.
+
Cargo, Rust’s build system, follows convention over configuration principle.
+It provides a set of good defaults for small projects, and it is especially well-tailored for public crates.io libraries.
+The defaults are not perfect, but they are good enough.
+The resulting ecosystem-wide consistency is also welcome.
+
However, Cargo is less opinionated when it comes to large, multi-crate projects, organized as a Cargo workspace.
+Workspaces are flexible — Cargo doesn’t have a preferred layout for them.
+As a result, people try different things, with varying degrees of success.
+
To cut to the chase, I think for projects in between ten thousand and one million lines of code, the flat layout makes the most sense.
+rust-analyzer (200k lines) is good example here.
+The repository is laid out this:
+
+
+
In the root of the repo, Cargo.toml defines a virtual manifest:
+
+
+
Everything else (including rust-analyzer “main” crate) is nested one-level deep under crates/.
+The name of each directory is equal to the name of the crate:
+
+
+
At the time of writing, there are 32 different subfolders in crates/.
It’s interesting that this advice goes against the natural tendency to just organize everything hierarchically:
+
+
+
There are several reasons why trees are inferior in this case.
+
First, the Cargo-level namespace of crates is flat.
+It’s not possible to write hir::def in Cargo.toml, so crates typically have prefixes in their names.
+Tree layout creates an alternative hierarchy, which adds a possibility for inconsistencies.
+
Second, even comparatively large lists are easier to understand at a glance than even small trees.
+ls ./crates gives immediate bird’s eye view of the project, and this view is small enough:
+
+
+
Doing the same for a tree-based layout is harder.
+Looking at a single level doesn’t tell you which folders contains nested crates.
+Looking at all level lists too many folders.
+Looking only at folder that contain Cargo.toml gives the right result, but is not as trivial as just ls.
+
It is true that nested structure scales better than a flat one.
+But the constant matters — until you hit a million lines of code, the number of crates in the project will probably fit on one screen.
+
Finally, the last problem with hierarchical layout is that there are no perfect hierarchies.
+With a flat structure, adding or splitting the crates is trivial.
+With a tree, you need to figure out where to put the new crate, and, if there isn’t a perfect match for it already, you’ll have to either:
+
+
+add a stupid mostly empty folder near the top
+
+
+add a catch-all utils folder
+
+
+place the code in a known suboptimal directory.
+
+
+
This is a significant issue for long-lived multi-person projects — tree structure tends to deteriorate over time, while flat structure doesn’t need maintenance.
Make the root of the workspace a virtual manifest.
+It might be tempting to put the main crate into the root, but that pollutes the root with src/, requires passing --workspace to every Cargo command, and adds an exception to an otherwise consistent structure.
+
Don’t succumb to the temptation to strip common prefix from folder names.
+If each crate is named exactly as the folder it lives in, navigation and renames become easier.
+Cargo.tomls of reverse dependencies mention both the folder and the crate name, it’s useful when they are exactly the same.
+
For large projects a lot of repository bloat often comes from ad-hoc automation — Makefiles and various prepare.sh scripts here and there.
+To avoid both the bloat and proliferation of ad-hoc workflows, write all automation in Rust in a dedicated crate.
+One pattern useful for this is cargo xtask.
+
Use version = "0.0.0" for internal crates you don’t intend to publish.
+If you do want to publish a subset of crates with proper semver API, be very deliberate about them.
+It probably makes sense to extract all such crates into a separate top-level folder, libs/.
+It makes it easier to check that things in libs/ don’t use things from crates/.
+
Some crates consist only of a single-file.
+For those, it is tempting to flatten out the src directory and keep lib.rs and Cargo.toml in the same directory.
+I suggest not doing that — even if crate is single file now, it might get expanded later.
It’s common knowledge that Rust code is slow to compile.
+But I have a strong gut feeling that most Rust code out there compiles much slower than it could.
This doesn’t make sense to me.
+rust-analyzer CI takes 8 minutes on GitHub actions.
+It is a fairly large and complex project with 200k lines of own code and 1 million lines of dependencies on top.
+
It is true that Rust is slow to compile in a rather fundamental way.
+It picked “slow compiler” in the generic dilemma, and its overall philosophy prioritizes runtime over compile time (an excellent series of posts about that:
+1,
+2,
+3,
+4).
+But rustc is not a slow compiler — it implements the most advanced incremental compilation in industrial compilers, it takes advantage of compilation model based on proper modules (crates), and it has been meticulously optimized.
+Fast to compile Rust projects are a reality, even if they are not common.
+Admittedly, some care and domain knowledge is required to do that.
+
So let’s take a closer look at what did it take for us to keep the compilation time within reasonable bounds for rust-analyzer!
One thing I want to make clear is that optimizing project’s build time is in some sense busy-work.
+Reducing compilation time provides very small direct benefits to the users, and is pure accidental complexity.
+
That being said, compilation time is a multiplier for basically everything.
+Whether you want to ship more features, to make code faster, to adapt to a change of requirements, or to attract new contributors, build time is a factor in that.
+
It also is a non-linear factor.
+Just waiting for the compiler is the smaller problem.
+The big one is losing the state of the flow or (worse) mental context switch to do something else while the code is compiling.
+One minute of work for the compiler wastes more than one minute of work for the human.
+
It’s hard for me to quantify the impact, but my intuitive understanding is that, as soon as the project grows beyond several thousands lines written by a single person, build times become pretty darn important!
+
The most devilish property of build times is that they creep up on you.
+While the project is small, build times are going to be acceptable.
+As projects grow incrementally, build times start to slowly increase as well.
+And if you let them grow, it might be rather hard to get them back in check later!
+
If project is already too slow to compile, then:
+
+
+Improving build times will be time consuming, because each iteration of “try a change, trigger the build, measure improvement” will take long time (yes, build times are a multiplier for everything, including build times themselves!)
+
+
+There won’t be easy wins: in contrast to runtime performance, pareto principle doesn’t work!
+If you write a thousand lines of code, maybe one hundred of them will be performance-sensitive, but each line will add to compile times!
+
+
+Small wins will seem too small until they add up: shaving off five seconds is a much bigger deal for a five minute build than for an hour-long build.
+
+
+Dually, small regressions will go unnoticed.
+
+
+
There’s also a culture aspect to it: if you join a project and its CI takes one hour, then an hour-long CI is normal, right?
+
Luckily, there’s one simple trick to solve the problem of build times …
You need to care about build times, keep an eye on them, and fix them before they become a problem.
+Build times are a fairly easy optimization problem: it’s trivial to get direct feedback (just time the build), there are a bunch of tools for profiling, and you don’t even need to come up with a representative benchmark.
+The task is to optimize a particular project’s build time, not performance of the compiler in general.
+That’s a nice property of most instances of accidental complexity — they tend to be well defined engineering problems with well understood solutions.
+
The only hard bit about compilation time is that you don’t know that it is a problem until it actually is one!
+So, the most valuable thing you can get from this post is this:
+if you are working on a Rust project, take some time to optimize its build today, and try to repeat the exercise once in a while.
+
Now, with the software engineering bits cleared, let’s finally get to some actionable programming advice!
I like to use CI time as one of the main metrics to keep an eye on.
+
Part of that is that CI time is important in itself.
+While you are not bound by CI when developing features, CI time directly affects how annoying it is to context switch when finishing one piece of work and starting the next one.
+Juggling five outstanding PRs waiting for CI to complete is not productive.
+Longer CI also creates a pressure to not split the work into independent chunks.
+If correcting a typo requires keeping a PR tab open for half a hour, it’s better to just make a drive by fix in the next feature branch, right?
+
But a bigger part is that CI gives you a standardized benchmark.
+Locally, you compile incrementally, and the time of build varies greatly with the kinds of changes you are doing.
+Often, you compile just a subset of the project.
+Due to this inherent variability, local builds give poor continuous feedback about build times.
+Standardized CI though runs for every change and gives you a time series where numbers are directly comparable.
+
To increase this standardization pressure of CI, I recommend following not rocket science rule and setting up a merge robot which guarantees that every state of the main branch passes CI.
+bors is a particular implementation I use, but there are others.
+
While it’s by far not the biggest reason to use something like bors, it gives two benefits for healthy compile times:
+
+
+It ensures that every change goes via CI, and creates pressure to keep CI healthy overall
+
+
+The time between leaving r+ comment on the PR and receiving the “PR merged” notification gives you an always on feedback loop.
+You don’t need to specifically time the build, every PR is a build benchmark.
+
If you think about it, it’s pretty obvious how a good caching strategy for CI should work.
+It makes sense to cache stuff that changes rarely, but it’s useless to cache frequently changing things.
+That is, cache all the dependencies, but don’t cache project’s own crates.
+
Unfortunately, almost nobody does this.
+A typical example would just cache the whole of ./target directory.
+That’s wrong — the ./target is huge, and most of it is useless on CI.
+
It’s not super trivial to fix though — sadly, Cargo doesn’t make it too easy to figure out which part of ./target are durable dependencies, and which parts are volatile local crates.
+So, you’ll need to write some code to clean the ./target before storing the cache.
+For GitHub actions in particular you can also use Swatinem/rust-cache.
Caching is usually the low-hanging watermelon, but there are several more things to tweak.
+
Split CI into separate cargo test --no-run and cargo test.
+It is vital to know which part of your CI is the build, and which are the tests.
+
Disable incremental compilation.
+CI builds often are closer to from-scratch builds, as changes are typically much bigger than from a local edit-compile cycle.
+For from-scratch builds, incremental adds an extra dependency-tracking overhead.
+It also significantly increases the amount of IO and the size of ./target, which make caching less effective.
+
Disable debuginfo — it makes ./target much bigger, which again harms caching.
+Depending on your preferred workflow, you might consider disabling debuginfo unconditionally, this brings some benefits for local builds as well.
+
While we are at it, add-D warnings to the RUSTFLAGS environmental variable to deny warning for all crates at the same time.
+It’s a bad idea to #![deny(warnings)] in code: you need to repeat it for every crate, it needlessly makes local development harder, and it might break your users when they upgrade their compiler.
+It might also make sense to bump cargo network retry limits.
Another obvious advice is to use fewer, smaller dependencies.
+
This is nuanced: libraries do solve actual problems, and it would be stupid to roll your own solution to something already solved by crates.io.
+And it’s not like it’s guaranteed that your solution will be smaller.
+
But it’s important to realise what problems your application is and is not solving.
+If you are building a CLI utility for thousands of people of to use, you absolutely need clap with all of its features.
+If you are writing a quick script to run during CI, which only the team will be using, it’s probably fine to start with simplistic command line parsing, but faster builds.
+
One tremendously useful exercise here is to read Cargo.lock (not Cargo.toml) and for each dependency think about the actual problem this dependency solves for the person in front of your application.
+It’s very frequent that you’ll find dependencies that just don’t make sense at all, in your context.
+
As an illustrative example, rust-analyzer depends on regex.
+This doesn’t make sense — we have exact parsers and lexers for Rust and Markdown, we don’t need to interpret regular expressions at runtime.
+regex is also one of the heavier dependencies — it’s a full implementation of a small language!
+The reason why this dependency is there is because the logging library we use allows to say something like:
+
+
+
where parsing of the filtering expression is done by regular expressions.
+
This is undoubtedly a very useful feature to have for some applications, but in the context of rust-analyzer we don’t need it.
+Simple env_logger-style filtering would be enough.
+
Once you identify a similar redundant dependency, it’s usually enough to tweak features field somewhere, or to send a PR upstream to make non-essential bits configurable.
+
Sometimes it is a bigger yak to shave :)
+For example, rust-analyzer optionally use jemalloc crate, and its build script pulls in fs_extra and (of all the things!) paste.
+The ideal solution here would be of course to have a production grade, stable, pure rust memory allocator.
Now that we’ve dealt with things which are just sensible to do, it’s time to start measuring before cutting.
+A tool to use here is timings flag for Cargo (documentation).
+Sadly, I lack the eloquence to adequately express the level of quality and polish of this feature, so let me just say ❤️ and continue with my dry prose.
+
cargo build -Z timings records profiling data during the build, and then renders it as a very legible and information-dense HTML file.
+This is a nightly feature, so you’ll need the +nightly toggle.
+This isn’t a problem in practice, as you only need to run this manually once in a while.
+
Here’s an example from rust-analyzer:
+
+
+
+
+
Not only can you see how long each crate took to compile, but you’ll also see how individual compilations where scheduled, when each crate started to compile, and its critical dependency.
This last point is important — crates form a directed acyclic graph of dependencies and, on a multicore CPU, the shape of this graph affects the compilation time a lot.
+
This is slow to compile, as all the crates need to be compiled sequentially:
+
+
+
This version is much faster, as it enables significantly more parallelism:
+
+
+
There’s also connection between parallelism and incrementality.
+In the wide graph, changing B doesn’t entail recompiling C and D.
+
The first advice you get when complaining about compile times in Rust is: “split the code into crates”.
+It is not that easy — if you ended up with a graph like the first one, you are not winning much.
+It is important to architect the applications to look like the second picture — a common vocabulary crate, a number of independent features, and a leaf crate to tie everything together.
+The most important property of a crate is which crates it doesn’t (transitively) depend on.
+
Another important consideration is the number of final artifacts (most typically binaries).
+Rust is statically linked, so, if two different binaries use the same library, each binary contains a separately linked copy of the library.
+If you have n binaries and m libraries, and each binary uses each library, then the amount of work to do during the linking is m * n.
+For this reason, it’s better to minimize the number of artifacts.
+One common technique here is BusyBox-style Swiss Army knife executables.
+The idea is that you can hardlink the same executable as several files with different names.
+The program then can look at the zeroth command line argument to learn the name it was invoked with, and use it effectively as a name of a subcommand.
+One cargo-specific gotcha here is that, by default, each file in ./examples or ./tests folder creates a new executable.
But Cargo is even smarter than that!
+It does pipelined compilation — splitting the compilation of a crate into metadata and codegen phases, and starting compilation of dependent crates as soon as the metadata phase is over.
+
This has interesting interactions with procedural macros (and build scripts).
+rustc needs to run procedural macros to compute crate’s metadata.
+That means that procedural macros can’t be pipelined, and crates using procedural macros are blocked until the proc macro is fully compiled to the binary code.
+
Separately from that, procedural macros need to parse Rust code, and that is a relatively complex task.
+The de-facto crate for this, syn, takes quite some time to compile (not because it is bloated — just because parsing Rust is hard).
+
This generally means that projects tend to have syn / serde shaped hole in the CPU utilization profile during compilation.
+It’s relatively important to use procedural macros only where they pull their weight, and try to push crates before syn in the cargo -Z timings graph.
+
The latter can be tricky, as proc macro dependencies can sneak up on you.
+The problem here is that they are often hidden behind feature flags, and those feature flags might be enabled by downstream crates.
+Consider this example:
+
You have a convenient utility type — for example, an SSO string, in a small_string crate.
+To implement serialization, you don’t actually need derive (just delegating to String works), so you add an (optional) dependency on serde:
+
+
+
SSO string is a rather useful abstraction, so it gets used throughout the codebase.
+Then in some leaf crate which, eg, needs to expose a JSON API, you add dependency on small_string with the serde feature, as well as serde with derive itself:
+
+
+
The problem here is that json-api enables the derive feature of serde, and that means that small-string and all of its reverse-dependencies now need to wait for syn to compile!
+Similarly, if a crate depends on a subset of syn’s features, but something else in the crate graph enables all features, the original crate gets them as a bonus as well!
+
It’s not necessarily the end of the world, but it shows that dependency graph can get tricky with the presence of features.
+Luckily, cargo -Z timings makes it easy to notice that something strange is happening, even if it might not be always obvious what exactly went wrong.
+
There’s also a much more direct way for procedural macros to slow down compilation — if the macro generates a lot of code, the result would take some time to compile.
+That is, some macros allow you to write just a bit of source code, which feels innocuous enough, but expands to substantial amount of logic.
+The prime example is serialization — I’ve noticed that converting values to/from JSON accounts for surprisingly big amount of compiling.
+Thinking in terms of overall crate graph helps here — you want to keep serialization at the boundary of the system, in the leaf crates.
+If you put serialization near the foundation, then all intermediate crates would have to pay its build-time costs.
+
All that being said, an interesting side-note here is that procedural macros are not inherently slow to compile.
+Rather, it’s the fact that most proc macros need to parse Rust or to generate a lot of code that makes them slow.
+Sometimes, a macro can accept a simplified syntax which can be parsed without syn, and emit a tiny bit of Rust code based on that.
+Producing valid Rust is not nearly as complicated as parsing it!
Now that we’ve covered macro issues at the level of crates, it’s time to look closer, at the code-level concerns.
+The main thing to look here are generics.
+It’s vital to understand how they are compiled, which, in case of Rust, is achieved by monomorphization.
+Consider a run of the mill generic function:
+
+
+
When Rust compiles this function, it doesn’t actually emit machine code.
+Instead, it stores an abstract representation of function body in the library.
+The actual compilation happens when you instantiate the function with a particular type parameter.
+The C++ terminology gives the right intuition here —frobnicate is a “template”, it produces an actual function when a concrete type is substituted for the parameter T.
+
In other words, in the following case
+
+
+
on the level of machine code there will be two separate copies of frobnicate, which would differ in details of how they deal with parameter, but would be otherwise identical.
+
Sounds pretty bad, right?
+Seems like that you can write a gigantic generic function, and then write just a small bit of code to instantiate it with a bunch of types, to create a lot of load for the compiler.
+
Well, I have bad news for you — the reality is much, much worse.
+You don’t even need different types to create duplication.
+Let’s say we have four crates which form a diamond
+
+
+
The frobnicate is defined in A, and is used by B and C
+
+
+
In this case, we only ever instantiate frobincate with String, but it will get compiled twice, because monomorphization happens per crate.
+B and C are compiled separately, and each includes machine code for do_* functions, so they need frobnicate<String>.
+If optimizations are disabled, rustc can share template instantiations with dependencies, but that doesn’t work for sibling dependencies.
+With optimizations, rustc doesn’t share monomorphizations even with direct dependencies.
+
In other words, generics in Rust can lead to accidentally-quadratic compilation times across many crates!
+
If you are wondering whether it gets worse than that, the answer is yes.
+I think the actual unit of monomorphization is codegen unit, so duplicates are possible even within one crate.
Besides just duplication, generics add one more problem — they shift the blame for compile times to consumers.
+Most of the compile time cost of generic functions is borne out by the crates that use the functionality, while the defining crate just typechecks the code without doing any code generation.
+Coupled with the fact that at times it is not at all obvious what gets instantiated where and why (example), this make it hard to directly see the footprint of generic APIs
+
Luckily, this is not needed — there’s a tool for that!
+cargo llvm-lines tells you which monomorphizations are happening in a specific crate.
It shows, for each generic function, how many copies of it were generated, and what’s their total size.
+The size is measured very coarsely, in the number of llvm ir lines it takes to encode the function.
+A useful fact: llvm doesn’t have generic functions, its the job of rustc to turn a function template and a set of instantiations into a set of actual functions.
Now that we understand the pitfalls of monomorphization, a rule of thumb becomes obvious: do not put generic code at the boundaries between the crates.
+When designing a large system, architect it as a set of components where each of the components does something concrete and has non-generic interface.
+
If you do need generic interface for better type-safety and ergonomics, make sure that the interface layer is thin, and that it immediately delegates to a non-generic implementation.
+The classical example to internalize here are various functions from str::fs module which operate on paths:
+
+
+
The outer function is parameterized — it is ergonomic to use, but is compiled afresh for every downstream crate.
+That’s not a problem though, because it is very small, and immediately delegates to a non-generic function that gets compiled in the std.
+
If you are writing a function which takes a path as an argument, either use &Path, or use impl AsRef<Path> and delegate to a non-generic implementation.
+If you care about API ergonomics enough to use impl trait, you should use inner trick — compile times are as big part of ergonomics, as the syntax used to call the function.
+
A second common case here are closures: by default, prefer &dyn Fn() over impl Fn().
+Similarly to paths, an impl-based nice API might be a thin wrapper around dyn-based implementation which does the bulk of the work.
+
Another idea along these lines is “generic, inline hotpath; concrete, outline coldpath”.
+In the once_cell crate, there’s this curious pattern (simplified, here’s the actual source):
+
+
+
Here, the initialize function is generic twice: first, the OnceCell is parametrized with the type of value being stored, and then initialize takes a generic closure parameter.
+The job of initialize is to make sure (even if it is called concurrently from many threads) that at most one f is run.
+This mutual exclusion task doesn’t actually depend on specific T and F and is implemented as non-generic synchronize_access, to improve compile time.
+One wrinkle here is that, ideally, we’d want an init: dyn FnOnce() argument, but that’s not expressible in today’s Rust.
+The let mut f = Some(f) / let f = f.take().unwrap() is a standard work-around for this case.
Build times are a big factor in the overall productivity of the humans working on the project.
+Optimizing this is a straightforward engineering task — the tools are there.
+What might be hard is not letting them slowly regress.
+I hope this post provides enough motivation and inspiration for that!
+As a rough baseline, 200k line Rust project somewhat optimized for reasonable build times should take about 10 minutes of CI on GitHub actions.
In this post, we’ll look at one technique from property-based testing repertoire: full coverage / exhaustive testing.
+Specifically, we will learn how to conveniently enumerate any kind of combinatorial object without using recursion.
+
To start, let’s assume we have some algorithmic problem to solve.
+For example, we want to sort an array of numbers:
+
+
+
To test that the sort function works, we can write a bunch of example-based test cases.
+This approach has two flaws:
+
+
+Generating examples by hand is time consuming.
+
+
+It might be hard to come up with interesting examples — any edge cases we’ve thought about is probably already handled in the code.
+We want to find cases which we didn’t think of before.
+
+
+
A better approach is randomized testing: just generate a random array and check that it is sorted:
+
+
+
Here, we generated one hundred thousand completely random test cases!
+
Sadly, the result might actually be worse than a small set of hand-picked examples.
+The problem here is that, if you pick an array completely at random (sample uniformly), it will be a rather ordinary array.
+In particular, given that the elements are arbitrary u32 numbers, it’s highly unlikely that we generate an array with at least some equal elements.
+And when I write quick sort, I always have that nasty bug that it just loops infinitely when all elements are equal.
+
There are several fixes for the problem.
+The simplest one is to just make the sampling space smaller:
+
+
+
If we generate not an arbitrary u32, but a number between 0 and 10, we’ll get some short arrays where all elements are equal.
+Another trick is to use a property-based testing library, which comes with some strategies for generating interesting sequences predefined.
+Yet another approach is to combine property-based testing and coverage guided fuzzing.
+When checking a particular example, we will collect coverage information for this specific input.
+Given a set of inputs with coverage info, we can apply targeted genetic algorithms to try to cover more of the code.
+A particularly fruitful insight here is that we don’t have to invent a novel structure-aware fuzzer for this.
+We can take an existing fuzzer which emits a sequence of bytes, and use those bytes as a sequence of random numbers to generate structured input.
+Essentially, we say that the fuzzer is a random number generator.
+That way, when the fuzzer flips bits in the raw bytes array, it applies local semantically valid transformations to the random data structure.
+
But this post isn’t about those techniques :)
+Instead, it is about the idea of full coverage.
+Most of the bugs involve small, tricky examples.
+If a sorting routine breaks on some array with ten thousand elements it’s highly likely that there’s a much smaller array (a handful of elements), which exposes the same bug.
+So what we can do is to just generate every array of length at most n with numbers up to m and exhaustively check them all:
+
+
+
The problem here is that implementing every_array is tricky.
+It is one of those puzzlers you know how to solve, but which are excruciatingly annoying to implement for the umpteenth time:
+
+
+
What’s more, for algorithms you often need to generate permutations, combinations and subsets, and they all have similar simple but tricky recursive solutions.
+
Yesterday I needed to generate a sequence of up to n segments with integer coordinates up to m, which finally pushed me to realize that there’s a relatively simple way to exhaustively enumerate arbitrary combinatorial objects.
+I don’t recall seeing it anywhere else, which is surprising, as the technique seems rather elegant.
+
+
Let’s look again at how we generate a random array:
+
+
+
This is definitely much more straightforward than the every_array function above, although it does sort-of the same thing.
+The trick is to take this “generate a random thing” code and just make it generate every thing instead.
+In the above code, we base decisions on random numbers.
+Specifically, an input sequence of random numbers generates one element in the search space.
+If we enumerate all sequences of random numbers, we then explore the whole space.
+
Essentially, we’ll rig the rng to not be random, but instead to emit all finite sequences of numbers.
+By writing a single generator of such sequences, we gain an ability to enumerate arbitrary objects.
+As we are interested in generating all “small” objects, we always pass an upper bound when asking for a “random” number.
+We can use the bounds to enumerate only the sequences which fit under them.
+
So, the end result will look like this:
+
+
+
The implementation of Gen is relatively straightforward.
+On each iteration, we will remember the sequence of numbers we generated together with bounds the user requested, something like this:
+
+
+
To advance to the next iteration, we will find the smallest sequence of values which is larger than the current one, but still satisfies all the bounds.
+“Smallest” means that we’ll try to increment the rightmost number.
+In the above example, the last two fours already match the bound, so we can’t increment them.
+However, we can increment one to get 3 2 4 4.
+This isn’t the smallest sequence though, 3 2 0 0 would be smaller.
+So, after incrementing the rightmost number we can increment, we zero the rest.
+
Here’s the full implementation:
+
+
+
Some notes:
+
+
+We need start field to track the first iteration, and to make while !g.done() syntax work.
+It’s a bit more natural to remove start and use a do { } while !g.done() loop, but it’s not available in Rust.
+
+
+v stores (value, bound) pairs.
+
+
+p tracks the current position in the middle of the iteration.
+
+
+v is conceptually an infinite vector with finite number of non-zero elements.
+So, when p gets past then end of v, we just materialize the implicit zero by pushing it onto v.
+
+
+As we store zeros implicitly anyway, we can just truncate the vector in done instead of zeroing-out the elements after the incremented one.
+
+
+Somewhat unusually, the bounds are treated inclusively.
+This removes the panic when bound is zero, and allows to generate a full set of numbers via gen(u32::MAX).
+
+
+
Let’s see how our gen fairs for generating random arrays of length at most n.
+We’ll count how many distinct cases were covered:
+
+
+
This test passes.
+That is, the gen approach for this case is both exhaustive (it generates all arrays) and efficient (each array is generated once).
+
As promised in the post’s title, let’s now generate all the things.
+
First case: there should be only one nothing (that’s the reason why we need start):
+
+
+
Second case: we expect to see n numbers and n*2 ordered pairs of numbers.
+
+
+
Third case: we expect to see n * (n - 1) / 2 unordered pairs of numbers.
+This one is interesting — here, our second decision is based on the first one, but we still enumerate all the cases efficiently (without duplicates).
+(Aside: did you ever realise that the number of ways to pick two objects out of n is equal to the sum of first n natural numbers?)
+
+
+
We’ve already generated all arrays, so let’s try to create all permutations.
+Still efficient:
+
+
+
Subsets:
+
+
+
Combinations:
+
+
+
Now, this one actually fails — while this code generates all combinations, some combinations are generated more than once.
+Specifically, what we are generating here are k-permutations (combinations with significant order of elements).
+While this is not efficient, this is OK for the purposes of exhaustive testing (as we still generate any combination).
+Nonetheless, there’s an efficient version as well:
+
+
+
I think this covers all standard combinatorial structures.
+What’s interesting, this approach works for non-standard structures as well.
+For example, for https://cses.fi/problemset/task/2168, the problem which started all this, I need to generate sequences of segments:
+
+
+
Due to the .contains check there are some duplicates, but that’s not a problem as long as all sequences of segments are generated.
+Additionally, examples are strictly ordered by their complexity — earlier examples have fewer segments with smaller coordinates.
+That means that the first example which fails a property test is actually guaranteed to be the smallest counterexample! Nifty!
+
That’s all!
+Next time when you need to test something, consider if you can just exhaustively enumerate all “sufficiently small” inputs.
+If that’s feasible, you can either write the classical recursive enumerator, or use this imperative Gen thing.
+
Update(2021-11-28):
+
There are now Rust (crates.io link) and C++ (GitHub link) implementations.
+“Capturing the Future by Replaying the Past” is a related paper which includes the above technique as a special case of “simulate any monad by simulating delimited continuations via exceptions and replay” trick.
Unedited summary of what I think a better module system for a Rust-like
+language would look like.
+
Today’s Rust module system is it’s most exciting feature, after borrow checker.
+Explicit separation between crates (which form a DAG) and modules (which might
+be mutually dependent) and the absence of a single global namespace (crates
+don’t have innate names; instead, the name is written on a dependency edge
+between two crates, and the same crate might be known under different names in
+two of its dependents) makes decentralized ecosystems of libraries a-la
+crates.io robust. Specifically, Rust allows linking-in several versions of the
+same crate without the fear of naming conflicts.
+
However, the specific surface syntax we use to express the model I feel is
+suboptimal. Module system is pretty confusing (in the pre-2018 surveys, it was
+by far the most confusing aspect of the language after lifetimes. Post-2018
+system is better, but there are still regular questions about module system).
+What can we do better?
+
First, be more precise about visibilities. The most single most important
+question about an item is “can it be visible outside of CU?”. Depending on the
+answer to that, you have either closed world (all usages are known) or open
+world (usages are not-knowable) assumption. This should be reflected in the
+modules system. pub is for “visible inside the whole CU, but not further”.
+export or (my favorite) pub* is for “visible to the outer world”. You sorta
+can have these in today’s rust with pub(crate), -Dunreachable_pub and some
+tolerance for compiler false-positive.
+
I am not sure if the rest of Rust visibility systems pulls its weight. It is OK,
+but it is pretty complex pub(in some::path) and doesn’t really help —
+making visibilities more precise within a single CU doesn’t meaningfully make
+the code better, as you can control and rewrite all the code anyway. CU doesn’t
+have internal boundaries which can be reflected in visibilities. If we go this
+way, we get a nice, simple system: fn foo() is visible in the current module
+only (not its children), pub fn foo() is visible anywhere inside the current
+crate, and pub* fn foo() is visible to other crates using ours. But then,
+again, the current tree-based visibility is OK, can leave it in as long as
+pub/pub* is more explicit and -Dunreachable_pub is an error by default.
+
In a similar way, the fact that use is an item (ie, a::b can use items
+imported in a) is an unnecessary cuteness. Imports should only introduce the
+name into module’s namespace, and should be separate from intentional
+re-exports. It might make sense to ban glob re-export — this’ll give you a
+nice property that all the names existing in the module are spelled out
+explicitly, which is useful for tooling. Though, as Rust has namespaces, looking
+at pub use submod::thing doesn’t tell you whether the thing is a type or a
+value, so this might not be a meaningful property after all.
+
The second thing to change would be module tree/directory structure mapping.
+The current system creates quite some visible problems:
+
+
+
library/binary confusion. It’s common for new users to have mod foo; in both
+src/main.rs and src/lib.rs.
+
+
+
mod {} file confusion — it’s common (even for some production code I’ve
+seen) to have mod foo { stuff }insidefoo.rs.
+
+
+
duplicate inclusion — again, it’s common to start every file in tests/ with
+mod common;. Rust book even recommends some awful work-around to put common
+into common/mod.rs, just so it itself isn’t treated as a test.
+
+
+
inconsistency — large projects which don’t have super-strict code style
+process end up using both the older foo/mod.rs and the newer foo.rs, foo/*
+conventions.
+
+
+
forgotten files — it is again pretty common to have some file somewhere in
+src/ which isn’t actually linked into the module tree at all by mistake.
+
+
+
A bunch of less-objective issues:
+
+
+mod.rs-less system is self-inconsistent. lib.rs and main.rsstill
+behave like mod.rs, in a sense that nested modules are their direct
+siblings, and not in the lib directory.
+
+
+naming for crates roots (lib.rs and main.rs) is ad-hoc
+
+
+current system doesn’t work well for tools, which have to iteratively
+discover the module tree. You can’t process all of the crate’s files in
+parallel, because you don’t know what those files are until you process them.
+
+
+
I think a better system would say that a compilation unit is equivalent to a
+directory with Rust source files, and that (relative) file paths correspond to
+module paths. There’s neither mod foo; nor mod foo {} (yes, sometimes those
+are genuinely useful. No, the fact that something can be useful doesn’t mean
+it should be part of the language — it’s very hard to come up with a language
+features which would be completely useless (though mod foo {} I think can be
+added back relatively painless)). We use mod.rs, but we name it
+_$name_of_the_module$.rs instead, to solve two issues: sort it first
+alphabetically, and generate a unique fuzzy-findable name. So, something like
+this:
+
+
+
The library there would give the following module tree:
+
+
+
To do conditional compilation, you’d do:
+
+
+
where _mutex.rs is
+
+
+
and linux_mutex.rs starts with #![cfg(linux)]. But of course we shouldn’t
+implement conditional compilation by barbarically cutting the AST, and instead
+should push conditional compilation to after the type checking, so that you at
+least can check, on Linux, that the windows version of your code wouldn’t fail
+due to some stupid typos in the name of #[cfg(windows)] functions. Alas, I
+don’t know how to design such conditional compilation system.
+
The same re-export idiom would be used for specifying non-default visibility:
+pub* use rt; would make regex::rt a public module (yeah, this
+particular bit is sketchy :-) ).
+
I think this approach would make most of pitfalls impossible. E.g, it wouldn’t
+be possible to mix several different crates in one source tree. Additionally,
+it’d be a great help for IDEs, as each file can be processed independently, and
+it would be clear just from the file contents and path where in the crate
+namespace the items are mounted, unlocking
+map-reduce
+style IDE.
+
While we are at it, use definitely should use exactly the same path resolution
+rules as the rest of the language, without any kind of “implicit leading ::”
+special cases. Oh, and we shouldn’t have nested use groups:
+
+
+
Some projects use them, some projects don’t use them, sufficiently large
+projects inconsistently both use and don’t use them.
+
Afterword: as I’ve said in the beginning, this is unedited and not generally
+something I’ve thought very hard and long about. Please don’t take this as one
+true way to do things, my level of confidence about these ideas is about 0.5 I
+guess.
I’ve learned a thing I wish I didn’t know.
+As a revenge, I am going to write it down so that you, my dear reader, also learn about this.
+You probably want to skip this post unless you are interested and somewhat experienced in all of Rust, NixOS, and dynamic linking.
I use NixOS and Rust.
+For linking my Rust code, I would love to use lld, the LLVM linker, as it is significantly faster.
+Unfortunately, this often leads to errors when trying to run the resulting binary:
We’ll be using evdev-rs as a running example.
+It is binding to the evdev shared library on Linux.
+First, we’ll build it with the default linker, which just works (haha, nope, this is NixOS).
+
Let’s get the crate:
+
+
+
And run the example
+
+
+
This of course doesn’t just work and spits out humongous error message, which contains one line of important information: we are missing libevdev library.
+As this is NixOS, we are not going to barbarically install it globally.
+Let’s create an isolated environment instead, using nix-shell:
+
+
+
And activate it:
+
+
+
This environment gives us two things — the pkg-config binary and the evdev library.
+pkg-config is a sort of half of a C package manager for UNIX: it can’t install libraries, but it helps to locate them.
+Let’s ask it about libevdev:
+
+
+
Essentially, it resolved library’s short name (libevdev) to the full path to the directory were the library resides:
+
+
+
The libevdev.so.2.3.0 file is the actual dynamic library.
+The symlinks stuff is another bit of a C package manager which implements somewhat-semver: libevdev.so.2 version requirement gets resolved to libevdev.so.2.3.0 version.
+
Anyway, this works well enough to allow us to finally run the example
+
+
+
Success!
+
Ooook, so let’s now do what we wanted to from the beginning and configure cargo to use lld, for blazingly fast linking.
The magic spell you need need to put into .cargo/config is (courtesy of @lnicola):
+
+
+
To unpack this:
+
+
+-C set codegen option link-arg=-fuse-ld=lld.
+
+
+link-arg means that rustc will pass “-fuse-ld=lld” to the linker.
+
+
+Because linkers are not in the least confusing, the “linker” here is actually the whole gcc/clang.
+That is, rather than invoking the linker, rustc will call cc and that will then call the linker.
+
+
+So -fuse-ld (unlike -C, I think this is an atomic option, not -f use-ld) is an argument to gcc/clang,
+which asks it to use lld linker.
+
+
+And note that it’s lld rather than ldd which confusingly exists and does something completely different.
+
+
+
Anyhow, the end result is that we switch the linker from ld (default slow GNU linker) to lld (fast LLVM linker).
Ok, what’s now?
+Now, let’s understand why the first example, with ld rather than lld, can’t work :-)
+
As a reminder, we use NixOS, so there’s no global folder a-la /usr/lib where all shared libraries are stored.
+Coming back to our pkgconfig example,
+
+
+
the libevdev.so is well-hidden behind the hash.
+So we need a pkg-config binary at compile time to get from libevdev name to actual location.
+
However, as this is a dynamic library, we need it not only during compilation, but during runtime as well.
+And at runtime loader (also known as dynamic linker (its binary name is something like ld-linux-x86-64.so, but despite the .so suffix, it’s an executable (I kid you not, this stuff is indeed this confusing))) loads the executable together with shared libraries required by it.
+Normally, the loader looks for libraries in well-known locations, like the aforementioned /usr/lib or LD_LIBRARY_PATH.
+So we need something which would tell the loader that libevdev lives at /nix/store/$HASH/lib.
+
That something is rpath (also known as RUNPATH) — this is more or less LD_LIBRARY_PATH, just hard-coded into the executable.
+We can use readelf to inspect program’s rpath.
+
When the binary is linked with the default linker, the result is as follows (lightly edited for clarity):
+
+
+
And sure, we see path to libevdev right there!
+
With rustflags = ["-Clink-arg=-fuse-ld=lld"], the result is different, the library is missing from rpath:
+
+
+
At this point, I think we know what’s going on.
+To recap:
+
+
+With both ld and lld, we don’t have problems at compile time, because pkg-config helps the compiler to find the library.
+
+
+At runtime, the library linked with lld fails to find the shared library, while the one linked with ld works.
+
+
+The difference between the two binaries is the value of rpath in the binary itself.
+ld somehow manages to include rpath which contains path to the library.
+This rpath is what allows the loader to locate the library at runtime.
+
+
+
Curious observation: dynamic linking on NixOS is not entirely dynamic.
+Because executables expect to find shared libraries in specific locations marked with hashes of the libraries themselves, it’s not possible to just upgrade .so on disk for all the binaries to pick it up.
Why do we have that magical rpath thing in one of the binaries.
+The answer is simple — to set rpath, one passes -rpath /nix/store/... flag to the linker at compile time.
+The linker then just embeds the specified string as rpath field in the executable, without really inspecting it in any way.
+
And here comes the magical/hacky bit — the thing that adds that -rpath argument to the linker’s command line is the NixOS wrapper script!
+That is, the ld on NixOS is not a proper ld, but rather a shell script which does a bit of extra fudging here and there, including the rpath:
+
+
+
There’s a lot of going on in that wrapper script, but the relevant thing to us, as far as I understand, is that everything that gets passed as -L at compile time gets embedded into the binary’s rpath, so that it can be used at runtime as well.
+
Now, let’s take a look at lld’s wrapper:
+
+
+
Haha, nope, there’s no wrapper!
+Unlike ld, lld on NixOS is an honest-to-Bosch binary file, and that’s why we can’t have great things!
+This is tracked in issue #24744 in the nixpkgs repo :)
+
Update:
+
So….. turns out there’s more than one lld on NixOS.
+There’s pkgs.lld, the thing I have been using in the post.
+And then there’s pkgs.llvmPackages.bintools package, which also contains lld.
+And that version is actually wrapped into an rpath-setting shell script, the same way ld is.
+
That is, pkgs.lld is the wrong lld, the right one is pkgs.llvmPackages.bintools.
This post has nothing to do with JIT-like techniques for patching machine code on the fly (though they are cool!).
+Instead, it describes a cute/horrible trick/hack you can use to generate source code if you are not a huge fan of macros.
+The final technique is going to be independent of any particular programming language, but the lead-up is going to be Rust-specific.
+The pattern can be applied to a wide variety of tasks, but we’ll use a model problem to study different solutions.
I have a field-less enum representing various error conditions:
+
+
+
This is a type I expect to change fairly often.
+I predict that it will grow a lot.
+Even the initial version contains half a dozen variants already!
+For brevity, I am showing only a subset here.
+
For the purposes of serialization, I would like to convert this error to and from an error code.
+One direction is easy, there’s built in mechanism for this in Rust:
+
+
+
The other direction is more annoying: it isn’t handled by the language automatically yet (although there’s an in-progress PR which adds just that!), so we have to write some code ourselves:
+
+
+
Now, given that I expect this type to change frequently, this is asking for trouble!
+It’s very easy for the match and the enum definition to get out of sync!
Now, seasoned Rust developers are probably already thinking about macros (or maybe even about specific macro crates).
+And we’ll get there!
+But first, let’s see how I usually solve the problem, when (as I am by default) I am not keen on adding macros.
+
The idea is to trick the compiler into telling us the number of elements in the enum, which would allow us to implement some sanity checking.
+We can do this by adding a fake element at the end of the enum:
+
+
+
Now, if we add a new error variant, but forget to update the ALL array, the code will fail to compile — exactly the reminder we need.
+The major drawback here is that __LAST variant has to exist.
+This is fine for internal stuff, but something not really great for a public, clean API.
Now, let’s get to macros, and let’s start with the simplest possible one I can think of!
+
+
+
Pretty simple, heh? Let’s look at the definition of define_error! though:
+
+
+
That’s … quite literally a puzzle!
+Declarative macro machinery is comparatively inexpressive, so you need to get creative to get what you want.
+Here, ideally I’d write
+
+
+
Alas, counting in macro by example is possible, but not trivial.
+It’s a subpuzle!
+Rather than solving it, I use the following work-around:
+
+
+
And then I have to #![allow(non_upper_case_globals)], to prevent the compiler from complaining.
The big problem with macro is that it’s not only the internal implementation which is baroque!
+The call-site is pretty inscrutable as well!
+Let’s imagine we are new to a codebase, and come across the following snippet:
+
+
+
The question I would ask here would be “what’s that Error thing is?”.
+Luckily, we live in the age of powerful IDEs, so we can just “goto definition” to answer that, right?
+
+
+
Well, not really.
+An IDE says that the Error token is produced by something inside that macro invocation.
+That’s a correct answer, if not the most useful one!
+So I have to read the definition of the define_error macro and understand how that works internally to get the idea about public API available externally (e.g., that the Error refers to a public enum).
+And here the puzzler nature of declarative macros is exacerbated.
+It’s hard enough to figure out how to express the idea you want using the restricted language of macros.
+It’s doubly hard to understand the idea the macro’s author had when you can’t peek inside their brain and observer only to the implementation of the macro.
+
One remedy here is to make macro input look more like the code we want to produce.
+Something like this:
+
+
+
This indeed is marginally friendlier for IDEs and people to make sense of:
+
+
+
The cost for this is a more complicated macro implementation.
+Generally, a macro needs to do two things: parse arbitrary token stream input, and emit valid Rust code as output.
+Parsing is usually the more complicated task.
+That’s why in our minimal attempt we used maximally simple syntax, just a list of identifiers.
+However, if we want to make the input of the macro look more like Rust, we have to parse a subset of Rust, and that’s more involved:
+
+
+
We have to carefully deal with all those visibilities and attributes.
+Even after we do that, the connection between the input Rust-like syntax and the output Rust is skin-deep.
+This is mostly smoke and mirrors, and is not much different from, e.g., using Haskell syntax here:
We can meaningfully increase the fidelity between macro input and macro output by switching to a derive macro.
+In contrast to function-like macros, derives require that their input is syntactically and even semantically valid Rust.
+
So the result looks like this:
+
+
+
Again, the enum Error here is an honest, simple enum!
+It’s not an alien beast which just wears enum’s skin.
+
And the implementation of the macro doesn’t look too bad either, thanks to @dtolnay’s tasteful API design:
+
+
+
Unlike declarative macros, here we just directly express the syntax that we want to emit — a match over consecutive natural numbers.
+
The biggest drawback here is that on the call-site now we don’t have any idea about the extra API generated by the macro.
+If, with declarative macros, you can notice an pub fn from_code in the same file and guess that that’s a part of an API, with a procedural macro that string is in a completely different crate!
+While proc-macro can greatly improve the ergonomics of using and implementing macros (inflated compile times notwithstanding), for the reader, they are arguably even more opaque than declarative macros.
Finally, let’s see the promised hacky solution :)
+While, as you might have noticed, I am not a huge fan of macros, I like plain old code generation — text in, text out.
+Text manipulation is much worse-is-betterer than advanced macro systems.
+
So what we are going to do is:
+
+
+Read the file with the enum definition as a string (file!() macro will be useful here).
+
+
+“Parse” enum definition using unsophisticated string splitting (str::split_once, aka cut would be our parser).
+
+
+Generate the code we want by concatenating strings.
+
+
+Paste the resulting code into a specially marked position.
+
+
+Overwrite the file in place, if there are changes.
+
+
+And we are going to use a #[test] to drive the process!
+
+
+
+
+
That’s the whole pattern!
+Note how, unlike every other solution, it is crystal clear how the generated code works.
+It’s just code which you can goto-definition, or step through in debugging.
+You can be completely oblivious about the shady #[test] machinery, and that won’t harm understanding in any way.
+
The code of the “macro” is also easy to understand — that’s literally string manipulation.
+What’s more, you can easily see how it works by just running the test!
+
The “read and update your own source code” part is a bit mind-bending!
+But the implementation is tiny and only uses the standard library, so it should be easy to understand.
+
Unlike macros, this doesn’t try to enforce at compile time that the generated code is fresh.
+If you update the Error definition, you need to re-run test for the generated code to be updated as well.
+But this will be caught by the tests.
+Note the important detail — the test only tries to update the source code if there are, in fact, changes.
+That is, writable src/ is required only during development.
+
That’s all, hope this survey was useful! Discussion on /r/rust.
LSP (language server protocol) is fairly popular today.
+There’s a standard explanation of why that is the case.
+You probably have seen this picture before:
+
+
+
I believe that this standard explanation of LSP popularity is wrong.
+In this post, I suggest an alternative picture.
There are M editors and N languages.
+If you want to support a particular language in a particular editor, you need to write a dedicated plugin for that.
+That means M * N work, as the picture on the left vividly demonstrates.
+What LSP does is cutting that to M + N, by providing a common thin waist, as show on the right picture.
The problem with the explanation is that also best to illustrate pictorially.
+In short, the picture above is not drawn to scale.
+Here’s a better illustration of how, for example, rust-analyzer + VS Code combo works together:
+
+
+
The (big) ball on the left is rust-analyzer — a language server.
+The similarly sized ball on the right is VS Code — an editor.
+And the small ball in the center is the code to glue them together, including LSP implementations.
+
That code is relatively and absolutely tiny.
+The codebases behind either the language server or the editor are enormous.
+
If the standard theory were correct, then, before LSP, we would have lived in a world where some languages has superb IDE support in some editors.
+For example, IntelliJ would have been great at Java, Emacs at C++, Vim at C#, etc.
+My recollection of that time is quite different.
+To get a decent IDE support, you either used a language supported by JetBrains (IntelliJ or ReSharper) or.
+
There was just a single editor providing meaningful semantic IDE support.
I would say that the reason for such poor IDE support in the days of yore is different.
+Rather than M * N being too big, it was too small, because N was zero and M just slightly more than that.
+
I’d start with N— the number of language servers, this is the side I am relatively familiar with.
+Before LSP, there simply weren’t a lot of working language-server shaped things.
+The main reason for that is that building a language server is hard.
+
The essential complexity for a server is pretty high.
+It is known that compilers are complicated, and a language server is a compiler and then some.
+
First, like a compiler, a language server needs to fully understand the language, it needs to be able to distinguish between valid and invalid programs.
+However, while for invalid programs a batch compiler is allowed to emit an error message and exit promptly, a language server must analyze any invalid program as best as it can.
+Working with incomplete and invalid programs is the first complication of a language server in comparison to a compiler.
+
Second, while a batch compiler is a pure function which transforms source text into machine code, a language server has to work with a code base which is constantly being modified by the user.
+It is a compiler with a time dimension, and evolution of state over time is one of the hardest problems in programming.
+
Third, a batch compiler is optimized for maximum throughput, while a language server aims to minimize latency (while not completely forgoing throughput).
+Adding a latency requirement doesn’t mean that you need to optimize harder.
+Rather, it means that you generally need to turn the architecture on its head to have an acceptable latency at all.
+
And this brings us to a related cluster of accidental complexity surrounding language servers.
+It is well understood how to write a batch compiler.
+It’s common knowledge.
+While not everyone have read the dragon book (I didn’t meaningfully get past the parsing chapters), everyone knows that that book contains all the answers.
+So most existing compilers end up looking like a typical compiler.
+And, when compiler authors start thinking about IDE support, the first thought is “well, IDE is kinda a compiler, and we have a compiler, so problem solved, right?”.
+This is quite wrong — internally an IDE is very different from a compiler but, until very recently, this wasn’t common knowledge.
+
Language servers are a counter example to the “never rewrite” rule.
+Majority of well regarded language servers are rewrites or alternative implementations of batch compilers.
+
Both IntelliJ and Eclipse wrote their own compilers rather than re-using javac inside an IDE.
+To provide an adequate IDE support for C#, Microsoft rewrote their C++ batch compiler into an interactive self-hosted one (project Roslyn).
+Dart, despite being a from-scratch, relatively modern language, ended up with three implementations (host AOT compiler, host IDE compiler (dart-analyzer), on-device JIT compiler).
+Rust tried both — incremental evolution of rustc (RLS) and from-scratch implementation (rust-analyzer), and rust-analyzer decisively won.
+
The two exceptions I know are C++ and OCaml.
+Curiously, both require forward declarations and header files, and I don’t think this is a coincidence.
+See the Three Architectures for a Responsive IDE post for details.
+
To sum up, on the language server’s side things were in a bad equilibrium.
+It was totally possible to implement language servers, but that required a bit of an iconoclastic approach, and it’s hard to be a pioneering iconoclast.
+
I am less certain what was happening on the editor’s side.
+Still, I do want to claim that we had no editors capable of being an IDE.
+
IDE experience consists of a host of semantic features.
+The most notable example is, of course completion.
+If one wants to implement custom completion for VS Code, one needs to implement
+CompletionItemProvider interface:
+
+
+
This means that, in VS Code, code completion (as well as dozens of other IDE related features) is an editor’s first-class concept, with uniform user UI and developer API.
+
Contrast this with Emacs and Vim.
+They just don’t have proper completion as an editor’s extension point.
+Rather, they expose low-level cursor and screen manipulation API, and then people implement competing completion frameworks on top of that!
+
And that’s just code completion!
+What about parameter info, inlay hints, breadcrumbs, extend selection, assists, symbol search, find usages (I’ll stop here :) )?
+
To sum the above succinctly, the problem with decent IDE support was not of N * M, but rather of an inadequate equilibrium of a two-sided market.
+
Language vendors were reluctant to create language servers, because it was hard, the demand was low (= no competition from other languages), and, even if one creates a language server, one would find a dozen editors absolutely unprepared to serve as a host for a smart server.
+
On the editor’s side, there was little incentive for adding high-level APIs needed for IDEs, because there were no potential providers for those APIs.
I don’t think it was a big technical innovation (it’s obvious that you want to separate a language-agnostic editor and a language-specific server).
+I think it’s a rather bad (aka, “good enough”) technical implementation (stay tuned for “Why LSP sucks?” post I guess?).
+But it moved us from a world where not having a language IDE was normal and no one was even thinking about language servers, to a world where a language without working completion and goto definition looks unprofessional.
+
Notably, the two-sided market problem was solved by Microsoft, who were a vendor of both languages (C# and TypeScript) and editors (VS Code and Visual Studio), and who were generally losing in the IDE space to a competitor (JetBrains).
+While I may rant about particular technical details of LSP, I absolutely admire their strategic vision in this particular area.
+They:
+
+
+built an editor on web technologies.
+
+
+identified webdev as a big niche where JetBrains struggles (supporting JS in an IDE is next to impossible).
+
+
+built a language (!!!!) to make it feasible to provide IDE support for webdev.
+
+
+built an IDE platform with a very forward-looking architecture (stay tuned for a post where I explain why vscode.d.ts is a marvel of technical excellence).
+
+
+launched LSP to increase the value of their platform in other domains for free (moving the whole world to a significantly better IDE equilibrium as a collateral benefit).
+
+
+and now, with code spaces, are posed to become the dominant player in the “remote first development”, should we indeed stop editing, building, and running code on our local machines.
+
+
+
Though, to be fair, I still hope that, in the end, the winner would be JetBrains with their idea of Kotlin as a universal language for any platform :-)
+While Microsoft takes full advantage of worse-is-better technologies which are dominant today (TypeScript and Electron), JetBrains tries to fix things from the bottom up (Kotlin and Compose).
Now I am just going to hammer it in that it’s really not M * N :)
+
First, M * N argument ignores the fact that this is an embarrassingly parallel problem.
+Neither language designers need to write plugins for all editors, nor editors need to add special support for all languages.
+Rather, a language should implement a server which speaks some protocol, an editor needs to implement language agnostic APIs for providing completions and such, and, if both the language and the editor are not esoteric, someone who is interested in both would just write a bit of glue code to bind the two together!
+rust-analyzer’s VS Code plugin is 3.2k lines of code, neovim plugin is 2.3k and Emacs plugin is 1.2k.
+All three are developed independently by different people.
+That’s the magic of decentralized open source development at its finest!
+If the plugins were to support custom protocol instead of LSP (provided that the editor supports high-level IDE API inside), I’d expect to add maybe 2k lines for that, which is still well within hobbyist working part-time budget.
+
Second, for M * N optimization you’d expect the protocol implementation to be generated from some machine readable implementation.
+But until the latest release, the source of truth for LSP spec was an informal markdown document.
+Every language and client was coming up with their own way to extract protocol out of it, many (including rust-analyzer) were just syncing the changes manually, with quite a bit of dupliction.
+
Third, if M * N is a problem, you’d expect to see only one LSP implementation for each editor.
+In reality, there are two competing Emacs implementations (lsp-mode and eglot) and, I kid you not, at the time of writing rust-analyzer’s manual contains instruction for integration with 6 (six) different LSP clients for vim.
+To echo the first point, this is open source!
+The total amount of work is almost irrelevant, the thing that matters is the amount of coordination to get things done.
+
Fourth, Microsoft itself doesn’t try to take advantage of M + N.
+There’s no universal LSP implementation in VS Code.
+Instead, each language is required to have a dedicated plugin with physically independent implementations of LSP.
Please demand better IDE support!
+I think today we crossed the threshold of general availability of baseline IDE support, but there’s so much we can do beyond the basics.
+In the ideal world, it should be possible to inspect every little semantic details about expression at the cursor, using the same simple API one can use today to inspect contents of editor’s buffer.
+
+
Text Editor Authors
+
+
Pay attention to the architecture of VS Code.
+While electron delivers questionable user experience, the internal architecture has a lot of wisdom in it.
+Do orient editor’s API around presentation-agnostic high-level features.
+Basic IDE functionality should be a first-class extension point, it shouldn’t be re-invented by every plugin’s author.
+In particular, add assist/code action/💡 as a first-class UX concept already.
+It’s the single most important UX innovation of IDEs, which is very old at this point.
+Its outright ridiculous that this isn’t a standard interface across all editors.
+
But don’t make LSP itself a first class concept.
+Surprising as it might seem, VS Code knows nothing about LSP.
+It just provides a bunch of extension points without caring the least how they are implemented.
+LSP implementation then is just a library, which is used by language-specific plugins.
+E.g., Rust and C++ extensions for VS Code do not share the same LSP implementation at runtime, there are two copies of LSP library in memory!
+
Also, try to harness the power of open-source.
+Don’t enforce centralization of all LSP implementations!
+Make it possible for separate groups of people to independenty work on perfect Go support and perfect Rust support for your editor.
+VS Code is one possible model, with a marketplace and distributed, independent plugins.
+But it probably should be possible to organize the work as a single shared repo/source tree, as long as languages can have independent maintainers sets
+
+
Language Server Authors
+
+
You are doing a great job!
+The quality of IDE support is improving rapidly for all the languages, though I feel this is only a beginning of a long road.
+One thing to keep in mind is that LSP is an interface to a semantic info about the language, but it isn’t the interface.
+A better thing might come along.
+Even today, limitations of LSP prevent from shipping useful features.
+So, try to treat LSP as a serialization format, not as an internal data model.
+And try to write more about how to implement language servers — I feel like there’s still not enough knowledge about this out there.
This post documents one rule of thumb I find useful when coding:
+
+
+
Being a rule-of-thumb, it naturally has exceptions, but those are relatively few.
+The primary context here is application development.
+Libraries with semver-constrained API have other guidelines —the rules are different at the boundaries.
+
This privacy rule is a manifestation of the fact that the two most popular kinds of entities in programs are:
+
+
+Abstract data types — complex objects with opaque implementation which guard interior invariants and expose intentionally limited API to the outside world
+
+
+Data — relatively simple objects which group a bunch of related attributes together
+
+
+
If some fields of a type are private, it can’t be data.
+If some fields of a type are public, it can still be an ADT, but the abstraction boundary will be a bit awkward.
+Better to just add getters for (usually few) fields which can be public, to make it immediately obvious what role is played by the type.
+
An example of ADT would be FileSet from rust-analyzer’s virtual file system implementation.
+
+
+
This type maintains a bidirectional mapping between string paths and integral file ids.
+How exactly the mapping is maintained (hash map, search tree, trie?) is irrelevant, this implementation detail is abstracted away.
+Additionally, there’s an invariant: files and paths fields are consistent, complimentary mappings.
+So this is the case where all fields are private and there’s a bunch of accessor functions.
This type specifies a set of paths to include in VFS, a sort-of simplified gitignore.
+This is an inert piece of data — a bunch of extensions, include paths and exclude paths.
+Any combination of the three is valid, so there’s no need for privacy here.
This rule is very mechanical, but it reflects a deeper distinction between flavors of types.
+For a more thorough treatment of the underlying phenomenon, see “Be clear what kind of class you’re writing” chapter from Alexandrescu’s “C++ Coding Standards” and
+“The Expression Problem” from ever thought-provoking Kaminski.
In this short post, I describe and name a cousin of the builder pattern — builder lite.
+
Unlike a traditional builder, which uses a separate builder object, builder lite re-uses the object itself to provide builder functionality.
+
Here’s an illustrative example
+
+
+
In contrast, the full builder is significantly wordier at the definition site, and requires a couple of extra invocations at the call site:
+
+
+
The primary benefit of builder-lite is that it is an incremental, zero-cost evolution from the new method.
+As such, it is especially useful in the context where the code evolves rapidly, in an uncertain direction.
+That is, when building applications rather than library.
+
To pull a motivational example from work, we had the following typical code:
+
+
+
Here’s a new method with a whole bunch of arguments for various dependencies.
+What we needed to do is to add yet another dependency, so that it could be overwritten in tests.
+The first attempt just added one more parameter to the new method:
+
+
+
However, this change required update of the seven call-sites where the new was called to supply the default counter.
+Switching that to builder lite allowed us to only modify a single call-site where we cared to override the counter.
+
A note on naming:
+If builder methods are to be used only occasionally, with_foo is the best naming.
+If most call-sites make use of builder methods, just .foo might work better.
+For boolean properties, sometimes it makes sense to have both:
In this post I’ll describe how to implement caches in Rust.
+It is inspired by two recent refactors I landed at nearcore (nearcore#6549, nearcore#6811).
+Based on that experience, it seems that implementing caches wrong is rather easy, and making a mistake there risks “spilling over”, and spoiling the overall architecture of the application a bit.
+
Let’s start with an imaginary setup with an application with some configuration and a database:
+
+
+
The database is an untyped key-value store:
+
+
+
And the App encapsulates database and provides typed access to domain-specific Widget:
+
+
+
Now, for the sake of argument let’s assume that database access and subsequent deserialization are costly, and that we want to add a cache of Widgets in front of the database.
+Data-oriented thinking would compel us to get rid of deserialization step instead, but we will not pursue that idea this time.
+
We’ll use a simple HashMap for the cache:
+
+
+
And we need to modify get_widget method to return the value from the cache, if there is one:
+
+
+
The biggest change is the &mut self.
+Even when reading the widget, we need to modify the cache, and the easiest way to get that ability is to require an exclusive reference.
+
I want to argue that this path of least resistance doesn’t lead to a good place.
+There are many problems with methods of the following-shape:
+
+
+
First, such methods conflict with each other.
+For example, the following code won’t work, because we’ll try to borrow the app exclusively twice.
+
+
+
Second, the &mut methods conflict even with & methods.
+Naively, it would seem that, as get_widgetreturns a shared reference, we should be able to call & methods.
+So, one can expect something like this to work:
+
+
+
Alas, it doesn’t.
+Rust borrow checker doesn’t distinguish between mut and non-mut lifetimes (for a good reason: doing that would be unsound).
+So, although w is just &Widget, the lifetime there is the same as on the &mut self, so the app remains mutably borrowed while the widget exists.
+
Third, perhaps the most important point, the &mut self becomes viral — most of functions in the program begin requiring &mut, and you lose type-system distinction between read-only and read-write operations.
+There’s no distinction between “this function can only modify the cache” and “this function can modify literally everything”.
+
Finally, even implementing get_widget is not pleasant.
+Seasoned rustaceans among you might twitch at the needlessly-repeated hashmap lookups.
+But trying to get rid of those with the help of the entry-API runs into current borrow checker limitations.
+
Let’s look at how we can better tackle this!
+
The general idea for this class of problems is to think what the ownership and borrowing situation should be and try to achieve that, as opposed to merely following suggestions by the compiler.
+That is, most of the time just using &mut and & as compiler guides you is a path to success, as, it turns out, majority of the code naturally follows simple aliasing rules.
+But there are exceptions, it’s important to recognize them as such and make use of interior mutability to implement the aliasing structure which makes sense.
+
Let’s start with a simplified case.
+Suppose that there’s only one Widget to deal with.
+In this case, we’d want something like this:
+
+
+
This doesn’t work as is — modifying the cache needs &mut which we’d very much prefer to avoid.
+However, thinking about this pattern, it feels like it should be valid — we enforce at runtime that the contents of the cache is never overwritten.
+That is, we actually do have exclusive access to cache on the highlighted line at runtime, we just can’t explain that to the type system.
+But we can reach out for unsafe for that.
+What’s more, Rust’s type system is powerful enough to encapsulate that usage of unsafe into a safe and generally re-usable API.
+So let’s pull once_cell crate for this:
+
+
+
Coming back to the original hash-map example, we can apply the same logic here:
+as long as we never overwrite, delete or move values, we can safely return references to them.
+This is handled by the elsa crate:
+
+
+
The third case is that of a bounded cache.
+If you need to evict values, than the above reasoning does not apply.
+If the user of a cache gets a &T, and than the corresponding entry is evicted, the reference would dangle.
+In this situations, we want the clients of the cache to co-own the value.
+This is easily handled by an Rc:
+
+
+
To sum up: when implementing a cache, the path of the least resistance is to come up with a signature like this:
+
+
+
This often leads to problems down the line.
+It’s usually better to employ some interior mutability and get either of these instead:
+
+
+
This is an instance of the more general effect: despite the “mutability” terminology, Rust references track not mutability, but aliasing.
+Mutability and exclusive access are correlated, but not perfectly.
+It’s important to identify instances where you need to employ interior mutability, often they are architecturally interesting.
+
To learn more about relationships between aliasing and mutability, I recommend the following two posts:
There’s a bit of discussion happening in Rust community on the generic associated types topic.
+I can not help but add my own thoughts to the pile :-)
+
I don’t intend to write a well-edited post considering all pros and cones (intentional typo to demonstrate how unedited this is).
+Rather, I just want to dump my experience as is.
+Ultimately I trust the lang team to make the right call here way more than I trust myself.
+The post could be read as a bit inflammatory, but my stated goal here is not to sway someone’s mind by the arguments, but rather expose my own thinking process.
+
This post is partially prompted by the following comment from the RFC:
+
+
+
It stuck with me, because this is very much the opposite of the experience I have.
+I’ve been using Rust extensively for a while, mostly as an application (as opposed to library) developer, and I can’t remember a single instance where I really wanted to have GATs.
+This is a consequences of my overall code style — I try to use abstraction sparingly and rarely reach out for traits.
+I don’t think I’ve ever build a meaningful abstraction which was expressed via traits?
+On the contrary, I try hard to make everything concrete and non-generic on the language level.
+
What’s more, when I do reach out for traits, most of the time this is to use trait objects, which give me a new runtime capability to use different, substitutable concrete type.
+For the static,monomorphization based subset of traits I find that most of the time non-trait solution seem to work.
+
And I think GATs (and associated types in general) don’t work with trait objects, which probably explains why, even when I use traits, I don’t generally need GATs.
+Though, it seems to me that lifetime-only subset of GATs actually works with trait objects?
+That is, lending iterator seems to be object safe?
+
I guess, the only place where I do, indirectly, want GATs is to make async trait work, but even then, I usually am interested in object-safe async traits, which I think don’t need and can’t use GATs?
+
+
Another disconnection between my usage of Rust and discussion surrounding the GATs is in one of the prominent examples — parser combinator library.
+In practice, for me parser combinator’s primary use-case was always a vehicle for teaching advanced types (eg, the monads paper uses parsers as one of the examples).
+For production use-cases I’ve encountered, it was always either a hand-written parser, or a full-blown parser generator.
In this post I argue that integration-vs-unit is a confused, and harmful, distinction.
+I provide a more useful two-dimensional mental model instead.
+The model is descriptive (it allows to think more clearly about any test), but I also include my personal prescriptions (the model shows metrics which are and aren’t worth optimizing).
+
Credit for the idea goes to the SWE book.
+I always felt that integration versus unit debate is confused, the book helped me to formulate in which way exactly.
+
I won’t actually rigorously demonstrate the existing confusion — I find it self-evident.
+As just two examples:
+
+
+Unit-testing is used as a synonym with automated testing (x-unit frameworks).
+
+
+Cargo uses “unit” and “integration” terminology to describe Rust-specific properties of the compilation model, which is orthogonal to the traditional, however fuzzy, meaning of this terms.
+
+
+
Most of the time, it’s more productive to speak about just “tests”, or maybe “automated tests”, rather than argue where something should be considered a unit or an integration tests.
+
But I argue that a useful, more precise classification exists.
The first axis of classification is, broadly speaking, performance.
+“How much time would a thousand of similar tests take?” is a very useful metric.
+The dependency between the time from making an edit to getting the test results and most other interesting metrics in software (performance, time to fix defects, security) is super-linear.
+Tests longer than attention span obliterate productivity.
+
It’s useful to take a closer look at what constitutes a performant test.
+One non-trivial observation here is that test speed is categorical, rather than numerical.
+Certain tests are order-of-magnitude slower than others.
+Consider the following list:
+
+
+Single-threaded pure computation
+
+
+Multi-threaded parallel computation
+
+
+Multi-threaded concurrent computation with time-based synchronization and access to disk
+
+
+Multi-process computation
+
+
+Distributed computation
+
+
+
Each step of this ladder adds half-an-order of magnitude to test’s runtime.
+
Time is not the only thing affected — the higher you go, the bigger is the fraction of flaky tests.
+It’s nay impossible to make a test for a pure function flaky.
+If you add threads into the mix, keeping flakiness out requires some careful thinking about synchronization.
+And if the tests spans several processes, it is almost bound to fail under some more unusual circumstances.
+
Yet another effect we observe along this axis is resilience to unrelated changes.
+The more of operating system and other processes is involved in the test, the higher is the probability that some upgrade somewhere breaks something.
+
I think the “purity” concept from functional programming is a good way to generalize this axis of the differences between the tests.
+Pure test do little-to-no IO, they are independent of timings and environment.
+Less pure tests do more of the impure things.
+Purity is correlated with performance, repeatability and stability.
+Test purity is non-binary, but it is mostly discrete.
+Threads, time, file-system, network, processes are the notches to think about.
The second axis is the fraction of the code which gets exercised, potentially indirectly, by the test.
+Does the test exercise only the business logic module, or is the database API and the HTTP handling also required?
+This is distinct from performance: running more code doesn’t mean that the code will run slower.
+An infinite loop takes very little code.
+What affects performance is not whether tests for business logic touch persistence, but whether, in tests, persistence is backed by an in-memory hash-map or by an out-of-process database server.
+
The “extent” of the tests is a good indicator of the overall architecture of the application, but usually it isn’t a worthy metric to optimize by itself.
+On the contrary, artificially limiting the extent of tests by mocking your own code (as opposed to mocking impure IO) reduces fidelity of the tests, and makes the code more brittle in the face of refactors.
+
One potential exception here is the impact on compilation time.
+In a layered application A < B < C, it’s possible to test A either through its interface to B (small-extent test) or by driving A indirectly through C.
+The latter has a problem that, after changing A, running tests might require, depending on the language, rebuilding B and C as well.
+
+
Summing up:
+
+
+Don’t think about tests in terms of opposition between unit and integration, whatever that means. Instead,
+
+
+Think in terms of test’s purity and extent.
+
+
+Purity corresponds to the amount of generalized IO the test is doing and is correlated with desirable metrics, namely performance and resilience.
+
+
+Extent corresponds to the amount of code the test exercises. Extent somewhat correlates with impurity, but generally does not directly affect performance.
+
+
+
And, the prescriptive part:
+
+
+Ruthlessly optimize purity, moving one step down on the ladder of impurity gives huge impact.
+
+
+Generally, just let the tests have their natural extent. Extent isn’t worth optimizing by itself, but it can tell you something about your application’s architecture.
+
+
+
If you enjoyed this post, you might like How to Test as well.
+It goes further in the prescriptive direction, but, when writing it, I didn’t have the two dimensional purity-extent vocabulary yet.
+
+
As I’ve said, this framing is lifted from the SWE book.
+There are two differences, one small and one big.
+The small difference is that the book uses “size” terminology in place of “purity”.
+The big difference is that the second axis is different: rather than looking at which fraction code gets exercised by the test, the book talks about test “scope”: how large is the bit we are actually testing?
+
I do find scope concept useful to think about!
+And, unlike extent, keeping most tests focused is a good active prescriptive advice.
+
I however find the scope concept a bit too fuzzy for actual classification.
+
Consider this test from rust-analyzer, which checks that we can complete a method from a trait if the trait is implemented:
+
+
+
I struggle with determining the scope of this test.
+On the one hand, this clearly tests very narrow, very specific scenario.
+On the other hand, to make this work, all the layers of the system have to work just right.
+The lexer, the parser, name resolution and type checking all have to be prepared for incomplete code.
+This test tests not so much the completion logic itself, as all the underlying infrastructure for semantic analysis.
+
The test is very easy to classify in the purity/extent framework.
+It’s 100% pure — no IO, just a single thread.
+It has maximal extent — the tests exercises the bulk of the rust-analyzer codebase, the only thing that isn’t touched here is the LSP itself.
+
Also, as a pitch for the How to Test post, take a second to appreciate how simple the test is, considering that it tests an error-resilient, highly incremental compiler :)
This is going to be a philosophical post, vaguely about language design, and vaguely about Rust.
+If you’ve been following this blog for a while, you know that one theme I consistently hammer at is that of boundaries.
+This article is no exception!
The most important boundary for a software project is its external interface, that which the users directly interact with and which you give backwards compatibility guarantees for.
+For a web-service, this would be the URL scheme and the shape of JSON request and responses.
+For a command line application — the set and the meaning of command-line flags.
+For an OS kernel — the set of syscalls (Linux) or the blessed user-space libraries (Mac).
+And, for a programming language, this would be the definition of the language itself, its syntax and semantics.
+
Sometimes, however, it is beneficial to install somewhat artificial, internal boundaries, a sort-of macro level layers pattern.
+Boundaries have a high cost.
+They prevent changes.
+But a skillfully placed internal (or even an artificial external) boundary can also help.
+
It cuts the system in two, and, if the cut is relatively narrow in comparison to the overall size of the system (hourglass shape), this boundary becomes a great way to understand the system.
+Understanding just the boundary allows you to imagine how the subsystem beneath it could be implemented.
+Most of the time, your imaginary version would be pretty close to what actually happens, and this mental map would help you a great deal to peel off the layers of glue code and get a gut feeling for where the core logic is.
+
Even if an internal boundary starts out in the right place, it, unlike an external one, is ever in danger of being violated.
+“Internal boundary” is a very non-physical thing, most of the time it’s just informal rules like “module A shall not import module B”.
+It’s very hard to notice that something is not being done!
+That’s why, I think, larger companies can benefit from microservices architecture: in theory, if we just solve human coordination problem, a monolith can be architectured just as cleanly, while offering much better performance.
+In practice, at sufficient scale, maintaining good architecture across teams is hard, and becomes much easier if the intended internal boundaries are reified as processes.
+
It’s hard enough to protect from accidental breaching of internal boundaries.
+But there’s a bigger problem: often, internal boundaries stand in the way of user-visible system features, and it takes a lot of authority to protect internal system’s boundary at the cost of not shipping something.
+
In this post, I’d want to catalog some of the cases I’ve seen in the Rust programming language where I think an internal boundaries were eroded with time.
It’s a somewhat obscure feature of Rust’s name resolution, but various things that inhabit Rust’s scopes (structs, modules, traits, variables) are split into three namespaces: types, values and macros.
+This allows to have two things with the same name in the same scope without causing conflicts:
+
+
+
The above is legal Rust, because the x struct lives in the types namespace, while the xfunction lives in the values namespace.
+The namespaces are reflected syntactically: . is used to traverse value namespace, while :: traverses types.
+
Except that this is almost a rule.
+There are some cases where compiler gives up on clear syntax-driven namespacing rules and just does ad-hoc disambiguation.
+For example:
+
+
+
Here, the str in &str and str::len is the strtype, from the type namespace.
+The two other strs are the strmodule.
+In other words, the str::len is a method of a str type, while str::from_utf8 is a free-standing function in the str module.
+Like types, modules inhabit the types namespace, so normally the code here would cause a compilation error.
+Compiler (and rust-analyzer) just hacks the primitive types case.
+
Another recently added case is that of const generics.
+Previously, the T in foo::<T>() was a syntactically-unambiguous reference to something from the types namespace.
+Today, it can refer either to a type or to a value.
+This begs the question: is splitting type and value namespaces a good idea?
+If we have to disambiguate anyway, perhaps we could have just a single namespace and avoid introducing second lookup syntax?
+That is, just use std.collections.HashMap;.
+
I think these namespace aspirations re-enact similar developments from C.
+I haven’t double checked my history here, so take the following with the grain of salt and do your own research before quoting, but I think that C, in the initial versions, used to have very strict syntactic separation between types and values.
+That’s why you are required to write struct when declaring a local variable of struct type:
+
+
+
The struct keyword tells the parser that it is parsing a type, and, therefore a declaration.
+But then at a latter point typedefs were added, and so the parser was taught to disambiguate types and values via the the lexer hack:
Rust has separate grammatical categories for patterns and expressions.
+It used to be the case that any utterance can be unambiguously classified, depending solely on the syntactic context, as either an expression or a pattern.
+But then a minor exception happened:
+
+
+
Syntactically, None and none are indistinguishable.
+But they play quite different roles: None refers to the Option::None constant, while none introduces a fresh binding into the scope.
+Swift elegantly disambiguates the two at the syntax level, by requiring a leading . for enum variants.
+Rust just hacks this at the name-resolution layer, by defaulting to a new binding unless there’s a matching constant in the scope.
+
Recently, the scope of the hack was increased greatly: with destructing assignment implemented, an expression can be re-classified as a pattern now:
+
+
+
Syntactically, = is a binary expression, so both the left hand side and the right hand side are expressions.
+But now the lhs is re-interpreted as a pattern.
+
So perhaps the syntactic boundary between expressions and patterns is a fake one, and we should have used unified expression syntax throughout?
A boundary which stands intact is the class of the grammar.
+Rust is still an LL(k) language: it can be parsed using a straightforward single-pass algorithm which doesn’t require backtracking.
+The cost of this boundary is that we have to type .collect::<Vec<_>>() rather than .collect<Vec<_>>() (nowadays, I type just .collect() and use the light-bulb to fill-in the turbofish).
Another recent development is the erosion of the boundary between the lexer and the parser.
+Rust has tuple structs, and uses .0 cutesy syntax to access numbered field.
+This is problematic for nested tuple struct.
+They need syntax like foo.1.2, but to the lexer this string looks like three tokens: foo, ., 1.2.
+That is, 1.2 is a floating point number, 6/5.
+So, historically one had to write this expression as foo.1 .2, with a meaningful whitespace.
+
Today, this is hacked in the parser, which takes the 1.2 token from the lexer, inspects its text and further breaks it up into 1, . and 2 tokens.
+
The last example is quite interesting: in Rust, unlike many programming languages, the separation between the lexer and the parser is not an arbitrary internal boundary, but is actually a part of an external, semver protected API.
+Tokens are the input to macros, so macro behavior depends on how exactly the input text is split into tokens.
+
And there’s a second boundary violation here: in theory, “token” as seen by a macro is just its text plus hygiene info.
+In practice though, to implement captures in macro by example ($x:expr things), a token could also be a fully-formed fragment of internal compiler’s AST data structure.
+The API is carefully future proofed such that, as soon as the macro looks at such a magic token, it gets decomposed into underlying true tokens, but there are some examples where the internal details leak via changes in observable behavior.
To end this on a more positive note, here’s one pretty important internal boundary which is holding up pretty well.
+In Rust, lifetimes don’t affect code generation.
+In fact, lifetimes are fully stripped from the data which is passed to codegen.
+This is pretty important: although the inferred lifetimes are opaque and hard to reason about, you can be sure that, for example, the exact location where a value is dropped is independent from the whims of the borrow checker.
+
+
Conclusion: not really? It seems that we are generally overly-optimistic about internal boundaries, and they seem to crumble under the pressure of feature requests, unless the boundary in question is physically reified (please don’t take this as an endorsement of microservice architecture for compilers).
This is a sequel to Notes on Paxos post.
+Similarly, the primarily goal here is for me to understand why the BFT consensus algorithm works in detail.
+This might, or might not be useful for other people!
+The Paxos article is a prerequisite, best to read that now, and return to this article tomorrow :)
+
Note also that while Paxos was more or less a direct translation of Lamport’s lecture, this post is a mish-mash oft the original BFT paper by Liskov and Castro, my own thinking, and a cursory glance as this formalization.
+As such, the probability that there are no mistakes here is quite low.
BFT stands for Byzantine Fault Tolerant consensus.
+Similarly to Paxos, we imagine a distributed system of computers communicating over a faulty network which can arbitrary reorder, delay, and drop messages.
+And we want computers to agree on some specific choice of value among the set of possibilities, such that any two computers pick the same value.
+Unlike Paxos though, we also assume that computers themselves might be faulty or malicious.
+So, we add a new condition to our list of bad things.
+Besides reordering, duplication, delaying and dropping, a fake message can be manufactured out of thin air.
+
Of course, if absolutely arbitrary messages can be forged, then no consensus is possible — each machine lives in its own solipsistic world which might be completely unlike the world of every other machine.
+So there’s one restriction — messages are cryptographically signed by the senders, and it is assumed that it is impossible for a faulty node to impersonate non-faulty one.
+
Can we still achieve consensus?
+As long as for each f faulty, malicious nodes, we have at least 2f + 1 honest ones.
+
Similarly to the Paxos post, we will capture this intuition into a precise mathematical statement about trajectories of state machines.
Our plan is to start with vanilla Paxos, and then patch it to allow byzantine behavior.
+Here’s what we’ve arrived at last time:
+
+
+
Our general idea is to add some “evil” acceptors 𝔼 to the mix and allow them sending arbitrary messages, while at the same time making sure that the subset of “good” acceptors continues to run Paxos.
+What makes this complex is that we don’t know which acceptor are good and which are bad.
+So this is our setup
+
+
+
If previously the quorum condition was “any two quorums have an acceptor in common”, it is now “any two quorums have a good acceptor in common”.
+An alternative way to say that is “a byzantine quorum is a super-set of normal quorum”, which corresponds to the intuition where we are running normal Paxos, and there are just some extra evil guys whom we try to ignore.
+For Paxos, we allowed f faulty out of 2f + 1 total nodes with f+1 quorums.
+For Byzantine Paxos, we’ll have f byzantine out 3f + 1 nodes with 2f+1 quorums.
+As I’ve said, if we forget about byzantine folks, we get exactly f + 1 out of 2f + 1 picture of normal Paxos.
+
The next step is to determine behavior for byzantine nodes.
+They can send any message, as long as they are the author:
+
+
+
That is, a byzantine acceptor can send any 1a or 2a message at any time, while for 1b and 2b the author should match.
+
What breaks?
+The most obvious thing is Phase2b, that is, voting.
+In Paxos, as soon as an acceptor receives a 2a message, it votes for it.
+The correctness of Paxos hinges on the Safe check before we send 2a message, but a Byzantine node can send an arbitrary 2a.
+
The solution here is natural: rather than blindly trust 2a messages, acceptors would themselves double-check the safety condition, and reject the message if it doesn’t hold:
+
+
+
Implementation wise, this means that, when a coordinator sends a 2a, it also wants to include 1b messages proving the safety of 2a.
+But in the spec we can just assume that all messages are broadcasted, for simplicity.
+Ideally, for correct modeling you also want to model how each acceptor learns new messages, to make sure that negative reasoning about a certain message not being sent doesn’t creep in, but we’ll avoid that here.
+
However, just re-checking safety doesn’t fully solve the problem.
+It might be the case that several values are safe at a particular ballot (indeed, in the first ballot any value is safe), and it is exactly the job of a coordinator / 2a message to pick one value to break the tie.
+And in our case a byzantine coordinator can send two 2a for different valid values.
+
And here we’ll make the single non-trivial modification to the algorithm.
+Like the Safe condition is at the heart of Paxos, the Confirmed condition is the heart here.
+
So basically we expect a good coordinator to send just one 2a message, but a bad one can send many.
+And we want to somehow distinguish the two cases.
+One way to do that is to broadcast ACKs for 2a among acceptors.
+If I received a 2a message, checked that the value therein is safe, and also know that everyone else received this same 2a message, I can safely vote for the value.
+
So we introduce a new message type, 2ac, which confirms a valid 2a message:
+
+
+
Naturally, evil acceptors can confirm whatever:
+
+
+
But, if we get a quorum of confirmations, we can be sure that no other value will be confirmed in a given ballot (each good acceptors confirms at most a single message in a ballot (and we need a bit of state for that as well))
+
+
+
Putting everything so far together, we get
+
+
+
In the above, I’ve also removed phases 1a and 2a, as byzantine acceptors are allowed to send arbitrary messages as well (we’ll need explicit 1a/2a for liveness, but we won’t discuss that here).
+
The most important conceptual addition is Phase2ac— if an acceptor receives a new 2a message for some ballot with a safe value, it sends out the confirmation provided that it hadn’t done that already.
+In Phase2b then we can vote for confirmed values: confirmation by a quorum guarantees both that the value is safe at this ballot, and that this is a single value that can be voted for in this ballot (two different values can’d be confirmed in the same ballot, because quorums have an honest acceptor in common).
+This almost works, but there’s still a problem.
+Can you spot it?
+
The problem is in the Safe condition.
+Recall that the goal of the Safe condition is to pick a value v for ballot b, such that, if any earlier ballot b1 concludes, the value chosen in b1 would necessary be v.
+The way Safe works for ballot b in normal Paxos is that the coordinator asks a certain quorum to abstain from further voting in ballots earlier than b, collects existing votes, and uses those votes to pick a safe value.
+Specifically, it looks at the vote for the highest-numbered ballot in the set, and declares a value from it as safe (it is safe: it was safe at that ballot, and for all future ballots there’s a quorum which abstained from voting).
+
This procedure puts a lot of trust in that highest vote, which makes it vulnerable.
+An evil acceptor can just say that it voted in some high ballot, and force a choice of arbitrary value.
+So, we need some independent confirmation that the vote was cast for a safe value.
+And we can re-use 2ac messages for this:
+
+
+
And … that’s it, really.
+Now we can sketch a proof that this thing indeed achieves BFT consensus, because it actually models normal Paxos among non-byzantine acceptors.
+
Phase1a messages of Paxos are modeled by Phase1a messages of BFT Paxos, as they don’t have any preconditions, the same goes for Phase1b.
+Phase2a message of Paxos is emitted when a value becomes confirmed in BFT Paxos.
+This is correct modeling, because BFT’s Safe condition models normal Paxos Safe condition (this … is a bit inexact I think, to make this exact, we want to separate “this value is safe” from “we are voting for this value” in original Paxos as well).
+Finally, Phase2b also displays direct correspondence.
+
As a final pop-quiz, I claim that the Confirmed(m.vote.bal, v) condition in Safe above can be relaxed.
+As stated, Confirmed needs a byzantine quorum of confirmations, which guarantees both that the value is safe and that it is the single confirmed value, which is a bit more than we need here.
+Do you see what would be enough?
This post is a case study of writing a Rust application using only minimal, artificially constrained API (eg, no dynamic memory allocation).
+It assumes a fair bit of familiarity with the language.
The back story here is a particular criticism of Rust and C++ from hard-core C programmers.
+This criticism is aimed at RAII— the language-defining feature of C++, which was wholesale imported to Rust as well.
+RAII makes using various resources requiring cleanups (file descriptors, memory, locks) easy — any place in the program can create a resource, and the cleanup code will be invoked automatically when needed.
+And herein lies the problem — because allocating resources becomes easy, RAII encourages a sloppy attitude to resources, where they are allocated and destroyed all over the place.
+In particular, this leads to:
+
+
+Decrease in reliability. Resources are usually limited in principle, but actual resource exhaustion happens rarely.
+If resources are allocated throughout the program, there are many virtually untested codepaths.
+
+
+Lack of predictability. It usually is impossible to predict up-front how much resources will the program consume.
+Instead, resource-consumption is observed empirically.
+
+
+Poor performance. Usually, it is significantly more efficient to allocate and free resources in batches.
+Cleanup code for individual resources is scattered throughout codebase, increasing code bloat
+
+
+Spaghetti architecture. Resource allocation is an architecturally salient thing.
+If all resource management is centralized to a single place, it becomes significantly easier to understand lifecycle of resources.
+
+
+
I think this is a fair criticism.
+In fact, I think this is the same criticism that C++ and Rust programmers aim at garbage collected languages.
+This is a spectrum:
+
+
+
Rust programmers typically are not exposed to the lowest level of this pyramid.
+But there’s a relatively compact exercise to gain the relevant experience: try re-implementing your favorite Rust programs on hard mode.
+
Hard Mode means that you split your program into std binary and #![no_std] no-alloc library.
+Only the small binary is allowed to directly ask OS for resources.
+For the library, all resources must be injected.
+In particular, to do memory allocation, the library receives a slice of bytes of a fixed size, and should use that for all storage.
+Something like this:
So, this is what the post is about: my experience implementing a toy hard mode ray tracer.
+You can find the code on GitHub: http://github.com/matklad/crt.
+
The task of a ray tracer is to convert a description of a 3D scene like the following one:
+
+
+
Into a rendered image like this:
+
+
+
This works rather intuitive conceptually.
+First, imagine the above scene, with an infinite fuchsia colored plane and a red Utah teapot hovering above that.
+Then, imagine a camera standing at 0,10,-50 (in cartesian coordinates) and aiming at the origin.
+Now, draw an imaginary rectangular 80x60 screen at a focus distance of 50 from the camera along its line of side.
+To get a 2D picture, we shoot a ray from the camera through each “pixel” on the screen, note which object on the scene is hit (plan, teapot, background), and color the pixel accordingly.
+See PBRT Book if you feel like falling further into this particular rabbit hole (warning: it is very deep) (I apologize for “little square pixels” simplification I use throughout the post :-) ).
+
I won’t focus on specific algorithms to implement that (indeed, crt is a very naive tracer), but rather highlight Hard Mode Rust specific concerns.
Ultimately, the out of a ray tracer is a 2D buffer with 8bit RGB pixels.
+One would typically represent it as follows:
+
+
+
For us, we want someone else (main) to allocate that box of colors for us, so instead we do the following:
+
+
+
The 'm lifetime we use for abstract memory managed elsewhere.
+Note how the struct grew an extra lifetime!
+This is extra price we have to pay for not relying on RAII to cleanup resources for us:
+
+
+
Note in particular how the Ctx struct now has to include two lifetimes.
+This feels unnecessary: 'a is shorter than 'm.
+I wish it was possible to somehow abstract that away:
+
+
+
I don’t think that’s really possible (earlier post about this).
+In particular, the following would run into variance issues:
+
+
+
Ultimately, this is annoying, but not a deal breaker.
+
With this rgb::Buf<'_>, we can sketch the program:
Ray tracing is an embarrassingly parallel task — the color of each output pixel can be computed independently.
+Usually, the excellent rayon library is used to take advantage of parallelism, but for our raytracer I want to show a significantly simpler API design for taking advantage of many cores.
+I’ve seen this design in Sorbet, a type checker for Ruby.
+
Here’s how a render function with support for parallelism looks:
+
+
+
The interface here is the in_parallel function, which takes another function as an argument and runs it, in parallel, on all available threads.
+You typically use it like this:
+
+
+
This is similar to a typical threadpool, but different.
+Similar to a threadpool, there’s a number of threads (typically one per core) which execute arbitrary jobs.
+The first difference is that a typical threadpool sends a job to to a single thread, while in this design the same job is broadcasted to all threads.
+The job is Fn + Sync rather than FnOnce + Send.
+The second difference is that we block until the job is done on all threads, so we can borrow data from the stack.
+
It’s on the caller to explicitly implement a concurrent queue to distributed specific work items.
+In my implementation, I slice the image in rows
+
+
+
In main, we implement a concrete ThreadPool by spawning a thread per core:
The scenes we are going to render are fundamentally dynamically sized.
+They can contain arbitrary number of objects.
+So we can’t just statically allocate all the memory up-front.
+Instead, there’s a CLI argument which sets the amount of memory a ray tracer can use, and we should either manage with that, or return an error.
+So we do need to write our own allocator.
+But we’ll try very hard to only allocate the memory we actually need, so we won’t have to implement memory deallocation at all.
+So a simple bump allocator would do:
+
+
+
We can create an allocator from a slice of bytes, and then ask it to allocate values and arrays.
+Schematically, alloc looks like this:
+
+
+
To make this fully kosher we need to handle alignment as well, but I cut that bit out for brevity.
+
For allocating arrays, it’s useful if all-zeros bitpattern is a valid default instance of type, as that allows to skip element-wise initialization.
+This condition isn’t easily expressible in today’s Rust though, so we require initializing every array member.
+
The result of an allocation is &'m T— this is how we spell Box<T> on hard mode.
The scene contains various objects, like spheres and planes:
+
+
+
Usually, we’d represent a scene as
+
+
+
We could implement a resizable array (Vec), but doing that would require us to either leak memory, or to implement proper deallocation logic in our allocator, and add destructors to reliably trigger that.
+But destructors is exactly something we are trying to avoid in this exercise.
+So our scene will have to look like this instead:
+
+
+
And that means we want to know the number of objects we’ll need upfront.
+The way we solve this problem is by doing two-pass parsing.
+In the first pass, we just count things, then we allocate them, then we actually parse them into allocated space.
+
+
+
If an error is encountered during parsing, we want to create a helpful error message.
+If the message is fully dynamic, we’d have to allocate it into'm, but it seems simpler to just re-use bits of input for error message.
+Hence, Error<'i> is tied to the input lifetime 'i, rather memory lifetime 'm.
One interesting type of object on the scene is a mesh of triangles (for example, the teapot is just a bunch of triangles).
+A naive way to represent a bunch of triangles is to use a vector:
+
+
+
This is wasteful: in a mesh, each edge is shared by two triangles.
+So a single vertex belongs to a bunch of triangles.
+If we store a vector of triangles, we are needlessly duplicating vertex data.
+A more compact representation is to store unique vertexes once, and to use indexes for sharing:
+
+
+
Again, on hard mode that would be
+
+
+
And a scene contains a bunch of meshes :
+
+
+
Note how, if the structure is recursive, we have “owned pointers” of &'m mut T<'m> shape.
+Originally I worried that that would cause problem with variance, but it seems to work fine for ownership specifically.
+During processing, you still need &'a mut T<'m> though.
+
And that’s why parsing functions hold an uncomfortable bunch of lifetimes:
+
+
+
The parser p holds &'i str input and a &'a mut Mem<'m> memory.
+It parses input into a &'b mut Mesh<'m>.
With Scene<'m> fully parsed, we can finally get to rendering the picture.
+A naive way to do this would be to iterate through each pixel, shooting a ray through it, and then do a nested iterations over every shape, looking for the closest intersection.
+That’s going to be slow!
+The teapot model contains about 1k triangles, and we have 640*480 pixels, which gives us 307_200_000 ray-triangle intersection tests, which is quite slow even with multithreading.
+
So we are going to speed this up.
+The idea is simple — just don’t intersect a ray with each triangle.
+It is possible to quickly discard batches of triangles.
+If we have a batch of triangles, we can draw a 3D box around them as a pre-processing step.
+Now if the ray doesn’t intersect the bounding box, we know that it can’t intersect any of the triangles.
+So we can use one test with a bounding box instead of many tests for each triangle.
+
This is of course one-sided — if the ray intersects the box, it might still miss all of the triangles.
+But, if we place bounding boxes smartly (small boxes which cover many adjacent triangles), we can hope to skip a lot of work.
+
We won’t go for really smart ways of doing that, and instead will use a simple divide-and-conquer scheme.
+Specifically, we’ll draw a large box around all triangles we have.
+Then, we’ll note which dimension of the resulting box is the longest.
+If, for example, the box is very tall, we’ll cut it in half horizontally, such that each half contains half of the triangles.
+Then, we’ll recursively subdivide the two halves.
+
In the end, we get a binary tree, where each node contains a bounding box and two children, whose bounding boxes are contained in the parent’s bounding box.
+Leaves contains triangles.
+This construction is called a bounding volume hierarchy, bvh.
+
To intersect the ray with bvh, we use a recursive procedure.
+Starting at the root node, we descend into children whose bounding boxes are intersected by the ray.
+Sometimes we’ll have to descend into both children, but often enough at least one child’s bounding box won’t touch the ray, allowing us to completely skip the subtree.
+
On easy mode Rust, we can code it like this:
+
+
+
On hard mode, we don’t really love all those separate boxes, we love arrays!
+So what we’d rather have is
+
+
+
So we want to write the following function which recursively constructs a bvh for a mesh:
+
+
+
The problem is, unlike the parser, we can’t cheaply determine the number of leaves and splits without actually building the whole tree.
So what we are going to do here is to allocate a pointer-tree structure into some scratch space, and then copy that into an &'m mut array.
+How do we find the scratch space?
+Our memory is &'m [u8].
+We allocate stuff from the start of the region.
+So we can split of some amount of scratch space from the end:
+
+
+
Stuff we allocate into the first half is allocated “permanently”.
+Stuff we allocate into the second half is allocated temporarily.
+When we drop temp buffer, we can reclaim all that space.
+
This… probably is the most sketchy part of the whole endeavor.
+It is unsafe, requires lifetimes casing, and I actually can’t get it past miri.
+But it should be fine, right?
+
So, I have the following thing API:
+
+
+
It can be used like this:
+
+
+
And here’s how with_scratch implemented:
+
+
+
With this infrastructure in place, we can finally implement bvh construction!
+We’ll do it in three steps:
+
+
+Split of half the memory into a scratch space.
+
+
+Build a dynamically-sized tree in that space, counting leaves and interior nodes.
+
+
+Allocate arrays of the right size in the permanent space, and copy data over once.
+
+
+
+
+
And that’s it!
+The thing actually works, miri complaints notwithstanding!
Actually, I am impressed.
+I was certain that this won’t actually work out, and that I’d have to write copious amount of unsafe to get the runtime behavior I want.
+Specifically, I believed that &'m mut T<'m> variance issue would force my hand to add 'm, 'mm, 'mmm and further lifetimes, but that didn’t happen.
+For “owning” pointers, &'m mut T<'m'> turned out to work fine!
+It’s only when processing you might need extra lifetimes.
+Parser<'m, 'i, 'a> is at least two lifetimes more than I am completely comfortable with, but I guess I can live with that.
+
I wonder how far this style of programming can be pushed.
+Aesthetically, I quite like that I can tell precisely how much memory the program would use!
A short post on how to create better troubleshooting documentation, prompted by me spending last evening trying to get builtin display of my laptop working with Linux.
+
What finally fixed the blank screen for me was this advice from NixOS wiki:
+
+
+
While this particular approach worked, in contrast to a dozen different ones I tried before, I think it shares a very common flaw, which is endemic to troubleshooting documentation.
+Can you spot it?
+
The advice tells you the remedy (“add this kernel parameter”), but it doesn’t explain how to verify that this indeed is the problem.
+That is, if the potential problem is a not loaded kernel driver, it would really help me to know how to check which kernel driver is in use, so that I can do both:
+
+
+Before adding the parameter, check that 46a6 doesn’t have a driver
+
+
+After the fix, verify that i915 is indeed used.
+
+
+
If a “fix” doesn’t come with a linked “diagnostic”, a very common outcome is:
+
+
+Apply some random fix from the Internet
+
+
+Observe that the final problem (blank screen) isn’t fixed
+
+
+Wonder which of the two is the case:
+
+
+the fix is not relevant for the problem,
+
+
+the fix is relevant, but is applied wrong.
+
+
+
+
+
So, call to action: if you are writing any kind of documentation, before explaining how to fix the problem, teach the user how to diagnose it.
+
When helping with git, start with explaining git log and git status, not with git reset or git reflog.
+
+
While the post might come as just a tiny bit angry, I want to explicitly mention that I am eternally grateful to all the people who write any kind of docs for using Linux on desktop.
+I’ve been running it for more than 10 years at this point, and I am still completely clueless as to how debug issues from the first principles.
+If not for all of the wikis, stackoverflows and random forum posts out there, I wouldn’t be able to use the OS, so thank you all!
This post contains some inconclusive musing on lightweight markup languages (Markdown, AsciiDoc, LaTeX, reStructuredText, etc).
+The overall mood is that I don’t think a genuinely great markup languages exists.
+I wish it did though.
+As an appropriate disclosure, this text is written in AsciiDoctor.
+
EDIT: if you like this post, you should definitely check out https://djot.net.
+
EDIT: welp, that escalated quickly, this post is now written in Djot.
This I think is the big one.
+Very often, a particular markup language is married to a particular output format, either syntactically (markdown supports HTML syntax), or by the processor just not making a crisp enough distinction between the input document and the output (AsciiDoctor).
+
Roughly, if the markup language is for emitting HTML, or PDF, or DocBook XML, that’s bad.
+A good markup language describes an abstract hierarchical structure of the document, and lets a separate program to adapt that structure to the desired output.
+
More or less, what I want from markup is to convert a text string into a document tree:
+
+
+
Markup language which nails this perfectly is HTML.
+It directly expresses this tree structure.
+Various viewers for HTML can then render the document in a particular fashion.
+HTML’s syntax itself doesn’t really care about tag names and semantics: you can imagine authoring HTML documents using an alternative set of tag names.
+
Markup language which completely falls over this is Markdown.
+There’s no way to express generic tree structure, conversion to HTML with specific browser tags is hard-coded.
+
Language which does this half-good is AsciiDoctor.
+
In AsciiDoctor, it is possible to express genuine nesting.
+Here’s a bunch of nested blocks with some inline content and attributes:
+
+
+
The problem with AsciiDoctor is that generic blocks come of as a bit of implementation detail, not as a foundation.
+It is difficult to untangle presentation-specific semantics of particular blocks (examples, admonitions, etc) from the generic document structure.
+As a fun consequence, a semantic-neutral block (equivalent of a </div>) is the only kind of block which can’t actually nest in AsciiDoctor, due to syntactic ambiguity.
Syntax matters.
+For lightweight text markup languages, syntax is of utmost importance.
+
The only right way to spell a list is
+
+
+
Not
+
+
+
And most definitely not
+
+
+
Similarly, you lose if you spell links like this:
+
+
+
Markdown is the trailblazer here, it picked a lot of great concrete syntaxes.
+Though, some choices are questionable, like trailing double space rule, or the syntax for including images.
+
AsciiDoctor is the treasure trove of tasteful syntactic decisions.
For example http://example.com[] gets parsed as <http>//example.com</http>, and the converter knows basic url schemes.
+And of course there’s a generic link syntax for corner cases where a URL syntax isn’t a valid AsciiDoctor syntax:
+
+
+
(image: produces an inline element, while image:: emits a block. Again, this isn’t hard-coded to images, it is a generic syntax for whatever::).
To convert our nice, sweet syntax to general tree and than into the final output, we need some kind of a tool.
+One way to do that is by direct translation from our source document to, eg, html.
+
Such one-step translation is convenient for all-inclusive tools, but is a barrier for extensibility.
+Amusingly, AsciiDoctor is both a positive and a negative example here.
+
On the negative side of things, classical AsciiDoctor is an extensible Ruby processor.
+To extend it, you essentially write a “compiler plugin”— a bit of Ruby code which gets hook into the main processor and gets invoked as a callback when certain “tags” are parsed.
+This plugin interacts with the Ruby API of the processor itself, and is tied to a particular toolchain.
+
In contrast, asciidoctor-web-pdf, a newer thing (which non-the-less uses the same Ruby core), approaches the task a bit differently.
+There’s no API to extend the processor itself.
+Rather, the processor produces an abstract document tree, and then a user-supplied JavaScript function can convert that piece of data into whatever html it needs, by following a lightweight visitor pattern.
+I think this is the key to a rich ecosystem: strictly separate converting input text to an abstract document model from rendering the model through some template.
+The two parts could be done by two separate processes which exchange serialized data.
+It’s even possible to imagine some canonical JSON encoding of the parsed document.
+
There’s one more behavior where all-inclusive approach of AsciiDoctor gets in a way of doing the right thing.
+AsciiDoctor supports includes, and they are textual, preprocessor includes, meaning that syntax of the included file affects what follows afterwards.
+A much cleaner solution would have been to keep includes in the document tree as distinct nodes (with the path to the included file as an attribute), and let it to the output layer to interpret those as either verbatim text, or subdocuments.
+
Another aspect of composability is that the parsing part of the processing should have, at minimum, a lightweight, embeddable implementation.
+Ideally, of course, there’s a spec and an array of implementations to choose from.
+
Markdown fairs fairly well here: there never was a shortage of implementations, and today we even have a bunch of different specs!
+
AsciiDoctor…
+Well, I am amazed.
+The original implementation of AsciiDoc was in Python.
+AsciiDoctor, the current tool, is in Ruby.
+Neither is too embeddable.
+But! AsciiDoctor folks are crazy, they compiled Ruby to JavaScript (and Java), and so the toolchain is available on JVM and Node.
+At least for Node, I can confidently say that that’s a real production-ready thing which is quite convenient to use!
+Still, I’d prefer a Rust library or a small WebAssembly blob instead.
+
A different aspect of composability is extensibility.
+In Markdown land, the usual answer for when Markdown doesn’t quite do everything needed (i.e., in 90% of cases), the answer is to extend concrete syntax.
+This is quite unfortunate, changing syntax is hard.
+A much better avenue I think is to take advantage of the generic tree structure, and extend the output layer instead.
+Tree-with-attributes should be enough to express whatever structure is needed, and than its up to the converter to pattern-match this structure and emit its special thing.
+
Do you remember the fancy two-column rendering above with source-code on the left, and rendered document on the right?
+This is how I’ve done it:
+
+
+
That is, a generic block, with .two-col attribute and two children — a listing block and a list.
+Then there’s a separate css which assigns an appropriate flexbox layout for .two-col elements.
+There’s no need for special “two column layout” extension.
+It would be perhaps nice to have a dedicated syntax here, but just re-using generic -- block is quite ok!
Not quite there, I would think!
+AsciiDoctor at least half-ticks quite a few of the checkboxes, but it is still not perfect.
+
There is a specification in progress, I have high hopes that it’ll spur alternative implementations (and most of AsciiDoctor problems are implementation issues).
+At the same time, I am not overly-optimistic.
+The overriding goal for AsciiDoctor is compatibility, and rightfully so.
+There’s a lot of content already written, and I would hate to migrate this blog, for example :)
+
At the same time, there are quite a few rough edges in AsciiDoctor:
+
+
+includes
+
+
+non-nestable generic blocks
+
+
+many ways to do certain things (AsciiDoctor essentially supports the union of Markdown and AsciiDoc concrete syntaxes)
+
+
+lack of some concrete sugar (reference-style links are notably better in Markdown)
+
+
+
It feels like there’s a smaller, simpler language somewhere (no, I will not link that xkcd for once (though xkcd:927[] would be a nice use of AsciiDoctor extensibility))
+
On the positive side of things, it seems that in the recent years we built a lot of infrastructure to make these kinds of projects more feasible.
+
Rust is just about the perfect language to take a String from a user and parse it into some sort of a tree, while packaging the whole thing into a self-contained zero-dependency, highly
+embeddable, reliable, and reusable library.
+
WebAssembly greatly extends reusability of low-level libraries: between a static library with a C ABI, and a .wasm module, you got all important platforms covered.
+
True extensibility fundamentally requires taking code as input data.
+A converter from a great markup language to HTML should accept some user-written script file as an argument, to do fine tweaking of the conversion process.
+WebAssembly can be a part of the solution, it is a toolchain-neutral way of expressing computation.
+But we have something even more appropriate.
+Deno with its friendly scripting language with nice template literals and a capabilities based security model, is just about the perfect runtime to implement a static site generator which takes a bunch of input documents, a custom conversion script, and outputs a bunch of HTML files.
+
If I didn’t have anything else to do, I’d certainly be writing my own lightweight markup language today!
The genre of this post is: “I am having opinions on something I am not an expert at, so hopefully the Internet would correct me”.
+
The specific question in question is:
+
+
+
I am not a web developer, but I do have a blog where I write CSS myself, and I very much want to do the right thing.
+I was researching and agonizing over this question for years, as I wasn’t able to find a conclusive argument one way or another.
+So I am writing one.
+
This isn’t ideal, but I am lazy, so this post assumes that you already did the research and understand the mechanics of and the difference between px, em, and rem.
+And so, you position is probably:
+
+
+
Although there are buts:
+
But the default font-size is 16px, and that’s just too small.
+If you just roll with intended defaults, than the text will be painful to read even for folks with great vision!
+
But default font-size of x pixels just doesn’t make sense: the actual perceived font size very much depends on the font itself.
+At 16px, some fonts will be small, some tiny, and some maybe even just about right.
+
But the recommended way to actually use rem boils down to setting a percentage font-size for the root element, such that 1rem is not the intended “font size of the root element”, but is equal to 1px (under default settings).
+Which, at this point, sounds like using pixels, just with more steps?
+After all, the modern browsers can zoom the pixels just fine?
+
So, yeah, lingering doubts…
+If you are like me, you painstakingly used rem’s everywhere, and then html { font-size: 22px } because default is unusable, and percentage of default is stupidly ugly :-)
+
+
So lets settle the question then.
+
The practical data we want is what do the users actually do in practice?
+Do they zoom or do they change default font size?
+I have spent 10 minutes googling that, didn’t find the answer.
+
After that, I decided to just check how it actually works.
+So, I opened browser’s settings, cranked the font size to the max, and opened Google.
+
To be honest, that was the moment where the question was mentally settled for me.
+If Google’s search page doesn’t respect user-agent’s default font-size, it’s an indirect, but also very strong, evidence that that’s not a meaningful thing to do.
+
The result of my ad-hoc survey:
+
+
+
Don’t care:
+
+
+
+Google
+
+
+Lobsters
+
+
+Hackernews
+
+
+Substack
+
+
+antirez.com
+
+
+tonsky.me
+
+
+New Reddit
+
+
+
+
+
+
+
+
Embiggen:
+
+
+
+Wikipedia
+
+
+Discourse
+
+
+Old Reddit
+
+
+
+
+
+
Google versus Wikipedia it is, eh?
+But this is actually quite informative: if you adjust your browser’s default font-size, you are in an “Alice in the Wonderland” version of the web which alternates between too large and too small.
+
The next useful question is: what about mobile?
+After some testing and googling, it seems that changing browser’s default font-size is just not possible on the iPhone?
+That the only option is page zoom?
+
Again, I don’t actually have the data on whether users rely on zoom or on font size.
+But so far it looks like the user doesn’t really have a choice?
+Only zoom seems to actually work in practice?
+
The final bit of evidence which completely settled the question in my mind comes from this post:
My reading of the above text: it’s on me, as an author, to ensure that my readers can scale the content using whatever method their user agent employs.
+If the UA can zoom, that’s perfect, we are done.
+
If the reader’s actual UA can’t zoom, but it can change default font size (eg, IE 6), then I need to support that.
+
That’s … most reasonable I guess?
+Just make sure that your actual users, in their actual use, can read stuff.
+And I am pretty sure my target audience doesn’t use IE 6, which I don’t support anyway.
+
TL;DR for the whole post:
+
Use pixels.
+The goal is not to check the “I suffered pain to make my website accessible” checkbox, the goal is to make the site accessible to real users.
+There’s an explicit guideline about that.
+There’s a strong evidence that, barring highly unusual circumstances, real users zoom, and pixels zoom just fine.
+
+
As a nice bonus, if you don’t use rem, you make browser’s font size setting more useful, because it can control the scale of the browser’s own chrome (which is fixed) independently from the scale of websites (which vary).
+
+
+
+
+
+
+
diff --git a/2022/11/18/if-a-tree-falls-in-a-forest-does-it-overflow-the-stack.html b/2022/11/18/if-a-tree-falls-in-a-forest-does-it-overflow-the-stack.html
new file mode 100644
index 00000000..957e9204
--- /dev/null
+++ b/2022/11/18/if-a-tree-falls-in-a-forest-does-it-overflow-the-stack.html
@@ -0,0 +1,327 @@
+
+
+
+
+
+
+ If a Tree Falls in a Forest, Does It Overflow the Stack?
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
A well-known pitfall when implementing a linked list in Rust is that the the default recursive drop implementation causes stack overflow for long lists.
+A similar problem exists for tree data structures as well.
+This post describes a couple of possible solutions for trees.
+This is a rather esoteric problem, so the article is denser than is appropriate for a tutorial.
+
Let’s start with our beloved linked list:
+
+
+
It’s easy to cause this code to crash:
+
+
+
The crash happens in the automatically generated recursive drop function.
+The fix is to write drop manually, in a non-recursive way:
+
+
+
What about trees?
+
+
+
If the tree is guaranteed to be balanced, the automatically generated drop is actually fine, because the height of the tree will be logarithmic.
+If the tree is unbalanced though, the same stack overflow might happen.
+
Let’s write an iterative Drop to fix this.
+The problem though is that the “swap with self” trick we used for list doesn’t work, as we have two children to recur into.
+The standard solution would be to replace a stack with an explicit vector of work times:
+
+
+
This works, but also makes my internal C programmer scream: we allocate a vector to free memory!
+Can we do better?
+
One approach would be to build on balanced trees observation.
+If we recur into the shorter branch, and iteratively drop the longer one, we should be fine:
+
+
+
This requires maintaining the depths though.
+Can we make do without?
+My C instinct (not that I wrote any substantial amount of C though) would be to go down the tree, and stash the parent links into the nodes themselves.
+And we actually can do something like that:
+
+
+If the current node has only a single child, we can descend into the node
+
+
+If there are two children, we can rotate the tree. If we always rotate into a
+single direction, eventually we’ll get into the single-child situation.
+
+
+
Here’s how a single rotation could look:
+
+
+
Or, in code,
+
+
+
Ok, what if we have an n-ary tree?
+
+
+
I think the same approach works: we can treat the first child as left, and the last child as right, and do essentially the same rotations.
+Though, we will rotate in other direction (as removing the right child is cheaper), and we’ll also check that we have at least two grandchildren (to avoid allocation when pushing to an empty vector).
+
Which gives something like this:
+
+
+
I am not sure this works, and I am not sure this works in linear time, but I am fairly certain that something like this could be made to work if need be.
+
Though, practically, if something like this is a concern, you probably want to re-design the tree structure to be something like this instead:
Ray or path tracing is an algorithm for getting a 2D picture out of a 3D virtual scene, by simulating a trajectory of a particle of light which hits the camera.
+It’s one of the fundamental techniques of computer graphics, but that’s not why it is the topic for today’s blog post.
+Implementing a toy ray tracer is one of the best exercises for learning a particular programming language (and a great deal about software architecture in general as well), and that’s the “why?” for this text.
+My goal here is to teach you to learn new programming languages better, by giving a particularly good exercise for that.
Learning a programming language consists of learning the theory (knowledge) and the set of tricks to actually make computer do things (skills).
+For me, the best way to learn skills is to practice them.
+Ray tracer is an exceptionally good practice dummy, because:
+
+
+It is a project of an appropriate scale: a couple of weekends.
+
+
+It is a project with a flexible scale — if you get carried away, you can sink a lot of weekends before you hit diminishing returns on effort.
+
+
+Ray tracer can make use of a lot of aspects of the language — modules, static and runtime polymorphism, parallelism, operator overloading, IO, string parsing, performance optimization, custom data structures.
+Really, I think the project doesn’t touch only a couple of big things, namely networking and evented programming.
+
+
+It is a very visual and feedback-friendly project — a bug is not some constraint violation deep in the guts of the database, it’s a picture upside-down!
+
+
+
I want to stress once again that here I view ray tracer as a learning exercise.
+We aren’t going to draw any beautiful photorealistic pictures here, we’ll settle for ugly things with artifacts.
+
Eg, this “beauty” is the final result of my last exercise:
+
+
+
And, to maximize learning, I think its better to do everything yourself from scratch.
+A crappy teapot which you did from the first principles is full to the brim with knowledge, while a beautiful landscape which you got by following step-by-step instructions is hollow.
+
And that’s the gist of the post: I’ll try to teach you as little about ray tracing as possible, to give you just enough clues to get some pixels to the screen.
+To be more poetic, you’ll draw the rest of the proverbial owl.
+
This is in contrast to Ray Tracing in One Weekend which does a splendid job teaching ray tracing, but contains way to many spoilers if you want to learn software architecture (rather than graphics programming).
+In particular, it contains snippets of code.
+We won’t see that here — as a corollary, all the code you’ll write is fully your invention!
+
Sadly, there’s one caveat to the plan: as the fundamental task is tracing a ray as it gets reflected through the 3D scene, we’ll need a hefty amount of math.
+Not an insurmountable amount — everything is going to be pretty visual and logical.
+But still, we’ll need some of the more advanced stuff, such as vectors and cross product.
+
If you are very comfortable with that, you can approach the math parts the same way as the programming parts — grab a pencil and a stack of paper and try to work out formulas yourself.
+If solving math puzzlers is not your cup of tea, feel absolutely free to just look up formulas online.
+https://avikdas.com/build-your-own-raytracer is a great resource for that.
+If, however, linear algebra is your worst nightmare, you might want to look for a more step-by-step tutorial (or maybe pick a different problem altogether! Another good exercise is a small chat server, for example).
So, what exactly is ray tracing?
+Imagine a 3D scene with different kinds of objects: an infinite plane, a sphere, a bunch of small triangles which resemble a teapot from afar.
+The scene is illuminated by some distant light source, and so objects cast shadows and reflect each other.
+We observe the scene from a particular view point.
+Roughly, a ray of light is emitted by a light source, bounces off scene objects and eventually, if it gets into our eye, we perceive a sensation of color, which is mixed from light’s original color, as well the colors of all the objects the ray reflected from.
+
Now, we are going to crudely simplify the picture.
+Rather than casting rays from the light source, we’ll cast rays from the point of view.
+Whatever is intersected by the ray will be painted as a pixels in the resulting image.
The ultimate result of our ray tracer is an image.
+A straightforward way to represent an image is to use a 2D grid of pixels, where each pixel is an “red, green, blue” triple where color values vary from 0 to 255.
+How do we display the image?
+One can reach out for graphics libraries like OpenGL, or image formats like BMP or PNG.
+
But, in the spirit of simplifying the problem so that we can do everything ourselves, we will simplify the problem!
+As a first step, we’ll display image as text in the terminal.
+That is, we’ll print . for “white” pixels and x for “black” pixels.
+
So, as the very first step, let’s write some code to display such image by just printing it.
+A good example image would be 64 by 48 pixels wide, with 5 pixel large circle in the center.
+And here’s the first encounter of math: to do this, we want to iterate all (x, y) pixels and fill them if they are inside the circle.
+It’s useful to recall equation for circle at the origin: x^2 + y^2 = r^2 where r is the radius.
+
🎉 we got hello-world working!
+Now, let’s go for more image-y images.
+We can roll our own “real” format like BMP (I think that one is comparatively simple), but there’s a cheat code here.
+There are text-based image formats!
+In particular, PPM is the one especially convenient.
+Wikipedia Article should be enough to write our own impl.
+I suggest using P3 variation, but P6 is also nice if you want something less offensively inefficient.
+
So, rewrite your image outputting code to produce a .ppm file, and also make sure that you have an image viewer that can actually display it.
+Spend some time viewing your circle in its colorful glory (can you color it with a gradient?).
+
If you made it this far, I think you understand the spirit of the exercise — you’ve just implemented an encoder for a real image format, using nothing but a Wikipedia article.
+It might not be the fastest encoder out there, but it’s the thing you did yourself.
+You probably want to encapsulate it in a module or something, and do a nice API over it.
+Go for it! Experiment with various abstractions in the language.
Now that we can display stuff, let’s do an absolutely basic ray tracer.
+We’ll use a very simple scene: just a single sphere with the camera looking directly at it.
+And we’ll use a trivial ray tracing algorithm: shoot the ray from the camera, if it hit the sphere, paint black, else, paint white.
+If you do this as a mental experiment, you’ll realize that the end result is going to be exactly what we’ve got so far: a picture with a circle in it.
+Except now, it’s going to be in 3D!
+
This is going to be the most annoying part, as there are a lot of fiddly details to get this right, while the result is, ahem, underwhelming.
+Let’s do this though.
+
First, the sphere.
+For simplicity, let’s assume that its center is at the origin, and it has radius 5, and so it’s equation is
+
+
+
Or, in vector form:
+
+
+
Here, v̅ is a point on a sphere (an (x, y, z) vector) and ⋅ is the dot product.
+As a bit of foreshadowing, if you are brave enough to take a stab at deriving various formulas, keeping to vector notation might be simpler.
+
Now, let’s place the camera.
+It is convenient to orient axes such that Y points up, X points to the right, and Z points at the viewer (ie, Z is depth).
+So let’s say that camera is at (0, 0, -20) and it looks at (0, 0, 0) (so, directly at the sphere’s center).
+
Now, the fiddly bit.
+It’s somewhat obvious how to cast a ray from the camera. If camera’s position is C̅, and we cast the ray in the direction d̅, then the equation of points on the ray is
+
+
+
where t is a scalar parameter.
+Or, in the cartesian form,
+
+
+
where (dx, dy, dz) is the direction vector for a particular ray.
+For example, for a ray which goes straight to the center of the sphere, that would be (0, 0, 1).
+
What is not obvious is how do we pick direction d?
+We’ll figure that out later.
+For now, assume that we have some magical box, which, given (x, y) position of the pixel in the image, gives us the (dx, dy, dz) of the corresponding ray.
+With that, we can use the following algorithm:
+
Iterate through all (x, y) pixels of our 64x48 the image.
+From the (x, y) of each pixel, compute the corresponding ray’s (dx, dy, dz).
+Check if the ray intersects the sphere.
+If it does, plaint the (x, y) pixel black.
+
To check for intersection, we can plug the ray equation, C̅ + t d̅, into the sphere equation, v̅ ⋅ v̅ = r^2.
+That is, we can substitute C̅ + t d̅ for v̅.
+As C̅, d̅ and r are specific numbers, the resulting equation would have only a single variable, t, and we could solve for that.
+For details, either apply pencil and paper, or look up “ray sphere intersection”.
+
But how do we find d̅ for each pixel?
+To do that, we actually need to add the screen to the scene.
+Our image is 64x48 rectangle.
+So let’s place that between the camera and the sphere.
+
We have camera at (0, 0, -20) our rectangular screen at, say, (0, 0, -10) and a sphere at (0, 0, 0).
+Now, each pixel in our 2D image has a corresponding point in our 3D scene, and we’ll cast the ray from camera’s position through this point.
+
The full list of parameters to define the scene is:
+
+
+
Focal distance is the distance from the camera to the screen.
+If we know the direction camera is looking along and the focal distance, we can calculate the position of the center of the screen, but that’s not enough.
+The screen can rotate, as we didn’t fixed which side is up, so we need an extra parameter for that.
+We also add a parameter for direction to the right for convenience, though it’s possible to derive “right” from “up” and “forward” directions.
+
Given this set of parameters, how do we calculate the ray corresponding to, say, (10, 20) pixel?
+Well, I’ll leave that up to you, but one hint I’ll give is that you can calculate the middle of the screen (camera position + view direction × focal distance).
+If you have the middle of the screen, you can get to (x, y) pixel by stepping x steps up (and we know up!) and y steps right (and we know right!).
+Once we know the coordinates of the point of the screen through which the ray shoots, we can compute ray’s direction as the difference between that point and camera’s origin.
+
Again, this is super fiddly and frustrating!
+My suggestion would be:
+
+
+Draw some illustrations to understand relation between camera, screen, sphere, and rays.
+
+
+Try to write the code which, given (x, y) position of the pixel in the image, gives (dx, dy, dz) coordinates of the direction of the ray from the camera through the pixel.
+
Coding wise, we obviously want to introduce some machinery here.
+The basic unit we need is a 3D vector — a triple of three real numbers (x, y, z).
+It should support all the expected operations — addition, subtraction, multiplication by scalar, dot product, etc.
+If your language supports operator overloading, you might look that up know.
+Is it a good idea to overload operator for dot product?
+You won’t know unless you try!
+
We also need something to hold the info about sphere, camera and the screen and to do the ray casting.
+
If everything works, you should get a familiar image of the circle.
+But it’s now powered by a real ray tracer and its real honest to god 3D, even if it doesn’t look like it!
+Indeed, with ray casting and ray-sphere intersection code, all the essential aspects are in place, from now on everything else are just bells and whistles.
Ok, now that we can see one sphere, let’s add the second one.
+We need to solve two subproblems for this to make sense.
+First, we need to parameterize our single sphere with the color (so that the second one looks differently, once we add it).
+Second, we should no longer hard-code (0, 0, 0) as a center of the sphere, and make that a parameter, adjusting the formulas accordingly.
+This is a good place to debug the code.
+If you think you move the sphere up, does it actually moves up in the image?
+
Now, the second sphere can be added with different radius, position and color.
+The ray casting code now needs to be adjusted to say which sphere intersected the ray.
+Additionally, it needs to handle the case where the ray intersects both spheres and figure out which one is closer.
+
With this machinery in hand, we can now create some true 3D scenes.
+If one sphere is fully in front of the other, that’s just concentric circles.
+But if the spheres intersect, the picture is somewhat more interesting.
The next step is going to be comparatively easy implementation wise, but it will fill our spheres with vibrant colors and make them spring out in their full 3D glory.
+We will add light to the scene.
+
Light source will be parameterized by two values:
+
+
+Position of the light source.
+
+
+Color and intensity of light.
+
+
+
For the latter, we can use a vector with three components (red, green, blue), where each components varies from 0.0 (no light) to 1.0 (maximally bright light).
+We can use a similar vector to describe a color of the object.
+Now, when the light hits the object, the resulting color would be a componentwise product of the light’s color and the object’s color.
+
Another contributor is the direction of light.
+If the light falls straight at the object, it seems bright.
+If the light falls obliquely, it is more dull.
+
Let’s get more specific:
+
+
+P̅ is a point on our sphere where the light falls.
+
+
+N̅ is the normal vector at P̅.
+That is, it’s a vector with length 1, which is locally perpendicular to the surface at P̅
+
+
+L̅ is the position of the light source
+
+
+R̅ is a vector of length one from P̅ to L̅: R̅ = (L̅ - P̅) / |L̅ - P̅|
+
+
+
Then, R̅ ⋅ N̅ gives us this “is the light falling straight at the surface?” coefficient between 0 and 1.
+Dot product between two unit vectors measures how similar their direction is (it is 0 for perpendicular vectors, and 1 for collinear ones).
+So, “is light perpendicular” is the same as “is light collinear with normal” is dot product.
+
The final color will be the memberwise product of light’s color and sphere’s color multiplied by this attenuating coefficient.
+Putting it all together:
+
For each pixel (x, y) we cast a C̅ + t d̅ ray through it.
+If the ray hits the sphere, we calculate point P where it happens, as well as sphere’s normal at point P.
+For sphere, normal is a vector which connects sphere’s center with P.
+Then we cast a ray from P to the light source L̅.
+If this ray hits the other sphere, the point is occluded and the pixel remains dark.
+Otherwise, we compute the color using using the angle between normal and direction to the light.
+
With this logic in place, the picture now should display two 3D-looking spheres, rather than a pair of circles.
+In particular, our spheres now cast shadows!
+
What we implemented here is a part of Phong reflection model, specifically, the diffuse part.
+Extending the code to include ambient and specular parts is a good way to get some nicer looking pictures!
At this point, we accumulated quite a few parameters: camera config, positions of spheres, there colors, light sources (you totally can have many of them!).
+Specifying all those things as constants in the code makes experimentation hard, so a next logical step is to devise some kind of textual format which describes the scene.
+That way, our ray tracer reads a textual screen description as an input, and renders a .ppm as an output.
+
One obvious choice is to use JSON, though it’s not too convenient to edit by hand, and bringing in a JSON parser is contrary to our “do it yourself” approach.
+So I would suggest to design your own small language to specify the scene.
+You might want to take a look at https://kdl.dev for the inspiration.
+
Note how the program grows bigger — there are now distinctive parts for input parsing, output formatting, rendering per-se, as well as the underlying nascent 3D geometry library.
+As usual, if you feel like organizing all that somewhat better, go for it!
So far, we’ve only rendered spheres.
+There’s a huge variety of other shapes we can add, and it makes sense to tackle at least a couple.
+A good candidate is a plane.
+To specify a plane, we need a normal, and a point on a plane.
+For example, N̅ ⋅ v̅ = 0 is the equation of the plain which goes through the origin and is orthogonal to N̅.
+We can plug our ray equation instead of v̅ and solve for t as usual.
+
The second shape to add is a triangle.
+A triangle can be naturally specified using its three vertexes.
+One of the more advanced math exercises would be to derive a formula for ray-triangle intersection.
+As usual, math isn’t the point of the exercise, so feel free to just look that up!
+
With spheres, planes and triangles which are all shapes, there clearly is some amount of polymorphism going on!
+You might want to play with various ways to best express that in your language of choice!
Triangles are interesting, because there are a lot of existing 3D models specified as a bunch of triangles.
+If you download such a model and put it into the scene, you can render somewhat impressive images.
+
There are many formats for storing 3D meshes, but for out purposes .obj files are the best.
+Again, this is a plain text format which you can parse by hand.
+
There are plenty of .obj models to download, with the Utah teapot being the most famous one.
+
Note that the model specifies three parameters for each triangle’s vertex:
+
+
+coordinate (v)
+
+
+normal (vn)
+
+
+texture (vt)
+
+
+
For the first implementation, you’d want to ignore vn and vt, and aim at getting a highly polygonal teapot on the screen.
+Note that the model contains thousands of triangles, and would take significantly more time to render.
+You might want to downscale the resolution a bit until we start optimizing performance.
+
To make the picture less polygony, you’d want to look at those vn normals.
+The idea here is that, instead of using a true triangle’s normal when calculating light, to use a fake normal as if the the triangle wasn’t actually flat.
+To do that, the .obj files specifies “fake” normals for each vertex of a triangle.
+If a ray intersects a triangle somewhere in the middle, you can compute a fake normal at that point by taking a weighted average of the three normals at the vertexes.
+
At this point, you should get a picture roughly comparable to the one at the start of the article!
With all bells and whistles, our ray tracer should be rather slow, especially for larger images.
+There are three tricks I suggest to make it faster (and also to learn a bunch of stuff).
+
First, ray tracing is an embarrassingly parallel task: each pixel is independent from the others.
+So, as a quick win, make sure that you program uses all the cores for rendering.
+Did you manage to get a linear speedup?
+
Second, its a good opportunity to look into profiling tools.
+Can you figure out what specifically is the slowest part?
+Can you make it faster?
+
Third, our implementation which loops over each shape to find the closest intersection is a bit naive.
+It would be cool if we had something like a binary search tree, which would show us the closest shape automatically.
+As far as I know, there isn’t a general algorithmically optimal index data structure for doing spatial lookups.
+However, there’s a bunch of somewhat heuristic data structures which tend to work well in practice.
+
One that I suggest implementing is the bounding volume hierarchy.
+The crux of the idea is that we can take a bunch of triangles and place them inside a bigger object (eg, a gigantic sphere).
+Then, if a ray doesn’t intersect this bigger object, we don’t need to check any triangles contained within.
+There’s a certain freedom in how one picks such bounding objects.
+
For BVH, we will use axis-aligned bounding box as our bounding volumes.
+It is a cuboid whose edges are parallel to the coordinate axis.
+You can parametrize an AABB with two points — the one with the lowest coordinates, and the one with the highest.
+It’s also easy to construct an AABB which bounds a set of shapes — take the minimum and maximum coordinates of all vertexes.
+Similarly, intersecting an AABB with a ray is fast.
+
The next idea is to define a hierarchy of AABBs.
+First, we define a root AABB for the whole scene.
+If the ray doesn’t hit it, we are done.
+The root box is then subdivided into two smaller boxes.
+The ray can hit one or two of them, and we recur into each box that got hit.
+Worst case, we are recurring into both subdivisions, which isn’t any faster, but in the common case we can skip at least a half.
+For simplicity, we also start with computing an AABB for each triangle we have in a scene, so we can think uniformly about a bunch of AABBs.
+
Putting everything together, we start with a bunch of small AABBs for our primitives.
+As a first step, we compute their common AABB.
+This will be the basis of our recursion step: a bunch of small AABBs, and a huge AABB encompassing all of them.
+We want to subdivide the big box.
+To do that, we select its longest axis (eg, if the big box is very tall, we aim to cut it in two horizontally), and find a midpoint.
+Then, we sort small AABBs into those whoche center is before or after midpoint along this axis.
+Finally, for each of the two subsets we compute a pair of new AABBs, and then recur.
+
Crucially, the two new bounding boxes might intersect.
+We can’t just cut the root box in two and unambiguously assign small AABBs to the two half, as they might not be entirely within one.
+But, we can expect the intersection to be pretty small in practice.
If you’ve made it this far, you have a pretty amazing pice of software!
+While it probably clocks at only a couple of thousands lines of code, it covers a pretty broad range of topics, from text file parsing to advanced data structures for spatial data.
+I deliberately spend no time explaining how to best fit all these pieces into a single box, that’s the main thing for you to experiment with and to learn.
+
There are two paths one can take from here:
+
+
+If you liked the graphics programming aspect of the exercise, there’s a lot you can do to improve the quality of the output.
+https://pbrt.org is the canonical book on the topic.
+
+
+If you liked the software engineering side of the project, you can try to re-implement it in different programming languages, to get a specific benchmark to compare different programming paradigms.
+Alternatively, you might want to look for other similar self-contained hand-made projects.
+Some options include:
+
+
+Software rasterizer: rather than simulating a path of a ray, we can project triangles onto the screen.
+This is potentially much faster, and should allow for real-time rendering.
+
+
+A highly concurrent chat server: a program which listens on a TCP port, allows clients to connect to it and exchange messages.
+
+
+A toy programming language: going full road from a text file to executable .wasm. Bonus points if you also do an LSP server for your language.
+
+
+A distributed key-value store based on Paxos or Raft.
+
For cryptographic purposes (eg, generating a key pair for public key cryptography), you want to use real random numbers, derived from genuinely stochastic physical signals
+(hardware random number generator, keyboard input, etc).
+The shape of the API here is:
+
+
+
As this fundamentally requires talking to some physical devices, this task is handled by the operating system.
+Different operating systems provide different APIs, covering which is beyond the scope of this article (and my own knowledge).
+
In Rust, getrandom crate provides a cross-platform wrapper for this functionality.
+
It is a major deficiency of Rust standard library that this functionality is not exposed there.
+Getting cryptographically secure random data is in the same class of OS services as getting the current time or reading standard input.
+Arguably, it’s even more important, as most applications for this functionality are security-critical.
For various non-cryptographic randomized algorithms, you want to start with a fixed, deterministic seed, and generate a stream of numbers, statistically indistinguishable from random.
+The shape of the API here is:
+
+
+
There are many different algorithms to do that.
+fastrand crate implements something sufficiently close to the state of the art.
+
Alternatively, a good-enough PRNG can be implemented in 9 lines of code:
+
+
+
This code was lifted from Rust’s standard library (source).
+
The best way to seed a PRNG is usually by using a fixed constant.
+If you absolutely need some amount of randomness in the seed, you can use the following hack:
+
+
+
In Rust, hash maps include some amount of randomization to avoid exploitable pathological behavior due to collisions.
+The above snippet extracts that randomness.
Good PRNG gives you a sequence of u32 numbers where each number is as likely as every other one.
+You can convert that to a number from 0 to 10 with random_u32() % 10.
+This will be good enough for most purposes, but will fail rigorous statistical tests.
+Because 232 isn’t evenly divisible by 10, 0 would be ever so slightly more frequent than 9.
+There is an algorithm to do this correctly (if random_u32() is very large, and falls into the literal remainder after dividing 232 by 10, throw it away and try again).
+
Sometimes you you want to use random_u32() to generate other kinds of random things, like a random point on a 3D sphere, or a random permutation.
+There are also algorithms for that.
+
Sphere: generate random point in the unit cube; if it is also in the unit ball, project it onto the surface, otherwise throw it away and try again.
+
Permutation: naive algorithm of selecting a random element to be the first, then selecting a random element among the rest to be the second, etc, works.
+
There are libraries which provide collections of such algorithms.
+For example, fastrand includes most common ones, like generating numbers in range, generating floating point numbers or shuffling slices.
+
rand includes more esoteric cases line the aforementioned point on a sphere or a normal distribution.
It is customary to expect existence of a global random number generator seeded for you.
+This is an anti-pattern — in the overwhelming majority of cases, passing a random number generator explicitly leads to better software.
+In particular, this is a requirement for deterministic tests.
+
In any case, this functionality can be achieved by storing a state of PRNG in a thread local:
rand is an umbrella crate which includes all of the above.
+rand also provides flexible trait-based “plugin” interface, allowing you to mix and match different combinations of PRNGs and algorithms.
+User interface of rand is formed primarily by extension traits.
Circling back to the beginning of the post, it is very important to distinguish between the two use-cases:
+
+
+using unpredictable data for cryptography
+
+
+using statistically uniform random data for stochastic algorithms
+
+
+
Although the two use-cases both have “randomness” in their name, they are disjoint, and underlying algorithms and APIs don’t have anything in common.
+They are physically different: one is a syscall, another is a pure function mapping integers to integers.
In Rust in 2023, @nrc floated an idea of a Rust compiler rewrite.
+As my hobby is writing Rust compiler frontends (1, 2), I have some (but not very many) thoughts here!
+The post consists of two parts, covering organizational and technical aspects.
Writing a production-grade compiler is not a small endeavor.
+The questions of who writes the code, who pays the people writing the code, and what’s the economic incentive to fund the work in the first place are quite important.
+
My naive guesstimate is that Rust is currently at that stage of its life where it’s clear that the language won’t die, and would be deployed quite widely, but where, at the same time, the said deployment didn’t quite happen to the full extent yet.
+From within the Rust community, it seems like Rust is everywhere.
+My guess is that from the outside it looks like there’s Rust in at least some places.
+
In other words, it’s high time to invest substantially into Rust ecosystem, as the risk that the investment sinks completely is relatively low, but the expected growth is still quite high.
+This makes me think that a next-gen rust compiler isn’t too unlikely: I feel that rustc is stuck in a local optimum, and that, with some boldness, it is possible to deliver something more awesome.
Here’s what I think an awesome rust compiler would do:
+
+
rust-native compilation model
+
+
Like C++, Rust (ab)uses the C compilation model — compilation units are separately compiled into object files, which are then linked into a single executable by the linker.
+This model is at odds with how the language work.
+In particular, compiling a generic function isn’t actually possible until you know specific type parameters at the call-site.
+Rust and C++ hack around that by compiling a separate copy for every call-site (C++ even re-type-checks every call-site), and deduplicating instantiations during the link step.
+This creates a lot of wasted work, which is only there because we try to follow “compile to object files then link” model of operation.
+It would be significantly more efficient to merge compiler and linker, such that only the minimal amount of code is compiled, compiled code is fully aware about surrounding context and can be inlined across crates, and where the compilation makes the optimal use of all available CPU and RAM.
+
+
intra-crate parallelism
+
+
C compilation model is not stupid — it is the way it is to enable separate compilation.
+Back in the day, compiling whole programs was simply not possible due to the limitations of the hardware.
+Rather, a program had to be compiled in separate parts, and then the parts linked together into the final artifact.
+With bigger computers today, we don’t think about separate compilation as much.
+It is still important though — not only our computers are more powerful, our programs are much bigger.
+Moreover, computing power comes not from increasing clock speeds, but from a larger number of cores.
+
Rust’s DAG of anonymous crates with well-defined declaration-site checked interfaces is actually quite great for compiling Rust in parallel (especially if we get rid of completely accidental interactions between monomorphization and existing linkers).
+However, even a single crate can be quite large, and is compiled sequentially.
+For example, in the recent compile time benchmark, a significant chunk of time was spent compiling just this file with a bunch of functions.
+Intuitively, as all these functions are completely independent, compiler should be able to process them in parallel.
+In reality, Rust doesn’t actually make that as easy as it seems, but it definitely is possible to do better than the current compiler.
+
+
open-world compiling; stable MIR
+
+
Today, Rust tooling is a black-box — you feed it with source text and an executable binary for the output.
+This solves the problem of producing executable binaries quite well!
+
However, for more complex projects you want to have more direct relationship with the code.
+You want tools other than compiler to understand the meaning of the code, and to act on it.
+For example automated large scale refactors and code analysis, project-specific linting rules or formal proofs of correctness all could benefit from having an access to semantically rich model of the language.
+
Providing such semantic model, where AST is annotated with resolved names, inferred types, and bodies are converted to a simple and precise IR, is a huge ask.
+Not because it is technically hard to implement, but because this adds an entirely new stable API to the language.
+Nonetheless, such an API would unlock quite a few use cases, so the tradeoff is worth it.
+
+
hermetic deterministic compilation
+
+
It is increasingly common to want reproducible builds.
+With NixOS and Guix, whole Linux distros are built in a deterministic fashion.
+It is possible to achieve reproducibility by carefully freezing whatever mess you are currently in, the docker way.
+But a better approach is to start with inherently pure and hermetic components, and assemble them into a larger system.
+
Today, Rust has some amount of determinism in its compilation, but it is achieved by plugging loopholes, rather than by not admitting impurities into the system in the first place.
+For example, the env! macro literally looks up a value in compiler’s environment, without any attempt at restricting or at least enumerating available inputs.
+Procedural macros are an unrestricted RCE.
+
It feels like we can do better, and that we should do better, if the goal is still less mess.
+
+
lazy and error-resilient compilation
+
+
For the task of providing immediate feedback right in the editor when the user types the code, compilation “pipeline” needs to be changed significantly.
+It should be lazy (so that only the minimal amount of code is inspected and re-analyzed on typing) and resilient and robust to errors (IDE job mostly ends when the code is error free).
+rust-analyzer shows one possible way to do that, with the only drawback of being a completely separate tool for IDE, and only IDE.
+There’s no technical limitation why the full compiler can’t be like that, just the organizational limitation of it being very hard to re-architecture existing entrenched code, perfected for its local optimum.
+
+
cargo install rust-compiler
+
+
Finally, for the benefit of compiler writers themselves, a compiler should be a simple rust crate, which builds with stable Rust and is otherwise a very boring text processing utility.
+Again, rust-analyzer shows that it is possible, and that the benefits for development velocity are enormous.
+I am glad to see a recent movement to making the build process for the compiler simpler!
People complain about Rust syntax.
+I think that most of the time when people think they have an issue with Rust’s syntax, they actually object to Rust’s semantics.
+In this slightly whimsical post, I’ll try to disentangle the two.
+
Let’s start with an example of an ugly Rust syntax:
+
+
+
This function reads contents of a given binary file.
+This is lifted straight from the standard library, so it is very much not a strawman example.
+And, at least to me, it’s definitely not a pretty one!
+
Let’s try to imagine what this same function would look like if Rust had a better syntax.
+Any resemblance to real programming languages, living or dead, is purely coincidental!
+
Let’s start with Rs++:
+
+
+
A Rhodes variant:
+
+
+
Typical RhodesScript:
+
+
+
Rattlesnake:
+
+
+
And, to conclude, CrabML:
+
+
+
As a slightly more serious and useful exercise, let’s do the opposite — keep the Rust syntax, but try to simplify semantics until the end result looks presentable.
+
Here’s our starting point:
+
+
+
The biggest source of noise here is the nested function.
+The motivation for it is somewhat esoteric.
+The outer function is generic, while the inner function isn’t.
+With the current compilation model, that means that the outer function is compiled together with the user’s code, gets inlined and is optimized down to nothing.
+In contrast, the inner function is compiled when the std itself is being compiled, saving time when compiling user’s code.
+One way to simplify this (losing a bit of performance) is to say that generic functions are always separately compiled, but accept an extra runtime argument under the hood which describes the physical dimension of input parameters.
+
With that, we get
+
+
+
The next noisy element is the <P: AsRef<Path>> constraint.
+It is needed because Rust loves exposing physical layout of bytes in memory as an interface, specifically for cases where that brings performance.
+In particular, the meaning of Path is not that it is some abstract representation of a file path, but that it is just literally a bunch of contiguous bytes in memory.
+So we need AsRef to make this work with any abstraction which is capable of representing such a slice of bytes.
+But if we don’t care about performance, we can require that all interfaces are fairly abstract and mediated via virtual function calls, rather than direct memory access.
+Then we won’t need AsRefat all:
+
+
+
Having done this, we can actually get rid of Vec<u8> as well — we can no longer use generics to express efficient growable array of bytes in the language itself.
+We’d have to use some opaque Bytes type provided by the runtime:
+
+
+
Technically, we are still carrying ownership and borrowing system with us, but, without direct control over memory layout of types, it no longer brings massive performance benefits.
+It still helps to avoid GC, prevent iterator invalidation, and statically check that non-thread-safe code isn’t actually used across threads.
+Still, we can easily get rid of those &-pretzels if we just switch to GC.
+We don’t even need to worry about concurrency much — as our objects are separately allocated and always behind a pointer, we can hand-wave data races away by noticing that operations with pointer-sized things are atomic on x86 anyway.
+
+
+
Finally, we are being overly pedantic with error handling here — not only we mention a possibility of failure in the return type, we even use ? to highlight any specific expression that might fail.
+It would be much simpler to not think about error handling at all, and let some top-level
+try { } catch (...) { /* intentionally empty */ }
+handler deal with it:
+
+
+
Much better now!
+
+
+
+
+
+
+
diff --git a/2023/02/10/how-a-zig-ide-could-work.html b/2023/02/10/how-a-zig-ide-could-work.html
new file mode 100644
index 00000000..5fcf651d
--- /dev/null
+++ b/2023/02/10/how-a-zig-ide-could-work.html
@@ -0,0 +1,340 @@
+
+
+
+
+
+
+ How a Zig IDE Could Work
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Zig is a very interesting language from an IDE point of view.
+Some aspects of it are friendly to IDEs, like a very minimal and simple-to-parse syntax
+(Zig can even be correctly lexed line-by-line, very cool!),
+the absence of syntactic macros, and ability to do a great deal of semantic analysis on a file-by-file basis, in parallel.
+On the other hand, comptime.
+I accidentally spent some time yesterday thinking about how to build an IDE for that, this post is a result.
It’s useful to discuss a bit how the compiler works today.
+For something more thorough, refer to this excellent series of posts: https://mitchellh.com/zig.
+
First, each Zig file is parsed into an AST.
+Delightfully, parsing doesn’t require any context whatsoever, it’s a pure []const u8 -> Ast function, and the resulting Ast is just a piece of data.
+
After parsing, the Ast is converted to an intermediate representation, Zir.
+This is where Zig diverges a bit from more typical statically compiled languages.
+Zir actually resembles something like Python’s bytecode — an intermediate representation that an interpreter for a dynamically-typed language would use.
+That’s because it is an interpreter’s IR — the next stage would use Zir to evaluate comptime.
+
Let’s look at an example:
+
+
+
Here, the Zir for generic_add would encode addition as a typeless operation, because we don’t know types at this point.
+In particular, T can be whatever.
+When the compiler would instantiategeneric_add with different Ts, like generic_add(u32, ...), generic_add(f64, ...), it will re-use the same Zir for different instantiations.
+That’s the two purposes of Zir: to directly evaluate code at compile time, and to serve as a template for monomorphisation.
+
The next stage is where the magic happens — the compiler partially evaluates dynamically typed Zir to convert it into a fairly standard statically typed IR.
+The process starts at the main function.
+The compiler more or less tries to evaluate the Zir.
+If it sees something like 90 + 2, it directly evaluates that to 92.
+For something which can’t be evaluated at compile time, like a + 2 where a is a runtime variable, the compiler generates typed IR for addition (as, at this point, we already know the type of a).
+
When the compiler sees something like
+
+
+
the compiler monomorphises the generic call.
+It checks that all comptime arguments (T) are fully evaluated, and starts partial evaluation of the called function, with comptime parameters fixed to particular values (this of course is memoized).
+
The whole process is lazy — only things transitively used from main are analyzed.
+Compiler won’t complain about something like
+
+
+
This looks perfectly fine at the Zir level, and the compiler will not move beyond Zir unless the function is actually called somewhere.
+works with code which rapidly changes over time
+
+
+gives results immediately, there is no edit/compile cycle
+
+
+provides source to source transformations
+
+
+
The hard bit is the combination of rapid changes and immediate results.
+This is usually achieved using some smart, language-specific combination of
+
+
+
Incrementality: although changes are frequent and plentiful, they are local, and it is often possible to re-use large chunks of previous analysis.
+
+
+
Laziness: unlike a compiler, an IDE does not need full analysis results for the entirety of the codebase.
+Usually, analysis of the function which is currently being edited is the only time-critical part, everything else can be done asynchronously, later.
+
+
+
This post gives an overview of some specific fruitful combinations of the two ideas:
It’s useful to start with a pedantically correct approach.
+Let’s run our usual compilation (recursively monomorphising called functions starting from the main).
+The result would contain a bunch of different monomorphisations of guinea_pig, for different values of T.
+For each specific monomorphisation it’s now clear what is the correct answer.
+For the unspecialized case as written in the source code, the IDE can now show something reasonable by combining partial results from each monomorphisation.
+
There are several issues with this approach.
+
First, collecting the full set of monomorphisations is not well-defined in the presence of conditional compilation.
+Even if you run the “full” compilation starting from main, today compiler assumes some particular environment (eg, Windows or Linux), which doesn’t give you a full picture.
+There’s a fascinating issue about multibuilds — making the compiler process all combinations of conditional compilation flags at the same time: zig#3028.
+With my IDE writer hat on, I really hope it gets in, as it will move IDE support from inherently heuristic territory, to something where, in principle, there’s a correct result (even if might not be particularly easy to compute).
+
The second problem is that this probably is going to be much too slow.
+If you think about IDE support for the first time, a very tantalizing idea is to try to lean just into incremental compilation.
+Specifically, you can imagine a compiler that maintains fully type-checked and resolved view of the code at all times.
+If a user edits something, the compiler just incrementally changes what needs to be changed.
+So the trick for IDE-grade interactive performance is just to implement sufficiently advanced incremental compilation.
+
The problem with sufficiently incremental compiler is that even the perfect incrementality, which does the minimal required amount of work, will be slow in a non-insignificant amount of cases.
+The nature of code is that a small change to the source in a single place might lead to a large change to resolved types all over the project.
+For examples, changing the name of some popular type invalidates all the code that uses this type.
+That’s the fundamental reason why IDE try hard to maintain an ability to not analyze everything.
+
On the other hand, at the end of the day you’ll have to do this work at least by the time you run the tests.
+And Zig’s compiler is written from the ground up to be very incremental and very fast, so perhaps this will be good enough?
+My current gut feeling is that the answer is no — even if you can re-analyze everything in, say, 100ms, that’ll still require burning the battery for essentially useless work.
+Usually, there’s a lot more atomic small edits for a single test run.
+
The third problem with the approach of collection all monomorphisations is that it simply does not work if the function isn’t actually called, yet.
+Which is common in incomplete code that is being written, exactly the use-case where the IDE is most useful!
Thinking about the “full” approach more, it feels like it could be, at least in theory, optimized somewhat.
+Recall that in this approach we have a graph of function instantiations, which starts at the root (main), and contains various monomorphisations of guinea_pig on paths reachable from the root.
+
It is clear we actually don’t need the full graph to answer queries about instantiations of guinea_pig.
+For example, if we have something like
+
+
+
and the helper does not (transitively) call guinea_pig, we can avoid looking into its body, as the signature is enough to analyze everything else.
+
More precisely, given the graph of monomorphisations, we can select minimal subgraph which includes all paths from main to guinea_pig instantiations, as well as all the functions whose bodies we need to process to understand their signatures.
+My intuition is that the size of that subgraph is going to be much smaller than the whole thing, and, in principle, an algorithm which would analyze only that subgraph should be speedy enough in practice.
+
The problem though is that, as far as I know, it’s not possible to understand what belongs to the subgraph without analysing the whole thing!
+In particular, using compile-time reflection our guinea_pig can be called through something like comptime "guinea" ++ "_pig".
+It’s impossible to infer the call graph just from Zir.
+
And of course this does not help the case where the function isn’t called at all.
from a different direction.
+What if we just treat this function as the root of our graph?
+We can’d do that exactly, because it has some comptime parameters.
+But we can say that we have some opaque values for the parameters: T = opaquevalue.
+Of course, we won’t be able to fully evaluate everything and things like if (T == int) would probably need to propagate opaqueness.
+At the same time, something like the result of BoundedArray(opaque) would still be pretty useful for an IDE.
+
I am wondering if there’s even perhaps some compilation-time savings in this approach?
+My understanding (which might be very wrong!) is that if a generic function contains something like 90 + 2, this expression would be comptime-evaluated anew for every instantiation.
+In theory, what we could do is to partially evaluate this function substituting opaque values for comptime parameters, and then, for any specific instantiation, we can use the result of this partial evaluation as a template.
+Not sure what that would mean precisely though: it definitely would be more complicated than just substituting Ts in the result.
Ast and Zir infra is good.
+It is per-file, so it naturally just works in an IDE.
+
Multibuilds are important.
+I am somewhat skeptical that they’ll actually fly, and it’s not a complete game over if they don’t
+(Rust has the same problem with conditional compilation, and it does create fundamental problems for both the users and authors of IDEs, but the end result is still pretty useful).
+Still, if Zig does ship multibuilds, that’d be awesome.
+
Given the unused function problem, I think it’s impossible to avoid at least some amount of abstract interpretation, so Sema has to learn to deal with opaque values.
+
With abstract interpretation machinery in place, it can be used as a first, responsive layer of IDE support.
+
Computing the full set of monomoprisations in background can be used to augment these limited synchronous features with precise results asynchronously.
+Though, this might be tough to express in existing editor UIs.
+Eg, the goto definition result is now an asynchronous stream of values.
Deno is a relatively new JavaScript runtime.
+I find quite interesting and aesthetically appealing, in-line with the recent trend to rein in the worse-is-better law of software evolution.
+This post explains why.
+
The way I see it, the primary goal of Deno is to simplify development of software, relative to the status quo.
+Simplifying means removing the accidental complexity.
+To me, a big source of accidental complexity in today’s software are implicit dependencies.
+Software is built of many components, and while some components are relatively well-defined (Linux syscall interface, amd64 ISA), others are much less so.
+Example: upgrading OpenSSL for your Rust project from 1.1.1 to 3.0.0 works on your machine, but breaks on CI, because 3.0.0 now needs some new perl module, which is expected to usually be there together with the perl installation, but that is not universally so.
+One way to solve these kinds of problems is by putting an abstraction boundary a docker container around them.
+But a different approach is to very carefully avoid creating the issues.
+Deno, in the general sense, picks this second noble hard path.
+
One of the first problems in this area is bootstrapping.
+In general, you can paper over quite a bit of complexity by writing some custom script to do all the grunt work.
+But how do you run it?
+
One answer is to use a shell script, as the shell is already installed.
+Which shell? Bash, sh, powershell?
+Probably POSIX sh is a sane choice, Windows users can just run a docker container a Linux in their subsystem.
+You’ll also want to install shellcheck to make sure you don’t accidentally use bashisms.
+At some point your script grows too large, and you rewrite it in Python.
+You now have to install Python, I’ve heard it’s much easier these days on Windows.
+Of course, you’ll run that inside a docker container a virtual environment.
+And you would be careful to use python3 -m pip rather than pip3 to make sure you use the right thing.
+
Although scripting and plumbing should be a way to combat complexity, just getting to the point where every contributor to your software can run scripts requires a docker container a great deal of futzing with the environment!
+
Deno doesn’t solve the problem of just being already there on every imaginable machine.
+However, it strives very hard to not create additional problems once you get the deno binary onto the machine.
+Some manifestations of that:
+
Deno comes with a code formatter (deno fmt) and an LSP server (deno lsp) out of the box.
+The high order bit here is not that these are high-value features which drive productivity (though that is so), but that you don’t need to pull extra deps to get these features.
+Similarly, Deno is a TypeScript runtime — there’s no transpilation step involved, you just deno main.ts.
+
Deno does not rely on system’s shell.
+Most scripting environments, including node, python, and ruby, make a grave mistake of adding an API to spawn a process intermediated by the shell.
+This is slow, insecure, and brittle (which shell was that, again?).
+I have a longer post about the issue.
+Deno doesn’t have this vulnerable API.
+Not that “not having an API” is a particularly challenging technical achievement, but it is better than the current default.
+
Deno has a correctly designed tasks system.
+Whenever you do a non-trivial software project, there inevitably comes a point where you need to write some software to orchestrate your software.
+Accidental complexity creeps in the form of a Makefile (whichmake is that?) or a ./scripts/*.sh directory.
+Node (as far as I know) pioneered a great idea to treat these as a first-class concern of the project, by including a scripts field in the package.json.
+It then botched the execution by running the scripts through system’s shell, which downgrades it to ./scripts directory with more indirection.
+In contrast, Deno runs the scripts in deno_task_shell— a purpose-built small cross-platform shell.
+You no longer need to worry that rm might behave differently depending on which rm it is, because it’s a shell’s built-in now.
+
These are all engineering nice-to-haves.
+They don’t necessary matter as much in isolation, but together they point at project values which align very well with my own ones.
+But there are a couple of innovative, bigger features as well.
+
The first big feature is the permissions system.
+When you run a Deno program, you need to specify explicitly which OS resources it can access.
+Pinging google.com would require an explicit opt-in.
+You can safely run
+
+
+
and be sure that this won’t steal your secrets.
+Of course, it can still burn the CPU indefinitely or fill out.txt with garbage, but it won’t be able to read anything beyond explicitly passed input.
+For many, if not most, scripting tasks this is a nice extra protection from supply chain attacks.
+
The second big feature is Deno’s interesting, minimal, while still practical, take on dependency management.
+First, it goes without saying that there are no global dependencies.
+Everything is scoped to the current project.
+Naturally, there are also lockfiles with checksums.
+
However, there’s no package registry or even a separate package manager.
+In Deno, a dependency is always a URL.
+The runtime itself understands URLs, downloads their contents and loads the resulting TypeScript or JavaScript.
+Surprisingly, it feels like this is enough to express various dependency patterns.
+For example, if you need a centralized registry, like https://deno.land/x, you can use URLs pointing to that!
+URLs can also express semver, with foo@1 redirecting to foo@1.2.3.
+Import maps are a standard, flexible way to remap dependencies, for when you need to tweak something deep in the tree.
+Crucially, in addition to lockfiles Deno comes with a built in deno vendor command, which fetches all of the dependencies of the current project and puts them into a subfolder, making production deployments immune to dependencies’ hosting failures.
+
Deno’s approach to built-in APIs beautifully bootstraps from its url-based dependency management.
+First, Deno provides a set of runtime APIs.
+These APIs are absolutely stable, follow existing standards (eg, fetch for doing networking), and play the role of providing cross-platform interface for the underlying OS.
+Then there’s the standard library.
+There’s an ambition to provide a comprehensive batteries included standard library, which is vetted by core developers, a-la Go.
+At the same time, huge stdlib requires a lot of work over many years.
+So, as a companion to a stable 1.30.3 runtime APIs, which is a part of deno binary, there’s 0.177.0 version of stdlib, which is downloaded just like any other dependency.
+I am fairly certain that in time this will culminate in actually stable, comprehensive, and high quality stdlib.
+
All these together mean that you can be sure that, if you got deno --version working, then deno run your-script.ts will always work, as the surface area for things to go wrong due to differences in the environment is drastically cut.
+
The only big drawback of Deno is the language — all this runtime awesomeness is tied to TypeScript.
+JavaScript is a curious beast — post ES6, it is actually quite pleasant to use, and has some really good parts, like injection-proof template literal semantics.
+But all the old WATs like
+
+
+
are still there.
+TypeScript does an admirable job with typing JavaScript, as it exists in the wild, but the resulting type system is not simple.
+It seems that, linguistically, something substantially better than TypeScript is possible in theory.
+But among the actually existing languages, TypeScript seems like a solid choice.
+
To sum up, historically the domain of “scripting” and “glue code” was plagued by the problem of accidentally supergluing oneself to a particular UNIX flavor at hand.
+Deno finally seems like a technology that tries to solve this issue of implicit dependencies by not having the said dependencies instead of putting everything in a docker container.
Usually, when discussing stability of the APIs (in a broad sense; databases and programming languages are also APIs), only two states are mentioned:
+
+
+an API is stable if there’s a promise that all future changes would be backwards compatible
+
+
+otherwise, it is unstable
+
+
+
This is reflected in, e.g, SemVer: before 1.0, anything goes, after 1.0 you only allow to break API if you bump major version.
+
I think the actual situation in the real world is a bit more nuanced than that.
+In addition to clearly stable or clearly unstable, there’s often a poorly defined third category.
+It often manifests as either:
+
+
+some technically non-stable version of the project (e.g., 0.2) becoming widely used and de facto stable
+
+
+some minor but technically breaking quietly slipping in shortly after 1.0
+
+
+
Here’s what I think happens over a lifetime of a typical API:
+
In the first phase, the API is actively evolving.
+There is a promise of anti-stability — there’s constant change and a lot of experimentation.
+Almost no one is using the project seriously:
+
+
+the API is simply incomplete, there are large gaps in functionality
+
+
+chasing upstream requires continuous, large effort
+
+
+there’s no certainty that the project will, in fact, ship a stable version, rather than die
+
+
+
In the second phase, the API is mostly settled.
+It does everything it needs to do, and the shape feels mostly right.
+Transition to this state happens when the API maintainers feel like they nailed down everything.
+However, no wide deployment had happened, so there might still be minor, but backwards incompatible adjustments wanting to be made.
+It makes sense to use the API for all active projects (though it costs you an innovation token).
+The thing basically works, you might need to adjust your code from time to time, occasionally an adjustment is not trivial, but the overall expected effort is low.
+The API is fully production ready, and has everything except stability.
+If you write a program on top of the API today, and try to run it ten years later, it will fail.
+But if you are making your own releases a couple of times a year, you should be fine.
+
In the third phase, the API is fully stable, and no backwards-incompatible changes are expected.
+Otherwise, it is identical to the second phase.
+Transition to this phase happens after:
+
+
+early adopters empirically stop uncovering deficiencies in the API
+
+
+API maintainers make a commitment to maintain stability.
+
+
+
In other words, it is not unstable -> stable, it is rather:
+
+
+experimental (unstable, not fit for production)
+
+
+production ready (still unstable, but you can budget-in a bounded amount of upgrade work)
+
+
+stable (no maintenance work is required)
+
+
+
We don’t have great, catchy terms to describe the second bullet, so it gets lumped together with the first or the last one.
An introductory post about complexity theory today!
+It is relatively well-known that there exist so-called NP-complete problems — particularly hard problems, such that, if you solve one of them efficiently, you can solve all of them efficiently.
+I think I’ve learned relatively early that, e.g., SAT is such a hard problem.
+I’ve similarly learned a bunch of specific examples of equally hard problems, where solving one solves the other.
+However, why SAT is harder than any NP problem remained a mystery for a rather long time to me.
+It is a shame — this fact is rather intuitive and easy to understand.
+This post is my attempt at an explanation.
+It assumes some familiarity with the space, but it’s not going to be too technical or thorough.
Let’s say you are solving some search problem, like “find a path that visits every vertex in a graph once”.
+It is often possible to write a naive algorithm for it, where we exhaustively check every possible prospective solution:
+
+
+
Although checking each specific candidate is pretty fast, the whole algorithm is exponential, because there are too many (exponent of) candidates.
+Turns out, it is possible to write “check if solution fits” part as a SAT formula!
+And, if you have a magic algorithm which solves SAT, you can use that to find a candidate solution which would work instead of enumerating all solutions!
+
In other words, solving SAT removes “search” from “search and check”.
+
That’s more or less everything I wanted to say today, but let’s make this a tiny bit more formal.
We will be discussing algorithms and their runtime.
+Big-O notation is a standard instrument for describing performance of algorithms, as it erases small differences which depend on a particular implementation of the algorithm.
+Both 2N + 1000 and 100N are O(N), linear.
+
In this post we will be even less precise.
+We will talk about polynomial time— an algorithm is polynomial if it is O(Nk) for some k.
+For example, N100 is polynomial, while 2N is not.
+
We will also be thinking about Turing machines (TMs) as our implementation device.
+Programming algorithms directly on Turing machines is cumbersome, but TMs have two advantages for our use case:
+
+
+it’s natural to define runtime of TM
+
+
+it’s easy to simulate a TM as a part of some larger algorithm (an interpreter for a TM is a small program)
+
+
+
Finally, we will only think about problems with binary answers (decision problem).
+“Is there a solution to this formula?” rather than “what is the solution to this formula?”.
+“Is there a path in the graph of length at least N?” rather than “what is the longest path in this graph?”.
Intuitively, a problem is NP if it’s easy to check that a solution is valid (even if finding the solution might be hard).
+This intuition doesn’t exactly work for yes/no problems we are considering.
+To fix this, we will also provide a “hint” for the checker.
+For example, if the problem is “is there a path of length N in a given graph?” the hint will be a path.
+
A decision problem is NP, if there’s an algorithm that can verify a “yes” answer in polynomial time, given a suitable hint.
+
That is, for every input where the answer is “yes” (and only for those inputs) there should be a hint that makes our verifying algorithm answer “yes”.
+
Boolean satisfiability, or SAT is a decision problem where an input is a boolean formula like
+
+
+
and the answer is “yes” if the formula evaluates to true for some variable assignment.
+
It’s easy to see that SAT is NP: the hint is variable assignment which satisfies the formula, and verifier evaluates the formula.
Turns out, there is the “hardest” problem in NP — solving just that single problem in polynomial time automatically solves every other NP problem in polynomial time (we call such problems NP-complete).
+Moreover, there’s actually a bunch of such problems, and SAT is one of them.
+Let’s see why!
+
First, let’s define a (somewhat artificial) problem which is trivially NP-complete.
+
Let’s start with this one: “Given a Turing machine and an input for it of length N, will the machine output “yes” after Nk steps?”
+(here k is a fixed parameter; pedantically, I describe a family of problems, one for each k)
+
This is very similar to a halting problem, but also much easier.
+We explicitly bound the runtime of the Turing machine by a polynomial, so we don’t need to worry about “looping forever” case — that would be a “no” for us.
+The naive algorithm here works: we just run the given machine on a given input for a given amount of steps and look at the answer.
+
Now, if we formulate the problem as “Is there an input I for a given Turing machine M such that M(I) answers “yes” after Nk steps?” we get our NP-complete problem.
+It’s trivially NP — the hint is the input that makes the machine answer “yes”, and the verifier just runs our TM with this input for Nk steps.
+It can also be used to efficiently solve any other NP problem (e.g. SAT).
+Indeed, we can use the verifying TM as M, and that way find if there’s any hint that makes it answer “yes”.
+
This is a bit circular and hard to wrap ones head around, but, at the same time, trivial.
+We essentially just carefully stare at the definition of an NP problem, specifically produce an algorithm that can solve any NP problem by directly using the definition, and notice that the resulting algorithm is also NP.
+Now there’s no surprise that there exists the hardest NP problem — we essentially defined NP such that this is the case.
+
What is still a bit mysterious is why non-weird problems like SAT also turn out to be NP-complete?
+This is because SAT is powerful enough to encode a Turing machine!
+
First, note that we can encode a state of a Turing machine as a set of boolean variables.
+We’ll need a boolean variable Ti for each position on a tape.
+The tape is in general infinite, but all our Turing machines run for polynomial (finite) time, so they use only a finite amount of cells, and it’s enough to create variables only for those cells.
+Position of the head can also be described by a set of booleans variables.
+For example, we can have a Pi“is the head at a cell i” variable for each cell.
+Similarly, we can encode the finite number of states our machine can be in as a set of Si variables (is the machine in state i?).
+
Second, we can write a set of boolean equations which describe a single transition of our Turing machine.
+For example the value of cell i at the second step T2i will depend on its value on the previous step T1i, whether the head was at i (P1i) and the rules of our specific states.
+For example, if our machine flips bits in state 0 and keeps them in state 1, then the formula we get for each cell is
+
+
+
We can write similar formulas for changes of P and S families of variables.
+
Third, after we wrote the transition formula for a single step, we can stack several such formulas on top of each other to get a formula for N steps.
+
Now let’s come back to our universal problem: “is there an input which makes a given Turing machine answer “yes” in Nk steps?”.
+At this point, it’s clear that we can replace a “Turing machine with Nk steps” with our transition formula duplicated Nk times.
+So, the question of existence of an input for a Turing machine reduces to the question of existence of a solution to a (big, but still polynomial) SAT formula.
SAT is hard, because it allows encoding Turing machine transitions.
+We can’t encode loops in SAT, but we can encode “N steps of a Turing machine” by repeating the same formula N times with small variations.
+So, if we know that a particular Turing machine runs in polynomial time, we can encode it by a polynomially-sized formula.
+(see also pure meson ray-tracer for a significantly more practical application of a similar idea).
+
And that means that every problem that can be solved by a brute-force search over all solutions can be reduced to a SAT instance, by encoding the body of the search loop as a SAT formula!
A common trope is how, if one wants to build a game, one should build a game, rather than a game engine, because it is all too easy to fall into a trap of building a generic solution, without getting to the game proper.
+It seems to me that the situation with code editors is the opposite — many people build editors, but few are building “editor engines”.
+What’s an “editor engine”? A made up term I use to denote a thin waist the editor is build upon, the set of core concepts, entities and APIs which power the variety of editor’s components.
+In this post, I will highlight Emacs’ thin waist, which I think is worthy of imitation!
+
Before we get to Emacs, lets survey various APIs for building interactive programs.
+
+
Plain text
+
+
The simplest possible thing, the UNIX way of programs-filters, reading input from stdin and writing data to stdout.
+The language here is just plain text.
+
+
ANSI escape sequences
+
+
Adding escape codes to plain text (and a bunch of ioctls) allows changing colors and clearing the screen.
+The language becomes a sequence of commands for the terminal (with “print text” being a fairly frequent one).
+This already is rich enough to power a variety of terminal applications, such as vim!
+
+
HTML
+
+
With more structure, we can disentangle ourselves from text, and say that all the stuff is made of trees of attributed elements (whose content might be text).
+That turns out to be enough to express basically whatever, as the world of modern web apps testifies.
+
+
Canvas
+
+
Finally, to achieve maximal flexibility, we can start with a clean 2d canvas with pixels and an event stream, and let the app draw however it likes.
+Desktop GUIs usually work that way (using some particular widget library to encapsulate common patterns of presentation and event handling).
+
+
+
+
Emacs is different.
+Its thin waist consists of (using idiosyncratic olden editor terminology) frames, windows, buffers and attributed text.
+This is less general than canvas or HTML, but more general (and way more principled) than ANSI escapes.
+Crucially, this also retains most of plain text’s composability.
+
The foundation is a text with attributes — a pair of a string and a map from string’s subranges to key-value dictionaries.
+Attributes express presentation (color, font, text decoration), but also semantics.
+A range of text can be designated as clickable.
+Or it can specify a custom keymap, which is only active when the cursor is on this range.
+
I find this to be a sweet spot for building efficient user interfaces.
+Consider magit:
+
+
+
The interface is built from text, but it is more discoverable, more readable, and more efficient than GUI solutions.
+
Text is surprisingly good at communicating with humans!
+Forgoing arbitrary widgets and restricting oneself to a grid of characters greatly constrains the set of possible designs, but designs which come out of these constraints tend to be better.
+
+
The rest (buffers, windows, and frames) serve to present attributed strings to the user.
+A Buffer holds a piece of text and stores position of the cursor (and the rest of editor’s state for this particular piece of text).
+A tiling window manager displays buffers:
+
+
+there’s a set of floating windows (frames in Emacs terminology) managed by a desktop environment
+
+
+each floating window is subdivided into a tree of vertical and horizontal splits (windows) managed by Emacs
+
+
+each split displays a buffer, although some buffers might not have a corresponding split
+
+
+
There’s also a tasteful selection of extras outside this orthogonal model.
+A buffer holds a status bar at the bottom and a set of fringe decorations at the left edge.
+Each floating window has a minibuffer — an area to type commands into (minibuffer is a buffer though — only presentation is slightly unusual).
+
But the vast majority of everything else is not special — every significant thing is a buffer.
+So, ./main.rs file, ./src file tree, a terminal session where you type cargo build are all displayed as attributed text.
+All use the same tools for navigation and manipulation.
+
Universality is the power of the model.
+Good old UNIX pipes, except interactive.
+With a GUI file manager, mass-renaming files requires a dedicated utility.
+In Emacs, file manager’s state is text, so you can use standard text-manipulation tools (regexes, multiple cursors, vim’s .) for the same task.
Pay more attention to the editor’s thin waist.
+Don’t take it as a given that an editor should be a terminal, HTML, or GUI app — there might be a better vocabulary.
+In particular, Emacs seems to hit the sweet spot with its language of attributed strings and buffers.
+
I am not sure that Emacs is the best we can do, but having a Rust library which implements Emacs model more or less as is would be nice!
+The two best resources to learn about this model are
This post will be a bit all over the place.
+Several months ago, I wrote Hard Mode Rust, exploring an allocation-conscious style of programming.
+In the ensuing discussion, @jamii name-dropped TigerBeetle, a reliable, distributed, fast, and small database written in Zig in a similar style, and, well, I now find myself writing Zig full-time, after more than seven years of Rust.
+This post is a hand-wavy answer to the “why?” question.
+It is emphatically not a balanced and thorough comparison of the two languages.
+I haven’t yet written my 100k lines of Zig to do that.
+(if you are looking for a more general “what the heck is Zig”, I can recommend @jamii’s post).
+In fact, this post is going to be less about languages, and more about styles of writing software (but pre-existing knowledge of Rust and Zig would be very helpful).
+Without further caveats, let’s get started.
To the first approximation, we all strive to write bug-free programs.
+But I think a closer look reveals that we don’t actually care about programs being correct 100% of the time, at least in the majority of the domains.
+Empirically, almost every program has bugs, and yet it somehow works out OK.
+To pick one specific example, most programs use stack, but almost no programs understand what their stack usage is exactly, and how far they can go.
+When we call malloc, we just hope that we have enough stack space for it, we almost never check.
+Similarly, all Rust programs abort on OOM, and can’t state their memory requirements up-front.
+Certainly good enough, but not perfect.
+
The second approximation is that we strive to balance program usefulness with the effort to develop the program.
+Bugs reduce usefulness a lot, and there are two styles of software engineering to deal with the:
+
Erlang style, where we embrace failability of both hardware and software and explicitly design programs to be resilient to partial faults.
+
SQLite style, where we overcome an unreliable environment at the cost of rigorous engineering.
+
rust-analyzer and TigerBeetle are perfect specimens of the two approaches, let me describe them.
rust-analyzer is an LSP server for the Rust programming language.
+By its nature, it’s expansive.
+Great developer tools usually have a feature for every niche use-case.
+It also is a fast-moving open source project which has to play catch-up with the rustc compiler.
+Finally, the nature of IDE dev tooling makes availability significantly more important than correctness.
+An erroneous completion option would cause a smirk (if it is noticed at all), while the server crashing and all syntax highlighting turning off will be noticed immediately.
+
For this cluster of reasons, rust-analyzer is shifted far towards the “embrace software imperfections” side of the spectrum.
+rust-analyzer is designed around having bugs.
+All the various features are carefully compartmentalized at runtime, such that panicking code in just a single feature can’t bring down the whole process.
+Critically, almost no code has access to any mutable state, so usage of catch_unwind can’t lead to a rotten state.
+
Development process itself is informed by this calculus.
+For example, PRs with new features land when there’s a reasonable certainty that the happy case works correctly.
+If some weird incomplete code would cause the feature to crash, that’s OK.
+It might be even a benefit — fixing a well-reproducible bug in an isolated feature is a gateway drug to heavy contribution to rust-analyzer.
+Our tight weekly release schedule (and the nightly release) help to get bug fixes out there faster.
+
Overall, the philosophy is to maximize provided value by focusing on the common case.
+Edge cases become eventually correct over time.
It is a database, with domain model fixed at compile time (we currently do double-entry bookkeeping).
+The database is distributed, meaning that there are six TigerBeetle replicas running on different geographically and operationally isolated machines, which together implement a replicated state machine.
+That is, TigerBeetle replicas exchange messages to make sure every replica processes the same set of transactions, in the same order.
+That’s a surprisingly hard problem if you allow machines to fail (the whole point of using many machines for redundancy), so we use a smart consensus algorithm (non-byzantine) for this.
+Traditionally, consensus algorithms assume reliable storage — data once written to disk can be always retrieved later.
+In reality, storage is unreliable, nearly byzantine — a disk can return bogus data without signaling an error, and even a single such error can break consensus.
+TigerBeetle combats that by allowing a replica to repair its local storage using data from other replicas.
+
On the engineering side of things, we are building a reliable, predictable system.
+And predictable means really predictable.
+Rather than reining in sources of non-determinism, we build the whole system from the ground up from a set of fully deterministic, hand crafted components.
+Here are some of our unconventional choices (design doc):
+
It’s hard mode!
+We allocate all the memory at a startup, and there’s zero allocation after that.
+This removes all the uncertainty about allocation.
+
The code is architected with brutal simplicity.
+As a single example, we don’t use JSON, or ProtoBuf, or Cap’n’Proto for serialization.
+Rather, we just cast the bytes we received from the network to a desired type.
+The motivation here is not so much performance, as reduction of the number of moving parts.
+Parsing is hard, but, if you control both sides of the communication channel, you don’t need to do it, you can send checksummed data as is.
+
We aggressively minimize all dependencies.
+We know exactly the system calls our system is making, because all IO is our own code (on Linux, our main production platform, we don’t link libc).
+
There’s little abstraction between components — all parts of TigerBeetle work in concert.
+For example, one of our core types, Message, is used throughout the stack:
+
+
+network receives bytes from a TCP connection directly into a Message
+
+
+consensus processes and sends Messages
+
+
+similarly, storage writes Messages to disk
+
+
+
This naturally leads to very simple and fast code.
+We don’t need to do anything special to be zero copy — given that we allocate everything up-front, we simply don’t have any extra memory to copy the data to!
+(A separate issue is that, arguably, you just can’t treat storage as a separate black box in a fault-tolerant distributed system, because storage is also faulty).
+
Everything in TigerBeetle has an explicit upper-bound.
+There’s not a thing which is just an u32— all data is checked to meet specific numeric limits at the edges of the system.
+
This includes Messages.
+We just upper-bound how many messages can be in-memory at the same time, and allocate precisely that amount of messages (source).
+Getting a new message from the message pool can’t allocate and can’t fail.
+
With all that strictness and explicitness about resources, of course we also fully externalize any IO, including time.
+All inputs are passed in explicitly, there’s no ambient influences from the environment.
+And that means that the bulk of our testing consists of trying all possible permutations of effects of the environment.
+Deterministic randomized simulation is very effective at uncovering issues in real implementations of distributed systems.
+
What I am getting at is that TigerBeetle isn’t really a normal “program” program.
+It strictly is a finite state machine, explicitly coded as such.
I find myself often returning to the first Rust slide deck.
+A lot of core things are different (no longer Rust uses only the old ideas), but a lot is the same.
+To be a bit snarky, while Rust “is not for lone genius hackers”, Zig … kinda is.
+On more peaceable terms, while Rust is a language for building modular software, Zig is in some sense anti-modular.
That’s the core of what Rust is doing: it provides you with a language to precisely express the contracts between components, such that components can be integrated in a machine-checkable way.
+
Zig doesn’t do that. It isn’t even memory safe. My first experience writing a non-trivial Zig program went like this:
+
+
+
However!
+Zig is a much smaller language than Rust.
+Although you’ll have to be able to keep the entirety of the program in your head, to control heaven and earth to not mess up resource management, doing that could be easier.
+
It’s not true that rewriting a Rust program in Zig would make it simpler.
+On the contrary, I expect the result to be significantly more complex (and segfaulty).
+I noticed that a lot of Zig code written in “let’s replace RAII with defer” style has resource-management bugs.
+
But it often is possible to architect the software such that there’s little resource management to do (eg, allocating everything up-front, like TigerBeetle, or even at compile time, like many smaller embedded systems).
+It’s hard — simplicity is always hard.
+But, if you go this way, I feel like Zig can provide substantial benefits.
+
Zig has just a single feature, dynamically-typed comptime, which subsumes most of the special-cased Rust machinery.
+It is definitely a tradeoff, instantiation-time errors are much worse for complex cases.
+But a lot more of the cases are simple, because there’s no need for programming in the language of types.
+Zig is very spartan when it comes to the language.
+There are no closures — if you want them, you’ll have to pack a wide-pointer yourself.
+Zig’s expressiveness is aimed at producing just the right assembly, not at allowing maximally concise and abstract source code.
+In the words of Andrew Kelley, Zig is a DSL for emitting machine code.
+
Zig strongly prefers explicit resource management.
+A lot of Rust programs are web-servers.
+Most web servers have a very specific execution pattern of processing multiple independent short-lived requests concurrently.
+The most natural way to code this would be to give each request a dedicated bump allocator, which turns drops into no-ops and “frees” the memory at bulk after each request by resetting offset to zero.
+This would be pretty efficient, and would provide per-request memory profiling and limiting out of the box.
+I don’t think any popular Rust frameworks do this — using the global allocator is convenient enough and creates a strong local optima.
+Zig forces you to pass the allocator in, so you might as well think about the most appropriate one!
+
Similarly, the standard library is very conscious about allocation, more so than Rust’s.
+Collections are not parametrized by an allocator, like in C++ or (future) Rust.
+Rather, an allocator is passed in explicitly to every method which actually needs to allocate.
+This is Call Site Dependency Injection, and it is more flexible.
+For example in TigerBeetle we need a couple of hash maps.
+These maps are sized at a startup time to hold just the right number of elements, and are never resized.
+So we pass an allocator to init method, but we don’t pass it to the event loop.
+We get to both use the standard hash-map, and to feel confident that there’s no way we can allocate in the actual event loop, because it doesn’t have access to an allocator.
First, I think Zig’s strength lies strictly in the realm of writing “perfect” systems software.
+It is a relatively thin slice of the market, but it is important.
+One of the problems with Rust is that we don’t have a reliability-oriented high-level programming language with a good quality of implementation (modern ML, if you will).
+This is a blessing for Rust, because it makes its niche bigger, increasing the amount of community momentum behind the language.
+This is also a curse, because a bigger niche makes it harder to maintain focus.
+For Zig, Rust already plays this role of “modern ML”, which creates bigger pressure to specialize.
+
Second, my biggest worry about Zig is its semantics around aliasing, provenance, mutability and self-reference ball of problems.
+I don’t worry all that much about this creating “iterator invalidation” style of UB.
+TigerBeetle runs in -DReleaseSafe, which mostly solves spatial memory safety, it doesn’t really do dynamic memory allocation, which unasks the question about temporal memory safety,
+and it has a very thorough fuzzer-driven test suite, which squashes the remaining bugs.
+I do worry about the semantics of the language itself.
+My current understanding is that, to correctly compile a C-like low-level language, one really needs to nail down semantics of pointers.
+I am not sure “portable assembly” is really a thing: it is possible to create a compiler which does little optimization and “works as expected” most of the time, but I am doubtful that it’s possible to correctly describe the behavior of such a compiler.
+If you start asking questions about what are pointers, and what is memory, you end up in a fairly complicated land, where bytes are poison.
+Rust tries to define that precisely, but writing code which abides by the Rust rules without a borrow-checker isn’t really possible — the rules are too subtle.
+Zig’s implementation today is very fuzzy around potentially aliased pointers, copies of structs with interior-pointers and the like.
+I wish that Zig had a clear answer to what the desired semantics is.
+
Third, IDE support.
+I’ve written about that before on this blog.
+As of today, developing Zig is quite pleasant —the language server is pretty spartan, but already is quite helpful, and for the rest, Zig is exceptionally greppable.
+But, with the lazy compilation model and the absence of out-of-the-language meta programming, I feel like Zig could be more ambitious here.
+To position itself well for the future in terms of IDE support, I think it would be nice if the compiler gets the basic data model for IDE use-case.
+That is, there should be an API to create a persistent analyzer process, which ingests a stream of code edits, and produces a continuously updated model of the code without explicit compilation requests.
+The model can be very simple, just “give me an AST of this file at this point in time” would do — all the fancy IDE features can be filled in later.
+What matters is a shape of data flow through the compiler — not an edit-compile cycle, but rather a continuously updated view of the world.
+
Fourth, one of the values of Zig which resonates with me a lot is a preference for low-dependency, self-contained processes.
+Ideally, you get yourself a ./zig binary, and go from there.
+The preference, at this time of changes, is to bundle a particular version of ./zig with a project, instead of using a system-wide zig.
+There are two aspects that could be better.
+
“Getting yourself a Zig” is a finicky problem, because it requires bootstrapping.
+To do that, you need to run some code that will download the binary for your platform, but each platform has its own way to “run code”.
+I wish that Zig provided a blessed set of scripts, get_zig.sh, get_zig.bat, etc (or maybe a small actually portable binary?), which projects could just vendor, so that the contribution experience becomes fully project-local and self-contained:
+
+
+
Once you have ./zig, you can use that to drive the rest of the automation.
+You already can ./zig build to drive the build, but there’s more to software than just building.
+There’s always a long tail of small things which traditionally get solved with a pile of platform-dependent bash scripts.
+I wish that Zig pushed the users harder towards specifying all that automation in Zig.
+A picture is worth a thousand words, so
+
+
+
Attempting to summarize,
+
+
+Rust is about compositional safety, it’s a more scalable language than Scala.
+
+
+Zig is about perfection.
+It is a very sharp, dangerous, but, ultimately, more flexible tool.
+
Rust is vertically scalable, in that you can write all kinds of software in it.
+You can write an advanced zero-alloc image compression library, build a web server exposing the library to the world as an HTTP SAAS, and cobble together a “script” for building, testing, and deploying it to wherever people deploy software these days.
+And you would only need Rust — while it excels in the lowest half of the stack, it’s pretty ok everywhere else too.
Rust is horizontally scalable, in that you can easily parallelize development of large software artifacts across many people and teams.
+Rust itself moves with a breakneck speed, which is surprising for such a loosely coordinated and chronically understaffed open source project of this scale.
+The relatively small community managed to put together a comprehensive ecosystem of composable high-quality crates on a short notice.
+Rust is so easy to compose reliably that even the stdlib itself does not shy from pulling dependencies from crates.io.
+
Steve Klabnik wrote about Rust’s Golden Rule,
+how function signatures are mandatory and authoritative and explicitly define the interface both for the callers of the function and for the function’s body.
+This thinking extends to other parts of the language.
+
My second most favorite feature of Rust (after safety) is its module system.
+It has first-class support for the concept of a library.
+A library is called a crate and is a tree of modules, a unit of compilation, and a principle visibility boundary.
+Modules can contain circular dependencies, but libraries always form a directed acyclic graph.
+There’s no global namespace of symbols — libraries are anonymous, names only appear on dependency edges between two libraries, and are local to the downstream crate.
+
The benefits of this core compilation model are then greatly amplified by Cargo, which is not a generalized task runner, but rather a rigid specification for what is a package of Rust code:
+
+
+a (library) crate,
+
+
+a manifest, which defines dependencies between packages in a declarative way, using semver,
+
+
+an ecosystem-wide agreement on the semantics of dependency specification, and accompanied dependency resolution algorithm.
+
+
+
Crucially, there’s absolutely no way in Cargo to control the actual build process.
+The build.rs file can be used to provide extra runtime inputs, but it’s cargo who calls rustc.
+
Again, Cargo defines a rigid interface for a reusable piece of Rust code.
+Both producers and consumers must abide by these rules, there is no way around them.
+As a reward, they get a super-power of working together by working apart.
+I don’t need to ping dtolnay in Slack when I want to use serde-json because we implicitly pre-agreed to a shared golden rule.
+
+
+
+
+
+
+
+
diff --git a/2023/04/02/ub-might-be-the-wrong-term-for-newer-languages.html b/2023/04/02/ub-might-be-the-wrong-term-for-newer-languages.html
new file mode 100644
index 00000000..7e8c9cbd
--- /dev/null
+++ b/2023/04/02/ub-might-be-the-wrong-term-for-newer-languages.html
@@ -0,0 +1,132 @@
+
+
+
+
+
+
+ UB Might Be a Wrong Term for Newer Languages
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
A short note on undefined behavior, which assumes familiarity with the subject (see this article for the introduction).
+The TL;DR is that I think that carrying the wording from the C standard into newer languages, like Zig and Rust, might be a mistake.
+This is strictly the word choice, the “lexical syntax of the comments” argument.
+
The C standard leaves many behaviors undefined.
+However, it allows any particular implementation to fill in the gaps and define some of undefined-in-the-standard behaviors.
+For example, C23 makes realloc(ptr, 0) into an undefined behavior, so that POSIX can further refine it without interfering with the standard (source).
+
It’s also valid for an implementation to leave UB undefined.
+If a program compiled with this implementation hits this UB path, the behavior of the program as a whole is undefined
+(or rather, bounded by the execution environment. It is not actually possible to summon nasal daemons, because a user-space process can not escape its memory space other than by calling syscalls, and there are no nasal daemons summoning syscalls).
+
C implementations are not required to but may define behaviors left undefined by the standard.
+A C program written for a specific implementation may rely on undefined-in-the-standard but defined-in-the-implementation behavior.
+
Modern languages like Rust and Zig re-use the “undefined behavior” term.
+However, the intended semantics is subtly different.
+A program exhibiting UB is always considered invalid.
+Even if an alternative implementation of Rust defines some of Rust’s UB, the programs hitting those behaviors would still be incorrect.
+
For this reason, I think it would be better to use a different term here.
+I am not ready to suggest a specific wording, but a couple of reasonable options would be “non-trapping programming error” or “invalid behavior”.
+The intended semantics being that any program execution containing illegal behavior is invalid under any implementation.
+
Curiously, C++ is ahead of the pack here, as it has an explicit notion of “ill-formed, no diagnostic required”.
+
Update: I’ve since learned that Zig is updating its terminology.
+The new term is illegal behavior.
+This is perfect, “illegal” has just the right connotation of being explicitly declared incorrect by a written specifciation.
+
+
+
+
+
+
+
diff --git a/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html b/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html
new file mode 100644
index 00000000..78c9f09f
--- /dev/null
+++ b/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html
@@ -0,0 +1,557 @@
+
+
+
+
+
+
+ Can You Trust a Compiler to Optimize Your Code?
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
More or less the title this time, but first, a story about SIMD. There are three
+levels of understanding how SIMD works (well, at least I am level 3 at the moment):
+
+
+
Compilers are smart! They will auto-vectorize all the code!
+
+
+
Compilers are dumb, auto-vectorization is fragile, it’s very easy to break it
+by unrelated changes to the code. It’s always better to manually write
+explicit SIMD instructions.
+
+
+
Writing SIMD by hand is really hard — you’ll need to re-do the work for
+every different CPU architecture. Also, you probably think that, for scalar
+code, a compiler writes better assembly than you. What makes you think that
+you’d beat the compiler at SIMD, where there are more funky instructions and
+constraints? Compilers are tools. They can reliably vectorize code if it is
+written in an amenable-to-vectorization form.
+
+
+
I’ve recently moved from the second level to the third one, and that made me aware of the moment when the model used by a compiler for optimization clicked in my head.
+In this post, I want to explain the general framework for reasoning about compiler optimizations for static languages such as Rust or C++.
+After that, I’ll apply that framework to auto-vectorization.
+
I haven’t worked on backends of production optimizing compilers, so the following will not be academically correct, but these models are definitely helpful at least to me!
The first bit of a puzzle is understanding how a compiler views code. Some useful references here include
+The SSA Book or LLVM’s
+Language Reference.
+
Another interesting choice would be WebAssembly Specification.
+While WASM would be a poor IR for an optimizing compiler, it has a lot of structural similarities, and the core spec is exceptionally readable.
+
A unit of optimization is a function.
+Let’s take a simple function like the following:
+
+
+
In some pseudo-IR, it would look like this:
+
+
+
The most important characteristic here is that there are two kinds of entities:
+
First, there is program memory, very roughly an array of bytes.
+Compilers generally can not reason about the contents of the memory very well, because it is shared by all the functions, and different functions might interpret the contents of the memory differently.
+
Second, there are local variables.
+Local variables are not bytes — they are integers, they obey mathematical properties which a compiler can reason about.
+
For example, if a compiler sees a loop like
+
+
+
It can reason that on each iteration tmp holds i * 4 and optimize the code to
+
+
+
This works, because all locals are just numbers.
+If we did the same computation, but all numbers were located in memory, it would be significantly harder for a compiler to reason that the transformation is actually correct.
+What if the storage for n and total actually overlaps?
+What if tmp overlaps with something which isn’t even in the current function?
+
However, there’s a bridge between the worlds of mathematical local variables and the world of memory bytes —load and store instructions.
+The load instruction takes a range of bytes in memory, interprets the bytes as an integer, and stores that integer into a local variable.
+The store instruction does the opposite.
+By loading something from memory into a local, a compiler gains the ability to reason about it precisely.
+Thus, the compiler doesn’t need to track the general contents of memory.
+It only needs to check that it would be correct to load from memory at a specific point in time.
+
So, a compiler really doesn’t see all that well — it can only really reason about a single function at a time, and only about the local variables in that function.
Compilers are myopic.
+This can be fixed by giving more context to the compiler, which is the task of two core optimizations.
+
The first core optimization is inlining.
+It substitutes callee’s body for a specific call.
+The benefit here is not that we eliminate function call overhead, that’s relatively minor.
+The big thing is that locals of both the caller and the callee are now in the same frame, and a compiler can optimize them together.
+
Let’s look again at that Rust code:
+
+
+
The xs[i] expression there is actually a function call.
+The indexing function does a bounds check before accessing the element of an array.
+After inlining it into the sum, compiler can see that it is dead code and eliminate it.
+
If you look at various standard optimizations, they often look like getting rid of dumb things, which no one would actually write in the first place, so its not clear immediately if it is worth it to implement such optimizations.
+But the thing is, after inlining a lot of dumb things appear, because functions tend to handle the general case, and, at a specific call-site, there are usually enough constraints to dismiss many edge cases.
+
The second core optimization is scalar replacement of aggregates.
+It is a generalization of the “let’s use load to avoid reasoning about memory and reason about a local instead” idea we’ve already seen.
+
If you have a function like
+
+
+
it’s pretty difficult for the compiler to reason about it.
+It receives a pointer to some memory which holds a complex struct (ptr, len, capacity triple), so reasoning about evolution of this struct is hard.
+What the compiler can do is to load this struct from memory, replacing the aggregate with a bunch of scalar local variables:
+
+
+
This way, a compiler again gains reasoning power.
+SROA is like inlining, but for memory rather than code.
+is great at noticing relations between local variables and rearranging the code based on that,
+
+
+is capable of limited reasoning about the memory (namely, deciding when it’s safe to load or store)
+
+
+
we can describe which code is reliably optimizable, and which code prevents optimizations, explaining zero cost abstractions.
+
To enable inlining, a compiler needs to know which function is actually called.
+If a function is called directly, it’s pretty much guaranteed that a compiler would try to inline it.
+If the call is indirect (via function pointer, or via a table of virtual functions), in the general case a compiler won’t be able to inline that.
+Even for indirect calls, sometimes the compiler can reason about the value of the pointer and de-virtualize the call, but that relies on successful optimization elsewhere.
+
This is the reason why, in Rust, every function has a unique, zero-sized type with no runtime representation.
+It statically guarantees that the compiler could always inline the code, and makes this abstraction zero cost, because any decent optimizing compiler will melt it to nothing.
+
A higher level language might choose to always represent functions with function pointers.
+In practice, in many cases the resulting code would be equivalently optimizable.
+But there won’t be any indication in the source whether this is an optimizable case (the actual pointer is knowable at compile time) or a genuinely dynamic call.
+With Rust, the difference between guaranteed to be optimizable and potentially optimizable is reflected in the source language:
+
+
+
So, the first rule is to make most of the calls statically resolvable, to allow inlining.
+Function pointers and dynamic dispatch prevent inlining.
+Separate compilation might also get in a way of inlining, see this separate essay on the topic.
+
Similarly, indirection in memory can cause troubles for the compiler.
+
For something like this
+
+
+
the Foo struct is completely transparent for the compiler.
+
While here:
+
+
+
it is not clear cut.
+Proving something about the memory occupied by Foo does not in general transfer to the memory occupied by Bar.
+Again, in many cases a compiler can reason through boxes thanks to uniqueness, but this is not guaranteed.
+
A good homework at this point is to look at Rust’s iterators and understand why they look the way they do.
Another important point about memory is that, in general, a compiler can’t change the overall layout of stuff.
+SROA can load some data structure into a bunch of local variables, which then can, eg, replace “a pointer and an index” representation with “a pair of pointers”.
+But at the end of the day SROA would have to materialize “a pointer and an index” back and store that representation back into the memory.
+This is because memory layout is shared across all functions, so a function can not unilaterally dictate a more optimal representation.
+
Together, these observations give a basic rule for the baseline of performant code.
Let’s apply this general framework of giving a compiler optimizable code to work with to auto-vectorization.
+We will be optimizing the function which computes the longest common prefix between two slices of bytes (thanks @nkkarpov for the example).
+
A direct implementation would look like this:
+
+
+
If you already have a mental model for auto-vectorization, or if you look at the assembly output, you can realize that the function as written works one byte at a time, which is much slower than it needs to be.
+Let’s fix that!
+
SIMD works on many values simultaneously.
+Intuitively, we want the compiler to compare a bunch of bytes at the same time, but our current code does not express that.
+Let’s make the structure explicit, by processing 16 bytes at a time, and then handling remainder separately:
+
+
+
Amusingly, this is already a bit faster, but not quite there yet.
+Specifically, SIMD needs to process all values in the chunk in parallel in the same way.
+In our code above, we have a break, which means that processing of the nth pair of bytes depends on the n-1st pair.
+Let’s fix that by disabling short-circuiting.
+We will check if the whole chunk of bytes matches or not, but we won’t care which specific byte is a mismatch:
+
+
+
And this version finally lets vectorization kick in, reducing the runtime almost by an order of magnitude.
+We can now compress this version using iterators.
+
+
+
Note how the code is meaningfully different from our starting point.
+We do not blindly rely on the compiler’s optimization.
+Rather, we are aware about specific optimizations we need in this case, and write the code in a way that triggers them.
+
Specifically, for SIMD:
+
+
+we express the algorithm in terms of processing chunks of elements,
+
+
+within each chunk, we make sure that there’s no branching and all elements are processed in the same way.
+
Compilers are tools.
+While there’s a fair share of “optimistic” transformations which sometimes kick in, the bulk of the impact of an optimizing compiler comes from guaranteed optimizations with specific preconditions.
+Compilers are myopic — they have a hard time reasoning about code outside of the current function and values not held in the local variables.
+Inlining and scalar replacement of aggregates are two optimizations to remedy the situation.
+Zero cost abstractions work by expressing opportunities for guaranteed optimizations in the language’s type system.
Compilers for systems programming languages (C, C++, Rust, Zig) tend to be implemented in the languages themselves.
+The idea being that the current version of the compiler is built using some previous version.
+But how can you get a working compiler if you start out from nothing?
+
The traditional answer has been “via bootstrap chain”.
+You start with the first version of the compiler implemented in assembly, use that to compile the latest version of the compiler it is capable of compiling, then repeat.
+This historically worked OK because older versions of GCC were implemented in C (and C is easy to provide a compiler for) and, even today, GCC itself is very conservative in using language features.
+I believe GCC 10.4 released in 2022 can be built with just a C++98 compiler.
+So, if you start with a C compiler, it’s not too many hops to get to the latest GCC.
+
This doesn’t feel entirely satisfactory, as this approach requires artificially constraining the compiler itself to be very conservative.
+Rust does the opposite of that.
+Rust requires that rustc 1.x.0 is built by rustc 1.x-1.0, and there’s a new rustc version every six weeks.
+This seems like a very reasonable way to build compilers, but it also is incompatible with chain bootstrapping.
+In the limit, one would need infinite time to compile modern rustc ex nihilo!
+
I think there’s a better way if the goal is to compile the world from nothing.
+To cut to the chase, the minimal bootstrap seed for Rust could be:
+
+
+source code for current version of the compiler
+
+
+this source code compiled to core WebAssembly
+
+
+
Bootstrapping from this should be easy.
+WebAssembly is a very small language, so a runtime for it can be built out of nothing.
+Using this runtime, and rustc-compiled-to-wasm we can re-compile rustc itself.
+Then, we can either cross-compile it to the architecture we need, if that architecture is supported by rustc.
+If the architecture is not supported, we can implement a new backend for that arch in Rust, compile our modified compiler to wasm, and then cross-compile to the desired target.
+
More complete bootstrap seed would include:
+
+
+Informal specification of the Rust language, to make sense of the source code.
+
+
+Rust source code for the compiler, which also doubles as a formal specification of the language.
+
+
+Informal specification of WebAssembly, to make sense of .wasm parts of the bootstrap seed.
+
+
+.wasm code for the rust compiler, which triple-checks the Rust specification.
+
+
+Rust implementation of a WebAssembly interpreter, which doubles as a formal spec for WebAssembly.
+
+
+
And this seed is provided for every version of a language.
+This way, it is possible to bootstrap, in constant time, any version of Rust.
+
Specific properties we use for this setup:
+
+
+Compilation is deterministic.
+Compiling bootstrap sources with bootstrap .wasm blob should result in a byte-for-byte identical wasm blob.
+
+
+WebAssembly is target-agnostic.
+It describes abstract computation, which is completely independent from the host architecture.
+
+
+WebAssembly is simple.
+Implementing a WebAssembly interpreter is easy in whatever computation substrate you have.
+
+
+Compiler is a cross compiler.
+We don’t want to bootstrap just the WebAssembly backend, we want to bootstrap everything.
+This requires that the WebAssembly version of the compiler can generate the code for arbitrary architectures.
+
+
+
This setup does not prevent the trusting trust attack.
+However, it is possible to rebuild the bootstrap seed using a different compiler.
+Using that compiler to compiler rustc to .wasm will produce a different blob.
+But using that .wasm to recompile rustc again should produce the blob from the seed (unless, of course, there’s a trojan in the seed).
+
This setup does not minimize the size of opaque binary blobs in the seed.
+The size of the .wasm would be substantial.
+This setup, however, does minimize the total size of the seed.
+In the traditional bootstrap, source code for rustc 1.0.0, rustc 1.1.0, rustc 1.2.0, etc would also have to be part of the seed.
+For the suggested approach, you need only one version, at the cost of a bigger binary blob.
+
This idea is not new.
+I think it was popularized by Pascal with p-code.
+OCaml uses a similar strategy.
+Finally, Zig makes an important observation that we no longer need to implement language-specific virtual machines, because WebAssembly is a good fit for the job.
In this post, I will present a theoretical design for an interner.
+It should be fast, but there will be no benchmarks as I haven’t implemented the thing.
+So it might actually be completely broken or super slow for one reason or another.
+Still, I think there are a couple of neat ideas, which I would love to call out.
+
The context for the post is this talk by Andrew Kelley, which notices that it’s hard to reconcile interning and parallel compilation.
+This is something I have been thinking about a lot in the context of rust-analyzer, which relies heavily on pointers, atomic reference counting and indirection to make incremental and parallel computation possible.
+
And yes, interning (or, more generally, assigning unique identities to things) is a big part of that.
+
Usually, compilers intern strings, but we will be interning trees today.
+Specifically, we will be looking at something like a Value type from the Zig compiler.
+In a simplified RAII style it could look like this:
+
+
+
Such values are individually heap-allocated and in general are held behind pointers.
+Zig’s compiler adds a couple of extra tricks to this structure, like not overallocating for small enum variants:
+
+
+
But how do we intern this stuff, such that:
+
+
+values are just u32 rather than full pointers,
+
+
+values are deduplicated,
+
+
+and this whole construct works efficiently even if there are multiple threads
+using our interner simultaneously?
+
+
+
Let’s start with concurrent SegmentedList:
+
+
+
Segmented list is like ArrayList with an extra super power that pushing new items does not move/invalidate old ones.
+In normal ArrayList, when the backing storage fills up, you allocate a slice twice as long, copy over the elements from the old slice and then destroy it.
+In SegmentList, you leave the old slice where it is, and just allocate a new one.
+
Now, as we are writing an interner and want to use u32 for an index, we know that we need to store 1<<32 items max.
+But that means that we’ll need at most 31 segments for our SegmentList:
+
+
+
So we can just “pre-allocate” array of 31 pointers to the segments, hence
+
+
+
If we want to be more precise with types, we can even use a tuple whose elements are nullable pointers to arrays of power-of-two sizes:
+
+
+
Indexing into such an echeloned array is still O(1).
+Here’s how echelons look in terms of indexes
+
+
+
The first n echelons hold 2**n - 1 elements.
+So, if we want to find the ith item, we first find the echelon it is in, by computing the nearest smaller power of two of i + 1, and then index into the echelon with i - (2**n - 1), give or take a +1 here or there.
+
+
+
Note that we pre-allocate an array of pointers to segments, but not the segments themselves.
+Pointers are nullable, and we allocate new segments lazily, when we actually write to the corresponding indexes.
+This structure is very friendly to parallel code.
+Reading items works because items are never reallocated.
+Lazily allocating new echelons is easy, because the position of the pointer is fixed.
+That is, we can do something like this to insert an item at position i:
+
+
+compute the echelon index
+
+
+@atomicLoad(.Acquire) the pointer
+
+
+if the pointer is null
+
+
+allocate the echelon
+
+
+@cmpxchgStrong(.Acquire, .Release) the pointer
+
+
+free the redundant echelon if exchange failed
+
+
+
+
+insert the item
+
+
+
Notice how we don’t need any locks or even complicated atomics, at the price of sometimes doing a second redundant allocation.
+
One thing this data structure is bad at is doing bounds checks and tracking which items are actually initialized.
+For the interner use-case, we will rely on an invariant that we always use indexes provided to use by someone else, such that possession of the index signifies that:
+
+
+the echelon holding the item is allocated
+
+
+the item itself is initialized
+
+
+there’s the relevant happens-before established
+
+
+
If, instead, we manufacture an index out of thin air, we might hit all kinds of nasty behavior without any bullet-proof way to check that.
+
Okay, now that we have this SegmentList, how would we use them?
+
Recall that our simplified value is
+
+
+
Of course we will struct-of-array it now, to arrive at something like this:
+
+
+
A Value is now an index.
+This index works for two fields of ValueTable, tag and data.
+That is, the index addresses five bytes of payload, which is all that is needed for small values.
+For large tags like aggregate, the data field stores an index into the corresponding payload SegmentList.
+
That is, every value allocates a tag and data elements, but only actual u64s occupy a slot in u64SegmentList.
+
So now we can write a lookup function which takes a value index and reconstructs a value from pieces:
+
+
+
Note that here ValueFull is non-owning type, it is a reference into the actual data.
+Note as well that aggregates now store a slice of indexes, rather than a slice of pointers.
+
Now let’s deal with creating and interning values.
+We start by creating a ValueFull using data owned by us
+(e.g. if we are creating an aggregate, we may use a stack-allocated array as a backing store for []Value slice).
+Then we ask ValueTable to intern the data:
+
+
+
If the table already contains an equal value, its index is returned.
+Otherwise, the table copiesValueFull data such that it is owned by the table itself, and returns a freshly allocated index.
+
For bookkeeping, we’ll need a hash table with existing values and a counter to use for a fresh index, something like this:
+
+
+
Pay attention to _count fields — we have value_count guarding the tag and index, and separate counts for specific kinds of values, as we don’t want to allocate, e.g. an u64 for every value.
+
Our hashmap is actually a set which stores u32 integers, but uses ValueFull to do a lookup: when we consider interning a new ValueFull, we don’t know its index yet.
+Luckily, getOrPutAdapted API provides the required flexibility.
+We can use it to compare a Value (index) and a ValueFull by hashing a ValueFull and doing component-wise comparisons in the case of a collision.
+
Note that, because of interning, we can also hash ValueFull efficiently!
+As any subvalues in ValueFull are guaranteed to be already interned, we can rely on shallow hash and hash only child value’s indexes, rather than their data.
+
This is a nice design for a single thread, but how do we make it thread safe?
+The straightforward solution would be to slap a mutex around the logic in intern.
+
This actually is not as bad as it seems, as we’d need a lock only in intern, and lookup would work without any synchronization whatsoever.
+Recall that obtaining an index of a value is a proof that the value was properly published.
+Still, we expect to intern a lot of values, and that mutex is all but guaranteed to become a point of contention.
+And some amount of contention is inevitable here — if two threads try to intern two identical values, we want them to clash, communicate, and end up with a single, shared value.
+
There’s a rather universal recipe for dealing with contention — you can shard the data.
+In our case, rather than using something like
+
+
+
we can do
+
+
+
That is, we create not one, but sixteen hashmaps, and use, e.g., lower 4 bits of the hash to decide which mutex and hashmap to use.
+Depending on the structure of the hashmap, such locks could even be pushed as far as individual buckets.
+
This doesn’t solve all our contention problems — now that several threads can simultaneously intern values (as long as they are hashed into different shards) we have to make all count variables atomic.
+So we essentially moved the single global point of contention from a mutex to value_count field, which is incremented for every interned value.
+
We can apply the sharding trick again, and shard all our SegmentLists.
+But that would mean that we have to dedicate some bits from Value index to the shard number, and to waste some extra space for non-perfectly balanced shards.
+
There’s a better way — we can amortize atomic increments by allowing each thread to bulk-allocate indexes.
+That is, if a thread wants to allocate a new value, it atomically increments value_cont by, say, 1024, and uses those indexes for the next thousand allocations.
+In addition to ValueTable, each thread now gets its own distinct LocalTable:
+
+
+
An attentive reader would notice a bonus here: in this setup, a thread allocates a contiguous chunk of values.
+It is reasonable to assume that values allocated together would also be used together, so we potentially increase future spatial locality here.
+
Putting everything together, the pseudo-code for interning would look like this:
+
+
+
Note that it is important that we don’t release the mutex immediately after assigning the index for a value, but rather keep it locked all the way until we fully copied thee value into the ValueTable.
+If we release the lock earlier, a different thread which tries to intern the same value would get the correct index, but would risk accessing partially-initialized data.
+This can be optimized a bit by adding value-specific lock (or rather, a Once).
+So we use the shard lock to assign an index, then release the shard lock, and use value-specific lock to do the actual (potentially slow) initialization.
+
And that’s all I have for today!
+Again, I haven’t implemented this, so I have no idea how fast or slow it actually is.
+But the end result looks rather beautiful, and builds upon many interesting ideas:
+
+
+
SegmentList allows to maintain index stability despite insertions.
+
+
+
There will be at most 31 echelons in a SegmentList, so you can put pointes to them into an array, removing the need to synchronize to read an echelon.
+
+
+
With this setup, it becomes easy to initialize a new echelon with a single CAS.
+
+
+
Synchronization is required only when creating a new item.
+If you trust indexes, you can use them to carry happens-before.
+
+
+
In a struct-of-arrays setup for enums, you can save space by requiring that an array for a specific variant is just as long as it needs to be.
+
+
+
One benefit of interning trees is that hash function becomes a shallow operation.
An amateur note on language design which explores two important questions:
+
+
+How to do polymorphism?
+
+
+How to do anything at all?
+
+
+
Let’s start with the second question.
+What is the basic stuff that everything else is made of?
+
Not so long ago, the most popular answer to that question was “objects”— blobs of mutable state with references to other blobs.
+This turned out to be problematic — local mutation of an object might accidentally cause unwanted changes elsewhere.
+Defensive copying of collections at the API boundary was a common pattern.
+
Another answer to the question of basic stuff is “immutable values”, as exemplified by functional programming.
+This fixes the ability to reason about programs locally at the cost of developer ergonomics and expressiveness.
+A lot of code is naturally formulated in terms of “let’s mutate this little thing”, and functionally threading the update through all the layers is tiresome.
+
The C answer is that everything is made of “memory (*)”.
+It is almost as if memory is an array of bytes.
+Almost, but not quite — to write portable programs amenable to optimization, certain restrictions must be placed on the ways memory is accessed and manipulated, hence (*).
+These restrictions not being checked by the compiler (and not even visible in the source code) create a fertile ground for subtle bugs.
+
Rust takes this basic C model and:
+
+
+Makes the (*) explicit:
+
+
+pointers always carry the size of addressed memory, possibly at runtime (slices),
+
+
+pointers carry lifetime, accessing the data past the end of the lifetime is forbidden.
+
+
+
+
+Adds aliasing information to the type system, such that it becomes possible to tell if there are other pointers pointing at a particular piece of memory.
+
+
+
Curiously, this approach allows rust to have an “immutable values” feel, without requiring the user to thread updates manually,
+“In Rust, Ordinary Vectors are Values”.
+But the cognitive cost for this approach is pretty high, as the universe of values is now forked by different flavors of owning/referencing.
+
Let’s go back to the pure FP model.
+Can we just locally fix it?
+Let’s take a look at an example:
+
+
+
It is pretty clear that we can allow mutation of local variables via a simple rewrite, as that won’t compromise local reasoning:
+
+
+
Similarly, we can introduce a rewrite rule for the ubiquitous x = f(x) pattern, such that the code looks like this:
+
+
+
Does this actually work?
+Yes, it does, as popularized by Swift and distilled in its pure form by Val.
+
Formalizing the rewriting reasoning, we introduce second-class references, which can only appear in function arguments (inout parameters), but, eg, can’t be stored as fields.
+With these restrictions, “borrow checking” becomes fairly simple — at each function call it suffices to check that no two inout arguments overlap.
+
Now, let’s switch gears and explore the second question — polymorphism.
+
Starting again with OOP, you can use subtyping with its familiar class Dog extends Triangle, but that is not very flexible.
+In particular, expressing something like “sorting a list of items” with pure subtyping is not too natural.
+What works better is parametric polymorphism, where you add type parameters to your data structures:
+
+
+
Except that it doesn’t quite work as, as we also need to specify how to sort the Ts.
+One approach here would be to introduce some sort of type-of-types, to group types with similar traits into a class:
+
+
+
A somewhat simpler approach is to just explicitly pass in a comparison function:
+
+
+
How does this relate to value oriented programming?
+It happens that, when programming with values, a very common pattern is to use indexes to express relationships.
+For example, to model parent-child relations (or arbitrary graphs), the following setup works:
+
+
+
Using direct references hits language limitations:
+
+
+
Another good use-case is interning, where you have something like this:
+
+
+
How do we sort a Vec<Name>?
+We can’t use the type class approach here, as knowing the type of Name isn’t enough to sort names lexicographically, an instance of NameTable is also required to fetch the actual string data.
+The approach with just passing in comparison function works, as it can close over the correct NameTable in scope.
+
The problem with “just pass a function” is that it gets tedious quickly.
+Rather than xs.print() you now need to say xs.print(Int::print).
+Luckily, similarly to how the compiler infers the type parameter T by default, we can allow limited inference of value parameters, which should remove most of the boilerplate.
+So, something which looks like names.print() would desugar to Vec::print_vec(self.name_table.print, names).
+
This could also synergize well with compile-time evaluation.
+If (as is the common case), the value of the implicit function table is known at compile time, no table needs to be passed in at runtime (and we don’t have to repeatedly evaluate the table itself).
+We can even compile-time partially evaluate things within the compilation unit, and use runtime parameters at the module boundaries, just like Swift does.
+
And that’s basically it!
+TL;DR: value oriented programming / mutable value semantics is an interesting “everything is X” approach to get the benefits of functional purity without giving up on mutable hash tables.
+This style of programming doesn’t work with cyclic data structures (values are always trees), so indexes are often used to express auxiliary relations.
+This, however, gets in a way of type-based generic programming — a T is no longer Comparable, only T + Context is.
+A potential fix for that is to base generic programming on explicit dictionary passing combined with implicit value parameter inference.
I already have a dedicated post about a hypothetical Zig language server.
+But perhaps the most important thing I’ve written so far on the topic is the short note at the end of Zig and Rust.
+
If you want to implement an LSP for a language, you need to start with a data model.
+If you correctly implement a store of source code which evolves over time and allows computing (initially trivial) derived data, then filling in the data until it covers the whole language is a question of incremental improvement.
+If, however, you don’t start with a rock-solid data model, and rush to implement language features, you might find yourself needing to make a sharp U-turn several years down the road.
+
I find this pretty insightful!
+At least, this evening I’ve been pondering a particular aspect of the data model, and I think I realized something new about the problem space!
+The aspect is cancellation.
Consider this.
+Your language server is happily doing something very useful and computationally-intensive —
+typechecking a giant typechecker,
+computing comptime Ackermann function,
+or talking to Postgres.
+Now, the user comes in and starts typing in the very file the server is currently processing.
+What is the desired behavior, and how could it be achieved?
+
One useful model here is strong consistency.
+If the language server acknowledged a source code edit, all future semantic requests (like “go to definition” or “code completion”) reflect this change.
+The behavior is as if all changes and requests are sequentially ordered, and the server fully processes all preceding edits before responding to a request.
+There are two great benefits to this model.
+First, for the implementor it’s an easy model to reason about. It’s always clear what the answer to a particular request should be, the model is fully deterministic.
+Second, the model gives maximally useful guarantees to the user, strict serializability.
+
So consider this sequence of events:
+
+
+User types fo.
+
+
+The editor sends the edit to the language server.
+
+
+The editor requests completions for fo.
+
+
+The server starts furiously typechecking modified file to compute the result.
+
+
+User types o.
+
+
+The editor sends the o.
+
+
+The editor re-requests completions, now for foo.
+
+
+
How does the server deal with this?
+
The trivial solution is to run everything sequentially to completion.
+So, on the step 6, the server doesn’t immediately acknowledge the edit, but rather blocks until it fully completes 4.
+This is a suboptimal behavior, because reads (computing completion) block writes (updating source code).
+As a rule of thumb, writes should be prioritized over reads, because they reflect more up-to-date and more useful data.
+
A more optimal solution is to make the whole data model of the server immutable, such that edits do not modify data inplace, but rather create a separate, new state.
+In this model, computing results for 3 and 7 proceeds in parallel, and, crucially, the edit 6 is accepted immediately.
+The cost of this model is the requirement that all data structures are immutable.
+It also is a bit wasteful — burning CPU to compute code completion for an already old file is useless, better dedicate all cores to the latest version.
+
A third approach is cancellation.
+On step 6, when the server becomes aware about the pending edit, it actively cancels all in-flight work pertaining to the old state and then applies modification in-place.
+That way we don’t need to defensively copy the data, and also avoid useless CPU work.
+This is the strategy employed by rust-analyzer.
+
It’s useful to think about why the server can’t just, like, apply the edit in place completely ignoring any possible background work.
+The edit ultimately changes some memory somewhere, which might be concurrently read by the code completion thread, yielding a data race and full-on UB.
+It is possible to work-around this by applying feral concurrency control and just wrapping each individual bit of data in a mutex.
+This removes the data race, but leads to excessive synchronization, sprawling complexity and broken logical invariants (function body might change in the middle of typechecking).
+
Finally, there’s this final solution, or rather, idea for a solution.
+One interesting approach for dealing with memory which is needed now, but not in the future, is semi-space garbage collection.
+We divide the available memory in two equal parts, use one half as a working copy which accumulates useful objects and garbage, and then at some point switch the halves, copying the live objects (but not the garbage) over.
+Another place where this idea comes up is Carmack’s architecture for functional games.
+On every frame, a game copies over the game state applying frame update function.
+Because frames happen sequentially, you only need two copies of game state for this.
+We can think about applying something like that for cancellation — without going for full immutability, we can let cancelled analysis to work with the old half-state, while we switch to the new one.
+
This … is not particularly actionable, but a good set of ideas to start thinking about evolution of a state in a language server.
+And now for something completely different!
The strict consistency is a good default, and works especially well for languages with good support for separate compilation, as the amount of work a language server needs to do after an update is proportional to the size of the update, and to the amount of code on the screen, both of which are typically O(1).
+For Zig, whose compilation model is “start from the entry point and lazily compile everything that’s actually used”, this might be difficult to pull off.
+It seems that Zig naturally gravitates to a smalltalk-like image-based programming model, where the server stores fully resolved code all the time, and, if some edit triggers re-analysis of a huge chunk of code, the user just has to wait until the server catches up.
+
But what if we don’t do strong consistency?
+What if we allow IDE to temporarily return non-deterministic and wrong results?
+I think we can get some nice properties in exchange, if we use that semi-space idea.
+
The state of our language server would be comprised of three separate pieces of data:
+
+
+A fully analyzed snapshot of the world, ready.
+This is a bunch of source file, plus their ASTs, ZIRs and AIRs.
+This also probably contains an index of cross-references, so that finding all usages of an identifier requires just listing already precomputed results.
+
+
+The next snapshot, which is being analyzed, working.
+This is essentially the same data, but the AIR is being constructed.
+We need two snapshots because we want to be able to query one of them while the second one is being updated.
+
+
+Finally, we also hold ASTs for the files which are currently being modified, pending.
+
+
+
The overall evolution of data is as follows.
+
All edits synchronously go to the pending state.
+pending is organized strictly on a per-file basis, so updating it can be done quickly on the main thread (maaaybe we want to move the parsing off the main thread, but my gut feeling is that we don’t need to).
+pending always reflects the latest state of the world, it is the latest state of the world.
+
Periodically, we collect a batch of changes from pending, create a new working and kick off a full analysis in background.
+A good point to do that would be when there’s no syntax errors, or when the user saves a file.
+There’s at most one analysis in progress, so we accumulate changes in pending until the previous analysis finishes.
+
When working is fully processed, we atomically update the ready.
+As ready is just an inert piece of data, it can be safely accessed from whatever thread.
+
When processing requests, we only use ready and pending.
+Processing requires some heuristics.
+ready and pending describe different states of the world.
+pending guarantees that its state is up-to-date, but it only has AST-level data.
+ready is outdated, but it has every bit of semantic information pre-computed.
+In particular, it includes cross-reference data.
+
So, our choices for computing results are:
+
+
+
Use the pending AST.
+Features like displaying the outline of the current file or globally fuzzy-searching function by name can be implemented like this.
+These features always give correct results.
+
+
+
Find the match between the pending AST and the ready semantics.
+This works perfectly for non-local “goto definition”.
+Here, we can temporarily get “wrong” results, or no result at all.
+However, the results we get are always instant.
+
+
+
Re-analyze pending AST using results from ready for the analysis of the context.
+This is what we’ll use for code completion.
+For code completion, pending will be maximally diverging from ready (especially if we use “no syntax errors” as a heuristic for promoting pending to working),
+so we won’t be able to complete based purely on ready.
+At the same time, completion is heavily semantics-dependent, so we won’t be able to drive it through pending.
+And we also can’t launch full semantic analysis on pending (what we effectively do in rust-analyzer), due to “from root” analysis nature.
+
But we can merge two analysis techniques.
+For example, if we are completing in a function which starts as fn f(comptime T: type, param: T),
+we can use ready to get a set of values of T the function is actually called with, to complete param. in a useful way.
+Dually, if inside f we have something like const list = std.ArrayList(u32){}, we don’t have to comptime evaluate the ArrayList function, we can fetch the result from ready.
+
Of course, we must also handle the case where there’s no ready yet (it’s a first compilation, or we switched branches), so completion would be somewhat non-deterministic.
+
+
+
One important flow where non-determinism would get in a way is refactoring.
+When you rename something, you should be 100% sure that you’ve found all usages.
+So, any refactor would have to be a blocking operation where we first wait for the current working to complete, then update working with the pending accumulated so far, and wait for that to complete, to, finally, apply the refactor using only up-to-date ready.
+Luckily, refactoring is almost always a two-phase flow, reminiscent of a GET/POST flow for HTTP form (more about that).
+Any refactor starts with read-only analysis to inform the user about available options and to gather input.
+For “rename”, you wait for the user to type the new name, for “change signature” the user needs to rearrange params.
+This brief interactive window should give enough headroom to flush all pending changes, masking the latency.
+
I am pretty excited about this setup.
+I think that’s the way to go for Zig.
+
+
+The approach meshes extremely well with the ambition of doing incremental binary patching, both because it leans on complete global analysis, and because it contains an explicit notion of switching from one snapshot to the next one
+(in contrast, rust-analyzer never really thinks about “previous” state of the code. There’s always only the “current” state, with lazy, partially complete analysis).
+
+
+Zig lacks declared interfaces, so a quick “find all calls to this function” operation is required for useful completion.
+Fully resolved historical snapshot gives us just that.
+
+
+Zig is carefully designed to make a lot of semantic information obvious just from the syntax.
+Unlike Rust, Zig lacks syntactic macros or glob imports.
+This makes is possible to do a lot of analysis correctly using only pending ASTs.
+
+
+This approach nicely dodges the cancellation problem I’ve spend half of the blog post explaining, and has a relatively simple threading story, which reduces implementation complexity.
+
+
+Finally, it feels like it should be super fast (if not the most CPU efficient).
+
In this tutorial, I will explain a particular approach to parsing, which gracefully handles syntax errors and is thus suitable for language servers, which, by their nature, have to handle incomplete and invalid code.
+Explaining the problem and the solution requires somewhat less than a trivial worked example, and I want to share a couple of tricks not directly related to resilience, so the tutorial builds a full, self-contained parser, instead of explaining abstractly just the resilience.
+
The tutorial is descriptive, rather than prescriptive — it tells you what you can do, not what you should do.
+
+
+If you are looking into building a production grade language server, treat it as a library of ideas, not as a blueprint.
+
+
+If you want to get something working quickly, I think today the best answer is “just use Tree-sitter”, so you’d better read its docs rather than this tutorial.
+
+
+If you are building an IDE-grade parser from scratch, then techniques presented here might be directly applicable.
+
Let’s look at one motivational example for resilient parsing:
+
+
+
Here, a user is in the process of defining the fib_rec helper function.
+For a language server, it’s important that the incompleteness doesn’t get in the way.
+In particular:
+
+
+
The following function, fib, should be parsed without any errors such that syntax and semantic highlighting is not disturbed, and all calls to fib elsewhere typecheck correctly.
+
+
+
The fib_rec function itself should be recognized as a partially complete function, so that various language server assists can help complete it correctly.
+
+
+
In particular, a smart language server can actually infer the expected type of fib_rec from a call we already have, and suggest completing the whole prototype.
+rust-analyzer doesn’t do that today, but one day it should.
+
+
+
Generalizing this example, what we want from our parser is to recognize as much of the syntactic structure as feasible.
+It should be able to localize errors — a mistake in a function generally should not interfere with parsing unrelated functions.
+As the code is read and written left-to-right, the parser should also recognize valid partial prefixes of various syntactic constructs.
+
Academic literature suggests another lens to use when looking at this problem: error recovery.
+Rather than just recognizing incomplete constructs, the parser can attempt to guess a minimal edit which completes the construct and gets rid of the syntax error.
+From this angle, the above example would look rather like fn fib_rec(f1: u32, /* ) {} */ , where the stuff in a comment is automatically inserted by the parser.
+
Resilience is a more fruitful framing to use for a language server — incomplete code is the ground truth, and only the user knows how to correctly complete it.
+An language server can only offer guesses and suggestions, and they are more precise if they employ post-parsing semantic information.
+
Error recovery might work better when emitting understandable syntax errors, but, in a language server, the importance of clear error messages for syntax errors is relatively lower, as highlighting such errors right in the editor synchronously with typing usually provides tighter, more useful tacit feedback.
The classic approach for handling parser errors is to explicitly encode error productions and synchronization tokens into the language grammar.
+This approach isn’t a natural fit for resilience framing — you don’t want to anticipate every possible error, as there are just too many possibilities.
+Rather, you want to recover as much of a valid syntax tree as possible, and more or less ignore arbitrary invalid parts.
+
Tree-sitter does something more interesting.
+It is a GLR parser, meaning that it non-deterministically tries many possible LR (bottom-up) parses, and looks for the best one.
+This allows Tree-sitter to recognize many complete valid small fragments of a tree, but it might have trouble assembling them into incomplete larger fragments.
+In our example fn fib_rec(f1: u32, , Tree-sitter correctly recognizes f1: u32 as a formal parameter, but doesn’t recognize fib_rec as a function.
+
Top-down (LL) parsing paradigm makes it harder to recognize valid small fragments, but naturally allows for incomplete large nodes.
+Because code is written top-down and left-to-right, LL seems to have an advantage for typical patterns of incomplete code.
+Moreover, there isn’t really anything special you need to do to make LL parsing resilient.
+You sort of… just not crash on the first error, and everything else more or less just works.
+
Details are fiddly though, so, in the rest of the post, we will write a complete implementation of a hand-written recursive descent + Pratt resilient parser.
For the lack of imagination on my side, the toy language we will be parsing is called L.
+It is a subset of Rust, which has just enough features to make some syntax mistakes.
+Here’s Fibonacci:
+
+
+
Note that there’s no base case, because L doesn’t have syntax for if.
+Here’s the syntax it does have, as an ungrammar:
+
+
+
The meta syntax here is similar to BNF, with two important differences:
+
+
+the notation is better specified and more familiar (recursive regular expressions),
+
+
+it describes syntax trees, rather than strings (sequences of tokens).
+
+
+
Single quotes signify terminals: 'fn' and 'return' are keywords, 'name' stands for any identifier token, like foo, and '(' is punctuation.
+Unquoted names are non-terminals. For example, x: i32, would be an example of Param.
+Unquoted punctuation are meta symbols of ungrammar itself, semantics identical to regular expressions. Zero or more repetition is *, zero or one is ?, | is alternation and () are used for grouping.
+
The grammar doesn’t nail the syntax precisely. For example, the rule for Param, Param = 'name' ':' Type ','? , says that Param syntax node has an optional comma, but there’s nothing in the above ungrammar specifying whether the trailing commas are allowed.
+
Overall, L has very little to it — a program is a series of function declarations, each function has a body which is a sequence of statements, the set of expressions is spartan, not even an if. Still, it’ll take us some time to parse all that.
+But you can already try the end result in the text-box below.
+The syntax tree is updated automatically on typing.
+Do make mistakes to see how a partial tree is recovered.
A traditional AST for L might look roughly like this:
+
+
+
Extending this structure to be resilient is non-trivial. There are two problems: trivia and errors.
+
For resilient parsing, we want the AST to contain every detail about the source text.
+We actually don’t want to use an abstract syntax tree, and need a concrete one.
+In a traditional AST, the tree structure is rigidly defined — any syntax node has a fixed number of children.
+But there can be any number of comments and whitespace anywhere in the tree, and making space for them in the structure requires some fiddly data manipulation.
+Similarly, errors (e.g., unexpected tokens), can appear anywhere in the tree.
+
One trick to handle these in the AST paradigm is to attach trivia and error tokens to other tokens.
+That is, for something like
+fn /* name of the function -> */ f() {} ,
+the fn and f tokens would be explicit parts of the AST, while the comment and surrounding whitespace would belong to the collection of trivia tokens hanging off the fn token.
+
One complication here is that it’s not always just tokens that can appear anywhere, sometimes you can have full trees like that.
+For example, comments might support markdown syntax, and you might actually want to parse that properly (e.g., to resolve links to declarations).
+Syntax errors can also span whole subtrees.
+For example, when parsing pub(crate) nope in Rust, it would be smart to parse pub(crate) as a visibility modifier, and nest it into a bigger Error node.
+
SwiftSyntax meticulously adds error placeholders between any two fields of an AST node, giving rise to
+unexpectedBetweenModifiersAndDeinitKeyword
+and such (source, docs).
+
An alternative approach, used by IntelliJ and rust-analyzer, is to treat the syntax tree as a somewhat dynamically-typed data structure:
+
+
+
This structure does not enforce any constraints on the shape of the syntax tree at all, and so it naturally accommodates errors anywhere.
+It is possible to layer a well-typed API on top of this dynamic foundation.
+An extra benefit of this representation is that you can use the same tree type for different languages; this is a requirement for universal tools.
+
Discussing specifics of syntax tree representation goes beyond this article, as the topic is vast and lacks a clear winning solution.
+To learn about it, take a look at Roslyn, SwiftSyntax, rowan and IntelliJ.
+
To simplify things, we’ll ignore comments and whitespace, though you’ll absolutely want those in a real implementation.
+One approach would be to do the parsing without comments, like we do here, and then attach comments to the nodes in a separate pass.
+Attaching comments needs some heuristics — for example, non-doc comments generally want to be a part of the following syntax node.
+
Another design choice is handling of error messages.
+One approach is to treat error messages as properties of the syntax tree itself, by either inferring them from the tree structure, or just storing them inline.
+Alternatively, errors can be considered to be a side-effect of the parsing process (that way, trees constructed manually during, eg, refactors, won’t carry any error messages, even if they are invalid).
+
Here’s the full set of token and tree kinds for our language L:
+
+
+
Things to note:
+
+
+explicit Error kinds;
+
+
+no whitespace or comments, as an unrealistic simplification;
+
+
+Eof virtual token simplifies parsing, removing the need to handle Option<Token>;
+
+
+punctuators are named after what they are, rather than after what they usually mean: Star, rather than Mult;
+
+
+a good set of name for various kinds of braces is {L,R}{Paren,Curly,Brack,Angle}.
+
Won’t be covering lexer here, let’s just say we have fn lex(text: &str) -> Vec<Token>, function. Two points worth mentioning:
+
+
+Lexer itself should be resilient, but that’s easy — produce an Error token for anything which isn’t a valid token.
+
+
+Writing lexer by hand is somewhat tedious, but is very simple relative to everything else.
+If you are stuck in an analysis-paralysis picking a lexer generator, consider cutting the Gordian knot and hand-writing.
+
With homogenous syntax trees, the task of parsing admits an elegant formalization — we want to insert extra parenthesis into a stream of tokens.
+
+
+
Note how the sequence of tokens with extra parenthesis is still a flat sequence.
+The parsing will be two-phase:
+
+
+in the first phase, the parser emits a flat list of events,
+
+
+in the second phase, the list is converted to a tree.
+
+
+
Here’s the basic setup for the parser:
+
+
+
+
+
open, advance, and close form the basis for constructing the stream of events.
+
+
+
Note how kind is stored in the Open event, but is supplied with the close method.
+This is required for flexibility — sometimes it’s possible to decide on the type of syntax node only after it is parsed.
+The way this works is that the open method returns a Mark which is subsequently passed to close to modify the corresponding Open event.
+
+
+
There’s a set of short, convenient methods to navigate through the sequence of tokens:
+
+
+nth is the lookahead method. Note how it doesn’t return an Option, and uses Eof special value for “out of bounds” indexes.
+This simplifies the call-site, “no more tokens” and “token of a wrong kind” are always handled the same.
+
+
+at is a convenient specialization to check for a specific next token.
+
+
+eat is at combined with consuming the next token.
+
+
+expect is eat combined with error reporting.
+
+
+
These methods are not a very orthogonal basis, but they are a convenience basis for parsing.
+Finally, advance_with_error advanced over any token, but also wraps it into an error node.
+
+
+
When writing parsers by hand, it’s very easy to accidentally write the code which loops or recurses forever.
+To simplify debugging, it’s helpful to add an explicit notion of “fuel”, which is replenished every time the parser makes progress,
+and is spent every time it does not.
+
+
+
The function to transform a flat list of events into a tree is a bit involved.
+It juggles three things: an iterator of events, an iterator of tokens, and a stack of partially constructed nodes (we expect the stack to contain just one node at the end).
We are finally getting to the actual topic of resilient parser.
+Now we will write a full grammar for L as a sequence of functions.
+Usually both atomic parser operations, like fn advance, and grammar productions, like fn parse_fn are implemented as methods on the Parser struct.
+I prefer to separate the two and to use free functions for the latter category, as the code is a bit more readable that way.
+
Let’s start with parsing the top level.
+
+
+
+
+
Wrap the whole thing into a File node.
+
+
+
Use the while loop to parse a file as a series of functions.
+Importantly, the entirety of the file is parsed; we break out of the loop only when the eof is reached.
+
+
+
To not get stuck in this loop, it’s crucial that every iteration consumes at least one token.
+If the token is fn, we’ll parse at least a part of a function.
+Otherwise, we consume the token and wrap it into an error node.
+
+
+
Lets parse functions now:
+
+
+
+
+
When parsing a function, we assert that the current token is fn.
+There’s some duplication with the if p.at(FnKeyword) , check at the call-site, but this duplication actually helps readability.
+
+
+
Again, we surround the body of the function with open/close pair.
+
+
+
Although parameter list and function body are mandatory, we precede them with an at check.
+We can still report the syntax error by analyzing the structure of the syntax tree (or we can report it as a side effect of parsing in the else branch if we want).
+It wouldn’t be wrong to just remove the if altogether and try to parse param_list unconditionally, but the if helps with reducing cascading errors.
+
+
+
Now, the list of parameters:
+
+
+
+
+Inside, we have a standard code shape for parsing a bracketed list.
+It can be extracted into a high-order function, but typing out the code manually is not a problem either.
+This bit of code starts and ends with consuming the corresponding parenthesis.
+
+
+In the happy case, we loop until the closing parenthesis.
+However, it could also be the case that there’s no closing parenthesis at all, so we add an eof condition as well.
+Generally, every loop we write would have && !p.eof() tackled on.
+
+
+As with any loop, we need to ensure that each iteration consumes at least one token to not get stuck.
+If the current token is an identifier, everything is ok, as we’ll parse at least some part of the parameter.
+
+
+
Parsing parameter is almost nothing new at this point:
+
+
+
+
+This is the only interesting bit.
+To parse a comma-separated list of parameters with a trailing comma, it’s enough to check if the following token after parameter is ).
+This correctly handles all three cases:
+
+
+if the next token is ), we are at the end of the list, and no comma is required;
+
+
+if the next token is ,, we correctly advance past it;
+
+
+finally, if the next token is anything else, then it’s not a ), so we are not at the last element of the list and correctly emit an error.
+
+
+
+
+
Parsing types is trivial:
+
+
+
The notable aspect here is naming.
+The production is deliberately named TypeExpr, rather than Type, to avoid confusion down the line.
+Consider fib(92) .
+It is an expression, which evaluates to a value.
+The same thing happens with types.
+For example, Foo<Int> is not a type yet, it’s an expression which can be “evaluated” (at compile time) to a type (if Foo is a type alias, the result might be something like Pair<Int, Int>).
+
Parsing a block gets a bit more involved:
+
+
+
Block can contain many different kinds of statements, so we branch on the first token in the loop’s body.
+As usual, we need to maintain an invariant that the body consumes at least one token.
+For let and return statements that’s easy, they consume the fixed first token.
+For the expression statement (things like 1 + 1;) it gets more interesting, as an expression can start with many different tokens.
+For the time being, we’ll just kick the can down the road and require stmt_expr to deal with it (that is, to guarantee that at least one token is consumed).
+
Statements themselves are straightforward:
+
+
+
Again, for stmt_expr, we push “must consume a token” invariant onto expr.
+
Expressions are tricky.
+They always are.
+For starters, let’s handle just the clearly-delimited cases, like literals and parenthesis:
+
+
+
In the catch-all arm, we take care to consume the token, to make sure that the statement loop in block can always make progress.
+
Next expression to handle would be ExprCall.
+This requires some preparation.
+Consider this example: f(1)(2) .
+
We want the following parenthesis structure here:
+
+
+
The problem is, when the parser is at f, it doesn’t yet know how many Open events it should emit.
+
We solve the problem by adding an API to go back and inject a new Open event into the middle of existing events.
+
+
+
+
+
Here we adjust close to also return a MarkClosed, such that we can go back and add a new event before it.
+
+
+
The new API. It is like open, but also takes a MarkClosed which carries an index of an Open event in front of which we are to inject a new Open.
+In the current implementation, for simplicity, we just inject into the middle of the vector, which is an O(N) operation worst-case.
+A proper solution here would be to use an index-based linked list.
+That is, open_before can push the new open event to the end of the list, and also mark the old event with a pointer to the freshly inserted one.
+To store a pointer, an extra field is needed:
+
+
+
The loop in build_tree needs to follow the open_before links.
+
+
+
With this new API, we can parse function calls:
+
+
+
+
+
expr_delimited now returns a MarkClosed rather than ().
+No code changes are required for this, as close calls are already in the tail position.
+
+
+
To parse function calls, we check whether we are at ( and use open_before API if that is the case.
+
+
+
Parsing argument list should be routine by now.
+Again, as an expression can start with many different tokens, we don’t add an if p.at check to the loop’s body, and require arg to consume at least one token.
+
+
+
Inside arg, we use an already familiar construct to parse an optionally trailing comma.
+
+
+
Now only binary expressions are left.
+We will use a Pratt parser for those.
+This is genuinely tricky code, so I have a dedicated article explaining how it all works:
Here, I’ll just dump a pageful of code without much explanation:
+
+
+
+
+
In this version of pratt, rather than passing numerical precedence, I pass the actual token (learned that from jamii’s post).
+So, to determine whether to break or recur in the Pratt loop, we ask which of the two tokens binds tighter and act accordingly.
+
+
+
When we start parsing an expression, we don’t have an operator to the left yet, so I just pass Eof as a dummy token.
+
+
+
The code naturally handles the case when the next token is not an operator (that is, when expression is complete, or when there’s some syntax error).
Let’s see how resilient our basic parser is.
+Let’s check our motivational example:
+
+
+
Here, the syntax tree our parser produces is surprisingly exactly what we want:
+
+
+
For the first incomplete function, we get Fn, Param and ParamList, as we should.
+The second function is parsed without any errors.
+
Curiously, we get this great result without much explicit effort to make parsing resilient, it’s a natural outcome of just not failing in the presence of errors.
+The following ingredients help us:
+
+
+homogeneous syntax tree supports arbitrary malformed code,
+
+
+any syntactic construct is parsed left-to-right, and valid prefixes are always recognized,
+
+
+our top-level loop in file is greedy: it either parses a function, or skips a single token and tries to parse a function again.
+That way, if there’s a valid function somewhere, it will be recognized.
+
+
+
Thinking about the last case both reveals the limitations of our current code, and shows avenues for improvement.
+In general, parsing works as a series of nested loops:
+
+
+
If something goes wrong inside a loop, our choices are:
+
+
+skip a token, and continue with the next iteration of the current loop,
+
+
+break out of the inner loop, and let the outer loop handle recovery.
+
+
+
The top-most loop must use the “skip a token” solution, because it needs to consume all of the input tokens.
Right now, each loop either always skips, or always breaks.
+This is not optimal.
+Consider this example:
+
+
+
Here, for f1 we want to break out of param_list loop, and our code does just that.
+For f2 though, the error is a duplicated comma (the user will add a new parameter between x and z shortly), so we want to skip here.
+We don’t, and, as a result, the syntax tree for f2 is a train wreck:
+
+
+
For parameters, it is reasonable to skip tokens until we see something which implies the end of the parameter list.
+For example, if we are parsing a list of parameters and see an fn token, then we’d better stop.
+If we see some less salient token, it’s better to gobble it up.
+Let’s implement the idea:
+
+
+
Here, we use at_any helper function, which is like at, but takes a list of tokens.
+The real implementation would use bitsets for this purpose.
+
The example now parses correctly:
+
+
+
What is a reasonable RECOVERY set in a general case?
+I don’t know the answer to this question, but follow sets from formal grammar theory give a good intuition.
+We don’t want exactly the follow set: for ParamList, { is in follow, and we do want it to be a part of the recovery set, but fn is not in follow, and yet it is important to recover on it.
+fn is included because it’s in the follow for Fn, and ParamList is a child of Fn: we also want to recursively include ancestor follow sets into the recovery set.
+
For expressions and statements, we have the opposite problem —block and arg_list loops eagerly consume erroneous tokens, but sometimes it would be wise to break out of the loop instead.
+
Consider this example:
+
+
+
It gives another train wreck syntax tree, where the g function is completely missed:
+
+
+
Recall that the root cause here is that we require expr to consume at least one token, because it’s not immediately obvious which tokens can start an expression.
+It’s not immediately obvious, but easy to compute — that’s exactly first set from formal grammars.
+
Using it, we get:
+
+
+
This fixes the syntax tree:
+
+
+
There’s only one issue left.
+Our expr parsing is still greedy, so, in a case like this
+
+
+
the let will be consumed as a right-hand-side operand of +.
+Now that the callers of expr contain a check for EXPR_FIRST, we no longer need this greediness and can return None if no expression can be parsed:
+
+
+
This gives the following syntax tree:
+
+
+
And this concludes the tutorial!
+You are now capable of implementing an IDE-grade parser for a real programming language from scratch.
+
Summarizing:
+
+
+
Resilient parsing means recovering as much syntactic structure from erroneous code as possible.
+
+
+
Resilient parsing is important for IDEs and language servers, who’s job mostly ends when the code does not have errors any more.
+
+
+
Resilient parsing is related, but distinct from error recovery and repair.
+Rather than guessing what the user meant to write, the parser tries to make sense of what is actually written.
+
+
+
Academic literature tends to focus on error repair, and mostly ignores pure resilience.
+
+
+
The biggest challenge of resilient parsing is the design of a syntax tree data structure.
+It should provide convenient and type-safe access to well-formed syntax trees, while allowing arbitrary malformed trees.
+
+
+
One possible design here is to make the underlying tree a dynamically-typed data structure (like JSON), and layer typed accessors on top (not covered in this article).
+
+
+
LL style parsers are a good fit for resilient parsing.
+Because code is written left-to-right, it’s important that the parser recognizes well-formed prefixes of incomplete syntactic constructs, and LL does just that.
+
+
+
Ultimately, parsing works as a stack of nested for loops.
+Inside a single for loop, on each iteration, we need to decide between:
+
+
+trying to parse a sequence element,
+
+
+skipping over an unexpected token,
+
+
+breaking out of the nested loop and delegating recovery to the parent loop.
+
+
+
+
+
first, follow and recovery sets help making a specific decision.
+
+
+
In any case, if a loop tries to parse an item, item parsing must consume at least one token (if only to report an error).
One of the values of Zig which resonates with me deeply is a mindful approach to dependencies.
+Zig tries hard not to ask too much from the environment, such that, if you get zig version running, you can be reasonably sure that everything else works.
+That’s one of the main motivations for adding an HTTP client to the Zig distribution recently.
+Building software today involves downloading various components from the Internet, and, if Zig wants for software built with Zig to be hermetic and self-sufficient, it needs to provide ability to download files from HTTP servers.
+
There’s one hurdle for self-sufficiency: how do you get Zig in the first place?
+One answer to this question is “from your distribution’s package manager”.
+This is not a very satisfying answer, at least until the language is both post 1.0 and semi-frozen in development.
+And even then, what if your distribution is Windows?
+How many distributions should be covered by “Installing Zig” section of your CONTRIBUTING.md?
+
Another answer would be a version manager, a-la rustup, nvm, or asdf.
+These tools work well, but they are quite complex, and rely on various subtle properties of the environment, like PATH, shell activation scripts and busybox-style multipurpose executable.
+And, well, this also kicks the can down the road — you can use zvm to get Zig, but how do you get zvm?
+
I like how we do this in TigerBeetle.
+We don’t use zig from PATH.
+Instead, we just put the correct version of Zig into ./zig folder in the root of the repository, and run it like this:
+
+
+
Suddenly, whole swaths of complexity go away.
+Quiz time: if you need to add a directory to PATH, which script should be edited so that both the graphical environment and the terminal are affected?
+
Finally, another interesting case study is Gradle.
+Usually Gradle is a negative example, but they do have a good approach for installing Gradle itself.
+The standard pattern is to store two scripts, gradlew.sh and gradlew.bat, which bootstrap the right version of Gradle by downloading a jar file (java itself is not bootstrapped this way though).
+
What all these approaches struggle to overcome is the problem of bootstrapping.
+Generally, if you need to automate anything, you can write a program to do that.
+But you need some pre-existing program runner!
+And there’s just no good options out of the box — bash and powershell are passable, but barely, and they are different.
+And “bash” and the set of coreutils also differs depending on the Unix in question.
+But there’s just no good solution here — if you want to bootstrap automatically, you must start with universally available tools.
+
But is there perhaps some scripting language which is shared between Windows and Unix?
+@cspotcode suggests a horrible workaround.
+You can write a script which is both a bash script and a powershell script.
+And it even isn’t too too ugly!
+
+
+
So, here’s an idea for a hermetic Zig version management workflow.
+There’s a canonical, short getzig.ps1 PowerShell/sh script which is vendored verbatim by various projects.
+Running this script downloads an appropriate version of Zig, and puts it into ./zig/zig inside the repository (.gitignore contains /zig).
+Building, testing, and other workflows use ./zig/zig instead of relying on global system state ($PATH).
+
A proof-of-concept getzig.ps1 is at the start of this article.
+Note that I don’t know bash, powershell, and how to download files from the Internet securely, so the above PoC was mostly written by Chat GPT.
+But it seems to work on my machine.
+I clone https://github.com/matklad/hello-getzig and run
+
+
+
on both NixOS and Windows 10, and it prints hello.
+
If anyone wants to make an actual thing out of this idea, here’s possible desiderata:
+
+
+
A single polyglot getzig.sh.ps1 is cute, but using a couple of different scripts wouldn’t be a big problem.
+
+
+
Size of the scripts could be a problem, as they are supposed to be vendored into each repository.
+I’d say 512 lines for combined getzig.sh.ps1 would be a reasonable complexity limit.
+
+
+
The script must “just work” on all four major desktop operating systems: Linux, Mac, Windows, and WSL.
+
+
+
The script should be polymorphic in curl / wget and bash / sh.
+
+
+
It’s ok if it doesn’t work absolutely everywhere — downloading/building Zig manually for an odd platform is also an acceptable workflow.
+
+
+
The script should auto-detect appropriate host platform and architecture.
+
+
+
Zig version should be specified in a separate zig-version.txt file.
+
+
+
After downloading the file, its integrity should be verified.
+For this reason, zig-version.txt should include a hash alongside the version.
+As downloads are different depending on the platform, I think we’ll need some help from Zig upstream here.
+In particular, each published Zig version should include a cross-platform manifest file, which lists hashes and urls of per-platform binaries.
+The hash included into zig-version.txt should be the manifest’s hash.
TL;DR, https://bors.tech delivers a meaningfully better experience, although it suffers from being a third-party integration.
+
Specific grievances:
+
Complexity. This is a vague feeling, but merge queue feels like it is built by complexity merchants — there are a lot of unclear settings and voluminous and byzantine docs.
+Good for allocating extra budget towards build engineering, bad for actual build engineering.
+
GUI-only configuration. Bors is setup using bors.toml in the repository, merge queue is setup by clicking through web GUI.
+To share config with other maintainers, I resorted to a zoomed-out screenshot of the page.
+
Unclear set of checks. The purpose of the merge queue is to enforce not rocket science rule of software engineering — making sure that the code in the main branch satisfies certain quality invariants (all tests are passing).
+It is impossible to tell what merge queue actually enforces.
+Typically, when you enable merge queue, you subsequently find out that it actually merges anything, without any checks whatsoever.
+
Double latency. One of the biggest benefits of a merge queue for a high velocity project is its asynchrony.
+After submitting a PR, you can do a review and schedule PR to be merged without waiting for CI to finish.
+This is massive: it is 2X reduction to human attention required.
+Without queue, you need to look at a PR twice: once to do a review, and once to click merge after the green checkmark is in.
+With the queue, you only need a review, and the green checkmark comes in asynchronously.
+Except that with GitHub merge queue, you can’t actually add a PR to the queue until you get a green checkmark.
+In effect, that’s still 2X attention, and then a PR runs through the same CI checks twice (yes, you can have separate checks for merge queue and PR. No, this is not a good idea, this is complexity and busywork).
+
Lack of delegation. With bors, you can use bors delegate+ to delegate merging of a single, specific pull request to its author.
+This is helpful to drive contributor engagement, and to formalize “LGTM with the nits fixed” approval (which again reduces number of human round trips).
+
You still should use GitHub merge queue, rather than bors-ng, as that’s now a first-party feature.
+Still, its important to understand how things should work, to be able to improve state of the art some other time.
In this post, we’ll look at how Rust, Go, and Zig express the signature of function cut— the power tool of string manipulation.
+Cut takes a string and a pattern, and splits the string around the first occurrence of the pattern:
+cut("life", "if") = ("l", "e").
+
At a glance, it seems like a non-orthogonal jumbling together of searching and slicing.
+However, in practice a lot of ad-hoc string processing can be elegantly expressed via cut.
+
A lot of things are key=value pairs, and cut fits perfectly there.
+What’s more, many more complex sequencies, like
+--arg=key=value,
+can be viewed as nested pairs.
+You can cut around = once to get --arg and key=value, and then cut the second time to separate key from value.
+
In Rust, this function looks like this:
+
+
+
Rust’s Option is a good fit for the result type, it clearnly describes the behavior of the function when the pattern isn’t found in the string at all.
+Lifetime 'a expresses the relationship between the result and the input — both pieces of result are substrings of &'a self, so, as long as they are used, the original string must be kept alive as well.
+Finally, the separator isn’t another string, but a generic P: Pattern.
+This gives a somewhat crowded signature, but allows using strings, single characters, and even fn(c: char) -> bool functions as patterns.
+
When using the function, there are is a multitude of ways to access the result:
+
+
+
Here’s a Go equivalent:
+
+
+
It has a better name!
+It’s important that frequently used building-block functions have short, memorable names, and “cut” is just perfect for what the function does.
+Go doesn’t have an Option, but it allows multiple return values, and any type in Go has a zero value, so a boolean flag can be used to signal None.
+Curiously if the sep is not found in s, after is set to "", but before is set to s (that is, the whole string).
+This is occasionally useful, and corresponds to the last Rust example.
+But it also isn’t something immediately obvious from the signature, it’s an extra detail to keep in mind.
+Which might be fine for a foundational function!
+Similarly to Rust, the resulting strings point to the same memory as s.
+There are no lifetimes, but a potential performance gotcha — if one of the resulting strings is alive, then the entire s can’t be garbage collected.
+
There isn’t much in way of using the function in Go:
+
+
+
Zig doesn’t yet have an equivalent function in its standard library, but it probably will at some point, and the signature might look like this:
+
+
+
Similarly to Rust, Zig can express optional values.
+Unlike Rust, the option is a built-in, rather than a user-defined type (Zig can express a generic user-defined option, but chooses not to).
+All types in Zig are strictly prefix, so leading ? concisely signals optionality.
+Zig doesn’t have first-class tuple types, but uses very concise and flexible type declaration syntax, so we can return a named tuple.
+Curiously, this anonymous struct is still a nominal, rather than a structural, type!
+Similarly to Rust, prefix and suffix borrow the same memory that s does.
+Unlike Rust, this isn’t expressed in the signature — while in this case it is obvious that the lifetime would be bound to s, rather than sep, there are no type system guardrails here.
+
Because ? is a built-in type, we need some amount of special syntax to handle the result, but it curiously feels less special-case and more versatile than the Rust version.
+
+
+
Moral of the story?
+Work with the grain of the language — expressing the same concept in different languages usually requires a slightly different vocabulary.
I was going to write a long post about designing an IDE-friendly language. I wrote an intro and
+figured that it would make a better, shorter post on its own. Enjoy!
+
The big idea of language server construction is that language servers are not magic — capabilities
+and performance of tooling are constrained by the syntax and semantics of the underlying language.
+If a language is not designed with toolability in mind, some capabilities (e.g, fully automated
+refactors) are impossible to implement correctly. What’s more, an IDE-friendly language turns out to
+be a fast-to-compile language with easy-to-compose libraries!
+
More abstractly, there’s this cluster of unrelated at a first sight, but intimately intertwined and
+mutually supportive properties:
+
+
+parallel, separate compilation,
+
+
+incremental compilation,
+
+
+resilience to errors.
+
+
+
Separate compilation measures how fast we can compile codebase from scratch if we have unlimited
+number of CPU cores. For a language server, it solves the cold start problem — time to
+code-completion when the user opens the project for the first time or switches branches. Incremental
+compilation is the steady state of the language server — user types code and expects to see
+immediate effects throughout the project. Resilience to errors is important for two different
+sub-reasons. First, when the user edits the code it is by definition incomplete and erroneous, but a
+language server still must analyze the surrounding context correctly. But the killer feature of
+resilience is that, if you are absolutely immune to some errors, you don’t even have to look at the
+code. If a language server can ignore errors in function bodies, it doesn’t have to look at the
+bodies of functions from dependencies.
+
All three properties, parallelism, incrementality, and resilience, boil down to modularity —
+partitioning the code into disjoint components with well-defined interfaces, such that each
+particular component is aware only about the interfaces of other components.
Lets do a short drill and observe how the three properties interact at a small scale. Let’s
+minimize the problem of separate compilation to just … lexical analysis. How can we build a
+language that is easier to tokenize for an language server?
+
An unclosed quote is a nasty little problem! Practically, it is rare enough that it doesn’t really
+matter how you handle it, but qualitatively it is illuminating. In a language like Rust, where
+strings can span multiple lines, inserting a " in the middle of a file changes the lexical structure
+of the following text completely (/*, start of a block comment, has the same effect). When tokens
+change, so does the syntax tree and the set of symbols defined by the file. A tiny edit, just one
+symbol, unhinges semantic structure of the entire compilation unit.
+
Zig solves this problem. In Zig, no token can span several lines. That is, it would be correct to
+first split Zig source file by \n, and then tokenize each line separately. This is achieved by
+solving underlying problems requiring multi-line tokens better. Specifically:
+
+
+
there’s a single syntax for comments, //,
+
+
+
double-quoted strings can’t contain a \n,
+
+
+
but there’s a really nice syntax for multiline strings:
+
+
+
+
+
Do you see modules here? Disjoint-partitioning into interface-connected components? From the
+perspective of lexical analysis, each line is a module. And a line always has a trivial, empty
+interface — different lines are completely independent. As a result:
+
First, we can do lexical analysis in parallel. If you have N CPU cores, you can split file into N
+equal chunks, then in parallel locally adjust chunk boundaries such that they fall on newlines, and
+then tokenize each chunk separately.
+
Second, we have quick incremental tokenization — given a source edit, you determine the set of
+lines affected, and re-tokenize only those. The work is proportional to the size of the edit plus at
+most two boundary lines.
+
Third, any lexical error in a line is isolated just to this line. There’s no unclosed quote
+problem, mistakes are contained.
+
I am by no means saying that line-by-line lexing is a requirement for an IDE-friendly language
+(though it would be nice)! Rather, I want you to marvel how the same underlying structure of the
+problem can be exploited for quarantining errors, reacting to changes quickly, and parallelizing the
+processing.
+
The three properties are just three different faces of modularity in the end!
+
+
I do want to write that “IDE-friendly language” post at some point, but, as a hedge (after all, I
+still owe you “Why LSP Sucks?” one…), here are two comments where I explored the idea somewhat:
+1,
+2.
+
I also recommend these posts, which explore the same underlying phenomenon from the software
+architecture perspective:
People sometimes ask me: “Alex, how do I learn X?”. This article is a compilation of advice I
+usually give. This is “things that worked for me” rather than “the most awesome things on earth”. I
+do consider every item on the list to be fantastic though, and I am forever grateful to people
+putting these resources together.
I don’t think I have any useful advice on how to learn programming from zero. The rest of the post
+assumes that you at least can, given sufficient time, write simple programs. E.g., a program that
+reads a list of integers from an input textual file, sorts them using a quadratic algorithm, and
+writes the result to a different file.
https://projecteuler.net/archives is fantastic. The first 50 problems or so are a perfect “drill”
+to build programming muscle, to go from “I can write a program to sort a list of integers” to “I can
+easily write a program to sort a list of integers”.
+
Later problems are very heavily math based. If you are mathematically inclined, this is perfect —
+you got to solve fun puzzles while also practicing coding. If advanced math isn’t your cup of tea,
+feel free to stop doing problems as soon as it stops being fun.
https://en.wikipedia.org/wiki/Modern_Operating_Systems is fantastic. A version of the
+book was the first
+thick programming related tome I devoured. It gives a big picture of the inner workings of software
+stack, and was a turning point for me personally. After reading this book I realized that I want to
+be a programmer.
https://www.nand2tetris.org is fantastic. It plays a similar “big picture” role as MOS,
+but this time you are the painter. In this course you build a whole computing system yourself,
+starting almost from nothing. It doesn’t teach you how the real software/hardware stack works, but
+it thoroughly dispels any magic, and is extremely fun.
https://cses.fi/problemset/ is fantastic. This is a list of algorithmic problems, which is
+meticulously crafted to cover all the standard topics to a reasonable depth. This is by far the best
+source for practicing algorithms.
https://www.coursera.org/learn/programming-languages is fantastic. This course is a whirlwind tour
+across several paradigms of programming, and makes you really get what programming languages are
+about (and variance).
https://www.tedinski.com/archive/ is fantastic. Work through the whole archive in chronological
+order. This is by far the best resource on “programming in the large”.
Having a great mentor is fantastic, but mentors are not always available. Luckily, programming can
+be mastered without a mentor, if you got past the initial learning step. When you code, you get a
+lot of feedback, and, through trial and error, you can process the feedback to improve your skills.
+In fact, the hardest bit is actually finding the problems to solve (and this article suggests many).
+But if you have the problem, you can self-improve noticing the following:
+
+
+How you verify that the solution works.
+
+
+Common bugs and techniques to avoid them in the future.
+
+
+Length of the solution: can you solve the problem using shorter, simpler code?
+
+
+Techniques — can you apply anything you’ve read about this week? How would the problem be solved
+in Haskell? Could you apply pattern from language X in language Y?
+
+
+
In this context it is important to solve the same problem repeatedly. E.g., you could try solving
+the same model problem in all languages you know, with a month or two break between attempts.
+Repeatedly doing the same thing and noticing differences and similarities between tries is the
+essence of self-learning.
Learning your first programming language is a nightmare, because you are learning your editing
+environment (PyScripter, IntelliJ IDEA, VS Code) first, simple algorithms second, and the language
+itself third. It gets much easier afterwards!
+
Learning different programming languages is one of the best way to improve your programming skills.
+By seeing what’s similar, and what’s different, you deeper learn how the things work under the hood.
+Different languages put different idioms to the forefront, and learning several expands your
+vocabulary considerably. As a bonus, after learning N languages, learning N+1st becomes a question
+of skimming through the official docs.
+
In general, you want to cover big families of languages: Python, Java, Haskell, C, Rust, Clojure
+would be a good baseline. Erlang, Forth, and Prolog would be good additions afterwards.
You are not actually learning algorithms, you are learning programming. At this stage, it doesn’t
+matter how long your code is, how pretty it is, or how efficient it is. The only thing that
+matters is that it solves the problem. Generally, this level ends when you are fairly comfortable
+with recursion. Few first problems from Project Euler are a great resource here.
+
+
Level 2
+
+
Here you learn algorithms proper. The goal here is mostly encyclopedic knowledge of common
+techniques. There are quite a few, but not too many of those. At this stage, the most useful thing
+is understanding the math behind the algorithms — being able to explain algorithm using
+pencil&paper, prove its correctness, and analyze Big-O runtime. Generally, you want to learn the
+name of algorithm or technique, read and grok the full explanation, and then implement it.
+
I recommend doing an abstract implementation first (i.e., not “HashMap to solve problem X”, but
+“just HashMap”). Include tests in your implementation. Use randomized testing (e.g., when testing
+sorting algorithms, don’t use a finite set of example, generate a million random ones).
+
It’s OK and even desirable to implement the same algorithm multiple times. When solving problems,
+like CSES, you could abstract your solutions and re-use them, but it’s better to code everything
+from scratch every time, until you’ve fully internalized the algorithm.
+
+
Level 3
+
+
One day, long after I’ve finished my university, I was a TA for an algorithms course. The lecturer
+for the course was the person who originally taught me to program, through a similar algorithms
+course. And, during one coffee break, he said something like
+
+
+
I was thunderstruck! I didn’t realize that’s the reason why I am learning (well, teaching at that
+point) algorithms! Before, I always muddled through my algorithms by randomly tweaking generally
+correct stuff until it works. E.g., with a binary search, just add +1 somewhere until it doesn’t
+loop on random arrays. After hearing this advice, I went home and wrote my millionth binary
+search, but this time I actually added comments with loop invariants, and it worked from the first
+try! I applied similar techniques for the rest of the course, and since then my subjective
+perception of bug rate (for normal work code) went down dramatically.
+
So this is the third level of algorithms — you hone your coding skills to program without bugs.
+If you are already fairly comfortable with algorithms, try doing CSES again. But this time, spend
+however much you need double-checking the code before submission, but try to get everything
+correct on the first try.
Here’s the list of things you might want to be able to do, algorithmically. You don’t need to be
+able to code everything on the spot. I think it would help if you know what each word is about, and
+have implemented the thing at least once in the past.
A very powerful exercise is coding a medium-sized project from scratch. Something that takes more
+than a day, but less than a week, and has a meaningful architecture which can be just right, or
+messed up. Here are some great projects to do:
+
+
Ray Tracer
+
+
Given an analytical description of a 3D scene, convert it to a colored 2D image, by simulating a
+path of a ray of light as it bounces off objects.
+
+
Software Rasterizer
+
+
Given a description of a 3D scene as a set of triangles, convert it to a colored 2D image by
+projecting triangles onto the viewing plane and drawing the projections in the correct order.
+
+
Dynamically Typed Programming Language
+
+
An interpreter which reads source code as text, parses it into an AST, and directly executes the
+AST (or maybe converts AST to the byte code for some speed up)
+
+
Statically Typed Programming Language
+
+
A compiler which reads source code as text, and spits out a binary (WASM would be a terrific
+target).
+
+
Relational Database
+
+
Several components:
+
+
+Storage engine, which stores data durably on disk and implements on-disk ordered data structures
+(B-tree or LSM)
+
+
+Relational data model which is implemented on top of primitive ordered data structures.
+
+
+Relational language to express schema and queries.
+
+
+Either a TCP server to accept transactions as a database server, or an API for embedding for an
+in-processes “embedded” database.
+
+
+
+
Chat Server
+
+
An exercise in networking and asynchronous programming. Multiple client programs connect to a
+server program. A client can send a message either to a specific different client, or to all other
+clients (broadcast). There are many variations on how to implement this: blocking read/write
+calls, epoll, io_uring, threads, callbacks, futures, manually-coded state machines.
+
+
+
Again, it’s more valuable to do the same exercise six times with variations, than to blast through
+everything once.
Zig has a nominal type system despite the fact that types lack names. A struct type is declared by
+struct { field: T }.
+It’s anonymous; an explicit assignment is required to name the type:
+
+
+
Still, the type system is nominal, not structural. The following does not compile:
+
+
+
The following does:
+
+
+
One place where Zig is structural are anonymous struct literals:
+
+
+
Types of x and y are different, but x can be coerced to y.
+
In other words, Zig structs are anonymous and nominal, but anonymous structs are structural!
Simple type inference for an expression works by first recursively inferring the types of
+subexpressions, and then deriving the result type from that. So, to infer types in
+foo().bar(), we first derive the type of foo(), then lookup method bar on that
+type, and use the return type of the method.
+
More complex type inference works through so called unification algorithm. It starts with a similar
+recursive walk over the expression tree, but this walk doesn’t infer types directly, but rather
+assigns a type variable to each subexpression, and generates equations relating type variables. So the
+result of this first phase look like this:
+
+
+
Then, in the second phase the equations are solved, yielding, in this case, x = Int and y = Int.
+
Usually languages with powerful type systems have unification somewhere, though often unification
+is limited in scope (for example, Kotlin infers types statement-at-a-time).
+
It is curious that Zig doesn’t do unification, type inference is a simple single-pass recursion (or
+at least it should be, I haven’t looked at how it is actually implemented). So, anytime there’s a
+generic function like
+fn reverse(comptime T: type, xs: []T) void,
+the call site has to pass the type in explicitly:
+
+
+
Does it mean that you have to pass the types all the time? Not really! In fact, the only place which
+feels like a burden are functions in std.mem module which operate on slices, but that’s just
+because slices are builtin types (a kind of pointer really) without methods. The thing is, when you
+call a method on a “generic type”, its type parameters are implicitly in scope, and don’t have to be
+specified. Study this example:
+
+
+
There’s a runtime parallel here. At runtime, there’s a single dynamic dispatch, which prioritizes
+dynamic type of the first argument, and multiple dynamic dispatch, which can look at dynamic types
+of all arguments. Here, at compile time, the type of the first argument gets a preferential
+treatment. And, similarly to runtime, this covers 80% of use cases! Though, I’d love for things like
+std.mem.eql to be actual methods on slices…
One of the best tricks a language server can pull off for as-you-type analysis is skipping bodies of
+the functions in dependencies. This works as long as the language requires complete signatures. In
+functional languages, its customary to make signatures optional, which precludes this crucial
+optimization. As per Modularity Of Lexical
+Analysis, this has
+repercussions for all of:
+
+
+incremental compilation,
+
+
+parallel compilation,
+
+
+robustness to errors.
+
+
+
I always assumed that Zig with its crazy comptime requires autopsy.
+But that’s not actually the case! Zig doesn’t have decltype(auto), signatures are always explicit!
+
Let’s look at, e.g., std.mem.bytesAsSlice:
+
+
+
Note how the return type is not anytype, but the actual, real thing. You could write complex
+computations there, but you can’t look inside the body. Of course, it also is possible to write fn
+foo() @TypeOf(bar()) {, but that feels like a fair game —bar() will be evaluated at
+compile time. In other words, only bodies of functions invoked at comptime needs to be looked at by
+a language server. This potentially improves performance for this use-case quite a bit!
+
It’s useful to contrast this with Rust. There, you could write
+
+
+
Although it feels like you are stating the interface, it’s not really the case. Auto traits like
+Send and Sync leak, and that can be detected by downstream code and lead to, e.g., different
+methods being called via Deref-based specialization depending on : Send being implemented:
+
+
+
Zig is much more strict here, you have to fully name the return type (the name doesn’t have to be
+pretty, take a second look at bytesAsSlice). But its not perfect, a genuine leakage happens with
+inferred error types (!T syntax). A bad example would look like this:
+
+
+
Here, to check main, we actually do need to dissect f’s body, we can’t treat the error union
+abstractly. When the compiler analyzes main, it needs to stop to process f signature (which is
+very fast, as it is very short) and then f’s body (this part could be quite slow, there might be a
+lot of code behind that Mystery! It’s interesting to ponder alternative semantics, where, during
+type checking, inferred types are treated abstractly, and error exhastiveness is a separate late
+pass in the compiler. That way, complier only needs f’s signature to check main. And that means
+that bodies of main and f could be checked in parallel.
+
That’s all for today! The type system surprising I’ve found so far are:
+
+
+
Nominal type system despite notable absence of names of types.
+
+
+
Unification-less generics which don’t incur unreasonable annotation burden due to methods “closing
+over” generic parameters.
+
+
+
Explicit signatures with no Voldemort types with a
+notable exception of error unions.
“Algorithms” are a useful skill not because you use it at work every day, but because they train you
+to be better at particular aspects of software engineering.
+
Specifically:
+
First, algorithms drill the skill of bug-free coding. Algorithms are hard and frustrating! Subtle
+off-by-one might not matter for simple tests, but breaks corner cases. But if you practice
+algorithms, you get better at this particular skill of writing correct small programs, and I think
+this probably generalizes.
+
To give an array of analogies:
+
+
+
People do cardio or strength exercises not because they need to lift heavy weights in real life.
+Quite the opposite — there’s too little physical exertion in our usual lives, so we need extra
+exercises for our bodies to gain generalized health (which is helpful in day-to-day life).
+
+
+
You don’t practice complex skill by mere repetition. You first break it down into atomic trainable
+sub skills, and drill each sub skill separately in unrealistic condition. Writing correct
+algorithmy code is a sub skill of software engineering.
+
+
+
When you optimize system, you don’t just repeatedly run end-to-end test until things go fast. You
+first identify the problematic area, then write a targeted micro benchmark to isolate this
+particular effect, and then you optimize that using much shorter event loop.
+
+
+
I still remember two specific lessons I learned when I started doing algorithms many years ago:
+
+
Debugging complex code is hard, first simplify, then debug
+
+
Originally, when I was getting a failed test, I sort of tried to add more code to my program to
+make it pass. At some point I realized that this is going nowhere, and then I changed my workflow
+to first try to remove as much code as I can, and only then investigate the problematic test
+case (which with time morphed into a skill of not writing more code then necessary in the first
+place).
+
+
Single source of truth is good
+
+
A lot of my early bugs was due to me duplicating the same piece of information in two places and
+then getting them out of sync. Internalizing that as a single source of truth fixed the issues.
+
+
+
Meta note: if you already know this, my lessons are useless. If you don’t yet know them, they are
+still useless and most likely will bounce off you. This is tacit knowledge — it’s very hard to
+convey it verbally, it is much more efficient to learn these things yourself by doing.
+
Somewhat related, I noticed a surprising correlation between programming skills in the small, and
+programming skills in the large. You can solve a problem in five lines of code, or, if you try hard,
+in ten lines of code. If you consistently come up with concise solutions in the small, chances are
+large scale design will be simple as well.
+
I don’t know how true is that, as I never tried to look at a proper study, but it looks very
+plausible from what I’ve seen. If this is true, the next interesting question is: “if you train
+programming-in-the-small skills, do they transfer to programming in the large?”. Again, I don’t
+know, but I’d take this Pascal’s wager.
+
Second, algorithms teach about properties and invariants. Some lucky people get those skills from
+a hard math background, but algorithms are a much more accessible way to learn them, as everything
+is very visual, immediately testable, and has very short and clear feedback loop.
+
And properties and invariants is what underlines most big and successful systems. Like 90% of the
+code is just fluff and glue, and if you have the skill to see the 10% that is architecturally
+salient properties, you could comprehend the system much faster.
+
Third, algorithms occasionally are useful at the job! Just last week on our design walk&talk we
+were brainstorming one particular problem, and I was like
+
+
+
We probably won’t go with that solution as that’s too complex algorithmically for what ultimately is
+a corner case, but it’s important that we understand problem space in detail before we pick a
+solution.
+
Note also how algorithms vocabulary helps me to think about the problem. In math (including
+algorithms), there’s just like a handful of ideas which are applied again and again under different
+guises. You need some amount of insight of course, but, for most simple problems, what you actually
+need is just an ability to recognize the structure you’ve seen somewhere already.
+
Fourth, connecting to the previous ones, the ideas really do form interconnected web which, on a
+deep level, underpins a whole lot of stuff. So, if you do have non-zero amount of pure curiosity
+when it comes to learning programming, algorithms cut pretty deep to the foundation. Let me repeat
+the list from the last post, but with explicit connections to other things:
+
+
linear search
+
+
assoc lists in most old functional languages work that way
+
+
binary search
+
+
It is literally everywhere. Also, binary search got a cute name, but actually it isn’t the
+primitive operation. The primitive operation is partition_point, a predicate version of binary
+search. This is what you should add to your language’s stdlib as a primitive, and base everything
+else in terms of it. Also, it is one of the few cases where we know lower bound of complexity. If
+an algorithm does k binary comparisons, it can give at most 2k distinct answers. So, to find
+insertion point among n items, you need at least k questions such that 2k > n.
+
+
quadratic sorting
+
+
We use it at work! Some collections are statically bound by a small constant, and quadratically
+sorting them just needs less machine code. We are also a bit paranoid that production sort
+algorithms are very complex and might have subtle bugs, esp in newer languages.
+
+
merge sort
+
+
This is how you sort things on disk. This is also how LSM-trees, the most practically important
+data structure you haven’t learned about in school, works! And k-way merge also is occasionally
+useful (this is from work from three weeks ago).
+
+
heap sort
+
+
Well, this one is only actually useful for the heap, but I think maybe the kernel uses it when
+it needs to sort something in place, without extra memory, and in guaranteed O(N log N)?
+
+
binary heap
+
+
Binary heaps are everywhere! Notably, simple timers are a binary heap of things in the order of
+expiration. This is also a part of Dijkstra and k-way-merge.
+
+
growable array
+
+
That’s the mostly widely used collection of them all! Did you know that grow factor 2 has a
+problem that the size after n reallocations is larger then the sum total of all previous sizes,
+so the allocator can’t re-use the space? Anecdotally, growth factors less than two are preferable
+for this reason.
Again, rust-analyzer green tree are binary search trees using offset as an implicit key.
+Monoid trees are also binary search trees.
+
+
AVL tree
+
+
Ok, this one I actually don’t know a direct application of! But I remember two
+programming-in-the-small lessons AVL could have taught me, but didn’t. I struggled a lot
+implementing all of “small left rotation”, “small right rotation”, “big left rotation”, “big right
+rotation”. Some years later, I’ve learned that you don’t do
+
+
+
as that forces code duplication. Rather, you do children: [Tree; 2] and then you could
+use child_index and child_index ^ 1 to abstract over left-right.
+
And then some years later still I read in wikipedia that big rotations are actually a composition
+of two small rotations.
+
Actually, I’ve lied that I don’t know connections here. You use the same rotations for the splay
+tree.
+
+
Red Black Tree
+
+
red-black tree is a 2-3 tree is a B-tree. Also, you probably use jemalloc, and it has a red-black
+tree implemented as a C
+macro.
+Left-leaning red-black tree are an interesting variation, which is claimed to be simpler, but is
+also claimed to not actually be simpler, because it is not symmetric and neuters the children
+trick.
+
+
B-tree
+
+
If you use Rust, you probably use B-tree. Also, if you use a database, it stores data either in
+LSM or in a B-tree. Both of these are because B-trees play nice with memory hierarchy.
Literally everywhere, both chaining and open-addressing versions are widely used.
+
+
Depth First Search
+
+
This is something I have to code, explicitly or implicitly, fairly often. Every time where you
+have a DAG, when things depend on other things, you’d have a DFS somewhere. In rust-analyzer,
+there are at least a couple — one in borrow checker for something (have no idea what that does,
+just grepped for fn dfs) and one in crate graph to detect cycles.
+
+
Breadth First Search
+
+
Ditto, any kind of exploration problem is usually solved with bfs. Eg, rust-analyzer uses bfs
+for directory traversal.
+
Which is better, bfs or dfs? Why not both?! Take a look at bdfs from rust-analyzer:
Again, comes up every time you deal with things which depend on each other. rust-analyzer has
+crates_in_topological_order
+
+
Strongly Connected Components
+
+
This is needed every time things depend on each other, but you also allow cyclic dependencies. I
+don’t think I’ve needed this one in real life. But, given that SCC is how you solve 2-SAT in
+polynomial time, seems important to know to understand the 3 in 3-SAT
+
+
Minimal Spanning Tree
+
+
Ok, really drawing a blank here! Connects to sorting, disjoint set union (which is needed for
+unification in type-checkers), and binary heap. Seems practically important algorithm though! Ah,
+MST also gives an approximation for planar traveling salseman I think, another border between hard
+& easy problems.
+
+
Dijkstra
+
+
Dijkstra is what I think about when I imagine a Platonic algorithm, though
+I don’t think I’ve used it in practice? Connects to heap.
+
Do you know why we use i, j, k for loop indices? Because D ijk stra!
+
+
Floyd-Warshall
+
+
This one is cool! Everybody knows why any regular expression can be complied to an equivalent
+finite state machine. Few people know the reverse, why each automaton has an equivalent regex
+(many people know this fact, but few understand why). Well, because Floyd-Warshall! To convert an
+automaton to regex use the same algorithm you use to find pairwise distances in a graph.
+
Also, this is a final boss of dynamic programming. If you understand why this algorithm works, you
+understand dynamic programming. Despite being tricky to understand, it’s very easy to implement! I
+randomly stumbled into Floyd-Warshall, when I tried to implement a different, wrong approach, and
+made a bug which turned my broken algo into a correct Floyd-Warshall.
+
+
Bellman-Ford
+
+
Again, not much practical applicaions here, but the theory is well connected. All shortest path
+algorithms are actually fixed-point iterations! But with Bellman-Ford and its explicit edge
+relaxation operator that’s most obvious. Next time you open static analysis textbook and learn
+about fixed point iteration, map that onto the problem of finding shortest paths!
+
+
Quadratic Substring Search
+
+
This is what you language standard library does
+
+
Rabin-Karp
+
+
An excellent application of hashes. The same idea, hash(composite) =
+compbine(hash(component)*), is used in rust-analyzer to intern syntax
+trees.
+
+
Boyer-Moore
+
+
This is beautiful and practical algorithm which probably handles the bulk of real-world searches
+(that is, it’s probably the hottest bit of ripgrep as used by an average person). Delightfully,
+this algorithm is faster than theoretically possible — it doesn’t even look at every byte of
+input data!
+
+
Knuth-Morris-Pratt
+
+
Another “this is how you do string search in the real world” algorithm. It also is the platonic
+ideal of a finite state machine, and almost everything is an FSM. It also is Aho-Corasick.
+
+
Aho-Corasick
+
+
This is the same as Knuth-Morris-Pratt, but also teaches you about tries. Again, super-useful for
+string searches. As it is an FSM, and a regex is an FSM, and there’s a general construct for
+building a product of two FSMs, you can use it to implement fuzzy search. “Workspace symbol”
+feature in rust-analyzer works like this. Here’s a part
+of implementation.
+
+
Edit Distance
+
+
Everywhere in Bioinformatics (not the actual edit distance, but this problem shape). The first
+post on this blog is about this problem:
There are two main historical trends when choosing an implementation language for something
+compiler-shaped.
+
For more language-centric tasks, like a formal specification, or a toy hobby language, OCaml makes
+most sense. See, for example, plzoo or WebAssembly reference
+interpreter.
+
For something implementation-centric and production ready, C++ is often chosen: LLVM, clang, v8,
+HotSpot are all C++.
+
These days, Rust is a great new addition to the landscape. It is influenced most directly by ML and
+C++, combines their strengths, and even brings something new of its own to the table, like seamless,
+safe multithreading. Still, Rust leans heavily towards production readiness side of the spectrum.
+While some aspects of it, like a “just works” build system, help with prototyping as well, there’s
+still extra complexity tax due to the necessity to model physical layout of data. The usual advice,
+when you start building a compiler in Rust, is to avoid pointers and use indexes. Indexes are great!
+In large codebase, they allow greater decoupling (side tables can stay local to relevant modules),
+improved performance (an index is u32 and nudges you towards struct-of-arrays layouts), and more
+flexible computation strategies (indexes are easier to serialize or plug into incremental
+compilation framework). But they do make programming-in-the-small significantly more annoying, which
+is a deal-breaker for hobbyist tinkering.
+
But OCaml is crufty! Is there something better? Today, I realized that TypeScript might actually be
+OK? It is not really surprising, given how the language works, but it never occured to me to think
+about TypeScript as an ML equivalent before.
+
So, let’s write a tiny-tiny typechecker in TS!
+
Of course, we start with deno. See A Love Letter to
+Deno for more details, but the
+TL;DR is that deno provides out-of-the-box experience for TypeScript. This is a pain point for
+OCaml, and something that Rust does better than either OCaml or C++. But deno does this better than
+Rust! It’s just a single binary, it comes with linting and formatting, there’s no compilation step,
+and there are built-in task runner and watch mode. A dream setup for quick PLT hacks!
+
And then there’s TypeScript itself, with its sufficiently flexible, yet light-ceremony type system.
+
Let’s start with defining an AST. As we are hacking, we won’t bother with making it an IDE-friendly
+concrete syntax tree, or incremental-friendly “only store relative offsets” tree, and will just tag
+AST nodes with locations in file:
+
+
+
Even here, we already see high-level nature of TypeScript — string is just a string, there’s no
+thinking about usize vs u32 as numbers are just numbers.
+
Usually, an expression is defined as a sum-type. As we want to tag each expression with a location,
+that representation would be slightly inconvenient for us, so we split things up a bit:
+
+
+
One more thing — as we are going for something quick, we’ll be storing inferred types directly in
+the AST nodes. Still, we want to keep raw and type-checked AST separate, so what we are going to do
+here is to parametrize the Expr over associated data it stores. A freshly parsed expression would
+use void as data, and the type checker will set it to Type. Here’s what we get:
+
+
+
A definition of ExprBinary could look like this:
+
+
+
Note how I don’t introduce separate types for, e.g, AddExpr and SubExpr— all binary
+expressions have the same shape, so one type is enough!
+
But we need a tiny adjustment here. Our Expr kind is defined as a union type. To match a value of
+a union type a bit of runtime type information is needed. However, it’s one of the core properties
+of TypeScript that it doesn’t add any runtime behaviors. So, if we want to match on expression kinds
+(and we for sure want!), we need to give a helping hand to the compiler and include a bit of RTTI
+manually. That would be the tag field:
+
+
+
tag: "binary" means that the only possible runtime value for tag is the string "binary".
+
Similarly to various binary expressions, boolean literal and int literal expressions have almost
+identical shape. Almost, because the payload (boolean or number) is different. TypeScript
+allows us to neatly abstract this over:
+
+
+
Finally, for control-flow expressions we only add if for now:
+
+
+
This concludes the definition of the ast! Let’s move on to the type inference! Start with types:
+
+
+
Our types are really simple, we could have gone with type Type = "Int" | "Bool", but
+lets do this a bit more enterprisy! We define separate types for integer and boolean types. As these
+types are singletons, we also provide canonical definitions. And here is another TypeScript-ism.
+Because TypeScript fully erases types, everything related to types lives in a separate namespace. So
+you can have a type and a value sharing the same name. Which is exactly what we use to define the
+singletons!
+
Finally, we can take advantage of our associated-data parametrized expression and write the
+signature of
+
+
+
As it says on the tin, inter_types fills in Type information into the void! Let’s fill in the
+details!
+
+
+
If at this point we hit Enter, the editor completes:
+
+
+
There’s one problem though. What we really want to write here is something like
+const inferred_type = switch(..),
+but in TypeScript switch is a statement, not an expression.
+So let’s define a generic visitor!
+
+
+
Armed with the visit, we can ergonomically match over the expression:
+
+
+
Before we go further, let’s generalize this visiting pattern a bit! Recall that our expressions are
+parametrized by the type of associated data, and type-checker-shaped transformations are essentially an
+Expr<U> -> Expr<V>
+transformation.
+
Let’s make this generic!
+
+
+
Transform maps an expression carrying T into an expression carrying V by applying an f
+visitor. Importantly, it’s Visitor<V, V>, rather than a Visitor<U, V>. This is
+counter-intuitive, but correct — we run transformation bottom up, transforming the leaves first.
+So, when the time comes to visit an interior node, all subexpression will have been transformed!
+
The body of transform is wordy, but regular, rectangular, and auto-completes itself:
+
+
+
+
+
Note how here expr.kind is both Expr<U> and Expr<V>— literals don’t depend on this type
+parameter, and TypeScript is smart enough to figure this out without us manually re-assembling
+the same value with a different type.
+
+
+
This is where that magic with Visitor<V, V> happens.
+
+
+
The code is pretty regular here though! So at this point we might actually recall that TypeScript is
+a dynamically-typed language, and write a generic traversal using Object.keys, while keeping the
+static function signature in-place. I don’t think we need to do it here, but there’s comfort in
+knowing that it’s possible!
+
Now implementing type inference should be a breeze! We need some way to emit type errors though.
+With TypeScript, it would be trivial to accumulate errors into an array as a side-effect, but let’s
+actually represent type errors as instances of a specific type, TypeError (pun intended):
+
+
+
To check ifs and binary expressions, we would also need a utility for comparing types:
+
+
+
We make the Error type equal to any other type to prevent cascading failures. With all that
+machinery in place, our type checker is finally:
+
+
+
Astute reader will notice that our visitor functions now take an extra ast.Location argument.
+TypeScript allows using this argument only in cases where it is needed, cutting down verbosity.
+
And that’s all for today! The end result is pretty neat and concise. It took some typing to get there,
+but TypeScript autocompletion really helps with that! What’s more important, there was very little
+fighting with the language, and the result feels quite natural and directly corresponds to the shape
+of the problem.
+
I am not entirely sure in the conclusion just yet, but I think I’ll be using TypeScript as my tool
+of choice for various small language hacks. It is surprisingly productive due to the confluence of
+three aspects:
+
+
+deno is a perfect scripting runtime! Small, hermetic, powerful, and optimized for effective
+development workflows.
+
+
+TypeScript tooling is great — the IDE is helpful and productive (and deno makes sure that it
+also requires zero configuration)
+
+
+The language is powerful both at runtime and at compile time. You can get pretty fancy with types,
+but you can also just escape to dynamic world if you need some very high-order code.
+
+
+
+
Just kidding, here’s one more cute thing. Let’s say that we want to have lots of syntactic sugar,
+and also want type-safe desugaring. We could tweak our setup a bit for that: instead of Expr and
+ExprKind being parametrized over associated data, we circularly parametrize Expr by the whole
+ExprKind and vice verse:
+
+
+
This allows expressing desugaring in a type-safe manner!
That’s too damn many of them! Some time ago I’ve noticed that my code involving comparisons is often
+hard to understand, and hides bugs. I’ve figured some rules of thumb to reduce complexity, which I
+want to share.
+
The core idea is to canonicalize things. Both x < y and y > x mean the same, and, if you use
+them with roughly equal frequency, you need to spend extra mental capacity to fold the two versions
+into the single “x tiny, y HUGE” concept in your head.
+
The number line is a great intuition and visualization
+for comparisons. If you order things from small to big,
+A B C D,
+you get intuitive concept of ordering without using comparison operators. You also plug into your
+existing intuition that the sort function arranges arrays in the ascending order.
+
So, as a first order rule-of-thumb:
+Strongly prefer < and <= over > and >=
+And, when using comparisons, use number line intuition.
+
Some snippets:
+
Checking if a point is inside the interval:
+
+
+
Checking if a point is outside of the interval:
+
+
+
Segment a is inside segment b:
+
+
+
Segments a and b are disjoint (either a is to the left of b or a is to the right of b):
+
+
+
A particular common case for ordered comparisons is checking that an index is in bounds for an
+array. Here, the rule about number line works together with another important rule: State
+invariants positively
+
The indexing invariant is spelled as index < xs.len(),
+
and you should prefer to see it exactly that way in the source code. Concretely,
+
+
+
is hard to get right, because is spells the converse of the invariant, and involves an extra mental
+negation (this is subtle — although there isn’t a literal negation operator, you absolutely do
+think about this as a negation of the invariant). If possible, the code should be reshaped to
I extolled the benefits of programming with invariants in a couple of recent posts.
+Naturally, I didn’t explain what I think when I write “invariant”. This post fixes that.
+
There are at least three different concepts I label with “invariant”:
+
+
+a general “math” mode of thinking, where you distinguish between fuzzy, imprecise thoughts and
+precise statements with logical meaning.
+
+
+a specific technique for writing correct code when programming in the small.
+
+
+when programming in the large, compact, viral, descriptive properties of the systems.
+
+
+
I wouldn’t discuss the first point here — I don’t know how to describe this better than “that
+thing that you do when you solve non-trivial math puzzler”. The bulk of the post describes the
+second bullet point, for which I think I have a perfect litmus test to explain exactly what I am
+thinking here. I also touch a bit on the last point in the end.
+
So let’s start with a litmus test program to show invariants in
+the small in action:
+
+
+
You might want to write one yourself before proceeding. Here’s an exhaustive
+test for this functionality,
+using exhaustigen crate:
+
+
+
Here’s how I would naively write this function. First, I start with defining the boundaries for the
+binary search:
+
+
+
Then, repeatedly cut the interval in half until it vanishes
+
+
+
and recur into the left or the right half accordingly:
+
+
+
Altogether:
+
+
+
I love this code! It has so many details right!
+
+
+The insertion_point interface compactly compresses usually messy result of a binary search to
+just one index.
+
+
+xs / x pair of names for the sequence and its element crisply describes abstract algorithm on
+sequencies.
+
+
+Similarly, lo / hi name pair is symmetric, expressing the relation between the two indexes.
+
+
+Half-open intervals are used for indexing.
+
+
+There are no special casing anywhere, the natural lo < hi condition handles empty slice.
+
+
+We even dodge Java’s binary search bug by computing midpoint without overflow.
+
+
+
There’s only one problem with this code — it doesn’t work. Just blindly following rules-of-thumb
+gives you working code surprisingly often, but this particular algorithm is an exception.
+
The question is, how do we fix this overwise great code? And here’s where thinking invariants helps.
+Before I internalized invariants, my approach would be to find a failing example, and to fumble with
+some plus or minus ones here and there and other special casing to make it work. That is, find a
+concrete problem, solve it. This works, but is slow, and doesn’t allow discovering the problem
+before running the code.
+
The alternative is to actually make an effort and spell out, explicitly, what the code is supposed
+to do. In this case, we want lo and hi to bound the result. That is,
+lo <= insertion_point <= hi
+should hold on every iteration. It clearly holds before we enter the loop. On each iteration, we
+would like to shorten this interval, cutting away the part that definitely does not contain
+insertion point.
+
Elaborating the invariant, all elements to the left of lo should be less than the target.
+Conversely, all elements to the right of hi should be at least as large as the target.
+
+
+
Let’s now take a second look at the branching condition:
+
+
+
It matches neither invariant prong exactly: x is on the left, but inequality is strict. We can
+rearrange the code to follow the invariant more closely:
+
+
+
+
+we flip the condition and if-branches, so that xs[mid] < x matches xs[i] < x from the
+invariant for lo
+
+
+to make the invariant tight, we add mid + 1 (if xs[mid] is less than x, we know that the
+insertion point is at least mid + 1)
+
+
+
The code now works. So what went wrong with the original version with x < xs[mid]? In the else
+case, when x >= xs[mid] we set lo = mid, but that’s wrong! It might be the case that x ==
+xs[mid] and x == xs[mid - 1], which would break the invariant for lo.
+
The point isn’t in this particular invariant or this particular algorithm. It’s the general
+pattern that it’s easy to write the code which implements the right algorithm, and sort-of works,
+but is wrong in details. To get the details right for the right reason, you need to understand
+precisely what the result should be, and formulating this as a (loop or recursion) invariant
+helps.
+
+
Perhaps it’s time to answer the title question: invariant is some property which holds at all times
+during dynamic evolution of the system. In the above example, the evolution is the program
+progressing through subsequent loop iterations. The invariant, the condition binding lo and hi,
+holds on every iteration. Invariants are powerful, because they are compressed descriptions of
+the system, they collapse away the time dimension, which is a huge simplification. Reasoning about
+each particular path the program could take is hard, because there are so many different paths.
+Reasoning about invariants is easy, because they capture properties shared by all execution paths.
+
The same idea applies when programming in the large. In the small, we looked at how the state of a
+running program evolves over time. In the large, we will look at how the source code of the program
+itself evolves, as it is being refactored and extended to support new features. Here are some
+systems invariants from the systems I’ve worked with:
+
Cargo:
+
File system paths entered by users are preserved exactly. If the user types
+cargo frob ../some/dir,
+Cargo doesn’t attempt to resolve ../some/dir to an absolute path and passes the path
+to the underlying OS as is. The reason for that is that file systems are very finicky. Although it
+might look as if two paths are equivalent, there are bound to be cases where they are not. If the
+user typed a particular form of a path, they believe that it’ll work, and any changes can mess
+things up easily.
+
This is a relatively compact invariant — basically, code is just forbidden from calling
+fs::canonicalize.
+
rust-analyzer:
+
Syntax trees are identity-less value types. That is, if you take an object representing an if
+expression, that object doesn’t have any knowledge of where in the larger program the if
+expression is. The thinking about this invariant was that it simplifies refactors — while in the
+static program it’s natural to talk about “if on the line X in file Y”, when you start modifying
+code, identity becomes much more fluid.
+
This is an invariant with far reaching consequences — that means that literally everything in
+rust-analyzer needs to track identities of things explicitly. You don’t just pass around syntax
+nodes, you pass nodes with extra breadcrumbs describing their origin. I think this might have been a
+mistake — while it does make refactoring APIs more principled, refactoring is not the common case!
+Most of the work of a language server consists of read-only analysis of existing code, and the
+actual refactor is just a cherry on top. So perhaps it’s better to try to bind identity mode tightly
+into the core data structure, and just use fake identities for temporary trees that arise during
+refactors.
+
A more successful invariant from rust-analyzer is that the IDE has a full, frozen view of a snapshot
+of the world. There’s no API for inferring the types, rather, the API looks as if all the types are
+computed at all times. Similarly, there’s no explicit API for changing the code or talking about
+different historical versions of the code — the IDE sees a single “current” snapshot with all
+derived data computed. Underneath, there’s a smart system to secretly compute the information on
+demand and re-use previous results, but this is all hidden from the API.
+
This is a great, simple mental model, and it provides for a nice boundary between the compiler
+proper and IDE fluff like refactors and code completion. Long term, I’d love to see several
+implementations of the “compiler parts”.
+
TigerBeetle:
+
A lot of thoughtful invariants here! To touch only a few:
+
TigerBeetle doesn’t allocate memory after startup. This simple invariant affects every bit of code
+— whatever you do, you must manage with existing, pre-allocated data structures. You can’t just
+memcpy stuff around, there’s no ambient available space to memcpy to! As a consequence (and,
+historically, as a motivation for the design)
+everything
+has a specific numeric limit.
+
Another fun one is that transaction logic can’t read from disk. Every object which could be touched
+by a transaction needs to be explicitly prefetched into memory before transaction begins. Because
+disk IO happens separately from the execution, it is possible to parallelize IO for a whole batch of
+transactions. The actual transaction execution is then a very tight serial CPU loop without any
+locks.
+
Speaking of disk IO, in TigerBeetle “reading from disk” can’t fail. The central API for reading
+takes a data block address, a checksum, and invokes the callback with data with a matching checksum.
+Everything built on top doesn’t need to worry about error handling. The way this works internally is
+that reads that fail on a local disk are repaired through other replicas in the cluster. It’s just
+that the repair happens transparently to the caller. If the block of data of interest isn’t found on
+the set of reachable replicas, the cluster correctly gets stuck until it is found.
+
+
Summing up: invariants are helpful for describing systems that evolve over time. There’s a
+combinatorial explosion of trajectories that a system could take. Invariants compactly describe
+properties shared by an infinite amount of trajectories.
+
In the small, formulating invariants about program state helps to wire correct code.
+
In the large, formulating invariants about the code itself helps to go from a small, simple system
+that works to a large system which is used in production.
+I am Alex Kladov, a programmer who loves simple code and programming languages.
+You can find me on GitHub.
+If you want to contact me, please write an e-mail (address is on the GitHub profile).
+
Code samples on this blog are dual licensed under MIT OR Apache-2.0.
I extolled the benefits of programming with invariants in a couple of recent posts.
+Naturally, I didn’t explain what I think when I write “invariant”. This post fixes that.
+
There are at least three different concepts I label with “invariant”:
+
+
+a general “math” mode of thinking, where you distinguish between fuzzy, imprecise thoughts and
+precise statements with logical meaning.
+
+
+a specific technique for writing correct code when programming in the small.
+
+
+when programming in the large, compact, viral, descriptive properties of the systems.
+
+
+
I wouldn’t discuss the first point here — I don’t know how to describe this better than “that
+thing that you do when you solve non-trivial math puzzler”. The bulk of the post describes the
+second bullet point, for which I think I have a perfect litmus test to explain exactly what I am
+thinking here. I also touch a bit on the last point in the end.
+
So let’s start with a litmus test program to show invariants in
+the small in action:
+
+
+
You might want to write one yourself before proceeding. Here’s an exhaustive
+test for this functionality,
+using exhaustigen crate:
+
+
+
Here’s how I would naively write this function. First, I start with defining the boundaries for the
+binary search:
+
+
+
Then, repeatedly cut the interval in half until it vanishes
+
+
+
and recur into the left or the right half accordingly:
+
+
+
Altogether:
+
+
+
I love this code! It has so many details right!
+
+
+The insertion_point interface compactly compresses usually messy result of a binary search to
+just one index.
+
+
+xs / x pair of names for the sequence and its element crisply describes abstract algorithm on
+sequencies.
+
+
+Similarly, lo / hi name pair is symmetric, expressing the relation between the two indexes.
+
+
+Half-open intervals are used for indexing.
+
+
+There are no special casing anywhere, the natural lo < hi condition handles empty slice.
+
+
+We even dodge Java’s binary search bug by computing midpoint without overflow.
+
+
+
There’s only one problem with this code — it doesn’t work. Just blindly following rules-of-thumb
+gives you working code surprisingly often, but this particular algorithm is an exception.
+
The question is, how do we fix this overwise great code? And here’s where thinking invariants helps.
+Before I internalized invariants, my approach would be to find a failing example, and to fumble with
+some plus or minus ones here and there and other special casing to make it work. That is, find a
+concrete problem, solve it. This works, but is slow, and doesn’t allow discovering the problem
+before running the code.
+
The alternative is to actually make an effort and spell out, explicitly, what the code is supposed
+to do. In this case, we want lo and hi to bound the result. That is,
+lo <= insertion_point <= hi
+should hold on every iteration. It clearly holds before we enter the loop. On each iteration, we
+would like to shorten this interval, cutting away the part that definitely does not contain
+insertion point.
+
Elaborating the invariant, all elements to the left of lo should be less than the target.
+Conversely, all elements to the right of hi should be at least as large as the target.
+
+
+
Let’s now take a second look at the branching condition:
+
+
+
It matches neither invariant prong exactly: x is on the left, but inequality is strict. We can
+rearrange the code to follow the invariant more closely:
+
+
+
+
+we flip the condition and if-branches, so that xs[mid] < x matches xs[i] < x from the
+invariant for lo
+
+
+to make the invariant tight, we add mid + 1 (if xs[mid] is less than x, we know that the
+insertion point is at least mid + 1)
+
+
+
The code now works. So what went wrong with the original version with x < xs[mid]? In the else
+case, when x >= xs[mid] we set lo = mid, but that’s wrong! It might be the case that x ==
+xs[mid] and x == xs[mid - 1], which would break the invariant for lo.
+
The point isn’t in this particular invariant or this particular algorithm. It’s the general
+pattern that it’s easy to write the code which implements the right algorithm, and sort-of works,
+but is wrong in details. To get the details right for the right reason, you need to understand
+precisely what the result should be, and formulating this as a (loop or recursion) invariant
+helps.
+
+
Perhaps it’s time to answer the title question: invariant is some property which holds at all times
+during dynamic evolution of the system. In the above example, the evolution is the program
+progressing through subsequent loop iterations. The invariant, the condition binding lo and hi,
+holds on every iteration. Invariants are powerful, because they are compressed descriptions of
+the system, they collapse away the time dimension, which is a huge simplification. Reasoning about
+each particular path the program could take is hard, because there are so many different paths.
+Reasoning about invariants is easy, because they capture properties shared by all execution paths.
+
The same idea applies when programming in the large. In the small, we looked at how the state of a
+running program evolves over time. In the large, we will look at how the source code of the program
+itself evolves, as it is being refactored and extended to support new features. Here are some
+systems invariants from the systems I’ve worked with:
+
Cargo:
+
File system paths entered by users are preserved exactly. If the user types
+cargo frob ../some/dir,
+Cargo doesn’t attempt to resolve ../some/dir to an absolute path and passes the path
+to the underlying OS as is. The reason for that is that file systems are very finicky. Although it
+might look as if two paths are equivalent, there are bound to be cases where they are not. If the
+user typed a particular form of a path, they believe that it’ll work, and any changes can mess
+things up easily.
+
This is a relatively compact invariant — basically, code is just forbidden from calling
+fs::canonicalize.
+
rust-analyzer:
+
Syntax trees are identity-less value types. That is, if you take an object representing an if
+expression, that object doesn’t have any knowledge of where in the larger program the if
+expression is. The thinking about this invariant was that it simplifies refactors — while in the
+static program it’s natural to talk about “if on the line X in file Y”, when you start modifying
+code, identity becomes much more fluid.
+
This is an invariant with far reaching consequences — that means that literally everything in
+rust-analyzer needs to track identities of things explicitly. You don’t just pass around syntax
+nodes, you pass nodes with extra breadcrumbs describing their origin. I think this might have been a
+mistake — while it does make refactoring APIs more principled, refactoring is not the common case!
+Most of the work of a language server consists of read-only analysis of existing code, and the
+actual refactor is just a cherry on top. So perhaps it’s better to try to bind identity mode tightly
+into the core data structure, and just use fake identities for temporary trees that arise during
+refactors.
+
A more successful invariant from rust-analyzer is that the IDE has a full, frozen view of a snapshot
+of the world. There’s no API for inferring the types, rather, the API looks as if all the types are
+computed at all times. Similarly, there’s no explicit API for changing the code or talking about
+different historical versions of the code — the IDE sees a single “current” snapshot with all
+derived data computed. Underneath, there’s a smart system to secretly compute the information on
+demand and re-use previous results, but this is all hidden from the API.
+
This is a great, simple mental model, and it provides for a nice boundary between the compiler
+proper and IDE fluff like refactors and code completion. Long term, I’d love to see several
+implementations of the “compiler parts”.
+
TigerBeetle:
+
A lot of thoughtful invariants here! To touch only a few:
+
TigerBeetle doesn’t allocate memory after startup. This simple invariant affects every bit of code
+— whatever you do, you must manage with existing, pre-allocated data structures. You can’t just
+memcpy stuff around, there’s no ambient available space to memcpy to! As a consequence (and,
+historically, as a motivation for the design)
+everything
+has a specific numeric limit.
+
Another fun one is that transaction logic can’t read from disk. Every object which could be touched
+by a transaction needs to be explicitly prefetched into memory before transaction begins. Because
+disk IO happens separately from the execution, it is possible to parallelize IO for a whole batch of
+transactions. The actual transaction execution is then a very tight serial CPU loop without any
+locks.
+
Speaking of disk IO, in TigerBeetle “reading from disk” can’t fail. The central API for reading
+takes a data block address, a checksum, and invokes the callback with data with a matching checksum.
+Everything built on top doesn’t need to worry about error handling. The way this works internally is
+that reads that fail on a local disk are repaired through other replicas in the cluster. It’s just
+that the repair happens transparently to the caller. If the block of data of interest isn’t found on
+the set of reachable replicas, the cluster correctly gets stuck until it is found.
+
+
Summing up: invariants are helpful for describing systems that evolve over time. There’s a
+combinatorial explosion of trajectories that a system could take. Invariants compactly describe
+properties shared by an infinite amount of trajectories.
+
In the small, formulating invariants about program state helps to wire correct code.
+
In the large, formulating invariants about the code itself helps to go from a small, simple system
+that works to a large system which is used in production.
That’s too damn many of them! Some time ago I’ve noticed that my code involving comparisons is often
+hard to understand, and hides bugs. I’ve figured some rules of thumb to reduce complexity, which I
+want to share.
+
The core idea is to canonicalize things. Both x < y and y > x mean the same, and, if you use
+them with roughly equal frequency, you need to spend extra mental capacity to fold the two versions
+into the single “x tiny, y HUGE” concept in your head.
+
The number line is a great intuition and visualization
+for comparisons. If you order things from small to big,
+A B C D,
+you get intuitive concept of ordering without using comparison operators. You also plug into your
+existing intuition that the sort function arranges arrays in the ascending order.
+
So, as a first order rule-of-thumb:
+Strongly prefer < and <= over > and >=
+And, when using comparisons, use number line intuition.
+
Some snippets:
+
Checking if a point is inside the interval:
+
+
+
Checking if a point is outside of the interval:
+
+
+
Segment a is inside segment b:
+
+
+
Segments a and b are disjoint (either a is to the left of b or a is to the right of b):
+
+
+
A particular common case for ordered comparisons is checking that an index is in bounds for an
+array. Here, the rule about number line works together with another important rule: State
+invariants positively
+
The indexing invariant is spelled as index < xs.len(),
+
and you should prefer to see it exactly that way in the source code. Concretely,
+
+
+
is hard to get right, because is spells the converse of the invariant, and involves an extra mental
+negation (this is subtle — although there isn’t a literal negation operator, you absolutely do
+think about this as a negation of the invariant). If possible, the code should be reshaped to
+
+
+]]>
+
+
+
+TypeScript is Surprisingly OK for Compilers
+
+2023-08-17T00:00:00+00:00
+2023-08-17T00:00:00+00:00
+https://matklad.github.io/2023/08/17/typescript-is-surprisingly-ok-for-compilers
+Alex Kladov
+
+
+ TypeScript is Surprisingly OK for Compilers
+
+
There are two main historical trends when choosing an implementation language for something
+compiler-shaped.
+
For more language-centric tasks, like a formal specification, or a toy hobby language, OCaml makes
+most sense. See, for example, plzoo or WebAssembly reference
+interpreter.
+
For something implementation-centric and production ready, C++ is often chosen: LLVM, clang, v8,
+HotSpot are all C++.
+
These days, Rust is a great new addition to the landscape. It is influenced most directly by ML and
+C++, combines their strengths, and even brings something new of its own to the table, like seamless,
+safe multithreading. Still, Rust leans heavily towards production readiness side of the spectrum.
+While some aspects of it, like a “just works” build system, help with prototyping as well, there’s
+still extra complexity tax due to the necessity to model physical layout of data. The usual advice,
+when you start building a compiler in Rust, is to avoid pointers and use indexes. Indexes are great!
+In large codebase, they allow greater decoupling (side tables can stay local to relevant modules),
+improved performance (an index is u32 and nudges you towards struct-of-arrays layouts), and more
+flexible computation strategies (indexes are easier to serialize or plug into incremental
+compilation framework). But they do make programming-in-the-small significantly more annoying, which
+is a deal-breaker for hobbyist tinkering.
+
But OCaml is crufty! Is there something better? Today, I realized that TypeScript might actually be
+OK? It is not really surprising, given how the language works, but it never occured to me to think
+about TypeScript as an ML equivalent before.
+
So, let’s write a tiny-tiny typechecker in TS!
+
Of course, we start with deno. See A Love Letter to
+Deno for more details, but the
+TL;DR is that deno provides out-of-the-box experience for TypeScript. This is a pain point for
+OCaml, and something that Rust does better than either OCaml or C++. But deno does this better than
+Rust! It’s just a single binary, it comes with linting and formatting, there’s no compilation step,
+and there are built-in task runner and watch mode. A dream setup for quick PLT hacks!
+
And then there’s TypeScript itself, with its sufficiently flexible, yet light-ceremony type system.
+
Let’s start with defining an AST. As we are hacking, we won’t bother with making it an IDE-friendly
+concrete syntax tree, or incremental-friendly “only store relative offsets” tree, and will just tag
+AST nodes with locations in file:
+
+
+
Even here, we already see high-level nature of TypeScript — string is just a string, there’s no
+thinking about usize vs u32 as numbers are just numbers.
+
Usually, an expression is defined as a sum-type. As we want to tag each expression with a location,
+that representation would be slightly inconvenient for us, so we split things up a bit:
+
+
+
One more thing — as we are going for something quick, we’ll be storing inferred types directly in
+the AST nodes. Still, we want to keep raw and type-checked AST separate, so what we are going to do
+here is to parametrize the Expr over associated data it stores. A freshly parsed expression would
+use void as data, and the type checker will set it to Type. Here’s what we get:
+
+
+
A definition of ExprBinary could look like this:
+
+
+
Note how I don’t introduce separate types for, e.g, AddExpr and SubExpr— all binary
+expressions have the same shape, so one type is enough!
+
But we need a tiny adjustment here. Our Expr kind is defined as a union type. To match a value of
+a union type a bit of runtime type information is needed. However, it’s one of the core properties
+of TypeScript that it doesn’t add any runtime behaviors. So, if we want to match on expression kinds
+(and we for sure want!), we need to give a helping hand to the compiler and include a bit of RTTI
+manually. That would be the tag field:
+
+
+
tag: "binary" means that the only possible runtime value for tag is the string "binary".
+
Similarly to various binary expressions, boolean literal and int literal expressions have almost
+identical shape. Almost, because the payload (boolean or number) is different. TypeScript
+allows us to neatly abstract this over:
+
+
+
Finally, for control-flow expressions we only add if for now:
+
+
+
This concludes the definition of the ast! Let’s move on to the type inference! Start with types:
+
+
+
Our types are really simple, we could have gone with type Type = "Int" | "Bool", but
+lets do this a bit more enterprisy! We define separate types for integer and boolean types. As these
+types are singletons, we also provide canonical definitions. And here is another TypeScript-ism.
+Because TypeScript fully erases types, everything related to types lives in a separate namespace. So
+you can have a type and a value sharing the same name. Which is exactly what we use to define the
+singletons!
+
Finally, we can take advantage of our associated-data parametrized expression and write the
+signature of
+
+
+
As it says on the tin, inter_types fills in Type information into the void! Let’s fill in the
+details!
+
+
+
If at this point we hit Enter, the editor completes:
+
+
+
There’s one problem though. What we really want to write here is something like
+const inferred_type = switch(..),
+but in TypeScript switch is a statement, not an expression.
+So let’s define a generic visitor!
+
+
+
Armed with the visit, we can ergonomically match over the expression:
+
+
+
Before we go further, let’s generalize this visiting pattern a bit! Recall that our expressions are
+parametrized by the type of associated data, and type-checker-shaped transformations are essentially an
+Expr<U> -> Expr<V>
+transformation.
+
Let’s make this generic!
+
+
+
Transform maps an expression carrying T into an expression carrying V by applying an f
+visitor. Importantly, it’s Visitor<V, V>, rather than a Visitor<U, V>. This is
+counter-intuitive, but correct — we run transformation bottom up, transforming the leaves first.
+So, when the time comes to visit an interior node, all subexpression will have been transformed!
+
The body of transform is wordy, but regular, rectangular, and auto-completes itself:
+
+
+
+
+
Note how here expr.kind is both Expr<U> and Expr<V>— literals don’t depend on this type
+parameter, and TypeScript is smart enough to figure this out without us manually re-assembling
+the same value with a different type.
+
+
+
This is where that magic with Visitor<V, V> happens.
+
+
+
The code is pretty regular here though! So at this point we might actually recall that TypeScript is
+a dynamically-typed language, and write a generic traversal using Object.keys, while keeping the
+static function signature in-place. I don’t think we need to do it here, but there’s comfort in
+knowing that it’s possible!
+
Now implementing type inference should be a breeze! We need some way to emit type errors though.
+With TypeScript, it would be trivial to accumulate errors into an array as a side-effect, but let’s
+actually represent type errors as instances of a specific type, TypeError (pun intended):
+
+
+
To check ifs and binary expressions, we would also need a utility for comparing types:
+
+
+
We make the Error type equal to any other type to prevent cascading failures. With all that
+machinery in place, our type checker is finally:
+
+
+
Astute reader will notice that our visitor functions now take an extra ast.Location argument.
+TypeScript allows using this argument only in cases where it is needed, cutting down verbosity.
+
And that’s all for today! The end result is pretty neat and concise. It took some typing to get there,
+but TypeScript autocompletion really helps with that! What’s more important, there was very little
+fighting with the language, and the result feels quite natural and directly corresponds to the shape
+of the problem.
+
I am not entirely sure in the conclusion just yet, but I think I’ll be using TypeScript as my tool
+of choice for various small language hacks. It is surprisingly productive due to the confluence of
+three aspects:
+
+
+deno is a perfect scripting runtime! Small, hermetic, powerful, and optimized for effective
+development workflows.
+
+
+TypeScript tooling is great — the IDE is helpful and productive (and deno makes sure that it
+also requires zero configuration)
+
+
+The language is powerful both at runtime and at compile time. You can get pretty fancy with types,
+but you can also just escape to dynamic world if you need some very high-order code.
+
+
+
+
Just kidding, here’s one more cute thing. Let’s say that we want to have lots of syntactic sugar,
+and also want type-safe desugaring. We could tweak our setup a bit for that: instead of Expr and
+ExprKind being parametrized over associated data, we circularly parametrize Expr by the whole
+ExprKind and vice verse:
+
+
+
This allows expressing desugaring in a type-safe manner!
+
+
+]]>
+
+
+
+Role Of Algorithms
+
+2023-08-13T00:00:00+00:00
+2023-08-13T00:00:00+00:00
+https://matklad.github.io/2023/08/13/role-of-algorithms
+Alex Kladov
+
+
+ Role Of Algorithms
+
+
This is lobste.rs comment as an article, so expect even more abysmal editing than usual.
“Algorithms” are a useful skill not because you use it at work every day, but because they train you
+to be better at particular aspects of software engineering.
+
Specifically:
+
First, algorithms drill the skill of bug-free coding. Algorithms are hard and frustrating! Subtle
+off-by-one might not matter for simple tests, but breaks corner cases. But if you practice
+algorithms, you get better at this particular skill of writing correct small programs, and I think
+this probably generalizes.
+
To give an array of analogies:
+
+
+
People do cardio or strength exercises not because they need to lift heavy weights in real life.
+Quite the opposite — there’s too little physical exertion in our usual lives, so we need extra
+exercises for our bodies to gain generalized health (which is helpful in day-to-day life).
+
+
+
You don’t practice complex skill by mere repetition. You first break it down into atomic trainable
+sub skills, and drill each sub skill separately in unrealistic condition. Writing correct
+algorithmy code is a sub skill of software engineering.
+
+
+
When you optimize system, you don’t just repeatedly run end-to-end test until things go fast. You
+first identify the problematic area, then write a targeted micro benchmark to isolate this
+particular effect, and then you optimize that using much shorter event loop.
+
+
+
I still remember two specific lessons I learned when I started doing algorithms many years ago:
+
+
Debugging complex code is hard, first simplify, then debug
+
+
Originally, when I was getting a failed test, I sort of tried to add more code to my program to
+make it pass. At some point I realized that this is going nowhere, and then I changed my workflow
+to first try to remove as much code as I can, and only then investigate the problematic test
+case (which with time morphed into a skill of not writing more code then necessary in the first
+place).
+
+
Single source of truth is good
+
+
A lot of my early bugs was due to me duplicating the same piece of information in two places and
+then getting them out of sync. Internalizing that as a single source of truth fixed the issues.
+
+
+
Meta note: if you already know this, my lessons are useless. If you don’t yet know them, they are
+still useless and most likely will bounce off you. This is tacit knowledge — it’s very hard to
+convey it verbally, it is much more efficient to learn these things yourself by doing.
+
Somewhat related, I noticed a surprising correlation between programming skills in the small, and
+programming skills in the large. You can solve a problem in five lines of code, or, if you try hard,
+in ten lines of code. If you consistently come up with concise solutions in the small, chances are
+large scale design will be simple as well.
+
I don’t know how true is that, as I never tried to look at a proper study, but it looks very
+plausible from what I’ve seen. If this is true, the next interesting question is: “if you train
+programming-in-the-small skills, do they transfer to programming in the large?”. Again, I don’t
+know, but I’d take this Pascal’s wager.
+
Second, algorithms teach about properties and invariants. Some lucky people get those skills from
+a hard math background, but algorithms are a much more accessible way to learn them, as everything
+is very visual, immediately testable, and has very short and clear feedback loop.
+
And properties and invariants is what underlines most big and successful systems. Like 90% of the
+code is just fluff and glue, and if you have the skill to see the 10% that is architecturally
+salient properties, you could comprehend the system much faster.
+
Third, algorithms occasionally are useful at the job! Just last week on our design walk&talk we
+were brainstorming one particular problem, and I was like
+
+
+
We probably won’t go with that solution as that’s too complex algorithmically for what ultimately is
+a corner case, but it’s important that we understand problem space in detail before we pick a
+solution.
+
Note also how algorithms vocabulary helps me to think about the problem. In math (including
+algorithms), there’s just like a handful of ideas which are applied again and again under different
+guises. You need some amount of insight of course, but, for most simple problems, what you actually
+need is just an ability to recognize the structure you’ve seen somewhere already.
+
Fourth, connecting to the previous ones, the ideas really do form interconnected web which, on a
+deep level, underpins a whole lot of stuff. So, if you do have non-zero amount of pure curiosity
+when it comes to learning programming, algorithms cut pretty deep to the foundation. Let me repeat
+the list from the last post, but with explicit connections to other things:
+
+
linear search
+
+
assoc lists in most old functional languages work that way
+
+
binary search
+
+
It is literally everywhere. Also, binary search got a cute name, but actually it isn’t the
+primitive operation. The primitive operation is partition_point, a predicate version of binary
+search. This is what you should add to your language’s stdlib as a primitive, and base everything
+else in terms of it. Also, it is one of the few cases where we know lower bound of complexity. If
+an algorithm does k binary comparisons, it can give at most 2k distinct answers. So, to find
+insertion point among n items, you need at least k questions such that 2k > n.
+
+
quadratic sorting
+
+
We use it at work! Some collections are statically bound by a small constant, and quadratically
+sorting them just needs less machine code. We are also a bit paranoid that production sort
+algorithms are very complex and might have subtle bugs, esp in newer languages.
+
+
merge sort
+
+
This is how you sort things on disk. This is also how LSM-trees, the most practically important
+data structure you haven’t learned about in school, works! And k-way merge also is occasionally
+useful (this is from work from three weeks ago).
+
+
heap sort
+
+
Well, this one is only actually useful for the heap, but I think maybe the kernel uses it when
+it needs to sort something in place, without extra memory, and in guaranteed O(N log N)?
+
+
binary heap
+
+
Binary heaps are everywhere! Notably, simple timers are a binary heap of things in the order of
+expiration. This is also a part of Dijkstra and k-way-merge.
+
+
growable array
+
+
That’s the mostly widely used collection of them all! Did you know that grow factor 2 has a
+problem that the size after n reallocations is larger then the sum total of all previous sizes,
+so the allocator can’t re-use the space? Anecdotally, growth factors less than two are preferable
+for this reason.
Again, rust-analyzer green tree are binary search trees using offset as an implicit key.
+Monoid trees are also binary search trees.
+
+
AVL tree
+
+
Ok, this one I actually don’t know a direct application of! But I remember two
+programming-in-the-small lessons AVL could have taught me, but didn’t. I struggled a lot
+implementing all of “small left rotation”, “small right rotation”, “big left rotation”, “big right
+rotation”. Some years later, I’ve learned that you don’t do
+
+
+
as that forces code duplication. Rather, you do children: [Tree; 2] and then you could
+use child_index and child_index ^ 1 to abstract over left-right.
+
And then some years later still I read in wikipedia that big rotations are actually a composition
+of two small rotations.
+
Actually, I’ve lied that I don’t know connections here. You use the same rotations for the splay
+tree.
+
+
Red Black Tree
+
+
red-black tree is a 2-3 tree is a B-tree. Also, you probably use jemalloc, and it has a red-black
+tree implemented as a C
+macro.
+Left-leaning red-black tree are an interesting variation, which is claimed to be simpler, but is
+also claimed to not actually be simpler, because it is not symmetric and neuters the children
+trick.
+
+
B-tree
+
+
If you use Rust, you probably use B-tree. Also, if you use a database, it stores data either in
+LSM or in a B-tree. Both of these are because B-trees play nice with memory hierarchy.
Literally everywhere, both chaining and open-addressing versions are widely used.
+
+
Depth First Search
+
+
This is something I have to code, explicitly or implicitly, fairly often. Every time where you
+have a DAG, when things depend on other things, you’d have a DFS somewhere. In rust-analyzer,
+there are at least a couple — one in borrow checker for something (have no idea what that does,
+just grepped for fn dfs) and one in crate graph to detect cycles.
+
+
Breadth First Search
+
+
Ditto, any kind of exploration problem is usually solved with bfs. Eg, rust-analyzer uses bfs
+for directory traversal.
+
Which is better, bfs or dfs? Why not both?! Take a look at bdfs from rust-analyzer:
Again, comes up every time you deal with things which depend on each other. rust-analyzer has
+crates_in_topological_order
+
+
Strongly Connected Components
+
+
This is needed every time things depend on each other, but you also allow cyclic dependencies. I
+don’t think I’ve needed this one in real life. But, given that SCC is how you solve 2-SAT in
+polynomial time, seems important to know to understand the 3 in 3-SAT
+
+
Minimal Spanning Tree
+
+
Ok, really drawing a blank here! Connects to sorting, disjoint set union (which is needed for
+unification in type-checkers), and binary heap. Seems practically important algorithm though! Ah,
+MST also gives an approximation for planar traveling salseman I think, another border between hard
+& easy problems.
+
+
Dijkstra
+
+
Dijkstra is what I think about when I imagine a Platonic algorithm, though
+I don’t think I’ve used it in practice? Connects to heap.
+
Do you know why we use i, j, k for loop indices? Because D ijk stra!
+
+
Floyd-Warshall
+
+
This one is cool! Everybody knows why any regular expression can be complied to an equivalent
+finite state machine. Few people know the reverse, why each automaton has an equivalent regex
+(many people know this fact, but few understand why). Well, because Floyd-Warshall! To convert an
+automaton to regex use the same algorithm you use to find pairwise distances in a graph.
+
Also, this is a final boss of dynamic programming. If you understand why this algorithm works, you
+understand dynamic programming. Despite being tricky to understand, it’s very easy to implement! I
+randomly stumbled into Floyd-Warshall, when I tried to implement a different, wrong approach, and
+made a bug which turned my broken algo into a correct Floyd-Warshall.
+
+
Bellman-Ford
+
+
Again, not much practical applicaions here, but the theory is well connected. All shortest path
+algorithms are actually fixed-point iterations! But with Bellman-Ford and its explicit edge
+relaxation operator that’s most obvious. Next time you open static analysis textbook and learn
+about fixed point iteration, map that onto the problem of finding shortest paths!
+
+
Quadratic Substring Search
+
+
This is what you language standard library does
+
+
Rabin-Karp
+
+
An excellent application of hashes. The same idea, hash(composite) =
+compbine(hash(component)*), is used in rust-analyzer to intern syntax
+trees.
+
+
Boyer-Moore
+
+
This is beautiful and practical algorithm which probably handles the bulk of real-world searches
+(that is, it’s probably the hottest bit of ripgrep as used by an average person). Delightfully,
+this algorithm is faster than theoretically possible — it doesn’t even look at every byte of
+input data!
+
+
Knuth-Morris-Pratt
+
+
Another “this is how you do string search in the real world” algorithm. It also is the platonic
+ideal of a finite state machine, and almost everything is an FSM. It also is Aho-Corasick.
+
+
Aho-Corasick
+
+
This is the same as Knuth-Morris-Pratt, but also teaches you about tries. Again, super-useful for
+string searches. As it is an FSM, and a regex is an FSM, and there’s a general construct for
+building a product of two FSMs, you can use it to implement fuzzy search. “Workspace symbol”
+feature in rust-analyzer works like this. Here’s a part
+of implementation.
+
+
Edit Distance
+
+
Everywhere in Bioinformatics (not the actual edit distance, but this problem shape). The first
+post on this blog is about this problem:
It’s not about algorithms though, its about CPU-level parallelism.
+
+
+]]>
+
+
+
+Types and the Zig Programming Language
+
+2023-08-09T00:00:00+00:00
+2023-08-09T00:00:00+00:00
+https://matklad.github.io/2023/08/09/types-and-zig
+Alex Kladov
+
+
+ Types and the Zig Programming Language
+
+
Notes on less-than-obvious aspects of Zig’s type system and things that surprised me after diving
+deeper into the language.
Zig has a nominal type system despite the fact that types lack names. A struct type is declared by
+struct { field: T }.
+It’s anonymous; an explicit assignment is required to name the type:
+
+
+
Still, the type system is nominal, not structural. The following does not compile:
+
+
+
The following does:
+
+
+
One place where Zig is structural are anonymous struct literals:
+
+
+
Types of x and y are different, but x can be coerced to y.
+
In other words, Zig structs are anonymous and nominal, but anonymous structs are structural!
Simple type inference for an expression works by first recursively inferring the types of
+subexpressions, and then deriving the result type from that. So, to infer types in
+foo().bar(), we first derive the type of foo(), then lookup method bar on that
+type, and use the return type of the method.
+
More complex type inference works through so called unification algorithm. It starts with a similar
+recursive walk over the expression tree, but this walk doesn’t infer types directly, but rather
+assigns a type variable to each subexpression, and generates equations relating type variables. So the
+result of this first phase look like this:
+
+
+
Then, in the second phase the equations are solved, yielding, in this case, x = Int and y = Int.
+
Usually languages with powerful type systems have unification somewhere, though often unification
+is limited in scope (for example, Kotlin infers types statement-at-a-time).
+
It is curious that Zig doesn’t do unification, type inference is a simple single-pass recursion (or
+at least it should be, I haven’t looked at how it is actually implemented). So, anytime there’s a
+generic function like
+fn reverse(comptime T: type, xs: []T) void,
+the call site has to pass the type in explicitly:
+
+
+
Does it mean that you have to pass the types all the time? Not really! In fact, the only place which
+feels like a burden are functions in std.mem module which operate on slices, but that’s just
+because slices are builtin types (a kind of pointer really) without methods. The thing is, when you
+call a method on a “generic type”, its type parameters are implicitly in scope, and don’t have to be
+specified. Study this example:
+
+
+
There’s a runtime parallel here. At runtime, there’s a single dynamic dispatch, which prioritizes
+dynamic type of the first argument, and multiple dynamic dispatch, which can look at dynamic types
+of all arguments. Here, at compile time, the type of the first argument gets a preferential
+treatment. And, similarly to runtime, this covers 80% of use cases! Though, I’d love for things like
+std.mem.eql to be actual methods on slices…
One of the best tricks a language server can pull off for as-you-type analysis is skipping bodies of
+the functions in dependencies. This works as long as the language requires complete signatures. In
+functional languages, its customary to make signatures optional, which precludes this crucial
+optimization. As per Modularity Of Lexical
+Analysis, this has
+repercussions for all of:
+
+
+incremental compilation,
+
+
+parallel compilation,
+
+
+robustness to errors.
+
+
+
I always assumed that Zig with its crazy comptime requires autopsy.
+But that’s not actually the case! Zig doesn’t have decltype(auto), signatures are always explicit!
+
Let’s look at, e.g., std.mem.bytesAsSlice:
+
+
+
Note how the return type is not anytype, but the actual, real thing. You could write complex
+computations there, but you can’t look inside the body. Of course, it also is possible to write fn
+foo() @TypeOf(bar()) {, but that feels like a fair game —bar() will be evaluated at
+compile time. In other words, only bodies of functions invoked at comptime needs to be looked at by
+a language server. This potentially improves performance for this use-case quite a bit!
+
It’s useful to contrast this with Rust. There, you could write
+
+
+
Although it feels like you are stating the interface, it’s not really the case. Auto traits like
+Send and Sync leak, and that can be detected by downstream code and lead to, e.g., different
+methods being called via Deref-based specialization depending on : Send being implemented:
+
+
+
Zig is much more strict here, you have to fully name the return type (the name doesn’t have to be
+pretty, take a second look at bytesAsSlice). But its not perfect, a genuine leakage happens with
+inferred error types (!T syntax). A bad example would look like this:
+
+
+
Here, to check main, we actually do need to dissect f’s body, we can’t treat the error union
+abstractly. When the compiler analyzes main, it needs to stop to process f signature (which is
+very fast, as it is very short) and then f’s body (this part could be quite slow, there might be a
+lot of code behind that Mystery! It’s interesting to ponder alternative semantics, where, during
+type checking, inferred types are treated abstractly, and error exhastiveness is a separate late
+pass in the compiler. That way, complier only needs f’s signature to check main. And that means
+that bodies of main and f could be checked in parallel.
+
That’s all for today! The type system surprising I’ve found so far are:
+
+
+
Nominal type system despite notable absence of names of types.
+
+
+
Unification-less generics which don’t incur unreasonable annotation burden due to methods “closing
+over” generic parameters.
+
+
+
Explicit signatures with no Voldemort types with a
+notable exception of error unions.
People sometimes ask me: “Alex, how do I learn X?”. This article is a compilation of advice I
+usually give. This is “things that worked for me” rather than “the most awesome things on earth”. I
+do consider every item on the list to be fantastic though, and I am forever grateful to people
+putting these resources together.
I don’t think I have any useful advice on how to learn programming from zero. The rest of the post
+assumes that you at least can, given sufficient time, write simple programs. E.g., a program that
+reads a list of integers from an input textual file, sorts them using a quadratic algorithm, and
+writes the result to a different file.
https://projecteuler.net/archives is fantastic. The first 50 problems or so are a perfect “drill”
+to build programming muscle, to go from “I can write a program to sort a list of integers” to “I can
+easily write a program to sort a list of integers”.
+
Later problems are very heavily math based. If you are mathematically inclined, this is perfect —
+you got to solve fun puzzles while also practicing coding. If advanced math isn’t your cup of tea,
+feel free to stop doing problems as soon as it stops being fun.
https://en.wikipedia.org/wiki/Modern_Operating_Systems is fantastic. A version of the
+book was the first
+thick programming related tome I devoured. It gives a big picture of the inner workings of software
+stack, and was a turning point for me personally. After reading this book I realized that I want to
+be a programmer.
https://www.nand2tetris.org is fantastic. It plays a similar “big picture” role as MOS,
+but this time you are the painter. In this course you build a whole computing system yourself,
+starting almost from nothing. It doesn’t teach you how the real software/hardware stack works, but
+it thoroughly dispels any magic, and is extremely fun.
https://cses.fi/problemset/ is fantastic. This is a list of algorithmic problems, which is
+meticulously crafted to cover all the standard topics to a reasonable depth. This is by far the best
+source for practicing algorithms.
https://www.coursera.org/learn/programming-languages is fantastic. This course is a whirlwind tour
+across several paradigms of programming, and makes you really get what programming languages are
+about (and variance).
https://www.tedinski.com/archive/ is fantastic. Work through the whole archive in chronological
+order. This is by far the best resource on “programming in the large”.
Having a great mentor is fantastic, but mentors are not always available. Luckily, programming can
+be mastered without a mentor, if you got past the initial learning step. When you code, you get a
+lot of feedback, and, through trial and error, you can process the feedback to improve your skills.
+In fact, the hardest bit is actually finding the problems to solve (and this article suggests many).
+But if you have the problem, you can self-improve noticing the following:
+
+
+How you verify that the solution works.
+
+
+Common bugs and techniques to avoid them in the future.
+
+
+Length of the solution: can you solve the problem using shorter, simpler code?
+
+
+Techniques — can you apply anything you’ve read about this week? How would the problem be solved
+in Haskell? Could you apply pattern from language X in language Y?
+
+
+
In this context it is important to solve the same problem repeatedly. E.g., you could try solving
+the same model problem in all languages you know, with a month or two break between attempts.
+Repeatedly doing the same thing and noticing differences and similarities between tries is the
+essence of self-learning.
Learning your first programming language is a nightmare, because you are learning your editing
+environment (PyScripter, IntelliJ IDEA, VS Code) first, simple algorithms second, and the language
+itself third. It gets much easier afterwards!
+
Learning different programming languages is one of the best way to improve your programming skills.
+By seeing what’s similar, and what’s different, you deeper learn how the things work under the hood.
+Different languages put different idioms to the forefront, and learning several expands your
+vocabulary considerably. As a bonus, after learning N languages, learning N+1st becomes a question
+of skimming through the official docs.
+
In general, you want to cover big families of languages: Python, Java, Haskell, C, Rust, Clojure
+would be a good baseline. Erlang, Forth, and Prolog would be good additions afterwards.
You are not actually learning algorithms, you are learning programming. At this stage, it doesn’t
+matter how long your code is, how pretty it is, or how efficient it is. The only thing that
+matters is that it solves the problem. Generally, this level ends when you are fairly comfortable
+with recursion. Few first problems from Project Euler are a great resource here.
+
+
Level 2
+
+
Here you learn algorithms proper. The goal here is mostly encyclopedic knowledge of common
+techniques. There are quite a few, but not too many of those. At this stage, the most useful thing
+is understanding the math behind the algorithms — being able to explain algorithm using
+pencil&paper, prove its correctness, and analyze Big-O runtime. Generally, you want to learn the
+name of algorithm or technique, read and grok the full explanation, and then implement it.
+
I recommend doing an abstract implementation first (i.e., not “HashMap to solve problem X”, but
+“just HashMap”). Include tests in your implementation. Use randomized testing (e.g., when testing
+sorting algorithms, don’t use a finite set of example, generate a million random ones).
+
It’s OK and even desirable to implement the same algorithm multiple times. When solving problems,
+like CSES, you could abstract your solutions and re-use them, but it’s better to code everything
+from scratch every time, until you’ve fully internalized the algorithm.
+
+
Level 3
+
+
One day, long after I’ve finished my university, I was a TA for an algorithms course. The lecturer
+for the course was the person who originally taught me to program, through a similar algorithms
+course. And, during one coffee break, he said something like
+
+
+
I was thunderstruck! I didn’t realize that’s the reason why I am learning (well, teaching at that
+point) algorithms! Before, I always muddled through my algorithms by randomly tweaking generally
+correct stuff until it works. E.g., with a binary search, just add +1 somewhere until it doesn’t
+loop on random arrays. After hearing this advice, I went home and wrote my millionth binary
+search, but this time I actually added comments with loop invariants, and it worked from the first
+try! I applied similar techniques for the rest of the course, and since then my subjective
+perception of bug rate (for normal work code) went down dramatically.
+
So this is the third level of algorithms — you hone your coding skills to program without bugs.
+If you are already fairly comfortable with algorithms, try doing CSES again. But this time, spend
+however much you need double-checking the code before submission, but try to get everything
+correct on the first try.
Here’s the list of things you might want to be able to do, algorithmically. You don’t need to be
+able to code everything on the spot. I think it would help if you know what each word is about, and
+have implemented the thing at least once in the past.
A very powerful exercise is coding a medium-sized project from scratch. Something that takes more
+than a day, but less than a week, and has a meaningful architecture which can be just right, or
+messed up. Here are some great projects to do:
+
+
Ray Tracer
+
+
Given an analytical description of a 3D scene, convert it to a colored 2D image, by simulating a
+path of a ray of light as it bounces off objects.
+
+
Software Rasterizer
+
+
Given a description of a 3D scene as a set of triangles, convert it to a colored 2D image by
+projecting triangles onto the viewing plane and drawing the projections in the correct order.
+
+
Dynamically Typed Programming Language
+
+
An interpreter which reads source code as text, parses it into an AST, and directly executes the
+AST (or maybe converts AST to the byte code for some speed up)
+
+
Statically Typed Programming Language
+
+
A compiler which reads source code as text, and spits out a binary (WASM would be a terrific
+target).
+
+
Relational Database
+
+
Several components:
+
+
+Storage engine, which stores data durably on disk and implements on-disk ordered data structures
+(B-tree or LSM)
+
+
+Relational data model which is implemented on top of primitive ordered data structures.
+
+
+Relational language to express schema and queries.
+
+
+Either a TCP server to accept transactions as a database server, or an API for embedding for an
+in-processes “embedded” database.
+
+
+
+
Chat Server
+
+
An exercise in networking and asynchronous programming. Multiple client programs connect to a
+server program. A client can send a message either to a specific different client, or to all other
+clients (broadcast). There are many variations on how to implement this: blocking read/write
+calls, epoll, io_uring, threads, callbacks, futures, manually-coded state machines.
+
+
+
Again, it’s more valuable to do the same exercise six times with variations, than to blast through
+everything once.
+
+
+]]>
+
+
+
+On Modularity of Lexical Analysis
+
+2023-08-01T00:00:00+00:00
+2023-08-01T00:00:00+00:00
+https://matklad.github.io/2023/08/01/on-modularity-of-lexical-analysis
+Alex Kladov
+
+
+ On Modularity of Lexical Analysis
+
+
I was going to write a long post about designing an IDE-friendly language. I wrote an intro and
+figured that it would make a better, shorter post on its own. Enjoy!
+
The big idea of language server construction is that language servers are not magic — capabilities
+and performance of tooling are constrained by the syntax and semantics of the underlying language.
+If a language is not designed with toolability in mind, some capabilities (e.g, fully automated
+refactors) are impossible to implement correctly. What’s more, an IDE-friendly language turns out to
+be a fast-to-compile language with easy-to-compose libraries!
+
More abstractly, there’s this cluster of unrelated at a first sight, but intimately intertwined and
+mutually supportive properties:
+
+
+parallel, separate compilation,
+
+
+incremental compilation,
+
+
+resilience to errors.
+
+
+
Separate compilation measures how fast we can compile codebase from scratch if we have unlimited
+number of CPU cores. For a language server, it solves the cold start problem — time to
+code-completion when the user opens the project for the first time or switches branches. Incremental
+compilation is the steady state of the language server — user types code and expects to see
+immediate effects throughout the project. Resilience to errors is important for two different
+sub-reasons. First, when the user edits the code it is by definition incomplete and erroneous, but a
+language server still must analyze the surrounding context correctly. But the killer feature of
+resilience is that, if you are absolutely immune to some errors, you don’t even have to look at the
+code. If a language server can ignore errors in function bodies, it doesn’t have to look at the
+bodies of functions from dependencies.
+
All three properties, parallelism, incrementality, and resilience, boil down to modularity —
+partitioning the code into disjoint components with well-defined interfaces, such that each
+particular component is aware only about the interfaces of other components.
Lets do a short drill and observe how the three properties interact at a small scale. Let’s
+minimize the problem of separate compilation to just … lexical analysis. How can we build a
+language that is easier to tokenize for an language server?
+
An unclosed quote is a nasty little problem! Practically, it is rare enough that it doesn’t really
+matter how you handle it, but qualitatively it is illuminating. In a language like Rust, where
+strings can span multiple lines, inserting a " in the middle of a file changes the lexical structure
+of the following text completely (/*, start of a block comment, has the same effect). When tokens
+change, so does the syntax tree and the set of symbols defined by the file. A tiny edit, just one
+symbol, unhinges semantic structure of the entire compilation unit.
+
Zig solves this problem. In Zig, no token can span several lines. That is, it would be correct to
+first split Zig source file by \n, and then tokenize each line separately. This is achieved by
+solving underlying problems requiring multi-line tokens better. Specifically:
+
+
+
there’s a single syntax for comments, //,
+
+
+
double-quoted strings can’t contain a \n,
+
+
+
but there’s a really nice syntax for multiline strings:
+
+
+
+
+
Do you see modules here? Disjoint-partitioning into interface-connected components? From the
+perspective of lexical analysis, each line is a module. And a line always has a trivial, empty
+interface — different lines are completely independent. As a result:
+
First, we can do lexical analysis in parallel. If you have N CPU cores, you can split file into N
+equal chunks, then in parallel locally adjust chunk boundaries such that they fall on newlines, and
+then tokenize each chunk separately.
+
Second, we have quick incremental tokenization — given a source edit, you determine the set of
+lines affected, and re-tokenize only those. The work is proportional to the size of the edit plus at
+most two boundary lines.
+
Third, any lexical error in a line is isolated just to this line. There’s no unclosed quote
+problem, mistakes are contained.
+
I am by no means saying that line-by-line lexing is a requirement for an IDE-friendly language
+(though it would be nice)! Rather, I want you to marvel how the same underlying structure of the
+problem can be exploited for quarantining errors, reacting to changes quickly, and parallelizing the
+processing.
+
The three properties are just three different faces of modularity in the end!
+
+
I do want to write that “IDE-friendly language” post at some point, but, as a hedge (after all, I
+still owe you “Why LSP Sucks?” one…), here are two comments where I explored the idea somewhat:
+1,
+2.
+
I also recommend these posts, which explore the same underlying phenomenon from the software
+architecture perspective:
+
+]]>
+
+
+
+Three Different Cuts
+
+2023-07-16T00:00:00+00:00
+2023-07-16T00:00:00+00:00
+https://matklad.github.io/2023/07/16/three-different-cuts
+Alex Kladov
+
+
+ Three Different Cuts
+
+
In this post, we’ll look at how Rust, Go, and Zig express the signature of function cut— the power tool of string manipulation.
+Cut takes a string and a pattern, and splits the string around the first occurrence of the pattern:
+cut("life", "if") = ("l", "e").
+
At a glance, it seems like a non-orthogonal jumbling together of searching and slicing.
+However, in practice a lot of ad-hoc string processing can be elegantly expressed via cut.
+
A lot of things are key=value pairs, and cut fits perfectly there.
+What’s more, many more complex sequencies, like
+--arg=key=value,
+can be viewed as nested pairs.
+You can cut around = once to get --arg and key=value, and then cut the second time to separate key from value.
+
In Rust, this function looks like this:
+
+
+
Rust’s Option is a good fit for the result type, it clearnly describes the behavior of the function when the pattern isn’t found in the string at all.
+Lifetime 'a expresses the relationship between the result and the input — both pieces of result are substrings of &'a self, so, as long as they are used, the original string must be kept alive as well.
+Finally, the separator isn’t another string, but a generic P: Pattern.
+This gives a somewhat crowded signature, but allows using strings, single characters, and even fn(c: char) -> bool functions as patterns.
+
When using the function, there are is a multitude of ways to access the result:
+
+
+
Here’s a Go equivalent:
+
+
+
It has a better name!
+It’s important that frequently used building-block functions have short, memorable names, and “cut” is just perfect for what the function does.
+Go doesn’t have an Option, but it allows multiple return values, and any type in Go has a zero value, so a boolean flag can be used to signal None.
+Curiously if the sep is not found in s, after is set to "", but before is set to s (that is, the whole string).
+This is occasionally useful, and corresponds to the last Rust example.
+But it also isn’t something immediately obvious from the signature, it’s an extra detail to keep in mind.
+Which might be fine for a foundational function!
+Similarly to Rust, the resulting strings point to the same memory as s.
+There are no lifetimes, but a potential performance gotcha — if one of the resulting strings is alive, then the entire s can’t be garbage collected.
+
There isn’t much in way of using the function in Go:
+
+
+
Zig doesn’t yet have an equivalent function in its standard library, but it probably will at some point, and the signature might look like this:
+
+
+
Similarly to Rust, Zig can express optional values.
+Unlike Rust, the option is a built-in, rather than a user-defined type (Zig can express a generic user-defined option, but chooses not to).
+All types in Zig are strictly prefix, so leading ? concisely signals optionality.
+Zig doesn’t have first-class tuple types, but uses very concise and flexible type declaration syntax, so we can return a named tuple.
+Curiously, this anonymous struct is still a nominal, rather than a structural, type!
+Similarly to Rust, prefix and suffix borrow the same memory that s does.
+Unlike Rust, this isn’t expressed in the signature — while in this case it is obvious that the lifetime would be bound to s, rather than sep, there are no type system guardrails here.
+
Because ? is a built-in type, we need some amount of special syntax to handle the result, but it curiously feels less special-case and more versatile than the Rust version.
+
+
+
Moral of the story?
+Work with the grain of the language — expressing the same concept in different languages usually requires a slightly different vocabulary.
TL;DR, https://bors.tech delivers a meaningfully better experience, although it suffers from being a third-party integration.
+
Specific grievances:
+
Complexity. This is a vague feeling, but merge queue feels like it is built by complexity merchants — there are a lot of unclear settings and voluminous and byzantine docs.
+Good for allocating extra budget towards build engineering, bad for actual build engineering.
+
GUI-only configuration. Bors is setup using bors.toml in the repository, merge queue is setup by clicking through web GUI.
+To share config with other maintainers, I resorted to a zoomed-out screenshot of the page.
+
Unclear set of checks. The purpose of the merge queue is to enforce not rocket science rule of software engineering — making sure that the code in the main branch satisfies certain quality invariants (all tests are passing).
+It is impossible to tell what merge queue actually enforces.
+Typically, when you enable merge queue, you subsequently find out that it actually merges anything, without any checks whatsoever.
+
Double latency. One of the biggest benefits of a merge queue for a high velocity project is its asynchrony.
+After submitting a PR, you can do a review and schedule PR to be merged without waiting for CI to finish.
+This is massive: it is 2X reduction to human attention required.
+Without queue, you need to look at a PR twice: once to do a review, and once to click merge after the green checkmark is in.
+With the queue, you only need a review, and the green checkmark comes in asynchronously.
+Except that with GitHub merge queue, you can’t actually add a PR to the queue until you get a green checkmark.
+In effect, that’s still 2X attention, and then a PR runs through the same CI checks twice (yes, you can have separate checks for merge queue and PR. No, this is not a good idea, this is complexity and busywork).
+
Lack of delegation. With bors, you can use bors delegate+ to delegate merging of a single, specific pull request to its author.
+This is helpful to drive contributor engagement, and to formalize “LGTM with the nits fixed” approval (which again reduces number of human round trips).
+
You still should use GitHub merge queue, rather than bors-ng, as that’s now a first-party feature.
+Still, its important to understand how things should work, to be able to improve state of the art some other time.
+]]>
+
+
+
+The Worst Zig Version Manager
+
+2023-06-02T00:00:00+00:00
+2023-06-02T00:00:00+00:00
+https://matklad.github.io/2023/06/02/the-worst-zig-version-manager
+Alex Kladov
+
+
+ The Worst Zig Version Manager
+
+
+
+
One of the values of Zig which resonates with me deeply is a mindful approach to dependencies.
+Zig tries hard not to ask too much from the environment, such that, if you get zig version running, you can be reasonably sure that everything else works.
+That’s one of the main motivations for adding an HTTP client to the Zig distribution recently.
+Building software today involves downloading various components from the Internet, and, if Zig wants for software built with Zig to be hermetic and self-sufficient, it needs to provide ability to download files from HTTP servers.
+
There’s one hurdle for self-sufficiency: how do you get Zig in the first place?
+One answer to this question is “from your distribution’s package manager”.
+This is not a very satisfying answer, at least until the language is both post 1.0 and semi-frozen in development.
+And even then, what if your distribution is Windows?
+How many distributions should be covered by “Installing Zig” section of your CONTRIBUTING.md?
+
Another answer would be a version manager, a-la rustup, nvm, or asdf.
+These tools work well, but they are quite complex, and rely on various subtle properties of the environment, like PATH, shell activation scripts and busybox-style multipurpose executable.
+And, well, this also kicks the can down the road — you can use zvm to get Zig, but how do you get zvm?
+
I like how we do this in TigerBeetle.
+We don’t use zig from PATH.
+Instead, we just put the correct version of Zig into ./zig folder in the root of the repository, and run it like this:
+
+
+
Suddenly, whole swaths of complexity go away.
+Quiz time: if you need to add a directory to PATH, which script should be edited so that both the graphical environment and the terminal are affected?
+
Finally, another interesting case study is Gradle.
+Usually Gradle is a negative example, but they do have a good approach for installing Gradle itself.
+The standard pattern is to store two scripts, gradlew.sh and gradlew.bat, which bootstrap the right version of Gradle by downloading a jar file (java itself is not bootstrapped this way though).
+
What all these approaches struggle to overcome is the problem of bootstrapping.
+Generally, if you need to automate anything, you can write a program to do that.
+But you need some pre-existing program runner!
+And there’s just no good options out of the box — bash and powershell are passable, but barely, and they are different.
+And “bash” and the set of coreutils also differs depending on the Unix in question.
+But there’s just no good solution here — if you want to bootstrap automatically, you must start with universally available tools.
+
But is there perhaps some scripting language which is shared between Windows and Unix?
+@cspotcode suggests a horrible workaround.
+You can write a script which is both a bash script and a powershell script.
+And it even isn’t too too ugly!
+
+
+
So, here’s an idea for a hermetic Zig version management workflow.
+There’s a canonical, short getzig.ps1 PowerShell/sh script which is vendored verbatim by various projects.
+Running this script downloads an appropriate version of Zig, and puts it into ./zig/zig inside the repository (.gitignore contains /zig).
+Building, testing, and other workflows use ./zig/zig instead of relying on global system state ($PATH).
+
A proof-of-concept getzig.ps1 is at the start of this article.
+Note that I don’t know bash, powershell, and how to download files from the Internet securely, so the above PoC was mostly written by Chat GPT.
+But it seems to work on my machine.
+I clone https://github.com/matklad/hello-getzig and run
+
+
+
on both NixOS and Windows 10, and it prints hello.
+
If anyone wants to make an actual thing out of this idea, here’s possible desiderata:
+
+
+
A single polyglot getzig.sh.ps1 is cute, but using a couple of different scripts wouldn’t be a big problem.
+
+
+
Size of the scripts could be a problem, as they are supposed to be vendored into each repository.
+I’d say 512 lines for combined getzig.sh.ps1 would be a reasonable complexity limit.
+
+
+
The script must “just work” on all four major desktop operating systems: Linux, Mac, Windows, and WSL.
+
+
+
The script should be polymorphic in curl / wget and bash / sh.
+
+
+
It’s ok if it doesn’t work absolutely everywhere — downloading/building Zig manually for an odd platform is also an acceptable workflow.
+
+
+
The script should auto-detect appropriate host platform and architecture.
+
+
+
Zig version should be specified in a separate zig-version.txt file.
+
+
+
After downloading the file, its integrity should be verified.
+For this reason, zig-version.txt should include a hash alongside the version.
+As downloads are different depending on the platform, I think we’ll need some help from Zig upstream here.
+In particular, each published Zig version should include a cross-platform manifest file, which lists hashes and urls of per-platform binaries.
+The hash included into zig-version.txt should be the manifest’s hash.
Welcome to my resume!
+It consists of two parts.
+The first part is the free-form narrative of what I do work-wise.
+This is something I would be excited to read from a person I am going to work with.
+The second part is a more traditional bullet-list of companies, positions, and projects.
+The resume is available as .html and .pdf.
I used to do math.
+Although I no longer do mathematics daily, it is the basis I use to think about programming.
+I enjoy solving an occasional puzzle.
+See “Generate All the Things” and “Notes on Paxos” articles as examples of math I like.
+
I am a programmer.
+I like writing code just for the sake of it.
+I like deleting code even more.
+I like short, simple, robust and beautiful code, which not only gets the job done, but does it in an obviously correct way.
+See, eg, ungrammar for an example of relatively short and self-contained piece of programming.
+
I am a pragmatist.
+The above two points sound outright scary, but don’t worry :)
+While I do enjoy encoding lambda calculus in types, that’s not what I spend most of my time on.
+I see most code as something to be replaced and re-written later, and optimise for making changes over time, not for perfection right now.
+This section from rust-analyzer style guide is a good example of this.
+
I loathe accidental complexity.
+I think I spend most of my time trying to make things simpler, trying to remove parts, trying to make foundational APIs more crisp.
+I have a visceral reaction to the gaps between how the thing should be, and how they are.
+cargo xtask pattern shows to what lengths I am willing to go just to get rid of the mess the unix shell is.
+
I build systems.
+Software engineering is programming integrated over time, and it’s that time dimension that really matters.
+The shape of the software today is determined by accidental, runaway, viral successes of yesterday.
+There’s a reason why VT100 interface is still programmed against today, and it is not its technical adequacy.
+This is not my article, but I like it so much that I’ll advertise it even in my resume.
+System’s thinking is why I am fascinated with Rust and not, eg, with Kotlin.
+Since Java with its reasonably fast managed runtime, Rust is the first PL revolution which meaningfully changes how we write software, and not just repacks known-good idioms with a better syntax (which is also important!, just not as exciting!).
+
I build open source communities.
+My biggest successes so far I think are IntelliJ Rust and rust-analyzer.
+I didn’t write the hardest, smartest bits of those.
+But I tried very hard to make sure that others can do that, by removing accidental complexity, by making contribution enjoyable, by trying to program the architecture which would be robust to time and systems effects.
+
More generally, I help build moderately large projects, which are combinations of all of the above: people, systematic forces, beautiful mathematical abstractions at the core, and hundreds of thousands of lines of code as a physical manifestation.
+See “One Hundred Thousand Lines of Rust” series for a bunch of concrete, pragmatic lessons I’ve learned so far.
I am a member of the dev-tools team of the Rust programming language. I was the
+original author of both IntelliJ
+Rust and rust-analyzer— the two
+tools which today power IDE support for Rust in virtually every editor. My work
+included both the technical task of writing an advanced, incremental, resilient
+compilers and organizing a vibrant community of contributors and maintainers
+to ensure that my direct involvement is not a requirement.
With Ferrous Systems, we brought rust-analyzer project from an MVP to a de-facto
+standard for the ecosystem. I also helped with teaching people to use Rust
+efficiently.
At JetBrains, I have led the development of
+IntelljJ-Rust plugin for the Rust
+programming language. The plugin is a Rust “compiler” written in Kotlin, with
+full-blown parser, name resolution and type inference algorithms, and
+integrations with build tools and debuggers. Besides solving the technical
+problems, I’ve created an open source community around the plugin by mentoring
+issues, writing developer documentation and supporting contributors.
Stepik is a e-learning platform, written in Python, focused on rich variety of
+practical exercises and ease of creating content. I was on the backend team of
+three from the start of the project. Among other things, I’ve worked on
+exercises subsystem and student’s code sandboxing, progress tracking and
+designed and implemented JSON API interface for the single-page frontend.
This is a super work-in-progress page which collects various rules-of-thumb I use.
+The primary goal so far is to collect the rules for myself, that’s why I don’t link to this page from anywhere yet.
Prefer full names except for extremely common cases (ctx for context), or equal-length pairs
+(next/prev). Use consistent names. Naming variables after types (let thing: Thing) is a way
+to achieve global consistency with little coordination.
+
Build a vocabulary of standard names and re-use it:
+
+
ctx
+
+
“context” of an operation. Typically holds something mutable. Read-only
+context is named params.
+
+
params
+
+
A bag of named arguments. Unlike config, might hold not only pod types.
+
+
config
+
+
Generally user-specified POD parameters.
+
+
sink
+
+
“output” of an internal iterator, typically sink: &mut FnMut(T) or sink: &mut Vec<T>.
Avoid opening file descriptors in favor of bulk operations. To write data to a file, you need to
+follow a lifecycle: open file descriptor, issue write syscalls, close file descriptor. Lifecycle
+handling requires complicated type-system machinery and is bettre avoided. Usually, standard library
+provides something like std::fs::read_to_string which encapsulates lifecycle management.