Table of Contents generated with DocToc
- RustyBuzz-WASM
- this code under development
- this README still fragmentary
- initial goal was to make text shaping as implemented by
rustybuzz
accessible from NodeJS - turns out
rustybuzz
depends onttf-parser
which offers access to glyf outlines, which is when text rendering (in SVG format) was added - then I found
textwrap
which implements optimized distribution of sized 'black boxes' (i.e. rectangular areas of arbitrary content) over stretches of equal length (i.e. lines of text); this solves the problem of wrapping text provided one knows where line break opportunities are (e.g. around whitespace, after a hard or soft hyphen, between CJK ideographs) - add text preparation and you have almost full typesetting (sans combination of styles so far, but we're
getting there):
- first do hyphenation for each paragraph of text, which takes some text and some language settings and
returns the same text with soft (a.k.a. discretionary, optional) hyphens (
U+00AD
Soft Hyphen) inserted at appropriate positions; - next, apply Unicode UAX#14: Unicode Line Breaking Algorithm to the text; this will identify all the stretches of text that must be kept together in typesetting (here dubbed 'slabs', short for 'syllables')
- first do hyphenation for each paragraph of text, which takes some text and some language settings and
returns the same text with soft (a.k.a. discretionary, optional) hyphens (
rustybuzz-wasm
is no longer a fully appropriate moniker.
This module allows users to take a Unicode text and a path to a font file as inputs and obtain a list of
GlyfIDs and 2D positions back. This process is known as text
shaping. It is an indespensible ingredient for
compositing text in so-called 'complex' writing systems like Arabic and Indic alphabets, but even when
applied to text written in the Latin alphabet, there are finer points of typesetting like
kerning and the choice of
ligatures which makes this process too difficult to
be reasonably implemented on-the-fly for each piece of software that uses text. Instead, what one wants is a
specialized library that knows lots of details about font file formats, OpenType font features, type metrics
and so on and applies that knowledge to a given text string to derive poisitioning data for the individual
graphical pieces ('glyfs') that, when drawn out on a canvas (such as an HTML <canvas>
or an <svg>
element) then instruct the rendering software to render an aesthetically pleasing and orthographically
correct (image of a) text. You can see all this in action in the live HarfBuzz demo
page. If you want to know more about text shaping, be sure to read
Ramsey Nasser's Unplain text: A primer on text shaping and rendering non-Latin text in the shadow of an
ASCII-dominated world;
also, you might want to take a look at the HarfBuzz terminology
glossary.
The leading free software to provide text shaping is HarfBuzz (repo
here), which is written in C++.
rustybuzz
is "is a complete harfbuzz's shaping algorithm port
to Rust", and since it's written in Rust, we can compile it to WASM and write a nice API surface for it,
which is what I did.
Sample in Arabic, using the Amiri Typeface to typeset "الخط الأمیری".
Notice visible overlaps and tasteful placement of complex ligatures (which will for the most part not be
present in the browser rendering of the same text unless you happen to configured a suitable font). Both
texts generated from the exact same sequence of Unicode codepoints, ا
, ل
, خ
, ط
, ␣
, ا
, ل
, أ
,
م
, ی
ر
, ی
(which starts with ا
and ends with ی
, notice RTL re-ordering by the browser). Also
note that while the bounding boxes of the glyfs differ in their vertical placements, in this case that only
reflects tthe different areas covered by the outlines; in the underlying SVG, the y
attributes of all
paths are set to 0
(i.e. all glyfs are still nominally sitting on the baseline).
Sample in Tibetan, using the Tibetan Machine Uni
Typeface,
to typeset ཨོཾ་མ་ཎི་པདྨེ་ཧཱུྃ (there is a certain chance even in 2021 that this piece of text will not be
rendered correctly across systems and browsers). Again, a complex composition is made from a linear string
of codepoints ཨ
, ོ
, ཾ
, ་
, མ
, ་
, ཎ
, ི
, ་
, པ
, ད
, ྨ
, ེ
, ་
, ཧ
, ཱ
,
ུ
. Notice that in this font, a choice has been made to precompose the stacked clusters ད
, ྨ
and
ཧ
, ཱ
, ུ
; this is a design choice which, were it not for a text shaper like rustybuzz
, would
cause a considerable amount of work for anyone striving to display Tibetan script correctly with this font
and others whose choice of ligatures may be completely different.
To implement rustybuzz-wasm
I started with the example shipped with
rustybuzz
which compiles to an
executable that accepts a path to a font file and a text and then echoes a containing glyf IDs and
positioning data. This I turned into a minimalist version with WASM entry
points. There's still a lot
missing, especially font feature selection, but since everything went so well so far, I guess I'll get to
that later.
rustybuzz-wasm
is not feature-complete withrustybuzz
, yet.rustybuzz-wasm
would appear to be 1.5 times faster thanharfbuzzjs
(which is what drives the HarfBuzz demo page]).harfbuzzjs
does not allow arbitrarily long lines and does not support font features (whichrustybuzz
will probably soon have).rustybuzz-wasm
is over 3 times faster than usingopentype.js
.- HarfBuzz does have command line utilities, too (referred to as
harfbuzzjs_shaping
in the below benchmark results), but the fact that one has to open a sub-process for each piece of text and re-read font files damages performance a great deal. This means thatrustybuzz-wasm
(running as WASM attached to a NodeJS process) is over 12 times as performant asharfbuzz
(using child processes over the command line). Note that this does not tell you how fast HarfBuzz itself is because secondary effects (overhead of one process per line of text, re-reading fonts) can be reasonably expected to dominate performance.
The benchmarks (source here) were done with 100 lines of text with 100 words on each line; counts represent Unicode code units (thus, approximately characters). "1,000 nspc" means "one thousand nanoseconds per cycle", a cycle being the unit of counting (roughly, one Unicode codepoint); here, lower figurs are better. The reciprocal value expressed in Hertz (cycles per send) tells you how many items you can expect to get through your chosen process, so higher numbers are better. The bar charts express relative performance with the top performer being pegged to 100%. Several runs were performed with randomized order of execution to minimize noise. The hardware is a 2015 customer grade, not fast, not new, not fancy laptop, so many machines will be considerably faster for all contestants.
rustybuzz_wasm_rusty_shaping 0.300 s 65,732 items 218,840⏶Hz 4,570⏷nspc
rustybuzz_wasm_json_shaping 0.368 s 65,732 items 178,605⏶Hz 5,599⏷nspc
rustybuzz_wasm_short_shaping 0.331 s 65,732 items 198,465⏶Hz 5,039⏷nspc
harfbuzzjs_shaping 0.373 s 65,732 items 176,392⏶Hz 5,669⏷nspc
opentypejs_shaping 0.928 s 65,732 items 70,815⏶Hz 14,121⏷nspc
fontkit_shaping 2.203 s 65,732 items 29,840⏶Hz 33,512⏷nspc
harfbuzz_shaping 3.745 s 65,732 items 17,553⏶Hz 56,971⏷nspc
rustybuzz_wasm_rusty_shaping 220,399 Hz 100.0 % │████████████▌│
rustybuzz_wasm_short_shaping 194,886 Hz 88.4 % │███████████ │
rustybuzz_wasm_json_shaping 180,277 Hz 81.8 % │██████████▎ │
harfbuzzjs_shaping 143,434 Hz 65.1 % │████████▏ │
opentypejs_shaping 65,468 Hz 29.7 % │███▊ │
fontkit_shaping 29,605 Hz 13.4 % │█▋ │
harfbuzz_shaping 17,153 Hz 7.8 % │█ │
⚠️ Rust Newbie here so probably the code is not ideal in some respects.⚠️ FTTB I have commited the WASM artefacts to the repo; since I'm still working on this you may happen to have ⛔️ downloaded some unoptimized code which is orders of magnitude slower than WASM resulting from optimized compilation ⛔️; therefore:- Always re-build before trying out:
- for faster compilation, do
wasm-pack build --debug --target nodejs && trash pkg/.gitignore && node demo-nodejs-using-wasm/lib/main.js > /tmp/foo.svg
- for faster execution, do
wasm-pack build --target nodejs && trash pkg/.gitignore && node demo-nodejs-using-wasm/lib/main.js > /tmp/foo.svg
- for faster compilation, do
- Always re-build before trying out:
⚠️ Values are currently communicated as JSON and hex-encoded binary strings; this is probably not terribly efficient and may change in the future; see https://hacks.mozilla.org/2019/11/multi-value-all-the-wasm/ and https://docs.rs/serde-wasm-bindgen/0.1.3/serde_wasm_bindgen/.
- provided out-of-the-box by
textwrap
, - includes hyphenation, character width calculation
- problem lies with Unicode UAX#11: East Asian Width (or its
implementation in packages like
string-width
(JS) andunicode-width
(Rust)) which report partially faulty lengths:- abc: 3 units 💚
- 御門: 4 units 💚
- اَلْعَرَبِيَّةُ: 15 units ❌
- العربية: 7 units 💚
- ﷺ: 2 units ❌
- ﷻ: 2 units ❌
- ﷼: 2 units ❓
- ﷽: 1 units ❌❌❌
- the better approach would seem to be to either monkey-fix widths known to be wrong or to do text shaping
using carefully selected fonts (and quantize widths where they are not already quantized); in either case,
one cannot simply use the solution provided by
textwrap
without landing a pull request first. - using this proposed method, monospaced typesetting does become more complicated, but on the other hand:
- where better speed is needed, one can still check texts for problematic characters, and, where needed, cache results
- monospaced typesetting becomes less of a special case and can be seamlessly integrated into the workflow of proportional typesetting, which is a huge advantage.
To build and test in dev (much faster, but also much slower)
wasm-pack build --debug --target nodejs && trash pkg/.gitignore && node demo-nodejs-using-wasm/lib/main.js
To build and test production:
wasm-pack build --target nodejs && trash pkg/.gitignore && node demo-nodejs-using-wasm/lib/main.js
pub fn set_font_bytes( font_bytes_hex: String ) {
—pub fn has_font_bytes() -> bool { unsafe { !FONT_BYTES.is_empty() } }
—
pub fn shape_text( user_cfg: &JsValue ) -> String {
—
pub fn glyph_to_svg_pathdata( js_glyph_id: &JsValue ) -> String {
—
pub fn wrap_text( text: String, width: usize ) -> String {
—
- 'raw' text is encoded as a series of bytes (UTF-8)
- a sequence of codepoints (positive integer numbers) intended to represent graphemes
- hyphenate text (in languages that use hyphenation); this inserts soft hyphens (U+00ad)
- find line break opportunities (LBOs); these occur, for example, after each space, each hyphen (hard or soft), each full stop and so on
On my smallish, not new laptop, RustyBuzz-WASM's shape_text()
method achieves speeds exceeding 280,000
glyfs (outlines) per second, around twenty times the speed attained by OpenTypeJS:
rustybuzz_wasm_rusty_shaping 286,779 Hz ≙ 1 ÷ 1.0 100.0 % │████████████▌│
rustybuzz_wasm_short_shaping 254,097 Hz ≙ 1 ÷ 1.1 88.6 % │███████████▏ │
rustybuzz_wasm_json_shaping 216,043 Hz ≙ 1 ÷ 1.3 75.3 % │█████████▍ │
opentypejs_shaping 61,997 Hz ≙ 1 ÷ 4.6 21.6 % │██▊ │
fontkit_shaping 27,953 Hz ≙ 1 ÷ 10.3 9.7 % │█▎ │
harfbuzz_shaping 16,221 Hz ≙ 1 ÷ 17.7 5.7 % │▊ │
Note that in order to obtain this kind of performance, you absolutely must build for
production as development builds will be much, much, much slower. Benchmarks created with
textshaping.benchmarks
.
- find out what makes format
rusty
(which has quite a few options) so much faster than the minimalisticshort
format (which has no options); to do so, modify the (constant) format flags - implement OpenType font features
- implement face selection
- implement language selection?
- implement script selection?
- implement clustering selection?
- write
INSTALL.md
-
ab-glyph
—"When laying out glyphs into paragraph, ab_glyph is faster than rusttype using .ttf fonts & much faster for .otf fonts." -
rusttype
—A pure Rust alternative to libraries like FreeType -
Fontdue—See below under Line Breaking / Text Wrapping.
-
Allsorts—Allsorts is a font parser, shaping engine, and subsetter for OpenType, WOFF, and WOFF2 written entirely in Rust. It was extracted from Prince, a tool that typesets and lays out HTML and CSS documents into PDF.
The Allsorts shaping engine was developed in conjunction with a specification for OpenType shaping, which aims to specify OpenType font shaping behaviour.
-
newbreak—written in JS/TS, has tentative Rust implementation; 🛑 TS fails to compile to JS; last commits in Summer 2020 so maybe abandoned.
-
fontdue—"Fontdue is a simple,
no_std
(does not use the standard library for portability), pure Rust, TrueType (.ttf/.ttc
) & OpenType (.otf
) font rasterizer and layout tool. It strives to make interacting with fonts as fast as possible, and currently has the lowest end to end latency for a font rasterizer".—Written in Rust, aims to be font rasterizer including text wrapping, but sadly 🛑 fails to compile although I could hotfix that. -
kas-text looks enticing but is a huge thing geared towards building GUI apps. 🛑 It uses the original HarfBuzz C libraries so I rather not touch this thing as C dependencies will always be cans of worms.
- [+] update dependencies (3.3.0 -> 4.3.0):
- Updating libc v0.2.104 -> v0.2.107
- Updating proc-macro2 v1.0.30 -> v1.0.32
- Updating rustybuzz v0.3.0 -> v0.4.0
- Updating rustybuzz-wasm v3.3.0 (/home/flow/jzr/rustybuzz-wasm) -> v4.3.0
- Updating serde_json v1.0.68 -> v1.0.69
- Updating syn v1.0.80 -> v1.0.81
- Updating ttf-parser v0.9.0 -> v0.12.3
- Updating unicode-general-category v0.2.0 -> v0.4.0
- [+] implement
ad.nobr
attribute to signal where breaking glyfs is unsafe - [+] set
ads.br: 'end'
to avoid spurious line break, loss of rest-of-line - [+] fix endless loop, spurious repeated hyphens in distribution
- [+] recover myteriously missing first glyf on line after break in sample
missing-t-b42