Skip to content

Latest commit

 

History

History
93 lines (57 loc) · 10.8 KB

script_matching.md

File metadata and controls

93 lines (57 loc) · 10.8 KB

One central task for skribo is choosing a font most suitable for the requested script. The requested script is a combination of an explicit lang tag and locale system settings.

Part of what makes this requirement tricky is that there is a fuzzy line between the scope of skribo and the scope of font-kit. Conceptually, skribo should be a consumer of font metadata, and font-kit is a provider. At a finer grain, there are decisions about exactly what that interface should look like. The main purpose of this document is to describe the problems to be solved.

Background

Selection of font is not strictly based on Unicode coverage, but is also affected by locale. The biggest impact is on Han unification, but there are other examples. Suitability of a font for a script is not encoded in the font's tables, but is metadata about the system's font configuration. Mechanisms for accessing this metadata vary widely from platform to platform.

A significant amount of research for this topic was reading through the Skia code, as that's the platform abstraction used by Chromium, so can be used as a rough first draft of requirements for correct cross-platform text layout for the Web. I also used Gecko and Qt as sources.

The main focus of this document is font fallback, but a related platform-specific concern is resolution of generic aliases (system UI font, generic serif, etc). This may be out of scope for skribo itself, but likely within the scope of font-kit. More clarity on requirements for each crate is needed.

TODO: write a clear requirement, link to this section of requirements doc.

BCP-47

I strongly recommend the use of BCP-47 as the identifier for language, script, and other locale metadata. This is an easy decision for web use cases, as it is the standard for the lang tag. The main challenge is that mechanisms for system font metadata in general predate BCP-47, so there will be some impedance matching.

TODO: investigate Rust ecosystem for common BCP-47 tag representation.

Platform-specific considerations

Each platform handles access to metadata (including script) in a different way. Platforms generally treat their text stack as an opaque black box, and vary in how much low-level access they give.

Fontconfig

The Fontconfig configuration file format specifies "langset" as an an "RFC-3066-style" language. RFC 3066 is a predecessor to BCP-47 (dated 2001), and basically specifies language and country, with no provision for explicit script or variant. For the purpose of Han unification, the convention is to infer script from country. For example, "zh-CN" could be translated to "zh-Hans", "zh-TW" to "zh-Hant". However, after a little investigation, it's not clear to me how useful it is to do sophisticated processing here, as the default fontconfig on a clean Debian 9 install lists doesn't specify "langset" attributes, but just has a few informal descriptions in comments.

The fontconfig API includes lang tags in its query interface (see Qt source for a good example of the usage). It seems that using this interface could do font matching based on script metadata, but I wasn't able to find evidence that Linux distros actuall ship useful config data. I did find a blog post on how users can write config files themselves, and this blog has a snippet of a font config from Fedora that suggests it might be a bit more CJK aware.

I request more input from CJK Linux users about what's going on here and the best way to handle it going forward.

Also note that fontconfig almost provides accurate aliases for system fonts, with the exception of kde. In those cases, kde does not update fontconfig, but rather stores the preference in .config/kdeglobals. Refer to the Qt code theme code for a reference of how to resolve these.

Windows

Windows is a tricky case. The DWrite font enumeration functions don't provide detailed metadata on script coverage. The recommended platform-native way to resolve font matching is to use IDWriteFontFallback to map characters, with an IDWriteTextAnalysisSource to specify locale and other details. The Skia Windows font manager uses this interface.

It's not clear to me whether Blink actually goes through SkFontMgr in this case, or whether it uses its own logic; there seems to be considerable duplication of concerns in the code base. In any case, the Blink code follows generally the same strategy, creating a TextAnalysisSource object containing locale information and then using a FontFallback object to map the characters.

The current font-kit code uses the font enumeration and not the text analysis APIs. Thus, getting full functionality will likely require significant architectural rework.

As far as I can tell, there are no Rust wrappers for these fallback and text analysis winapi functions.

As an alternative approach, Qt seems to have hardcoded lists of fonts for Han unification. My evaluation of the Qt code is that it's fairly crufty and probably not a great reference.

(Side note: font-kit uses dwrote, while the driud/piet infra uses the directwrite crate. We should dedup this or at the very least make interop painless)

macOS

The current font-kit code uses the availableFontFamilies call in NSFontManager, which reports just the names without any additional metadata.

Skia's macOS implementation of font selection (the onMatchFamilyStyleCharacter) is based on CTFontCreateForString, architecturally a similar approach as on Windows. However, that doesn't seem to be sensitive to the bcp47 argument.

Based on some more investigation, I think it's likely that Blink uses CTFontCopyDefaultCascadeListForLanguages as the primary mechanism to determine fallback fonts on macOS (the call seems to be in ui/gfx/font_fallback_mac.mm). Based on this, it seems like the SkFontMgr interface is not consistently used to determine fallback fonts on all platforms, and specifically on macOS it's done in Blink instead. Another useful trail of breadcrumbs to follow might be chromium#586517.

The Gecko codebase currently uses a very different approach. It stores a list of known platform fonts along with lang+script metadata in the Firefox preferences system, with platform-specific [https://searchfox.org/mozilla-central/source/modules/libpref/init/all.js#4124] initial values, but also the flexibility for users to override. There are other places in the code (starting from gfxFontGroup::FindFontForChar) where there is font-specific additional logic. There's a bit more detail in the comments of the PR landing this document.

Gecko also explored mozilla bug 1212731 a CTFontCopyDefaultCascadeListForLanguages based approach, but reverted it because of performance and compatibility concerns. (See also mozilla bug 1418724 for more breadcrumbs)

Qt uses the CTFontCopyDefaultCascadeListForLanguages approach.

It thus seems an open question which strategy is best.

Android

Android has a complete and up-to-date listing of font metadata, including BCP-47 language tags, in its fonts.xml file. This format is quasi-standard; the only officially supported way to access fonts is through the Java APIs, but that's not adequate for low-level needs such as a Web browser, so the Android frameworks team has historically made some effort not to break this file format. Note that the format evolves a bit over time, in particular Lollipop was a significant difference. The file also has a threatening sounding notice suggesting that it will go away in an imminent release; I'm not sure what they have in mind to replace it.

The best reference for parsing fonts.xml is Skia, as that's used to configure fonts in Chrome. In particular, its parsing logic is in SkFontMgr_android_parser.

A major additional possible complication on Android is downloadable fonts, which manifest as additional fonts that are not listed in fonts.xml. Access to these is through Java-language interfaces only, and the resulting Typeface object does not give documented access to the underlying font files. I was not able to find evidence that Chromium supports this feature, so likely it is out of scope, at least for an initial implementation.