Support RTL/bidi text, some font and harfbuzz fixes #309

poire-z · 2019-09-14T12:50:10Z

See individual commit messages for details. Some technical notes in #307.

Some additional work on the block rendering code (list item bullets on the right, ordering of table cells/columns from right to left, H1,H2 text-align should be start instead of left in our epub.css...) is still needed to have full support for RTL documents - but a few style tweaks to force text right alignment may help in the meantime.

For now, we get the following (showing these limitations):

More screenshots of arabic and hebrew texts in koreader/koreader#5359 (comment).

Note: we won't draw highlights well when they span bidi segment boundaries, as Firefox would do well (I have no idea how to do that as well):

The whole <bdi> and <bdo> handling in lvrend.cpp was to have these kind of samples (found on the web) work as expected:

Some visual examples regarding the first commits related to fonts:

This is some text rendered with Harfbuzz with a font with its bold and italic variants present (note the different italic f, and the fi ligature in all variants):

If I remove the italic, bold and bold italic font files, it previously rendered (with former crengine embolden code loosing the harfbuzz fi ligature on the bold):

With this PR, it will render as (note the regular f in italic, and the preserved fi ligature):

This helps correctly rendering arabic drawn with the fallback font (here, our regular-only FreeSerif):

Previously, with our limited to western ligatures harfbuzz code, it was this mess (even messier with a bigger font size):

@pkb

Skip elements among siblings that are not list items. By @pkb from buggins/coolreader#105

Frenzie · 2019-09-14T12:53:32Z

crengine/src/lvtinydom.cpp

@@ -13364,6 +13364,10 @@ bool ldomNode::getNodeListMarker( int & counterValue, lString16 & marker, int &
                css_style_ref_t cs = child->getStyle();
                if ( cs.isNull() )
                    continue;
+                if ( cs->display!=css_d_list_item_block && cs->display!=css_d_list_item) {
+                    // Alien element among list item nodes, skip it to not mess numbering


Alien, hehe. :-)

NiLuJe · 2019-09-14T13:01:13Z

crengine/src/hyphman.cpp

 {
    bool soft_hyphens_found = false;
    for ( int i = 0; i<len; i++ ) {
        if ( widths[i] + hyphCharWidth > maxWidth )
            break;
        if ( str[i] == UNICODE_SOFT_HYPHEN_CODE ) {
-            flags[i] |= LCHAR_ALLOW_HYPH_WRAP_AFTER;
+            switch ( flagSize ) {


Unless you ultimately intend to support more flagSize values, which seems unlikely, I'd swap that switch to a simple if (with the most common branch first) ;).

NiLuJe · 2019-09-14T13:01:54Z

crengine/src/hyphman.cpp

@@ -894,18 +906,30 @@ bool TexHyph::hyphenate( const lChar16 * str, int len, lUInt16 * widths, lUInt8
        // p+2 because: +1 because word has a space prepended, and +1 because
        // mask[] holds the flag for char n on slot n+1
        if ( (mask[p+2-soft_hyphens_skipped]&1) && nw <= maxWidth ) {
-            flags[p] |= LCHAR_ALLOW_HYPH_WRAP_AFTER;
+            switch ( flagSize ) {


NiLuJe · 2019-09-14T13:02:04Z

crengine/src/hyphman.cpp

@@ -967,7 +991,19 @@ bool AlgoHyph::hyphenate( const lChar16 * str, int len, lUInt16 * widths, lUInt8
                                            break;
                                        }
                                    if (!disabled)
-                                        flags[i] |= LCHAR_ALLOW_HYPH_WRAP_AFTER;
+                                        switch ( flagSize ) {


NiLuJe · 2019-09-14T13:04:47Z

crengine/src/lvfntman.cpp

-            // correct gamma
-            if ( gammaIndex!=GAMMA_NO_CORRECTION_INDEX )
-                cr_correct_gamma_buf(item->bmp, w*h, gammaIndex);
+        } else


Haven't checked in context, so it's hard to tell from the diff view, is the removed { after the else expected here?

You got me worried :) but no pb, the #endif is the counterpart of a ... #if 0, and that } else is still in it. So, from space, it looks like:

} else { #if 0 if ( bitmap->pixel_mode==FT_PIXEL_MODE_MONO ) { memset( item->bmp, 0, w*h ); lUInt8 * srcrow = bitmap->buffer; lUInt8 * dstrow = item->bmp; for ( int y=0; y<h; y++ ) { lUInt8 * src = srcrow; for ( int x=0; x<w; x++ ) { dstrow[x] = ( (*src)&(0x80>>(x&7)) ) ? 255 : 0; if ((x&7)==7) src++; } srcrow += bitmap->pitch; dstrow += w; } } else #endif memcpy( item->bmp, bitmap->buffer, w*h ); // correct gamma if ( gammaIndex!=GAMMA_NO_CORRECTION_INDEX ) cr_correct_gamma_buf(item->bmp, w*h, gammaIndex); }

Yeah, but that commented out block would break if uncommented as-is, right?

(i.e., only the memcpy would be part of the "else" branch).

(Sidebar: this is partly why I'm really not a fan of the broken down } else { syntax and non-mandatory braces even for one-liners ;)).

Right. Better then to just add a // : } // else: ?
Otherwise, I'd need to add after these 4 active lines:

#if 0 } #endif

poire-z · 2019-09-14T13:33:59Z

Small bug noticed while looking at the first screenshot: footnote links are recorded in the left to right order, so a bit dis-ordered in the in-page footnotes.

No logical changes: just tab and indentation cleanups, and comments.

We'll need more flags for char classification. HyphMan hyphenate() (used by some other parts of the code) then needs to be tweaked so it can work on a table of lUInt8, as well as on our upgraded table of lUInt16 flags.

Use Freetype embolden API instead of LVFontBoldTransform to make fake bold (for fonts that do not provide a bold face). This allows Harfbuzz to work with them (even if they won't look as nice as if they came with a real bold font). (LVFontBoldTransform is a wrapper that enlarge glyphs and advances, so killing any Harfbuzz measurement and glyph replacements.) See comments for more details.

Even if the original text is italic and/or bold, when requesting the fallback font, we always got the regular unbold variant - so part of a sentence or word would jump out as strange. With this and previous commit (Freetype embolden), we'll get many things nicer when using our FreeSerif (which has no italic nor bold variant) as the fallback font. Also loosen fonts charset detection by no more looking for "azAZ09" in a font (NotoSansMyanmar.ttf, for example, comes without them and could then not even be used as a fallback font). Also switch symbol fonts to the FT_ENCODING_MS_SYMBOL charmap immediately on load - otherwise Harfbuzz would not see/use the symbol glyphs.

measureText(): accepts an added "hints" parameter, thru which we can pass text direction and additional hints for Harfbuzz (begin/end of paragraph for now). For freetype and harbuzz light, draw chars in the reverse order when direction is RTL: this allows individual RTL words to be drawn correctly, even when not using Harfbuzz (and to witness the additional Harfbuzz magic when switching to it). DrawTextString(): returns advance, so we can draw subsegments of the text with the fallback font, instead of only individual chars are previously. This is needed to get correct shaping with Harfbuzz with the fallback font. Harbuzz, in both measureText() and DrawTextString(): Don't use filterChar() with the main font, as the not-found chars might be found in the fallback font; no need to replace them thar early. Rework glyph/cluster/text walking and drawing to be more generic: should work with "one char > multiple glyphs" situations (we previously handled correctly only "multiple chars merged into a single glyph", like ligatures), and with RTL text. Also accumulate not-found glyphs so we can draw them as a single segment with the fallback font, with harfbuzz (instead of with Freetype previously): this is needed when using a latin font as the main font, and be able to see nice arabic drawn with the fallback font. Drop letterspacing when the detected script is cursive. Harfbuzz light: hbCalcCharWidth(): skip triplet when any of the 3 chars is not found, as it would mess up the result (and cached values) when some char is combined with mutiple rare diacritic marks (eg. Hebrew). This could mess the drawing of the whole text. Fallback to more robust Freetype measurement in such cases.

@pkb

Shouldn't have any impact on pure LTR text. Additions to the usual LTR processing: copyText(): detect if we have RTL chars (arabic, hebrew, unicode bidi chars...) in a paragraph text, and use fribidi only when we have some (fribidi processing is quite expensive, so avoid it when not needed). When we do, compute bidi levels, to be used for visual reordering in addLine(). measureText(): Split measuring on bidi level change (and also on letter spacing change - upstream fix by @pkb). Provide fribidi segment direction and harfbuzz hints to font->measureText() for correct Harfbuzz measurements. Allow for ignoring some chars when measuring and drawing (for now: the set of unicode bidirectionality chars that can be used to tweak the unicode bidi algorithm, as some font have glyphs for them). When splitting the paragraph into lines, we continue to process chars in the logical order. Only in addLine(), with bidi paragraphs, we reorder the chars (and flags, widths...) from the logical line segment, into their visual order, and we make words from the result the usual way, with some additional flags (to help later with drawing, createXPointer() and getRect()). alignLine(): simple tweaks for RTL paragraphs: - put text-indent on the right - for justified text, align last (or single) line to the right More work needs to be done on the block rendering code for proper RTL layout (list items bullets on the right, table columns ordered from right to left...) lvrend.cpp: handle <bdi> and <bdo> elements, used to tweak the unicode bidirectional algorithm. Also add simple support for <q> by using a hardcoded single set of quotes.

To allow links/text to be selected and highlighted in bidi/RTL text. Text selection should work fine in pure RTL text, but may get bogus in bidi text when selection cross bidi levels (where a single selection suddenly becomes 2 as we are steping over a segment of the opposite direction, and moving back towards the previous segment...)

In EPUB, the <html> node of each embedded HTML file is not included in the generated single DOM. We now parse its attributes and forward them to be included as attribute of the followup <docFragment> element, so they are part of the DOM.

Frenzie · 2019-09-14T14:13:28Z

crengine/src/lvfntman.cpp

+        _embolden = true;
+        // A real bold font has weight 700, vs 400 for the regular.
+        // LVFontBoldTransform did +200, so we get 600 (demibold).
+        // Let's do the same (even if I don't see why not +300).


Errm maybe let's just do 300 then? :-)

It's not because "I don't see why" that there is not a why :)
It has no impact on the visible weight of the font, it's just used in the font match scoring, so may be there is a reason for our fake bold font to not be 700, as a real bold font will be 700 and then might be prefered - that's speculation, but not taking any risk :)

In-page footnotes are now vertically stacked in the order they are met by the reading direction in a same line.

Fix possible jumps in list items numbering

a93360d

Skip elements among siblings that are not list items. By @pkb from buggins/coolreader#105

Frenzie reviewed Sep 14, 2019

View reviewed changes

NiLuJe reviewed Sep 14, 2019

View reviewed changes

poire-z added 8 commits September 14, 2019 16:07

lvfntman.cpp: reformating for better readability

69e384a

No logical changes: just tab and indentation cleanups, and comments.

lvtextfm.cpp: upgrade m_flags from lUInt8 to lUInt16

485101a

We'll need more flags for char classification. HyphMan hyphenate() (used by some other parts of the code) then needs to be tweaked so it can work on a table of lUInt8, as well as on our upgraded table of lUInt16 flags.

poire-z force-pushed the bidi_rtl_fonts branch from 12e542e to ce616f5 Compare September 14, 2019 14:14

Frenzie reviewed Sep 14, 2019

View reviewed changes

RTL: fix in-page footnotes order

04fc187

In-page footnotes are now vertically stacked in the order they are met by the reading direction in a same line.

poire-z force-pushed the bidi_rtl_fonts branch 2 times, most recently from b93e0aa to 6ed9d00 Compare September 14, 2019 19:38

Fix a few clang-tidy warnings

c2727e7

poire-z force-pushed the bidi_rtl_fonts branch from 6ed9d00 to c2727e7 Compare September 14, 2019 22:35

poire-z merged commit 21aa3ef into koreader:master Sep 15, 2019

poire-z deleted the bidi_rtl_fonts branch September 15, 2019 06:35

This was referenced Sep 15, 2019

bump crengine: support RTL/bidi text, adds thirdparty/fribidi koreader/koreader-base#980

Merged

Enhanced text layout: links, thoughts and discussion #307

Open

poire-z mentioned this pull request Sep 28, 2019

Enhanced block rendering: handle RTL direction #312

Merged

poire-z mentioned this pull request Mar 11, 2021

Enhancement proposed: to display Open Type layout features. koreader/koreader#5821

Closed

poire-z mentioned this pull request Apr 5, 2021

Font weight buggins/coolreader#274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support RTL/bidi text, some font and harfbuzz fixes #309

Support RTL/bidi text, some font and harfbuzz fixes #309

poire-z commented Sep 14, 2019

Frenzie Sep 14, 2019

NiLuJe Sep 14, 2019 •

edited

Loading

NiLuJe Sep 14, 2019

NiLuJe Sep 14, 2019

NiLuJe Sep 14, 2019

poire-z Sep 14, 2019

NiLuJe Sep 14, 2019 •

edited

Loading

poire-z Sep 14, 2019

poire-z commented Sep 14, 2019

Frenzie Sep 14, 2019

poire-z Sep 14, 2019

Support RTL/bidi text, some font and harfbuzz fixes #309

Support RTL/bidi text, some font and harfbuzz fixes #309

Conversation

poire-z commented Sep 14, 2019

Choose a reason for hiding this comment

NiLuJe Sep 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NiLuJe Sep 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poire-z commented Sep 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NiLuJe Sep 14, 2019 •

edited

Loading

NiLuJe Sep 14, 2019 •

edited

Loading