Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RTL/bidi text, some font and harfbuzz fixes #309

Merged
merged 11 commits into from
Sep 15, 2019

Conversation

poire-z
Copy link
Contributor

@poire-z poire-z commented Sep 14, 2019

See individual commit messages for details. Some technical notes in #307.

Some additional work on the block rendering code (list item bullets on the right, ordering of table cells/columns from right to left, H1,H2 text-align should be start instead of left in our epub.css...) is still needed to have full support for RTL documents - but a few style tweaks to force text right alignment may help in the meantime.

For now, we get the following (showing these limitations):
image
More screenshots of arabic and hebrew texts in koreader/koreader#5359 (comment).

Note: we won't draw highlights well when they span bidi segment boundaries, as Firefox would do well (I have no idea how to do that as well):
image

The whole <bdi> and <bdo> handling in lvrend.cpp was to have these kind of samples (found on the web) work as expected:
bdi_bdo


Some visual examples regarding the first commits related to fonts:

This is some text rendered with Harfbuzz with a font with its bold and italic variants present (note the different italic f, and the fi ligature in all variants):
with_bold_italic_fonts

If I remove the italic, bold and bold italic font files, it previously rendered (with former crengine embolden code loosing the harfbuzz fi ligature on the bold):
before

With this PR, it will render as (note the regular f in italic, and the preserved fi ligature):
after

This helps correctly rendering arabic drawn with the fallback font (here, our regular-only FreeSerif):
after_hb_fallback

Previously, with our limited to western ligatures harfbuzz code, it was this mess (even messier with a bigger font size):
image

Skip elements among siblings that are not list items.
By @pkb from buggins/coolreader#105
@@ -13364,6 +13364,10 @@ bool ldomNode::getNodeListMarker( int & counterValue, lString16 & marker, int &
css_style_ref_t cs = child->getStyle();
if ( cs.isNull() )
continue;
if ( cs->display!=css_d_list_item_block && cs->display!=css_d_list_item) {
// Alien element among list item nodes, skip it to not mess numbering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alien, hehe. :-)

{
bool soft_hyphens_found = false;
for ( int i = 0; i<len; i++ ) {
if ( widths[i] + hyphCharWidth > maxWidth )
break;
if ( str[i] == UNICODE_SOFT_HYPHEN_CODE ) {
flags[i] |= LCHAR_ALLOW_HYPH_WRAP_AFTER;
switch ( flagSize ) {
Copy link
Member

@NiLuJe NiLuJe Sep 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you ultimately intend to support more flagSize values, which seems unlikely, I'd swap that switch to a simple if (with the most common branch first) ;).

@@ -894,18 +906,30 @@ bool TexHyph::hyphenate( const lChar16 * str, int len, lUInt16 * widths, lUInt8
// p+2 because: +1 because word has a space prepended, and +1 because
// mask[] holds the flag for char n on slot n+1
if ( (mask[p+2-soft_hyphens_skipped]&1) && nw <= maxWidth ) {
flags[p] |= LCHAR_ALLOW_HYPH_WRAP_AFTER;
switch ( flagSize ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -967,7 +991,19 @@ bool AlgoHyph::hyphenate( const lChar16 * str, int len, lUInt16 * widths, lUInt8
break;
}
if (!disabled)
flags[i] |= LCHAR_ALLOW_HYPH_WRAP_AFTER;
switch ( flagSize ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

// correct gamma
if ( gammaIndex!=GAMMA_NO_CORRECTION_INDEX )
cr_correct_gamma_buf(item->bmp, w*h, gammaIndex);
} else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked in context, so it's hard to tell from the diff view, is the removed { after the else expected here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got me worried :) but no pb, the #endif is the counterpart of a ... #if 0, and that } else is still in it. So, from space, it looks like:

    }
    else {
        #if 0
        if ( bitmap->pixel_mode==FT_PIXEL_MODE_MONO ) {
            memset( item->bmp, 0, w*h );
            lUInt8 * srcrow = bitmap->buffer;
            lUInt8 * dstrow = item->bmp;
            for ( int y=0; y<h; y++ ) {
                lUInt8 * src = srcrow;
                for ( int x=0; x<w; x++ ) {
                    dstrow[x] =  ( (*src)&(0x80>>(x&7)) ) ? 255 : 0;
                    if ((x&7)==7)
                        src++;
                }
                srcrow += bitmap->pitch;
                dstrow += w;
            }
        } else
        #endif
        memcpy( item->bmp, bitmap->buffer, w*h );
        // correct gamma
        if ( gammaIndex!=GAMMA_NO_CORRECTION_INDEX )
            cr_correct_gamma_buf(item->bmp, w*h, gammaIndex);
    }

Copy link
Member

@NiLuJe NiLuJe Sep 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but that commented out block would break if uncommented as-is, right?

(i.e., only the memcpy would be part of the "else" branch).

(Sidebar: this is partly why I'm really not a fan of the broken down } else { syntax and non-mandatory braces even for one-liners ;)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Better then to just add a // : } // else: ?
Otherwise, I'd need to add after these 4 active lines:

#if 0
}
#endif

@poire-z
Copy link
Contributor Author

poire-z commented Sep 14, 2019

Small bug noticed while looking at the first screenshot: footnote links are recorded in the left to right order, so a bit dis-ordered in the in-page footnotes.

No logical changes: just tab and indentation cleanups,
and comments.
We'll need more flags for char classification.
HyphMan hyphenate() (used by some other parts of the
code) then needs to be tweaked so it can work on
a table of lUInt8, as well as on our upgraded
table of lUInt16 flags.
Use Freetype embolden API instead of LVFontBoldTransform to
make fake bold (for fonts that do not provide a bold face).
This allows Harfbuzz to work with them (even if they won't
look as nice as if they came with a real bold font).
(LVFontBoldTransform is a wrapper that enlarge glyphs and
advances, so killing any Harfbuzz measurement and glyph
replacements.)
See comments for more details.
Even if the original text is italic and/or bold, when requesting
the fallback font, we always got the regular unbold variant - so
part of a sentence or word would jump out as strange.
With this and previous commit (Freetype embolden), we'll get many
things nicer when using our FreeSerif (which has no italic nor
bold variant) as the fallback font.

Also loosen fonts charset detection by no more looking for "azAZ09"
in a font (NotoSansMyanmar.ttf, for example, comes without them and
could then not even be used as a fallback font).

Also switch symbol fonts to the FT_ENCODING_MS_SYMBOL charmap
immediately on load - otherwise Harfbuzz would not see/use the
symbol glyphs.
measureText(): accepts an added "hints" parameter,
thru which we can pass text direction and additional
hints for Harfbuzz (begin/end of paragraph for now).

For freetype and harbuzz light, draw chars in the
reverse order when direction is RTL: this allows
individual RTL words to be drawn correctly, even
when not using Harfbuzz (and to witness the additional
Harfbuzz magic when switching to it).

DrawTextString(): returns advance, so we can draw
subsegments of the text with the fallback font, instead
of only individual chars are previously.
This is needed to get correct shaping with Harfbuzz with
the fallback font.

Harbuzz, in both measureText() and DrawTextString():
Don't use filterChar() with the main font, as the
not-found chars might be found in the fallback font;
no need to replace them thar early.
Rework glyph/cluster/text walking and drawing to be more
generic: should work with "one char > multiple glyphs"
situations (we previously handled correctly only "multiple
chars merged into a single glyph", like ligatures), and
with RTL text.
Also accumulate not-found glyphs so we can draw them as
a single segment with the fallback font, with harfbuzz
(instead of with Freetype previously): this is needed
when using a latin font as the main font, and be able
to see nice arabic drawn with the fallback font.
Drop letterspacing when the detected script is cursive.

Harfbuzz light:
hbCalcCharWidth(): skip triplet when any of the 3 chars is
not found, as it would mess up the result (and cached values)
when some char is combined with mutiple rare diacritic marks
(eg. Hebrew). This could mess the drawing of the whole text.
Fallback to more robust Freetype measurement in such cases.
Shouldn't have any impact on pure LTR text.

Additions to the usual LTR processing:

copyText(): detect if we have RTL chars (arabic, hebrew,
unicode bidi chars...) in a paragraph text, and use fribidi
only when we have some (fribidi processing is quite
expensive, so avoid it when not needed).
When we do, compute bidi levels, to be used for visual
reordering in addLine().

measureText():
Split measuring on bidi level change (and also on letter
spacing change - upstream fix by @pkb).
Provide fribidi segment direction and harfbuzz hints
to font->measureText() for correct Harfbuzz measurements.
Allow for ignoring some chars when measuring and drawing
(for now: the set of unicode bidirectionality chars that
can be used to tweak the unicode bidi algorithm, as some
font have glyphs for them).

When splitting the paragraph into lines, we continue
to process chars in the logical order.
Only in addLine(), with bidi paragraphs, we reorder
the chars (and flags, widths...) from the logical
line segment, into their visual order, and we make
words from the result the usual way, with some additional
flags (to help later with drawing, createXPointer()
and getRect()).

alignLine(): simple tweaks for RTL paragraphs:
- put text-indent on the right
- for justified text, align last (or single) line to
  the right
More work needs to be done on the block rendering code
for proper RTL layout (list items bullets on the right,
table columns ordered from right to left...)

lvrend.cpp: handle <bdi> and <bdo> elements, used to tweak
the unicode bidirectional algorithm.
Also add simple support for <q> by using a hardcoded single
set of quotes.
To allow links/text to be selected and highlighted in
bidi/RTL text.
Text selection should work fine in pure RTL text, but
may get bogus in bidi text when selection cross bidi
levels (where a single selection suddenly becomes 2 as
we are steping over a segment of the opposite direction,
and moving back towards the previous segment...)
In EPUB, the <html> node of each embedded HTML file
is not included in the generated single DOM.
We now parse its attributes and forward them to be
included as attribute of the followup <docFragment>
element, so they are part of the DOM.
_embolden = true;
// A real bold font has weight 700, vs 400 for the regular.
// LVFontBoldTransform did +200, so we get 600 (demibold).
// Let's do the same (even if I don't see why not +300).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errm maybe let's just do 300 then? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because "I don't see why" that there is not a why :)
It has no impact on the visible weight of the font, it's just used in the font match scoring, so may be there is a reason for our fake bold font to not be 700, as a real bold font will be 700 and then might be prefered - that's speculation, but not taking any risk :)

In-page footnotes are now vertically stacked in the order
they are met by the reading direction in a same line.
@poire-z poire-z force-pushed the bidi_rtl_fonts branch 2 times, most recently from b93e0aa to 6ed9d00 Compare September 14, 2019 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants