-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[core] Use ICU for bidirectional text layout and Arabic text shaping #6984
Conversation
@ChrisLoer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @jfirebaugh, @kkaefer and @brunoabinader to be potential reviewers. |
UErrorCode errorCode = U_ZERO_ERROR; | ||
|
||
std::u16string input16 = util::utf32_to_utf16::convert( input32 ); | ||
std::unique_ptr<UChar*> outputText = std::make_unique<UChar*>( new UChar[input16.size() * 2] ); // Maximum output of ubidi_transform is twice the size of input according to ubidi_transform.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what valgrind is complaining about on the Qt5 Travis bot.
I suppose you mean std::make_unique<UChar[]>(input16.size() * 2);
instead?
@@ -15,7 +15,28 @@ class utf8_to_utf32 { | |||
boost::u8_to_u32_iterator<std::string::const_iterator> end(utf8.end()); | |||
return std::u32string(begin,end); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rewrite this whole file in terms of std::codecvt
, dropping the boost dependency.
@@ -90,7 +92,7 @@ SymbolLayout::SymbolLayout(std::string bucketName_, | |||
u8string = platform::lowercase(u8string); | |||
} | |||
|
|||
ft.text = util::utf8_to_utf32::convert(u8string); | |||
ft.text = bidi.bidiTransform( util::utf8_to_utf32::convert( u8string ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since getGlyphRange
doesn't support anything beyond the Basic Multilingual Plane, are we better off using utf16 throughout this code, rather than going utf8 ⇢ utf32 ⇢ utf16 ⇢ utf32?
UErrorCode errorCode = U_ZERO_ERROR; | ||
|
||
std::u16string input16 = util::utf32_to_utf16::convert( input32 ); | ||
std::unique_ptr<UChar*> outputText = std::make_unique<UChar*>( new UChar[input16.size() * 2] ); // Maximum output of ubidi_transform is twice the size of input according to ubidi_transform.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto outputText = std::make_unique<UChar[]>(input16.size() * 2);
Awesome stuff @ChrisLoer! Btw, we usually add tags e.g. |
We'll want some rendering tests to go along with these changes. I suggest keeping them very simple, modeled on test cases such as this one which uses a single symbol layer plus a GeoJSON source to provide simple data. |
@@ -90,10 +92,10 @@ SymbolLayout::SymbolLayout(std::string bucketName_, | |||
u8string = platform::lowercase(u8string); | |||
} | |||
|
|||
ft.text = util::utf8_to_utf32::convert(u8string); | |||
ft.text = bidi.bidiTransform( util::utf8_to_utf16::convert( u8string ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m curious about how visual mirroring at this stage interacts with line wrapping. Does this PR address the concern raised in #6057 (comment)?
In #6057, I intentionally performed the mirroring in GlyphSet::lineWrap()
, which seemed to work fine for point-placed, wrapped, and line-placed labels: #6057 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this approach has the same problem, I was punting on the line wrap issue, but we should probably do at least as much as you did and apply line wrapping before the transform. Having RTL also be bottom-to-top is obviously no good!
For fully correct line breaking support, we'd need to work with the ICU bidirectional algorithm at a lower level -- you start with ubidi_setParagraph
, then you find line breaks, then you use ubidi_setLine
on the previously set paragraph. It's necessary in order to correctly calculate the embedding levels. My intuition is that we'll be able to get away without doing that, at least at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, the line wrapping algorithm doesn't really have correct shaping input if you don't do the transform ahead of time. One partial fix we could try is to detect the "paragraph level" direction and pass it into GlyphSet::lineWrap()
-- if the paragraph is RTL, then line break up instead of down. It wouldn't do the right thing if you had, say, 3 lines of LTR text embedded in an RTL paragraph, but that might be rare enough to ignore for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bidi POI labels could get pretty long, for instance “مكتبة الإسكندرية Maktabat al-Iskandarīyah” for the Library of Alexandria. In this case, the LTR text is a third longer than the RTL text. I don’t know how many cases there are where the two are roughly even in length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed my changes so you can take a look at how I did the line breaking, but the results for the Library of Alexandria are pretty disheartening.
6b83cec
to
5e25e6e
Compare
@@ -90,10 +92,12 @@ SymbolLayout::SymbolLayout(std::string bucketName_, | |||
u8string = platform::lowercase(u8string); | |||
} | |||
|
|||
ft.text = util::utf8_to_utf32::convert(u8string); | |||
std::u16string u16string = util::utf8_to_utf16::convert( u8string ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No spaces after (
or before )
in our house style. You can use clang-format to fix formatting up.
@kkaefer Is there a way to run clang-format on all modified files in a PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfirebaugh uncomment this line and run make check
. This will run both clang-tidy
and clang-format
on modified files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran clang-tidy and clang-format on all the files, and it looks like they hadn't been tidied in a while, so it introduced a lot of diffs unrelated to my changes. Should I still go ahead with the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you subset it further, to the code that's being modified for other reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just did a quick manual review and tried to only clang-format "new" code.
df14358
to
a2a6f2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpicking. This is close. Can you rebase on master, fix the merge conflicts, and see where CI is at?
U_SHAPE_LETTERS_SHAPE, // Add options here for handling numbers in bidirectional text | ||
&errorCode); | ||
|
||
if (U_FAILURE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the comment above the if statement so this doesn't line-break so awkwardly.
// This just looks for the first character with a strong direction property, it does not perform | ||
// the BiDi algorithm | ||
return ubidi_getBaseDirection(input.c_str(), static_cast<int32_t>(input.size())) == UBIDI_RTL | ||
? true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to write <boolean expression> ? true : false
, just write <boolean expression>
.
a2a6f2b
to
8b7a6cb
Compare
@1ec5 you might want to double-check the changes I made to i18n.cpp while I was rebasing -- it was just moving from uint32_t characters to uint16_t characters, which I think should be fine since we're already limited to the BMP. I talked with @jfirebaugh about the line-breaking problem and we decided to start with the simplified rule of reversing the line breaks when the base direction of the label is RTL. Your heuristic for line breaking in #6057 gets better results, but it's not as easy to apply when we're doing the transformation ahead of time with the input string. Our plan is to start with this PR as a strict improvement on "totally broken", and then follow up with a PR that uses ICU's support for bidirectional line breaking. One other issue that we haven't discussed at all is handling of numbers in Arabic text. I just used the default behavior, and we end up with the same behavior OpenStreetMap uses (numbers are written LTR using Arabic-Indic digits: for example, see the "24" at https://www.openstreetmap.org/way/26177467). Google Maps uses European digits in the same case (see https://goo.gl/maps/E2yvjkVTp8R2). ICU shaping includes options that make it easy for us to choose either format regardless of the underlying data. |
#6057 was strictly about mirroring rather than complex text shaping. It mirrors after applying line breaking, taking advantage of the fact that the visual length of a line of text remains constant after mirroring. Similarly, I think it would be possible to individually text-shape each individual line after line breaking, since Arabic words are still delimited by spaces (unlike, say, Thai). The caveat is that some lines would shorten during the transformation, so it would appear as though we were breaking some lines prematurely. The result for purely RTL text would probably be severely width-constrained compared to the approach you’ve implemented; on the other hand, bidirectional text would finally read correctly.
This PR is certainly an improvement over master. I’m looking forward to the glorious ICU future. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8b7a6cb
to
7435d4b
Compare
I ran a manual build for Android/arm-v7 and got a size change of +7%, from 4,903,788 bytes to 5,239,412 bytes |
// the BiDi algorithm | ||
return ubidi_getBaseDirection(input.c_str(), static_cast<int32_t>(input.size())) == UBIDI_RTL; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please merge the code formatting changes into the previous commit? That makes the code changes a lot easier to follow.
float horizontalAlign, float verticalAlign, float justify, | ||
float spacing, const Point<float> &translate) const; | ||
void lineWrap(Shaping &shaping, float lineHeight, float maxWidth, float horizontalAlign, | ||
float verticalAlign, float justify, const Point<float> &translate, | ||
bool useBalancedIdeographicBreaking) const; | ||
bool useBalancedIdeographicBreaking, const bool rightToLeft) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been moving towards using enums like enum class WritingDirection : bool { LeftToRight, RightToLeft };
to avoid ambiguity in functions like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would jive nicely with the WritingMode bitfield that I'm implementing for #1682.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved this into a WritingDirection
enum defined in bidi.hpp
. @1ec5, I didn't see a PR that already had a WritingMode
bitfield, so hopefully this still jives nicely.
#include <codecvt> | ||
#include <locale> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. So clang-format
will put includes in alphabetical order... Is our style "add includes in alphabetical order, but don't reorder existing ones"?
BiDi::BiDi() | ||
{ | ||
UErrorCode errorCode = U_ZERO_ERROR; | ||
transform = ubiditransform_open(&errorCode); // Only error is failure to allocate memory, in that case ubidi_transform would fall back to creating transform object on the fly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're recording the error code, but don't do anything with it. Can we either fail hard here, or add checks in BiDi::bidiTransform
that don't use the transform
object if it's not initialized?
std::u16string bidiTransform( const std::u16string& ); | ||
|
||
private: | ||
UBiDiTransform* transform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize to nullptr
?
@@ -11,6 +11,7 @@ class SymbolFeature { | |||
public: | |||
GeometryCollection geometry; | |||
optional<std::u16string> text; | |||
optional<bool> rightToLeft; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an enum WritingDirection
here for clarity
7435d4b
to
18568fc
Compare
6e62c60
to
ae64c57
Compare
#7098 should fix the test-suite errors. |
… conversions and reduce in-memory size. Continue to use uint32 as glyph ID to maintain Glyph PBF, even though we're only using 16 bits of that uint32. Use std::codecvt instead of boost::unicode_iterator for UTF8->UTF16 conversions.
ae64c57
to
24c3595
Compare
… shaping. Apply bidi and shaping in symbol_layout. Add utility functions for converting to and from UTF-16.
24c3595
to
39a6d73
Compare
No description provided.