Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linebreaking at dashes #364

Closed
Jellby opened this issue Aug 6, 2020 · 68 comments
Closed

Linebreaking at dashes #364

Jellby opened this issue Aug 6, 2020 · 68 comments

Comments

@Jellby
Copy link
Contributor

Jellby commented Aug 6, 2020

I'm hoping the linebreaking at dashes can be improved... In something like " on the sea—’ " I get a linebreak between the a and the em-dash (note it's a dash followed by a closing single quote). I would suggest a linebreak at a dash (before or after) is only allowed if there are letters on both sides of the dash (e.g. "that he—or she"), but not if there is a punctuation on either side.

That's for English, for Spanish we pretty much want to never allow a linebreak at a dash, but since dashes are always followed or preceded by a space or punctuation, it should be covered in the above suggestion too.

@NiLuJe
Copy link
Member

NiLuJe commented Aug 6, 2020

Is there anything funky in the markup that might not be obvious at first glance? (soft hyphen?).

Otherwise, it's the UAX #14 & #29 rules as applied by libunibreak, and I'm not quite sure what they say about this case ;).

@poire-z
Copy link
Contributor

poire-z commented Aug 6, 2020

We follow (thanks to libunibreak, we don't have to understand it :) the Unicode Line Breaking Algorithm (aka UAX#14), which says about EMDASH:
https://www.unicode.org/reports/tr14/tr14-22.html#B2

The EM DASH is used to set off parenthetical text. Normally, it is used without spaces. However, this is language dependent. For example, in Swedish, spaces are used around the EM DASH. Line breaks can occur before and after an EM DASH. Because EM DASHes are sometimes used in pairs instead of a single quotation dash, the default behavior is not to break the line between even though not all fonts use connecting glyphs for the EM DASH.

.

I would suggest a linebreak at a dash (before or after) is only allowed if there are letters on both sides of the dash (e.g. "that he—or she"), but not if there is a punctuation on either side.

So, go suggest that to these people :)

We can tweak it a bit by typography language, which we do for quotation marks:

#if USE_LIBUNIBREAK==1
// libunibreak per-language LineBreakProperties extensions
//
// Rules extracted from libunibreak/src/linebreakdef.c, so we can adapt
// them and build LineBreakProperties adequately for more languages.
// See https://en.wikipedia.org/wiki/Quotation_mark
// These are mostly need only for languages that may add a space between
// the quote and its content - otherwise, the quote will be part of the
// word it sticks to, and break will be allowed on the other side which
// probably is a space.
// When a language allows the use of unpaired quotes (same quote on both
// sides), it seems best to not specify anything.
bool has_left_single_quotation_mark_opening = false; // U+2018 ‘
bool has_left_single_quotation_mark_closing = false;
bool has_right_single_quotation_mark_opening = false; // U+2019 ’
bool has_right_single_quotation_mark_closing = false;
bool has_right_single_quotation_mark_glue = false;
bool has_left_double_quotation_mark_opening = false; // U+201C “
bool has_left_double_quotation_mark_closing = false;
bool has_right_double_quotation_mark_opening = false; // U+201D ”
bool has_right_double_quotation_mark_closing = false;
bool has_left_single_angle_quotation_mark_opening = false; // U+2039 ‹
bool has_left_single_angle_quotation_mark_closing = false;
bool has_right_single_angle_quotation_mark_opening = false; // U+203A ›
bool has_right_single_angle_quotation_mark_closing = false;
bool has_left_double_angle_quotation_mark_opening = false; // U+00AB «
bool has_left_double_angle_quotation_mark_closing = false;
bool has_right_double_angle_quotation_mark_opening = false; // U+00BB »
bool has_right_double_angle_quotation_mark_closing = false;
// Note: these macros use 'lang_tag'.
if ( LANG_STARTS_WITH(("en")) ) { // English
has_left_single_quotation_mark_opening = true; // no right..closing in linebreakdef.c
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
}
else if ( LANG_STARTS_WITH(("fr") ("es")) ) { // French, Spanish
has_left_single_quotation_mark_opening = true; // no right..closing in linebreakdef.c
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
has_left_single_angle_quotation_mark_opening = true;
has_right_single_angle_quotation_mark_closing = true;
has_left_double_angle_quotation_mark_opening = true;
has_right_double_angle_quotation_mark_closing = true;
}
else if ( LANG_STARTS_WITH(("de")) ) { // German
has_left_single_quotation_mark_closing = true;
has_right_single_quotation_mark_glue = true;
has_left_double_quotation_mark_closing = true;
has_left_single_angle_quotation_mark_closing = true;
has_right_single_angle_quotation_mark_opening = true;
has_left_double_angle_quotation_mark_closing = true;
has_right_double_angle_quotation_mark_opening = true;
}
else if ( LANG_STARTS_WITH(("ru")) ) { // Russian
has_left_double_quotation_mark_closing = true;
has_left_double_angle_quotation_mark_opening = true;
has_right_double_angle_quotation_mark_closing = true;
}
else if ( LANG_STARTS_WITH(("zh")) ) { // Chinese
has_left_single_quotation_mark_opening = true;
has_right_single_quotation_mark_closing = true;
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
}
// Add languages rules here, or reuse previous one with other languages if needed.
// Set up _lb_props.
// Important: the unicode indices must be in strict ascending order (or libunibreak
// might abort checking them all)
int n = 0;
if ( has_left_double_angle_quotation_mark_opening ) _lb_props[n++] = { 0x00AB, 0x00AB, LBP_OP };
if ( has_left_double_angle_quotation_mark_closing ) _lb_props[n++] = { 0x00AB, 0x00AB, LBP_CL };
// Soft-Hyphens are handled by Hyphman hyphenate(), have them handled as Zero-Width-Joiner by
// libunibreak so they don't allow any break and don't prevent hyphenate() to handle them correctly.
_lb_props[n++] = { 0x00AD, 0x00AD, LBP_ZWJ };
if ( has_right_double_angle_quotation_mark_opening ) _lb_props[n++] = { 0x00BB, 0x00BB, LBP_OP };
if ( has_right_double_angle_quotation_mark_closing ) _lb_props[n++] = { 0x00BB, 0x00BB, LBP_CL };
if ( has_left_single_quotation_mark_opening ) _lb_props[n++] = { 0x2018, 0x2018, LBP_OP };
if ( has_left_single_quotation_mark_closing ) _lb_props[n++] = { 0x2018, 0x2018, LBP_CL };
if ( has_right_single_quotation_mark_opening ) _lb_props[n++] = { 0x2019, 0x2019, LBP_OP };
if ( has_right_single_quotation_mark_closing ) _lb_props[n++] = { 0x2019, 0x2019, LBP_CL };
if ( has_right_single_quotation_mark_glue ) _lb_props[n++] = { 0x2019, 0x2019, LBP_GL };
if ( has_left_double_quotation_mark_opening ) _lb_props[n++] = { 0x201C, 0x201C, LBP_OP };
if ( has_left_double_quotation_mark_closing ) _lb_props[n++] = { 0x201C, 0x201C, LBP_CL };
if ( has_right_double_quotation_mark_opening ) _lb_props[n++] = { 0x201D, 0x201D, LBP_OP };
if ( has_right_double_quotation_mark_closing ) _lb_props[n++] = { 0x201D, 0x201D, LBP_CL };
if ( has_left_single_angle_quotation_mark_opening ) _lb_props[n++] = { 0x2039, 0x2039, LBP_OP };
if ( has_left_single_angle_quotation_mark_closing ) _lb_props[n++] = { 0x2039, 0x2039, LBP_CL };
if ( has_right_single_angle_quotation_mark_opening ) _lb_props[n++] = { 0x203A, 0x203A, LBP_OP };
if ( has_right_single_angle_quotation_mark_closing ) _lb_props[n++] = { 0x203A, 0x203A, LBP_CL };
// End of list
_lb_props[n++] = { 0, 0, LBP_Undefined };
// Done with libunibreak per-language LineBreakProperties extensions
// Other line breaking and text layout tweaks
_lb_char_sub_func = NULL;
if ( LANG_STARTS_WITH(("pl")) ) { // Polish
_lb_char_sub_func = &lb_char_sub_func_polish;
_duplicate_real_hyphen_on_next_line = true;
}
if ( LANG_STARTS_WITH(("cs") ("sk")) ) { // Czech, Slovak
_lb_char_sub_func = &lb_char_sub_func_czech_slovak;
}
if ( LANG_STARTS_WITH(("pt")) ) { // Portuguese
_duplicate_real_hyphen_on_next_line = true;
}
#endif

By "a bit", I mean we can set some alternative "line breaking class" to some unicode codepoint (or substitute it with another char) - so the same algorithm applies, but consider that char differently.
We can't (= I don't want to :) tweak the algorithm - it's a chance we can delegate this (we used to do it ourselves, and it was painful and wrong - and I think we used to get what I suggest below about emdash being wide with the old homemade algorithm).

So, you'd need to understand the algo and classes a bit to see if some other class would make emdash behave as you wish - and actually test it :)
(I won't dig into it , it's nighmarish, see #337 (comment)).

Also, as an emdash is quite wide, preventing a break on either side might pull some stuff on the next line, and cause large interword spacing on the previous line. So, I guess in the emdash case, the best might be the enemy of the good,

@Jellby
Copy link
Contributor Author

Jellby commented Aug 6, 2020

Is there anything funky in the markup that might not be obvious at first glance? (soft hyphen?).

No, nothing like that, no HTML code or invisible characters.

We follow (thanks to libunibreak, we don't have to understand it :) the Unicode Line Breaking Algorithm (aka UAX#14)

Yes, I have this problem with all ebook renderers I have seen... As I said, I was hoping things could be improved/tweaked in KOReader.

By "a bit", I mean we can set some alternative "line breaking class" to some unicode codepoint (or substitute it with another char) - so the same algorithm applies, but consider that char differently.

Does the substitution need to be global or could it be done based on context (i.e. surrounding characters)?

Also, as an emdash is quite wide, preventing a break on either side might pull some stuff on the next line, and cause large interword spacing on the previous line.

Well, I would prefer some wider spacing to this linebreak. To me, it's like breaking before a ? o after a (. I would welcome at least an option to prevent all linebreaks at an em-dash (personally, I use space-endash-space where a break would be allowed). Maybe that's "QU: Ambiguous Quotation"?

@poire-z
Copy link
Contributor

poire-z commented Aug 6, 2020

Maybe that's "QU: Ambiguous Quotation"?

I don't know. All classes interacts in various ways with the classes the previous or next char has...
We used to have this https://unicode.org/cldr/utility/breaks.jsp for testing this algorithm (and all others), but it's not available anymore :( (just saw that today, I hope it will be back...)
There is a Javascript implementation, that might make it easier to hack something to quickly test:
https://github.com/foliojs/linebreak

on the sea—’ it's a dash followed by a closing single quote

Dunno which should win, the emdash that allows a break after - or a closing single quote that prevents a break before. But its closing status might depend on the language, cf my code snippet above. Can you check if it behaves better whether you select Typography language to be German or Chinese?

Does the substitution need to be global or could it be done based on context (i.e. surrounding characters)?

We can have function substitution per language, and they got access to context, ex:

#if USE_LIBUNIBREAK==1
lChar16 lb_char_sub_func_polish(const lChar16 * text, int pos, int next_usable) {
// https://github.com/koreader/koreader/issues/5645#issuecomment-559193057
// Letters aiouwzAIOUWS are prepositions that should not be left at the
// end of a line.
// Make them behave (for libunibreak) just like a opening paren (which
// being LBC_OP, will prevent a line break after it, even if followed
// by a space).
if ( pos >= 1 && text[pos-1] == ' ' ) {
switch ( text[pos] ) {

This might add some small overhead if it were global to all languages, and I'm not sure your rule apply to all languages.
Also, without a language, there, you wouldn't know if the right single quote is closing or opening, and you don't have easy access to the _lb_props rules to know. But with a language, you could make it to detect them as the _lb_props rules would set them.

is only allowed if there are letters on both sides of the dash (e.g. "that he—or she"), but not if there is a punctuation on either side.

que lui—« ou elle (probably not common in french) I wouldn't want you to prevent the wrapping after the dash :) So, it depends on the punctuation, whether it's opening or closing, which depends on the language...

Yes, I have this problem with all ebook renderers

personally, I use space-endash-space where a break would be allowed

Well, if that's so general and no renderer renders them as you expect - may be it's the publisher's fault: they could/should use emdash + zero-width-word-joiner + right quote - like they should use letter + no-break-space + ? (even if that one might be covered by the algorithm).

(https://en.wikipedia.org/wiki/Dash used to have a lot more examples in many languages - that I don't see anymore - I remember it was very variable, and I guess the Unicode people did the best they could to do the less bad things in the more cases)
(I understand it's bothering - and I would feel the same if the algo would break in between blah ».)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

on the sea—’ it's a dash followed by a closing single quote

Dunno which should win, the emdash that allows a break after - or a closing single quote that prevents a break before.

That's not where the problem appears, I haven't (so far) seen a break between the dash and the quote, the problem is before the dash.

que lui—« ou elle (probably not common in french) I wouldn't want you to prevent the wrapping after the dash :) So, it depends on the punctuation, whether it's opening or closing, which depends on the language...

Good point, you could of course write it in English: that he—“or she. Still, I believe fewer breaks (at the expense of wider spaces) are less evil than breaks at inappropriate places, but maybe that's just me.

Well, if that's so general and no renderer renders them as you expect - may be it's the publisher's fault: they could/should use emdash + zero-width-word-joiner + right quote - like they should use letter + no-break-space + ? (even if that one might be covered by the algorithm).

Yes, I thought about that... but it feels wrong (and inconvenient) to have to add these invisible codes. But maybe I'm spoiled by LaTeX (where, among other things, you don't need to add the no-break spaces before/after French punctuation, because that can be handled automatically).

About every renderer doing the "wrong" thing... I expect most will use pretty dumb algorithms, but when I saw the "typography rules" menu in KOReader I hoped I could finally get something smarter going on ;)

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

when I saw the "typography rules" menu in KOReader I hoped I could finally get something smarter going on ;)

It's there to allow us trying to be smarter :)
But we just can't go hack stuff because in this specific case - or with the badly published book you're reading - it would make things better. We need to make rules, and think about their effects with all languages and situations - and possibly tweak that to only some language if these cases might happen really often in one language - and rarely with others where the new rules might have bad effects.

So, it just requires some thinking :) And I can't think about an obvious solution right now. But more thinking might gives something.
You might have a look at LaTeX code how it handles that - might give some new ideas.

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

That's not where the problem appears, I haven't (so far) seen a break between the dash and the quote, the problem is before the dash.

Another thing about the UAX14 algorithm/libunibreak that I got: when fed a char, it gives the line break status between the previous char and this char. It can't go back. The only thing it can do it postpone the decision for a line break and keep giving "no break" until that decision (this is what happens with consecutive spaces, as it doesn't know what will come next: the first spaces have no break, and only between the last one and what comes next, we can have a "break allowed").
So, I guess the algorithm alone itself, no matter how we changes line breaking classes, can't solve your issue with on the sea—’ and whether to break after a when we meet 2 slots later a .

So, if that's gonna be fixed, I guess it can only by via a lb_char_sub_func_english() or lb_char_sub_func_generic().
But I don't know, I've never seen such constructs (on the sea—’ ) in French.
Is that really a standard construct in English ? worth fixing ? @Frenzie @NiLuJe ?

for Spanish we pretty much want to never allow a linebreak at a dash

If that's really the case, I guess the right line breaking class could solve it. But you'll have to be certain of that never :)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

I agree completely. Except for the "badly published book" part... I actually read mostly ebooks I create myself, so I know the code ;) In my experience, "retail" ebooks are the same (in this particular aspect) or worse (they'd often have a mishmash of hyphens, en-dashes and em-dashes used with no consistency, "smart" quotes facing the wrong side, etc.).

Anyway, my intent when creating this issue was not to push my view and demand any action, but to call the attention on a possible area of improvement. And not to sound like a whiner, I also put forward a suggestion. I'll try to think of a more reasonable solution and test it somewhere.

As a poor-man's solution, would it be possible/easy to have a toggleable option to "never break at an em-dash" (possibly en-dash too)? At least in English the use of dashes varies with publishers and styles, and it may be beneficial to enable this for some books and not others... and it may also be helpful for other languages and people like me who are more bugged by wrong linebreaks than by large white spaces ;)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

But I don't know, I've never seen such constructs (on the sea—’ ) in French.
Is that really a standard construct in English ? worth fixing ? @Frenzie @NiLuJe ?

It is very common (in my experience at least) to use an em-dash as a sort of ellipsis for interrupted speech. There may be a difference between American/British English, but I find it all the time.

for Spanish we pretty much want to never allow a linebreak at a dash

If that's really the case, I guess the right line breaking class could solve it. But you'll have to be certain of that never :)

Well, Unicode would call it a "quotation dash", but that's bullshit, there's only a single dash character in Spanish, and it's used both for starting a quotation and for parenthetical sentences —like this—, and it is used as I did, with space on one side, possibly with punctuation. You want breaks where the spaces are, the dashes don't allow any break that a normal letter wouldn't.

I might consider using different dash characters in Spanish the day I see widespread use of different characters for a right single quote mark and a (curly) apostrophe in English...

@NiLuJe
Copy link
Member

NiLuJe commented Aug 7, 2020

I do like the interrupted speech approach in English (as well as the parenthetical one, for that matter), but I'm absolutely not bothered by breaking around the dash in both of these cases ;).

@NiLuJe
Copy link
Member

NiLuJe commented Aug 7, 2020

Then again, I wholeheartedly fall on the "break more" than the "more rivers" side of the equation ;).

@Frenzie
Copy link
Member

Frenzie commented Aug 7, 2020

@poire-z

Is that really a standard construct in English ?

It's plenty standard. It indicates a sudden stop. "I went to the—"

Do you do something different in French? I don't recall encountering suddenly interrupted speech otoh.

It's described here: https://en.wikipedia.org/wiki/Dash#Interruption_of_a_speaker

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

I actually read mostly ebooks I create myself, so I know the code ;)

At least, you can fix that in the source then (zero width no break space/word joiner) :)
Another option would be to wrap your emdash into <SPAN class="emdash">-</SPAN> and use some CSS (via style tweaks, or in your embedded CSS):
span.emdash::before { content: "\A0"; } /* insert a nbsp before */

Anyway, my intent when creating this issue was not to push my view and demand any action, but to call the attention on a possible area of improvement. And not to sound like a whiner, I also put forward a suggestion.

And I took it like that, and tried/am trying to think with you about the best solution - suggesting some caveats I can think ok, so it's not a quick suggestion/decision that would cause other issues later.

As a poor-man's solution, would it be possible/easy to have a toggleable option to "never break at an em-dash" (possibly en-dash too)?

I'm not keen on having many toggables (mostly because it's a pain to propagate from the UI to the engine, and to explain them). The Typography language abstraction was to hide that (my initial idea with using flags as you'd like them at #307 (comment) - which evolved into per language flags #307 (comment) - to finally have it hardcoded in crengine).

But if ever we'd need to have minor tweaks to libunibreak/UAX14, that we can't associate to a language, or to text contexts that would be valid in all languages, and should be targeted/toggable by the user, I'd go with a style tweak:
body { -cr-hint: uax14-moded-by-jellby; } (or another anonymous name to protect the not so innocents) - and we could put various mods in there - but it would be all these, or standard UAX14 (tweaked by language).

Well, Unicode would call it a "quotation dash", but that's bullshit, there's only a single dash character in Spanish

Well Unicode just calls it a EM DASH :) and there's a single EM DASH codepoint in the whole world languages :)
https://unicode.org/charts/PDF/U2000.pdf
image
The thing is that Unicode does not know the context: it can only (rarely) suggest to use another codepoint (that may look exactly similar) to get another behaviour.

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

It's plenty standard. It indicates a sudden stop. "I went to the—"

Then, a sudden line break should enhance the effect and look even better ! :)

I went to the
—"             (oh, more surprise, I didn't expect that!)

Do you do something different in French? I don't recall encountering suddenly interrupted speech otoh.

Mhhh, neither do I. I don't know how we render that. (But we're very polite and don't interrupt other people speech :)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

Well Unicode just calls it a EM DASH :)

Look again at 2015. That's what they want us to use in some places. The fact that it might be not supported by the font or that it may actually look different from the em-dash (which would be used in other places) is yet another reason not to use it.

Then, a sudden line break should enhance the effect and look even better ! :)

Ha. It doesn't work so good (even if I could admit that's any good :P) when even a single word can be interrupted:

‘Um,’ he said. ‘Actually, it’s not at all ba
—’

or the dash can be used to "edit" a name that one doesn't want to spell fully:

I met him with Mrs. P
—, but I didn't know her at the time

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

Another thing about the UAX14 algorithm/libunibreak that I got: when fed a char, it gives the line break status between the previous char and this char. It can't go back.

But could one say, for instance:

if this particular dash is (preceded by this or that) or (followed by whatever):
  treat it as class 1
otherwise:
  treat it as class 2

that, of course, would be applied to every dash in a given (or maybe all) language

@Frenzie
Copy link
Member

Frenzie commented Aug 7, 2020

Mhhh, neither do I. I don't know how we render that. (But we're very polite and don't interrupt other people speech :)

We Dutch do, but there are many things that could interrupt you. I think it's more typically used for those other kinds of interruptions.

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

But could one say, for instance:

if this particular dash is (preceded by this or that) or (followed by whatever):
  treat it as class 1
otherwise:
  treat it as class 2

That's not how the UAX14 algo works I think. A codepoint has a single LB class.
Its the flow of char classes (a class preceded by such class => decision, with possibly some decision postponed until another class is met) that decides about allow/forbid line break.
May be there are other algos that can look back/ahead more, make corrections, etc... dunno. But the fact is UAX14 is somehow linear and then can get pretty faster than alternatives.

But that's somehow what our lb_char_sub_func_xxx() can do:

if this particular dash is (preceded by this or that) or (followed by whatever):
  treat it as this char1 (so, the class1 associated to char1)
otherwise:
  treat it as this char2 (so, the class2 associated to char2)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

May I ask how the RIGHT SINGLE QUOTATION MARK is interpreted in English? Is it assigned a "Closing Punctuation" class? I've been testing with the textseg python package and it seems that would allow a break after it, even if there is no space. Is that the case? I presume the linebreaking class does not affect the identification of allowed hyphenation points, right?

Also, it looks like the textlang.cpp tries to find nested quotes. Is there any provision for the fact that, in Spanish at least, a closing sign is used at the beginning of the second and later paragraphs of a multi-paragraph quote? I guess it doesn't really matter much in practice, since by definition it has to be the first character in the line, and therefore unlikely to get a linebreak anyway. But maybe that's a matter for a different issue.

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

Try reading this:

#if USE_LIBUNIBREAK==1
// libunibreak per-language LineBreakProperties extensions
//
// Rules extracted from libunibreak/src/linebreakdef.c, so we can adapt
// them and build LineBreakProperties adequately for more languages.
// See https://en.wikipedia.org/wiki/Quotation_mark
// These are mostly need only for languages that may add a space between
// the quote and its content - otherwise, the quote will be part of the
// word it sticks to, and break will be allowed on the other side which
// probably is a space.
// When a language allows the use of unpaired quotes (same quote on both
// sides), it seems best to not specify anything.
bool has_left_single_quotation_mark_opening = false; // U+2018 ‘
bool has_left_single_quotation_mark_closing = false;
bool has_right_single_quotation_mark_opening = false; // U+2019 ’
bool has_right_single_quotation_mark_closing = false;
bool has_right_single_quotation_mark_glue = false;
bool has_left_double_quotation_mark_opening = false; // U+201C “
bool has_left_double_quotation_mark_closing = false;
bool has_right_double_quotation_mark_opening = false; // U+201D ”
bool has_right_double_quotation_mark_closing = false;
bool has_left_single_angle_quotation_mark_opening = false; // U+2039 ‹
bool has_left_single_angle_quotation_mark_closing = false;
bool has_right_single_angle_quotation_mark_opening = false; // U+203A ›
bool has_right_single_angle_quotation_mark_closing = false;
bool has_left_double_angle_quotation_mark_opening = false; // U+00AB «
bool has_left_double_angle_quotation_mark_closing = false;
bool has_right_double_angle_quotation_mark_opening = false; // U+00BB »
bool has_right_double_angle_quotation_mark_closing = false;
// Note: these macros use 'lang_tag'.
if ( LANG_STARTS_WITH(("en")) ) { // English
has_left_single_quotation_mark_opening = true; // no right..closing in linebreakdef.c
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
}
else if ( LANG_STARTS_WITH(("fr") ("es")) ) { // French, Spanish
has_left_single_quotation_mark_opening = true; // no right..closing in linebreakdef.c
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
has_left_single_angle_quotation_mark_opening = true;
has_right_single_angle_quotation_mark_closing = true;
has_left_double_angle_quotation_mark_opening = true;
has_right_double_angle_quotation_mark_closing = true;
}
else if ( LANG_STARTS_WITH(("de")) ) { // German
has_left_single_quotation_mark_closing = true;
has_right_single_quotation_mark_glue = true;
has_left_double_quotation_mark_closing = true;
has_left_single_angle_quotation_mark_closing = true;
has_right_single_angle_quotation_mark_opening = true;
has_left_double_angle_quotation_mark_closing = true;
has_right_double_angle_quotation_mark_opening = true;
}
else if ( LANG_STARTS_WITH(("ru")) ) { // Russian
has_left_double_quotation_mark_closing = true;
has_left_double_angle_quotation_mark_opening = true;
has_right_double_angle_quotation_mark_closing = true;
}
else if ( LANG_STARTS_WITH(("zh")) ) { // Chinese
has_left_single_quotation_mark_opening = true;
has_right_single_quotation_mark_closing = true;
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
}
// Add languages rules here, or reuse previous one with other languages if needed.
// Set up _lb_props.
// Important: the unicode indices must be in strict ascending order (or libunibreak
// might abort checking them all)
int n = 0;
if ( has_left_double_angle_quotation_mark_opening ) _lb_props[n++] = { 0x00AB, 0x00AB, LBP_OP };
if ( has_left_double_angle_quotation_mark_closing ) _lb_props[n++] = { 0x00AB, 0x00AB, LBP_CL };
// Soft-Hyphens are handled by Hyphman hyphenate(), have them handled as Zero-Width-Joiner by
// libunibreak so they don't allow any break and don't prevent hyphenate() to handle them correctly.
_lb_props[n++] = { 0x00AD, 0x00AD, LBP_ZWJ };
if ( has_right_double_angle_quotation_mark_opening ) _lb_props[n++] = { 0x00BB, 0x00BB, LBP_OP };
if ( has_right_double_angle_quotation_mark_closing ) _lb_props[n++] = { 0x00BB, 0x00BB, LBP_CL };
if ( has_left_single_quotation_mark_opening ) _lb_props[n++] = { 0x2018, 0x2018, LBP_OP };
if ( has_left_single_quotation_mark_closing ) _lb_props[n++] = { 0x2018, 0x2018, LBP_CL };
if ( has_right_single_quotation_mark_opening ) _lb_props[n++] = { 0x2019, 0x2019, LBP_OP };
if ( has_right_single_quotation_mark_closing ) _lb_props[n++] = { 0x2019, 0x2019, LBP_CL };
if ( has_right_single_quotation_mark_glue ) _lb_props[n++] = { 0x2019, 0x2019, LBP_GL };
if ( has_left_double_quotation_mark_opening ) _lb_props[n++] = { 0x201C, 0x201C, LBP_OP };
if ( has_left_double_quotation_mark_closing ) _lb_props[n++] = { 0x201C, 0x201C, LBP_CL };
if ( has_right_double_quotation_mark_opening ) _lb_props[n++] = { 0x201D, 0x201D, LBP_OP };
if ( has_right_double_quotation_mark_closing ) _lb_props[n++] = { 0x201D, 0x201D, LBP_CL };
if ( has_left_single_angle_quotation_mark_opening ) _lb_props[n++] = { 0x2039, 0x2039, LBP_OP };
if ( has_left_single_angle_quotation_mark_closing ) _lb_props[n++] = { 0x2039, 0x2039, LBP_CL };
if ( has_right_single_angle_quotation_mark_opening ) _lb_props[n++] = { 0x203A, 0x203A, LBP_OP };
if ( has_right_single_angle_quotation_mark_closing ) _lb_props[n++] = { 0x203A, 0x203A, LBP_CL };
// End of list
_lb_props[n++] = { 0, 0, LBP_Undefined };
// Done with libunibreak per-language LineBreakProperties extensions
// Other line breaking and text layout tweaks
_lb_char_sub_func = NULL;
if ( LANG_STARTS_WITH(("pl")) ) { // Polish
_lb_char_sub_func = &lb_char_sub_func_polish;
_duplicate_real_hyphen_on_next_line = true;
}
if ( LANG_STARTS_WITH(("cs") ("sk")) ) { // Czech, Slovak
_lb_char_sub_func = &lb_char_sub_func_czech_slovak;
}
if ( LANG_STARTS_WITH(("pt")) ) { // Portuguese
_duplicate_real_hyphen_on_next_line = true;
}
#endif

(If i re-read my #337 (comment)):
For all the stuff that stays false, we stay with the default class of "quotation" http://www.unicode.org/reports/tr14/#QU - which usually prevents break between other such stuff (as we don't know if it's opening or closing).
For all the stuff that we set to true for some languages, they get the OP(ening), CL(osing) or GL(ue) - and this allows more break (between a closing one and an opening one).

May I ask how the RIGHT SINGLE QUOTATION MARK is interpreted in English? Is it assigned a "Closing Punctuation" class?

So, it seems it is not. It stays QU.

I presume the linebreaking class does not affect the identification of allowed hyphenation points

Well, hyphenation happen inside words, that should all fully be non-breakable.

Also, it looks like the textlang.cpp tries to find nested quotes.

No, that's left to libunibreak with our lb_props rules.
The code you're probably mentionning is just to get the correct char/glyph for HTML's <q>first <q> second level</q></q> (which is rare, publishers usually just use the quote char they want) #345 :
image

@poire-z
Copy link
Contributor

poire-z commented Aug 7, 2020

Not sure if that's what you're thinking about:

if ( LANG_STARTS_WITH(("en")) ) { // English
has_left_single_quotation_mark_opening = true; // no right..closing in linebreakdef.c
has_left_double_quotation_mark_opening = true;
has_right_double_quotation_mark_closing = true;
}

So, if you were to substitute your emdash with U+201D ”, it would behave as on the sea”’ - which I guess would not break - but you'd need to think about how this would behave in the various context a emdash can be used.

@Jellby
Copy link
Contributor Author

Jellby commented Aug 7, 2020

The code you're probably mentionning is just to get the correct char/glyph for HTML's <q>first <q> second level</q></q> (which is rare, publishers usually just use the quote char they want) #345

I see... Well in case it helps, in Spanish one could have a third level: «First “second ‘third level’ level” level» (the "single guillemet" has not been commonly used so far). Since the first dialog level is marked with a dash, and not with quotes, that gives 4 levels in dialogs, easy peasy :)

@poire-z
Copy link
Contributor

poire-z commented Aug 8, 2020

I had a quick test at having dashes handled as http://www.unicode.org/reports/tr14/#NS
NS | Nonstarter | “‼”, “‽”, “⁇”, “⁉”, etc. | Allow only indirect line breaks before

--- a/crengine/src/textlang.cpp
+++ b/crengine/src/textlang.cpp
@@ -584,2 +584,3 @@ TextLangCfg::TextLangCfg( lString16 lang_tag ) {
     bool has_right_double_angle_quotation_mark_closing = false;
+    bool has_dashes_nonstarter = false; // U+2013 U+2014 avoid break before in some cases

@@ -622,2 +623,6 @@ TextLangCfg::TextLangCfg( lString16 lang_tag ) {

+    if ( LANG_STARTS_WITH(("en") ("es")) ) {
+        has_dashes_nonstarter = true;
+    }
+
     // Set up _lb_props.
@@ -633,2 +638,4 @@ TextLangCfg::TextLangCfg( lString16 lang_tag ) {
     if ( has_right_double_angle_quotation_mark_closing ) _lb_props[n++] = { 0x00BB, 0x00BB, LBP_CL };
+    if ( has_dashes_nonstarter )                         _lb_props[n++] = { 0x2013, 0x2013, LBP_NS };
+    if ( has_dashes_nonstarter )                         _lb_props[n++] = { 0x2014, 0x2014, LBP_NS };
     if ( has_left_single_quotation_mark_opening )        _lb_props[n++] = { 0x2018, 0x2018, LBP_OP };

Some quick tests with having them preceded or followed by spaces, opening or closing quotes/parens do not exhibit strange things (at least, not strange to me) - and they are allowed to start a line when there's a space before.
Here are the libs built for Kobo, if you have a Kobo: libkoreader-cre-dashes-nonstarter-EN+ES-kobo.zip

(I have no idea where/when “‼”, “‽”, “⁇”, “⁉” are used, and why they would behave differently than a single "?" or "!" - anyone can enlighten me ?) (that red is not mine, but github highlighting ⁉ :)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 8, 2020

(I have no idea where/when “bangbang”, “‽”, “⁇”, “interrobang” are used, and why they would behave differently than a single "?" or "!" - anyone can enlighten me ?) (that red is not mine, but github highlighting interrobang :)

I was wondering the same thing. Can't see the logic in that.

Some quick tests with having them preceded or followed by spaces, opening or closing quotes/parens do not exhibit strange things (at least, not strange to me) - and they are allowed to start a line when there's a space before.

Some quick test on my side (I have a Kobo, but I tested this with a minimal python script), indicates that that would result in (using the character ¦ to mark allowed linebreaks): that ¦he—¦or ¦she or —¦No ¦creo ¦—¦dijo, where we'd like that ¦he¦—¦or ¦she and —No ¦creo ¦—dijo.

I think the following could work for English (and possibly other languages) if it would be feasible to implement:

  • Scan characters on both sides of the dash until an alphanumeric character, a space or forced linebreak (e.g. end of paragraph) is found. (Some refinement may be needed to define what is "alphanumeric" and what is "space")
  • If an alphanumeric was found on both sides, this is a "normal" dash: B2
  • Otherwise (i.e. if a space or linebreak was found on any side), this is a "no break" dash: QU

For Spanish, a much simpler expedient would be to just assign the em-dash to the QU class, as it matches how it is used, or AL (which seems to be the class for the "horizontal bar").

By the way, I don't think this applies to the en-dash, the en-dash has a different class and usage, and I haven't seen a problem with it yet.

@poire-z
Copy link
Contributor

poire-z commented Aug 8, 2020

Some quick test on my side - I tested this with a minimal python script

Good, perfect, enemies... :)
I'd like you to test in real reading if it is better enough than the previous behaviour for the common cases you noticed - and the caveats/false decisions less worse than previously :)
And if it's as worse as before, or bring new ugly things, we can drop the idea.

if it would be feasible to implement:
until an alphanumeric character, a space or forced linebreak (e.g. end of paragraph) is found. (Some refinement may be needed to define what is "alphanumeric" and what is "space")

That would be costly, and detecting "alphanumeric" would mean detecting kinda line-breaking-classes outside of the algorithm, and decide upon these classes - and we don't have access to these I think.

Otherwise (i.e. if a space or linebreak was found on any side), this is a "no break" dash: QU

QU will prevent a break on both side, even if there is a space I think. creo ¦—dijo would not break.

or AL (which seems to be the class for the "horizontal bar").

I tried AL earlier, which is just the class of any letter. So, enemies—better would never break. I tried it on some english book with tons of these, and the word spacing was horrible.
(As a french, I find the absence of space surrounding these emdashes hideous and making it hard to read :)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 8, 2020

I'd like you to test in real reading if it is better enough than the previous behaviour for the common cases you noticed - and the caveats/false decisions less worse than previously :)
And if it's as worse as before, or bring new ugly things, we can drop the idea.

I'd rather like to see all possible linebreak points in sample texts. Testing in real reading means being (un)lucky enough to find the problem situations. For instance, if I read British English books, I'd only rarely see double quotes, more rarely next to an em-dash, and will I be able to say if a linebreak was avoided or it just didn't happen to fall there? Of course, at the end a real reading text is necessary/good.

Still without testing, and judging from the description, I don't think NS is a good class. It allows a break after the dash in ‘—and so on or creo —dijo. But any class that would prevent any break unless there are spaces would look good in my "reading test", because the books I read don't have cases like that he—or she (I use space-endash-space for those).

That would be costly, and detecting "alphanumeric" would mean detecting kinda line-breaking-classes outside of the algorithm, and decide upon these classes - and we don't have access to these I think.

I suspected that. Well, in the vast majority of cases, the "scan" for characters would just stop at the first, sometimes second character. Instead of checking for line-breaking-classes, could one check for unicode categories? It could be something like: find the first character that's not punctuation, and decide based on whether or not it's a space/linebreak.

Otherwise (i.e. if a space or linebreak was found on any side), this is a "no break" dash: QU

QU will prevent a break on both side, even if there is a space I think. creo ¦—dijo would not break.

Really? If the curly apostrophe is QU, are linebreaks prevented at get ’em? or no "good" (straight quotes)?

According to the UAX#14 document, The AL QU combination should be "do not break before QU, unless one or more spaces follow AL", and the "unless" would apply in this case.

or AL (which seems to be the class for the "horizontal bar").

I tried AL earlier, which is just the class of any letter. So, enemies—better would never break. I tried it on some english book with tons of these, and the word spacing was horrible.
(As a french, I find the absence of space surrounding these emdashes hideous and making it hard to read :)

Right, which is why I said this would be specific for Spanish (and possibly French and other languages), where that usage without spaces does not occur (in self-respected texts).

@poire-z
Copy link
Contributor

poire-z commented Aug 8, 2020

I'd rather like to see all possible linebreak points in sample texts. Testing in real reading means being (un)lucky enough to find the problem situations.

I see... You won't stop at "better" :/ You want "perfect" :)

But any class that would prevent any break unless there are spaces would look good in my "reading test", because the books I read don't have cases like...

But that won't do generally, because of the case I mentionned previously getting too large word spacing.
At least, NS was good enough at solving this issue, and allowing a break on one side.
Aim at "perfect for all", not only at "perfect for you" :)

Really? If the curly apostrophe is QU, are linebreaks prevented at get ’em? or no "good" (straight quotes)? According to the UAX#14 document, The AL QU combination should be "do not break before QU, unless one or more spaces follow AL", and the "unless" would apply in this case.

May be you're right :) Good to see you're getting better at UAX14 than me :)

Instead of checking for line-breaking-classes, could one check for unicode categories? It could be something like: find the first character that's not punctuation, and decide based on whether or not it's a space/linebreak.

I still think line breaking classes are better - and us staying in that categoriztion plane rather than picking into another - and we get a chance to have libunibreak do the right thing with Arabic or Chinese.

Are you able to build KOReader's emulator ? I'd rather focus on trying to make the LB class accessible into a lb_char_sub_func_generic() so you can get the LB class of neighbours chars and play with that - than having to think UAX14 with you :)

and decide based on whether or not it's a space/linebreak.

In our context, we're processing already splitted paragraphs (by <BR> or \n or <P>...</P>) - so no need to bother with linebreak. Either pos== 0 and you're at start of a paragraph logical text - or next_usable==0 and you're at end of paragraph logical text.

@Jellby
Copy link
Contributor Author

Jellby commented Aug 8, 2020

I'd rather like to see all possible linebreak points in sample texts. Testing in real reading means being (un)lucky enough to find the problem situations.

I see... You won't stop at "better" :/ You want "perfect" :)

Yes, perfect within my reach :D

The problem with NS is that I think it just fixes one particular case and not the general problem, and it's asymmetric. Still... yes, it would be some improvement. As a user, I'd be semi-happy with that (in English at least), but as a developer I don't think the benefits justify the "hack".

But that won't do generally, because of the case I mentionned previously getting too large word spacing.
At least, NS was good enough at solving this issue, and allowing a break on one side.
Aim at "perfect for all", not only at "perfect for you" :)

That's my point, a (n exhaustive) reading test will only tell me when it's "perfect for me"

I still think line breaking classes are better - and us staying in that categoriztion plane rather than picking into another - and we get a chance to have libunibreak do the right thing with Arabic or Chinese.

... in case they use em-dashes :D

I get the wish to refer only to linebreak classes, but I don't see why using categories would break it for Arabic or Chinese... those characters surely have some category too, and it won't be punctuation or space...

Are you able to build KOReader's emulator ? I'd rather focus on trying to make the LB class accessible into a lb_char_sub_func_generic() so you can get the LB class of neighbours chars and play with that - than having to think UAX14 with you :)

I'll try. I'm not a noob with compiling.

In our context, we're processing already splitted paragraphs (by <BR> or \n or <P>...</P>) - so no need to bother with linebreak. Either pos== 0 and you're at start of a paragraph logical text - or next_usable==0 and you're at end of paragraph logical text.

Good, I just included that to make sure the beginning/end of the text to be broken is taken into account.

@Jellby
Copy link
Contributor Author

Jellby commented Aug 8, 2020

I could build the emulator, and run it. But when I add for instance your NS changes (and recompile), I still see breaks at dash—’, with the dash starting a new line. If I add has_right_single_quotation_mark_closing = true I see the different behaviour of the apostrophe (elsewhere), so I know changes are being applied.

@poire-z
Copy link
Contributor

poire-z commented Aug 8, 2020

You added this line has the right place (between the two in my patch - codepoints needs to be linearly increasing - if you added it at the end, a 2014 won't reach it and stop at 2018):
if ( has_dashes_nonstarter ) _lb_props[n++] = { 0x2014, 0x2014, LBP_NS };

(Note that it's possible I haven't tested it explicitely with dash—’, so dunno if it works in that case :) better try it with LBP_AL to have it as a letter sticking letters together.)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 9, 2020

So, I guess that French, usually having some space (regular or else) around emdash, will be fine with QU as it won't change anything (break if regular space, don't break if specified nobreakspace). Except for may be other cases I can't think of. We'll see how it does.

Right, maybe for French no change was needed.

Well, OK, all this feels like it can somehow wholly make sense :)
Still testing it? Wanna make your first PR here? Or should I do it for you?

I guess I could do it, but you could do the one for koreader-base, if you don't mind.

And mhhh (whooly make sense... really?! :) please, again, what is and why the difference now between EN vs. FR+ES ? May be a few identical samples that are dealt with differently in these 2 contexts, so I can get it? :)

In current-style "good" texts, no difference, but I guess it's faster to do without the lb_char_sub_func when possible. Otherwise, blabla—blabla would break in EN, and not in FR, ES (but it should never appear in these languages).

@poire-z
Copy link
Contributor

poire-z commented Aug 9, 2020

Right, maybe for French no change was needed.

Well, let's keep it for now, so I can notice if there are some issues.

Otherwise, blabla—blabla would break in EN, and not in FR, ES (but it should never appear in these languages).

OK. May be write a sentence about that in the comments. (And add the url to this issue for reference? I usually don't, but for this one, I think it's worth it.)

you could do the one for koreader-base, if you don't mind.

Yep, no problem, I'll do the bumps to base and to frontend.

@poire-z
Copy link
Contributor

poire-z commented Aug 10, 2020

note that next_usable is what it says: it will tell you about the current text node - because the upper code is actually copying text nodes to "text", and has not yet parsed what's coming next - so there might be more text coming in that paragraph after next_usable - but it's not usable...)

Just mentionning that again ^.
That means when reaching the end of next_usable, it might not be the end of the paragraph. It's only the end of the current text node - and you don't know what's coming next.
So with word-<sup>[123]</sup> or open-<span color=blue>source</span>, you don't know what's coming after the dash. So, be sure to take the most appropriate decision when what's next is unknown (and don't assume it's the last char of a paragraph).

@Jellby
Copy link
Contributor Author

Jellby commented Aug 10, 2020

Oh, then we're left again with the problem: How to detect end of paragraph / forced line break?

@poire-z
Copy link
Contributor

poire-z commented Aug 10, 2020

May be with something like that (untested):

--- a/crengine/include/textlang.h
+++ b/crengine/include/textlang.h
@@ -94,5 +94,5 @@ public:
 #define MAX_NB_LB_PROPS_ITEMS 10 // for our statically sized array (increase if needed)

-typedef lChar16 (*lb_char_sub_func_t)(const lChar16 * text, int pos, int next_usable);
+typedef lChar16 (*lb_char_sub_func_t)(const lChar16 * text, int pos, int next_usable, bool is_last_fragment);

 class TextLangCfg
--- a/crengine/src/lvtextfm.cpp
+++ b/crengine/src/lvtextfm.cpp
@@ -1401,5 +1401,5 @@ public:
                         // Lang specific function may want to substitute char (for
                         // libunibreak only) to tweak line breaking around it
-                        ch = src->lang_cfg->getLBCharSubFunc()(m_text, pos, len-1 - k);
+                        ch = src->lang_cfg->getLBCharSubFunc()(m_text, pos, len-1 - k, i==end-1);
                     }
                     int brk = lb_process_next_char(&lbCtx, (utf32_t)ch);

@Jellby
Copy link
Contributor Author

Jellby commented Aug 10, 2020

And what about pos==0? Does it also occur and node boundary? E.g., what happens with <em>one</em>—<em>two</em>?

@poire-z
Copy link
Contributor

poire-z commented Aug 10, 2020

No issue on the left side: pos==0 means first char of first text node of the paragraph. pos=5 is first char of 2nd text node if 1st text node was 5 chars long.

@Jellby
Copy link
Contributor Author

Jellby commented Aug 10, 2020

So, if I understand it correctly, we can look at the previous characters even if they are from different nodes, but not at the following characters if they're in a different node?

@poire-z
Copy link
Contributor

poire-z commented Aug 10, 2020

Right, you understand correctly.
(That's an optimisation to do that libunibreak flagging stuff while copying text nodes content into the paragraph text formatting buffers - I'm not too keen on changing that just for this :)

@Jellby
Copy link
Contributor Author

Jellby commented Aug 11, 2020

Well, that made me change the approach, since I can no longer be sure there will be a space or forced linebreak after the dash (e.g. in one—<em>two</em>), so in order to guarantee that at least break on one side will be possible, my working code now does:

blah—blah                     ->  —  (break before or after)
blah “—blah , <p>—blah        ->  {  (do not break after)
blah—” Blah , blah—”</p>      ->  }  (do not break before)
blah — blah , blah —<em>blah  ->  "  (break only at spaces)

@poire-z
Copy link
Contributor

poire-z commented Aug 11, 2020

And you still manage to get "perfect"-enough results ?

@Jellby
Copy link
Contributor Author

Jellby commented Aug 11, 2020

As far as I have seen, yes. The only "problem" is that when there's a node boundary after the dash (one—<em>two</em>), a break will be forbidden before the dash; the break after is still allowed (unless further forbidden by punctuation). It's not an extremely rare situation, I guess, but it's a risk I'm willing to accept.

@poire-z
Copy link
Contributor

poire-z commented Aug 17, 2020

Implemented in #365.

@poire-z
Copy link
Contributor

poire-z commented Sep 25, 2020

We used to have this https://unicode.org/cldr/utility/breaks.jsp for testing this algorithm (and all others), but it's not available anymore :( (just saw that today, I hope it will be back...)

It's back again ! https://util.unicode.org/UnicodeJsps/breaks.jsp

@poire-z
Copy link
Contributor

poire-z commented May 9, 2021

Possibly unrelated to this issue or how it was solved, but still related to linebreaking and dashes.
I got this:
image.

The HTML reads : cultes\u20\u2013\uA0<i>Les\u20tontons</i> (\uA0 = &nbsp;)
(Haven't checked, but I think the <i> is not at play in the situation.)

Obviously, the non-stripped-out NBSP at start of line makes this a little ugly.

We never needed to care about handling NBSP at start/end of wrapped lines, because a wrap should normally not happen around a NO-BREAK-SPACE :) but it looks like it may happen if the NBSP is stuck to some other kind of chars that break the non-breakable aspect.
Should we try to strip such NBSP at start/end of line ? Or do their spacing is of some importance?
I believe it is of importance at start of a paragraph (as it could be used to add some indentation), not sure about when at the end of a paragraph.
But when inside a paragraph, and it happens to end up on the side of a line wrap (non-mandatory line break), what should or are we allowed to do?
Thoughts ?

@NiLuJe
Copy link
Member

NiLuJe commented May 9, 2021

Possibly only if it's stuck after a may_break character?

(Although, to be fair, preserving it here doesn't bother me all that much ;)).

@Frenzie
Copy link
Member

Frenzie commented May 9, 2021

Did you test what happens if you stick a hyphen before a non-breaking space in, say, Firefox?

@poire-z
Copy link
Contributor

poire-z commented May 9, 2021

Well, Firefox won't break between the U+2013 and NBSP :) (So, may be the fact that we do is related to this issue and its solution, but that's not my issue, I'm fine with that :)

image
(on the last one, NBSPs on both sides of the U+2013.)

On the 4th item (\u0020\u00AD\u00A0after), it keeps the space/indentation aspect of it at start of line.
It feels unwelcome, but I guess that means we should (unless the fast that it's a soft-hyphens, and the soft-hyphens itself is what is breaking on, so everything else should be shown).

@poire-z
Copy link
Contributor

poire-z commented May 9, 2021

(Haven't checked, but I think the <i> is not at play in the situation.)

Well, I had a check, and it looks different, and nice: no blank before any of the left and right margins if I remove the <i>. So, might be related to #364 (comment) and the fact @Jellby's tweaks can't know what's coming next that makes them break before or after the NBSP depending if he knows or not it's some normal word, and we might handle it correctly (not showing it) when we break after - or something around that (not investigating more :)

@poire-z
Copy link
Contributor

poire-z commented May 9, 2021

Or may be not related to these dash tweaks: I think they applied only to English or French - and I get the same behaviour with &nbsp;<i> if I select Catalan or Bulgarian typography... So, possible just to the normal code can do with text nodes and libunibreak... Still strange though that there is this difference...

@poire-z
Copy link
Contributor

poire-z commented May 10, 2021

OK, the difference is caused by this, so actually voluntary:

// Ignore space at start of line (this rarely happens, as line
// splitting discards the space on which a split is made - but it
// can happen in other rare wrap cases like lastDeprecatedWrap)
if ( (m_flags[start] & LCHAR_IS_SPACE) && !(lastSrc->flags & LTEXT_FLAG_PREFORMATTED) ) {
// But do it only if we're going to stay in same text node (if not
// the space may have some reason - there's sometimes a no-break-space
// before an image)
if (start < end-1 && m_srcs[start+1] == m_srcs[start]) {
start++;
lastSrc = m_srcs[start];
}
}

From #241 89b0650.

Also prevent skipping the above space when it's the last thing before another elements, as it may have some purpose (I don't see which in the HTML above... but I assume that if the publisher put one, it was for a reason).

May be I've been too generous in thinking all "may have some purpose" :) I actually don't remember why I felt the need to keep it - I was obviously thinking about spaces between images, and if consecutive images, to keep the space in between - but why at start of line ? :/
Or may be it was in this sample to keep it on the first line of a paragraph if there is one (see the "Selection HTML" in #241), but may be we don't need to do it for the next lines...

@poire-z
Copy link
Contributor

poire-z commented May 10, 2021

I think I'll fix this with:

@@ -2738,13 +2737,10 @@ public:
         // Ignore space at start of line (this rarely happens, as line
         // splitting discards the space on which a split is made - but it
-        // can happen in other rare wrap cases like lastDeprecatedWrap)
-        if ( (m_flags[start] & LCHAR_IS_SPACE) && !(lastSrc->flags & LTEXT_FLAG_PREFORMATTED) ) {
-            // But do it only if we're going to stay in same text node (if not
-            // the space may have some reason - there's sometimes a no-break-space
-            // before an image)
-            if (start < end-1 && m_srcs[start+1] == m_srcs[start]) {
-                start++;
-                lastSrc = m_srcs[start];
-            }
+        // can happen in other rare wrap cases like lastDeprecatedWrap).
+        // Do it only for the 2nd++ lines of a paragraph, as a leading
+        // no-break-space may be used to add some indentation.
+        if ( !first && (m_flags[start] & LCHAR_IS_SPACE) && !(lastSrc->flags & LTEXT_FLAG_PREFORMATTED) ) {
+            start++;
+            lastSrc = m_srcs[start];
         }

@poire-z
Copy link
Contributor

poire-z commented Oct 25, 2021

About to be off for a week, so can't investigate this unexpected/bad wrap:

image

Posting this so I remember, and in case @Jellby and his UAX#14 science tells me it's expected and there's nothing to investigate :)


Which might be the case, https://util.unicode.org/UnicodeJsps/breaks.jsp tells us we do it right:
image
Input: lU+2019U+00ABU+2009U+0061presU+2009U+00BB, moi

Still unsure why this is right :/

@Jellby
Copy link
Contributor Author

Jellby commented Oct 26, 2021

U+2009 is "thin space"
U+202F is "narrow no-break space"

From my point of view, the book is simply using the wrong character. U+2009 belongs to the BA (break after) class, not the SP (space) class. The relevant UAX#14 rule is, I believe:

LB14 Do not break after ‘[’, even after spaces. OP SP* ×

But there's no equivalent with BA. Maybe you could tweak the French rules to make it treat U+2009 as SP.

@poire-z
Copy link
Contributor

poire-z commented Oct 26, 2021

Thanks for the analysis, looks like you are right.

Maybe you could tweak the French rules to make it treat U+2009 as SP.

Well, I dunno, I then might be tempted to add a few more of them :) http://unicode.org/reports/tr14/#BA Why not hair space ?
I'll have to see how this construct is done in other books, see if it's a common pattern, if some word processors do use a thin space with French ... or if it's just a bug from this specific publisher/book.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants