-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linebreaking at dashes #364
Comments
We follow (thanks to libunibreak, we don't have to understand it :) the Unicode Line Breaking Algorithm (aka UAX#14), which says about EMDASH:
.
So, go suggest that to these people :) We can tweak it a bit by typography language, which we do for quotation marks: crengine/crengine/src/textlang.cpp Lines 556 to 663 in 8a15844
By "a bit", I mean we can set some alternative "line breaking class" to some unicode codepoint (or substitute it with another char) - so the same algorithm applies, but consider that char differently. So, you'd need to understand the algo and classes a bit to see if some other class would make emdash behave as you wish - and actually test it :) Also, as an emdash is quite wide, preventing a break on either side might pull some stuff on the next line, and cause large interword spacing on the previous line. So, I guess in the emdash case, the best might be the enemy of the good, |
No, nothing like that, no HTML code or invisible characters.
Yes, I have this problem with all ebook renderers I have seen... As I said, I was hoping things could be improved/tweaked in KOReader.
Does the substitution need to be global or could it be done based on context (i.e. surrounding characters)?
Well, I would prefer some wider spacing to this linebreak. To me, it's like breaking before a |
I don't know. All classes interacts in various ways with the classes the previous or next char has...
Dunno which should win, the emdash that allows a break after - or a closing single quote that prevents a break before. But its closing status might depend on the language, cf my code snippet above. Can you check if it behaves better whether you select Typography language to be German or Chinese?
We can have function substitution per language, and they got access to context, ex: crengine/crengine/src/textlang.cpp Lines 438 to 447 in 8a15844
This might add some small overhead if it were global to all languages, and I'm not sure your rule apply to all languages.
Well, if that's so general and no renderer renders them as you expect - may be it's the publisher's fault: they could/should use (https://en.wikipedia.org/wiki/Dash used to have a lot more examples in many languages - that I don't see anymore - I remember it was very variable, and I guess the Unicode people did the best they could to do the less bad things in the more cases) |
That's not where the problem appears, I haven't (so far) seen a break between the dash and the quote, the problem is before the dash.
Good point, you could of course write it in English:
Yes, I thought about that... but it feels wrong (and inconvenient) to have to add these invisible codes. But maybe I'm spoiled by LaTeX (where, among other things, you don't need to add the no-break spaces before/after French punctuation, because that can be handled automatically). About every renderer doing the "wrong" thing... I expect most will use pretty dumb algorithms, but when I saw the "typography rules" menu in KOReader I hoped I could finally get something smarter going on ;) |
It's there to allow us trying to be smarter :) So, it just requires some thinking :) And I can't think about an obvious solution right now. But more thinking might gives something. |
Another thing about the UAX14 algorithm/libunibreak that I got: when fed a char, it gives the line break status between the previous char and this char. It can't go back. The only thing it can do it postpone the decision for a line break and keep giving "no break" until that decision (this is what happens with consecutive spaces, as it doesn't know what will come next: the first spaces have no break, and only between the last one and what comes next, we can have a "break allowed"). So, if that's gonna be fixed, I guess it can only by via a
If that's really the case, I guess the right line breaking class could solve it. But you'll have to be certain of that never :) |
I agree completely. Except for the "badly published book" part... I actually read mostly ebooks I create myself, so I know the code ;) In my experience, "retail" ebooks are the same (in this particular aspect) or worse (they'd often have a mishmash of hyphens, en-dashes and em-dashes used with no consistency, "smart" quotes facing the wrong side, etc.). Anyway, my intent when creating this issue was not to push my view and demand any action, but to call the attention on a possible area of improvement. And not to sound like a whiner, I also put forward a suggestion. I'll try to think of a more reasonable solution and test it somewhere. As a poor-man's solution, would it be possible/easy to have a toggleable option to "never break at an em-dash" (possibly en-dash too)? At least in English the use of dashes varies with publishers and styles, and it may be beneficial to enable this for some books and not others... and it may also be helpful for other languages and people like me who are more bugged by wrong linebreaks than by large white spaces ;) |
It is very common (in my experience at least) to use an em-dash as a sort of ellipsis for interrupted speech. There may be a difference between American/British English, but I find it all the time.
Well, Unicode would call it a "quotation dash", but that's bullshit, there's only a single dash character in Spanish, and it's used both for starting a quotation and for parenthetical sentences —like this—, and it is used as I did, with space on one side, possibly with punctuation. You want breaks where the spaces are, the dashes don't allow any break that a normal letter wouldn't. I might consider using different dash characters in Spanish the day I see widespread use of different characters for a right single quote mark and a (curly) apostrophe in English... |
I do like the interrupted speech approach in English (as well as the parenthetical one, for that matter), but I'm absolutely not bothered by breaking around the dash in both of these cases ;). |
Then again, I wholeheartedly fall on the "break more" than the "more rivers" side of the equation ;). |
It's plenty standard. It indicates a sudden stop. "I went to the—" Do you do something different in French? I don't recall encountering suddenly interrupted speech otoh. It's described here: https://en.wikipedia.org/wiki/Dash#Interruption_of_a_speaker |
At least, you can fix that in the source then (zero width no break space/word joiner) :)
And I took it like that, and tried/am trying to think with you about the best solution - suggesting some caveats I can think ok, so it's not a quick suggestion/decision that would cause other issues later.
I'm not keen on having many toggables (mostly because it's a pain to propagate from the UI to the engine, and to explain them). The Typography language abstraction was to hide that (my initial idea with using flags as you'd like them at #307 (comment) - which evolved into per language flags #307 (comment) - to finally have it hardcoded in crengine). But if ever we'd need to have minor tweaks to libunibreak/UAX14, that we can't associate to a language, or to text contexts that would be valid in all languages, and should be targeted/toggable by the user, I'd go with a style tweak:
Well Unicode just calls it a EM DASH :) and there's a single EM DASH codepoint in the whole world languages :) |
Then, a sudden line break should enhance the effect and look even better ! :)
Mhhh, neither do I. I don't know how we render that. (But we're very polite and don't interrupt other people speech :) |
Look again at 2015. That's what they want us to use in some places. The fact that it might be not supported by the font or that it may actually look different from the em-dash (which would be used in other places) is yet another reason not to use it.
Ha. It doesn't work so good (even if I could admit that's any good :P) when even a single word can be interrupted:
or the dash can be used to "edit" a name that one doesn't want to spell fully:
|
But could one say, for instance:
that, of course, would be applied to every dash in a given (or maybe all) language |
We Dutch do, but there are many things that could interrupt you. I think it's more typically used for those other kinds of interruptions. |
That's not how the UAX14 algo works I think. A codepoint has a single LB class. But that's somehow what our lb_char_sub_func_xxx() can do:
|
May I ask how the RIGHT SINGLE QUOTATION MARK is interpreted in English? Is it assigned a "Closing Punctuation" class? I've been testing with the Also, it looks like the |
Try reading this: crengine/crengine/src/textlang.cpp Lines 556 to 663 in 8a15844
(If i re-read my #337 (comment)): For all the stuff that stays false, we stay with the default class of "quotation" http://www.unicode.org/reports/tr14/#QU - which usually prevents break between other such stuff (as we don't know if it's opening or closing). For all the stuff that we set to true for some languages, they get the OP(ening), CL(osing) or GL(ue) - and this allows more break (between a closing one and an opening one).
So, it seems it is not. It stays QU.
Well, hyphenation happen inside words, that should all fully be non-breakable.
No, that's left to libunibreak with our lb_props rules. |
Not sure if that's what you're thinking about: crengine/crengine/src/textlang.cpp Lines 587 to 591 in 8a15844
So, if you were to substitute your emdash with U+201D ” , it would behave as on the sea”’ - which I guess would not break - but you'd need to think about how this would behave in the various context a emdash can be used.
|
I see... Well in case it helps, in Spanish one could have a third level: |
I had a quick test at having dashes handled as http://www.unicode.org/reports/tr14/#NS --- a/crengine/src/textlang.cpp
+++ b/crengine/src/textlang.cpp
@@ -584,2 +584,3 @@ TextLangCfg::TextLangCfg( lString16 lang_tag ) {
bool has_right_double_angle_quotation_mark_closing = false;
+ bool has_dashes_nonstarter = false; // U+2013 U+2014 avoid break before in some cases
@@ -622,2 +623,6 @@ TextLangCfg::TextLangCfg( lString16 lang_tag ) {
+ if ( LANG_STARTS_WITH(("en") ("es")) ) {
+ has_dashes_nonstarter = true;
+ }
+
// Set up _lb_props.
@@ -633,2 +638,4 @@ TextLangCfg::TextLangCfg( lString16 lang_tag ) {
if ( has_right_double_angle_quotation_mark_closing ) _lb_props[n++] = { 0x00BB, 0x00BB, LBP_CL };
+ if ( has_dashes_nonstarter ) _lb_props[n++] = { 0x2013, 0x2013, LBP_NS };
+ if ( has_dashes_nonstarter ) _lb_props[n++] = { 0x2014, 0x2014, LBP_NS };
if ( has_left_single_quotation_mark_opening ) _lb_props[n++] = { 0x2018, 0x2018, LBP_OP }; Some quick tests with having them preceded or followed by spaces, opening or closing quotes/parens do not exhibit strange things (at least, not strange to me) - and they are allowed to start a line when there's a space before. (I have no idea where/when “‼”, “‽”, “⁇”, “⁉” are used, and why they would behave differently than a single "?" or "!" - anyone can enlighten me ?) (that red is not mine, but github highlighting ⁉ :) |
I was wondering the same thing. Can't see the logic in that.
Some quick test on my side (I have a Kobo, but I tested this with a minimal python script), indicates that that would result in (using the character I think the following could work for English (and possibly other languages) if it would be feasible to implement:
For Spanish, a much simpler expedient would be to just assign the em-dash to the QU class, as it matches how it is used, or AL (which seems to be the class for the "horizontal bar"). By the way, I don't think this applies to the en-dash, the en-dash has a different class and usage, and I haven't seen a problem with it yet. |
Good, perfect, enemies... :)
That would be costly, and detecting "alphanumeric" would mean detecting kinda line-breaking-classes outside of the algorithm, and decide upon these classes - and we don't have access to these I think.
QU will prevent a break on both side, even if there is a space I think.
I tried AL earlier, which is just the class of any letter. So, |
I'd rather like to see all possible linebreak points in sample texts. Testing in real reading means being (un)lucky enough to find the problem situations. For instance, if I read British English books, I'd only rarely see double quotes, more rarely next to an em-dash, and will I be able to say if a linebreak was avoided or it just didn't happen to fall there? Of course, at the end a real reading text is necessary/good. Still without testing, and judging from the description, I don't think NS is a good class. It allows a break after the dash in
I suspected that. Well, in the vast majority of cases, the "scan" for characters would just stop at the first, sometimes second character. Instead of checking for line-breaking-classes, could one check for unicode categories? It could be something like: find the first character that's not punctuation, and decide based on whether or not it's a space/linebreak.
Really? If the curly apostrophe is QU, are linebreaks prevented at According to the UAX#14 document, The AL QU combination should be "do not break before QU, unless one or more spaces follow AL", and the "unless" would apply in this case.
Right, which is why I said this would be specific for Spanish (and possibly French and other languages), where that usage without spaces does not occur (in self-respected texts). |
I see... You won't stop at "better" :/ You want "perfect" :)
But that won't do generally, because of the case I mentionned previously getting too large word spacing.
May be you're right :) Good to see you're getting better at UAX14 than me :)
I still think line breaking classes are better - and us staying in that categoriztion plane rather than picking into another - and we get a chance to have libunibreak do the right thing with Arabic or Chinese. Are you able to build KOReader's emulator ? I'd rather focus on trying to make the LB class accessible into a
In our context, we're processing already splitted paragraphs (by |
Yes, perfect within my reach :D The problem with NS is that I think it just fixes one particular case and not the general problem, and it's asymmetric. Still... yes, it would be some improvement. As a user, I'd be semi-happy with that (in English at least), but as a developer I don't think the benefits justify the "hack".
That's my point, a (n exhaustive) reading test will only tell me when it's "perfect for me"
... in case they use em-dashes :D I get the wish to refer only to linebreak classes, but I don't see why using categories would break it for Arabic or Chinese... those characters surely have some category too, and it won't be punctuation or space...
I'll try. I'm not a noob with compiling.
Good, I just included that to make sure the beginning/end of the text to be broken is taken into account. |
I could build the emulator, and run it. But when I add for instance your NS changes (and recompile), I still see breaks at |
You added this line has the right place (between the two in my patch - codepoints needs to be linearly increasing - if you added it at the end, a 2014 won't reach it and stop at 2018): (Note that it's possible I haven't tested it explicitely with |
Right, maybe for French no change was needed.
I guess I could do it, but you could do the one for koreader-base, if you don't mind.
In current-style "good" texts, no difference, but I guess it's faster to do without the |
Well, let's keep it for now, so I can notice if there are some issues.
OK. May be write a sentence about that in the comments. (And add the url to this issue for reference? I usually don't, but for this one, I think it's worth it.)
Yep, no problem, I'll do the bumps to base and to frontend. |
Just mentionning that again ^. |
Oh, then we're left again with the problem: How to detect end of paragraph / forced line break? |
May be with something like that (untested): --- a/crengine/include/textlang.h
+++ b/crengine/include/textlang.h
@@ -94,5 +94,5 @@ public:
#define MAX_NB_LB_PROPS_ITEMS 10 // for our statically sized array (increase if needed)
-typedef lChar16 (*lb_char_sub_func_t)(const lChar16 * text, int pos, int next_usable);
+typedef lChar16 (*lb_char_sub_func_t)(const lChar16 * text, int pos, int next_usable, bool is_last_fragment);
class TextLangCfg
--- a/crengine/src/lvtextfm.cpp
+++ b/crengine/src/lvtextfm.cpp
@@ -1401,5 +1401,5 @@ public:
// Lang specific function may want to substitute char (for
// libunibreak only) to tweak line breaking around it
- ch = src->lang_cfg->getLBCharSubFunc()(m_text, pos, len-1 - k);
+ ch = src->lang_cfg->getLBCharSubFunc()(m_text, pos, len-1 - k, i==end-1);
}
int brk = lb_process_next_char(&lbCtx, (utf32_t)ch); |
And what about |
No issue on the left side: pos==0 means first char of first text node of the paragraph. pos=5 is first char of 2nd text node if 1st text node was 5 chars long. |
So, if I understand it correctly, we can look at the previous characters even if they are from different nodes, but not at the following characters if they're in a different node? |
Right, you understand correctly. |
Well, that made me change the approach, since I can no longer be sure there will be a space or forced linebreak after the dash (e.g. in
|
And you still manage to get "perfect"-enough results ? |
As far as I have seen, yes. The only "problem" is that when there's a node boundary after the dash ( |
Implemented in #365. |
It's back again ! https://util.unicode.org/UnicodeJsps/breaks.jsp |
Possibly only if it's stuck after a may_break character? (Although, to be fair, preserving it here doesn't bother me all that much ;)). |
Did you test what happens if you stick a hyphen before a non-breaking space in, say, Firefox? |
Well, I had a check, and it looks different, and nice: no blank before any of the left and right margins if I remove the |
Or may be not related to these dash tweaks: I think they applied only to English or French - and I get the same behaviour with |
OK, the difference is caused by this, so actually voluntary: crengine/crengine/src/lvtextfm.cpp Lines 2738 to 2749 in dd7e9bb
From #241 89b0650.
May be I've been too generous in thinking all "may have some purpose" :) I actually don't remember why I felt the need to keep it - I was obviously thinking about spaces between images, and if consecutive images, to keep the space in between - but why at start of line ? :/ |
I think I'll fix this with: @@ -2738,13 +2737,10 @@ public:
// Ignore space at start of line (this rarely happens, as line
// splitting discards the space on which a split is made - but it
- // can happen in other rare wrap cases like lastDeprecatedWrap)
- if ( (m_flags[start] & LCHAR_IS_SPACE) && !(lastSrc->flags & LTEXT_FLAG_PREFORMATTED) ) {
- // But do it only if we're going to stay in same text node (if not
- // the space may have some reason - there's sometimes a no-break-space
- // before an image)
- if (start < end-1 && m_srcs[start+1] == m_srcs[start]) {
- start++;
- lastSrc = m_srcs[start];
- }
+ // can happen in other rare wrap cases like lastDeprecatedWrap).
+ // Do it only for the 2nd++ lines of a paragraph, as a leading
+ // no-break-space may be used to add some indentation.
+ if ( !first && (m_flags[start] & LCHAR_IS_SPACE) && !(lastSrc->flags & LTEXT_FLAG_PREFORMATTED) ) {
+ start++;
+ lastSrc = m_srcs[start];
} |
About to be off for a week, so can't investigate this unexpected/bad wrap: Posting this so I remember, and in case @Jellby and his UAX#14 science tells me it's expected and there's nothing to investigate :) Which might be the case, https://util.unicode.org/UnicodeJsps/breaks.jsp tells us we do it right: Still unsure why this is right :/ |
U+2009 is "thin space" From my point of view, the book is simply using the wrong character. U+2009 belongs to the BA (break after) class, not the SP (space) class. The relevant UAX#14 rule is, I believe: LB14 Do not break after ‘[’, even after spaces. But there's no equivalent with BA. Maybe you could tweak the French rules to make it treat U+2009 as SP. |
Thanks for the analysis, looks like you are right.
Well, I dunno, I then might be tempted to add a few more of them :) http://unicode.org/reports/tr14/#BA Why not hair space ? |
I'm hoping the linebreaking at dashes can be improved... In something like " on the sea—’ " I get a linebreak between the
a
and the em-dash (note it's a dash followed by a closing single quote). I would suggest a linebreak at a dash (before or after) is only allowed if there are letters on both sides of the dash (e.g. "that he—or she"), but not if there is a punctuation on either side.That's for English, for Spanish we pretty much want to never allow a linebreak at a dash, but since dashes are always followed or preceded by a space or punctuation, it should be covered in the above suggestion too.
The text was updated successfully, but these errors were encountered: