Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special handling for Soft Hyphen (SHY) unicode symbol to DOCX emi… #1180

Merged
merged 4 commits into from
Jan 27, 2023

Conversation

hvbtup
Copy link
Contributor

@hvbtup hvbtup commented Jan 12, 2023

…tter and PDF emitter.

Added support for correctly handling the SHY symbol (Unicode 173) to the PDF emitter and the DOCX emitter.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 13, 2023

The code is not yet perfect, but I think it is a big step in the right direction.

Some remarks:

  1. There is another unicode symbol (hyphen, "\u2010") which I did not test.

  2. MS Excel does not support the soft hyphen symbol "\u00ad". I wonder if it would be better to remove the SHY symbol for the Excel output.

  3. I did not test other emitters except PDF, DOCX and HTML and XLSX.

  4. The code for removing the soft hyphen symbol from text could probably use some performance tuning, but frankly I didn't see an obvious way to do so. As long as hyphenation is relatively rare, this is not an issue.

  5. Now that we found the places in the code, this opens the door for automatic hyphenation.

  6. This is unrelated, but it seems like great parts of BIRT's source code is unused. Eg. there are two classes called TextCompositor which look very similar. Also there's o.e.b.report.engine.layout and o.e.b.report.engine.nLayout.

More details regarding 5):

In modern HTML, there are some CSS properties related to line-breaking:

  • hyphens
  • overflow-wrap
  • word-break
  • hyphenate-character

These are described in more detail here: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Text/Wrapping_Text

I did not examine the other properties, but regarding the hyphens property, the status quo of BIRT is:

BIRT behaves as if hyphens: manual was specified.

Without this PR, however, the behavior is incorrect, because the SHY symbol does not work as expected.
Minus symbols cause a correct line-break. Part of this logic already happens in ICU (see WordRecognizerWrapper and com.ibm.icu.text.BreakIterator.

With this PR, the behavior is correct regarding the SHY symbol.

I think it would be possible (but of course not part of this PR) to support hyphens in a better way in BIRT:

Firstly, document that BIRT's default value manual differs from HTML CSS default value none.

Next steps:

Make BIRT recognize and use the hyphens CSS property.

Add support for hyphens: none?
The HTML default value hyphens: none does not make much sense to me.
Since the WordRecognizerWrapper already splits words, it would be necessary to "collect and join" the Word instances which it delivers when hyphens: none is specified. This should be possible, but I think this is not an important feature.

Add support for hyphens: auto:
Automatic hyphenation can only work if the language is known, because hyphenation is language-specific. CSS requires that the lang attribute is present. I don't know if we already have something similar to the lang attribute in BIRT. If not, then it would make sense to add it anyway, because it is required for accessibility, too.
CSS specifies that a pre-hyphenated word always overrules automatic hyphenation, e.g. if a word contains SHY symbols or hyphen symbols (not minus-signs), then only the locations marked by these symbols must be taken into account for word-breaking. OK, that should be easy.
To support automatic hyphenation, there exist open source libraries. e.g. hunspell/hyphen (I'm pretty sure there are Java wrappers or pure Java libraries as well).
As there are different libraries and different languages, probably it makes no sense trying support all libraries and languages.
Instead, automatic hyphenation support for a specific should use the existing plugin concept.

Copy link
Contributor

@wimjongman wimjongman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some comments.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 16, 2023

I renamed some attributes and method names and added a lot of comments in reaction to Wim's suggestions.

@hvbtup
Copy link
Contributor Author

hvbtup commented Jan 26, 2023

@wimjongman @claesrosell Can we merge this?

Copy link
Contributor

@wimjongman wimjongman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@wimjongman
Copy link
Contributor

Henning, please squash and merge but make sure you make a good commit comment. Something like this comment #1180 (comment) but maybe a bit more condensed.

@hvbtup hvbtup merged commit f5cb70e into eclipse-birt:master Jan 27, 2023
@hvbtup hvbtup deleted the SHY-support_#1179 branch January 27, 2023 08:24
@wimjongman wimjongman added this to the 4.13 milestone Jan 27, 2023
@wimjongman
Copy link
Contributor

Thank you for this contribution, Henning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for soft hyphens (SHY, Unicode 173) to Word and PDF emitter
2 participants