Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-breaking content in Python #251

Merged
merged 1 commit into from
Aug 17, 2023
Merged

Conversation

kojiishi
Copy link
Collaborator

@kojiishi kojiishi commented Aug 15, 2023

This patch supports non-breaking content in Python.

In Java and Python implementations, the "Skip" operation includes the skipped content to the BudouX parser, so no changes to the text for the parser is needed.

This patch changes following items:

  1. Changed to_skip to a stack of elements, rather than always reset to False at the end of an element.
  2. When there's a phrase boundary right before the "skip" element, insert a break before the "skip" element.

Note <NOBR> is added to skip_nodes.json at: #248.

This patch supports non-breaking content in Python.

In Java and Python implementations, the "Skip" operation includes the
skipped content to the BudouX parser, so no changes to the text for the
parser is needed.

This patch changes following items:
1. Changed `to_skip` to a stack of elements, rather than always reset
   to `False` at the end of an element.
2. When there's a phrase boundary right before the "skip" element,
   insert a break before the "skip" element.

Note `<NOBR>` is added to `skip_nodes.json` at:
google#248.
@kojiishi kojiishi marked this pull request as ready for review August 15, 2023 12:35
@kojiishi kojiishi requested a review from tushuhei August 15, 2023 12:35
@tushuhei tushuhei merged commit a38f629 into google:main Aug 17, 2023
22 checks passed
@kojiishi kojiishi deleted the nobr-py branch August 17, 2023 07:27
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 10, 2023
google#251 assumed that all tags are closed properly.

This assumption doesn't stand for cases like:
1. Self-closing tags such as `<img>` don't have corresponding close tags.
2. Unpaired close tags are still valid HTML.

This patch supports these cases by assuming all open tags that doesn't
nest correctly or that doesn't close are automatically closed.

This isn't the full HTML "adoption agency algorithm", but it should be
good enough for the needs of BudouX.

Fixes google#355
kojiishi added a commit that referenced this pull request Nov 11, 2023
* Fix unpaired close tags and self-closing tags

#251 assumed that all tags are closed properly.

This assumption doesn't stand for cases like:
1. Self-closing tags such as `<img>` don't have corresponding close tags.
2. Unpaired close tags are still valid HTML.

This patch supports these cases by assuming all open tags that doesn't
nest correctly or that doesn't close are automatically closed.

This isn't the full HTML "adoption agency algorithm", but it should be
good enough for the needs of BudouX.

Fixes #355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants