Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take advantage of non-breaking spaces #53

Open
hftf opened this issue Sep 26, 2018 · 0 comments
Open

Take advantage of non-breaking spaces #53

hftf opened this issue Sep 26, 2018 · 0 comments

Comments

@hftf
Copy link

hftf commented Sep 26, 2018

I have a corpus of text that often uses explicit non-breaking spaces (NBSP, U+00A0). They are mainly used to keep together words in the same sentence. They often appear after sentence-medial terminal punctuation (.!?) and before short sentence-final words (as in G. F. Handel composed Water Music for George I.). They were used in order to improve both text document layout and parsing.

Consider the following four cases:

1.  'Peter Pan is a J. M. Barrie play.'   # no NBSP
2.  'Peter Pan is a J. M. Barrie play.'   # NBSP after J.
3.  'Peter Pan is a J. M. Barrie play.'   # NBSP after M.
4.  'Peter Pan is a J. M. Barrie play.'   # NBSP after J. and M.

I was surprised that only the first case was segmented correctly out of the box:

1.  ['Peter Pan is a J. M. Barrie play.']
2. *['Peter Pan is a J.', ' M.', 'Barrie play.']
3. *['Peter Pan is a J. M.', ' Barrie play.']
4. *['Peter Pan is a J.', ' M.', ' Barrie play.']

Of course, it would be easy to just translate all of them to normal spaces and call it a day.


But non-breaking spaces can serve as useful disambiguation to a pragmatic sentence segmenter.

Not every sentence can be accurately segmented using a limited number of rules, so taking advantage of non-breaking spaces can improve the results in some trickier cases without making the checks much more complicated.

(Disclaimer: I have not looked at the rules and do not claim to understand the internals of the program.)

I will provide a few examples to demonstrate how this may be useful. You are welcome to incorporate them as new test cases, even if you ultimately decide not to bother with non-breaking spaces.


Cases 5–7 are extremely similar, but 7 surprisingly produces a different result. In case 8, adding an NBSP seems to fix the problem, presumably without making This behave like He and They internally.

5.  'Sri Lanka was conquered by the Cholas and Raja Raja I. They moved the capital.'   # no NBSP
6.  'Sri Lanka was conquered by Raja Raja I. He moved the capital to Polonnaruwa.'     # no NBSP
7.  'Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.'   # no NBSP
8.  'Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.'   # NBSP before I.

5.  ['Sri Lanka was conquered by the Cholas and Raja Raja I.', 'They moved the capital.']
6.  ['Sri Lanka was conquered by Raja Raja I.', 'He moved the capital to Polonnaruwa.']
7. *['Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.']
8.  ['Sri Lanka was conquered by Raja Raja I.', 'This moved the capital to Polonnaruwa.']

Cases 9–10 show the same sort of “fix”:

9.   'Lu Xun wrote The True Story of Ah Q. He was a Chinese author.'   # no NBSP
10.  'Lu Xun wrote The True Story of Ah Q. He was a Chinese author.'   # NBSP before Q.

9.  *['Lu Xun wrote The True Story of Ah Q. He was a Chinese author.']
10.  ['Lu Xun wrote The True Story of Ah Q.', 'He was a Chinese author.']

Cases 11–14 show that it’s much more difficult than it looks as Feng S. He is a plausible Chinese name:

11.  'Feng S. He was a Chinese diplomat who secretly saved 3,000 Austrian Jews.'  # no NBSP
12.  'He said the story of Feng S. He was a secret. He was a Chinese diplomat.'   # no NBSP
13.  'He said the story of Feng S. He was a secret. He was a Chinese diplomat.'   # NBSP after S.
14.  'I learned the son of Feng S. He was a Chinese-American microbiologist.'     # no NBSP

11.  ['Feng S. He was a Chinese diplomat who secretly saved 3,000 Austrian Jews.']
12.  ['He said the story of Feng S. He was a secret.', 'He was a Chinese diplomat.']
13. *['He said the story of Feng S.', ' He was a secret.', 'He was a Chinese diplomat.']
14.  ['I learned the son of Feng S. He was a Chinese-American microbiologist.']

The non-breaking space currently has no effect in cases 15 and 16. However, they could be easily segmented correctly without adding more rules or rare words like .NET to an explicit list.

15.  'I want to learn Microsoft’s .NET framework.'   # no NBSP
16.  'I want to learn Microsoft’s .NET framework.'   # NBSP before .NET

15. *['I want to learn Microsoft’s .', 'NET framework.']
16. *['I want to learn Microsoft’s .', 'NET framework.']

Resolving this issue could help with the following:

  1. Improve accuracy for corpora that (partially) use an existing convention for keeping words together using non-breaking spaces.
  2. Improve the existing rules by scrutinizing issues with the cases provided.

I favor this program for my use case because of its stance on embedded quotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant