Segmenting sentences at colons #9

fhamborg · 2020-01-22T16:12:00Z

For example the following snippet will be extracted as one single sentence (ending at the last full stop), but it should perhaps be split at the colons.

Here they “warn” anyone who opposes his radical ideology:
Four police officers were sent to hospital:
Violence against police officers is not only acceptable with Bernie Sanders and Black Lives Matter terrorists, its necessary to create chaos and panic:
What kind of violent protest would be complete without Barack Obama’s good friend, domestic terrorist Bill Ayers:
It’s probably just a coincidence that on a day that <u><b>Obama</b></u> was too busy to attend Nancy Reagan’s funeral, he was able to address a crowd about his hate for Trump only hours before this organized chaos in Chicago:
And finally, we’re wondering how much our Organizer In Chief had to do with this Alinsky style chaos in Chicago:
Illegal aliens, paid Soros protesters, angry Black Lives Matter terrorists inspired by Obama’s race war and Bernie Sanders supporters who have absolutely no idea why they showed up, sent four innocent police officers to the hospital; prevented thousands of innocent Americans from exercising their First Amendment right.

Is this by intention? Is there a way to force splitting at colons? Besides this extreme example I think I came across many cases where syntok did not split at colons.

The text was updated successfully, but these errors were encountered:

fnl · 2020-01-22T16:26:29Z

Thank you, Felix, for bringing this up; A valid feature request: Colon (and semi-colon) handling is indeed a bit of a borderline affair, and technically they are sentence separators. It might make sense to support that, but I need to think about it a bit more. I'd also love to hear feedback/oppinions from other users about this.

[Correcting the title of and adding labels.]

fhamborg · 2020-01-30T11:40:36Z

Yea I agree, whether segmentation is sensible at colon and semicolon likely also depends on the text domain. Looking at the definition of each in Wikipedia, one finds that both have cases, where segmentation would be required and others, where not.

E.g., for semicolon (cf. Wikipedia): "The semicolon or semi-colon[1] (;) is a punctuation mark that separates major sentence elements. A semicolon can be used between two closely related independent clauses, provided they are not already joined by a coordinating conjunction. Semicolons can also be used in place of commas to separate the items in a list, particularly when the elements of that list contain commas."

Yet, at least for the colon, I found that nltk and CoreNLP actually do perform segmentation more often than not (if not always?).

nmstoker · 2020-04-26T01:02:01Z

My two cents: those examples aren't really separate sentences because of the colons, they're separate sentences due to the content of the sentence, and they just happen to have the (very odd) colons at the end. It's not normal English usage to end a sentence with a colon, in fact it actively implies some following content. Therefore I would tend not to expect it to split on a colon and would prefer that was left to people to deal with if there are special cases with their particular text source.

However, with a semi-colon I would be more open to the idea that they can be treated as separate sentences. It's not uncommon for editors looking to simplify text to turn such cases into two (or more) distinct sentences and it would be less surprising here than it would be with the colon case.

fnl · 2020-04-27T08:08:07Z

In general, libraries such as nltk and CoreNLP tend to severely over-split, which was the major reason for me to come up with my own. Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too.

fhamborg · 2020-04-27T08:57:35Z

Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too.

This seems feasible to me.

fnl · 2020-04-28T21:43:12Z

Release 1.3.1 now supports semi-colon segmentation.

I will leave this ticket open, however, as this was specifically about segmenting colons.

fnl changed the title ~~incorrect handling of colons~~ Segmenting sentences at colons Jan 22, 2020

fnl added the enhancement New feature or request label Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmenting sentences at colons #9

Segmenting sentences at colons #9

fhamborg commented Jan 22, 2020

fnl commented Jan 22, 2020

fhamborg commented Jan 30, 2020 •

edited

Loading

nmstoker commented Apr 26, 2020

fnl commented Apr 27, 2020

fhamborg commented Apr 27, 2020 •

edited

Loading

fnl commented Apr 28, 2020

Segmenting sentences at colons #9

Segmenting sentences at colons #9

Comments

fhamborg commented Jan 22, 2020

fnl commented Jan 22, 2020

fhamborg commented Jan 30, 2020 • edited Loading

nmstoker commented Apr 26, 2020

fnl commented Apr 27, 2020

fhamborg commented Apr 27, 2020 • edited Loading

fnl commented Apr 28, 2020

fhamborg commented Jan 30, 2020 •

edited

Loading

fhamborg commented Apr 27, 2020 •

edited

Loading