-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmenting sentences at colons #9
Comments
Thank you, Felix, for bringing this up; A valid feature request: Colon (and semi-colon) handling is indeed a bit of a borderline affair, and technically they are sentence separators. It might make sense to support that, but I need to think about it a bit more. I'd also love to hear feedback/oppinions from other users about this. [Correcting the title of and adding labels.] |
Yea I agree, whether segmentation is sensible at colon and semicolon likely also depends on the text domain. Looking at the definition of each in Wikipedia, one finds that both have cases, where segmentation would be required and others, where not. E.g., for semicolon (cf. Wikipedia): "The semicolon or semi-colon[1] (;) is a punctuation mark that separates major sentence elements. A semicolon can be used between two closely related independent clauses, provided they are not already joined by a coordinating conjunction. Semicolons can also be used in place of commas to separate the items in a list, particularly when the elements of that list contain commas." Yet, at least for the colon, I found that nltk and CoreNLP actually do perform segmentation more often than not (if not always?). |
My two cents: those examples aren't really separate sentences because of the colons, they're separate sentences due to the content of the sentence, and they just happen to have the (very odd) colons at the end. It's not normal English usage to end a sentence with a colon, in fact it actively implies some following content. Therefore I would tend not to expect it to split on a colon and would prefer that was left to people to deal with if there are special cases with their particular text source. However, with a semi-colon I would be more open to the idea that they can be treated as separate sentences. It's not uncommon for editors looking to simplify text to turn such cases into two (or more) distinct sentences and it would be less surprising here than it would be with the colon case. |
In general, libraries such as nltk and CoreNLP tend to severely over-split, which was the major reason for me to come up with my own. Hence, I agree, adding semicolons as potential markers could be interesting, while it seems unwise to elevating colons to official markers, too. |
This seems feasible to me. |
Release 1.3.1 now supports semi-colon segmentation. I will leave this ticket open, however, as this was specifically about segmenting colons. |
For example the following snippet will be extracted as one single sentence (ending at the last full stop), but it should perhaps be split at the colons.
Is this by intention? Is there a way to force splitting at colons? Besides this extreme example I think I came across many cases where syntok did not split at colons.
The text was updated successfully, but these errors were encountered: