-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer — problems with punctuation? #801
Comments
Thanks for the report – this seems to be caused by the global infix rules being too specific. Currently, they cover all common hyphens, but no combinations of character + hyphen. I'll add a regression test for this issue and see if I can fix the rules to handle these cases, without breaking anything else. (Now that we have a much better test suite in place for the tokenizers, it's definitely a lot easier to make sure changes to the regexes don't produce unintended results elsewhere.) I definitely have the vision of one day being able to handle all Gutenberg texts perfectly and out-of-the-box, though 😉 In some cases, the formatting markup is tricky and may conflict with other rules, so if you're working with a lot of texts like these, it might be worth creating a custom tokenizer subclass and add overriding some of the punctuation rules. |
Okay, so after playing around with it for a bit, here's a compromise for now: There's currently no easy way to define rules for splitting multiple infixes. But in any case, we definitely don't want to end up with punctuation attached to a token. I've modified the infix rules to not split off hyphens if they're following certain punctuation. Still not perfect, but at least the non-punctuation tokens are now correct and spaCy will be able to assign the correct POS tags. ['"', 'deserve', ',"--', 'and']
['exception', ';--', 'exclusive']
['day', '.--', 'Is']
['refinement', ':--', 'just']
['memories', '?--', 'To']
['Useful', '.=--', 'Therefore']
['=', 'Hope', '.=--', 'Pandora'] So for now, we'll have to assume that this is the "correct" behaviour. I've edited the regression test to expect the above output, and closing this issue since it's technically fixed. But handling multiple infixes would be a nice additional feature, so if you have a suggestion (or an idea for a PR along those lines), this would definitely be appreciated! 👍 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Consider the following snippet:
This outputs
deserve,"--and
.Running tokenizer on full text outputs following errors (I assume
X
marks unidentifiable POS):Most of those tokenization failures are due to punctuation. Admittedly, Gutenberg's texts are not the cleanest ones, but perhaps tokenization rules could be improved?
The text was updated successfully, but these errors were encountered: