Tokenizer — problems with punctuation? #801

ematvey · 2017-02-02T13:59:18Z

Consider the following snippet:

text = """Vanity is one of the things which are perhaps most difficult for a noble man to understand: he will be tempted to deny it, where another kind of man thinks he sees it self-evidently. The problem for him is to represent to his mind beings who seek to arouse a good opinion of themselves which they themselves do not possess--and consequently also do not "deserve,"--and who yet BELIEVE in this good opinion afterwards."""
# verbatim portion of http://www.gutenberg.org/cache/epub/4363/pg4363.txt

import spacy
en = spacy.load('en')
doc = en(text)
for token in doc:
  if token.pos_ == 'X':
    print(token.orth_)

This outputs deserve,"--and.

Running tokenizer on full text outputs following errors (I assume X marks unidentifiable POS):

faculties"--of
de
la
exception;--exclusive
mandeikagati
day.--Is
O
cherche
le
vrai
que
pour
faire
le
bien"--I
il
faut
etre
sec
des
decouvertes
c'est
voir
clair
dans
ce
qui
kind,--that
liben
pensatori
net,--to
or-
enveloppe
le
corps
[Greek
JE
L'ART
et
de
mulierel
suffers,--who
history"--an
national'--what
do.--What
fatherlands"--they
co
deserve,"--and
are!"--even
himself."--Goethe
refinement:--just
thee-
teeth.--Perhaps
memories?--To
rue
_must
etc
es
Useful.=--Therefore
intelligibeln
unegoistic."--In
=Hope.=--Pandora
hope.--Zeus
=Man
_must

Most of those tokenization failures are due to punctuation. Admittedly, Gutenberg's texts are not the cleanest ones, but perhaps tokenization rules could be improved?

The text was updated successfully, but these errors were encountered:

ines · 2017-02-02T14:20:16Z

Thanks for the report – this seems to be caused by the global infix rules being too specific. Currently, they cover all common hyphens, but no combinations of character + hyphen.

I'll add a regression test for this issue and see if I can fix the rules to handle these cases, without breaking anything else. (Now that we have a much better test suite in place for the tokenizers, it's definitely a lot easier to make sure changes to the regexes don't produce unintended results elsewhere.)

I definitely have the vision of one day being able to handle all Gutenberg texts perfectly and out-of-the-box, though 😉 In some cases, the formatting markup is tricky and may conflict with other rules, so if you're working with a lot of texts like these, it might be worth creating a custom tokenizer subclass and add overriding some of the punctuation rules.

ines · 2017-02-02T15:27:48Z

Okay, so after playing around with it for a bit, here's a compromise for now:

There's currently no easy way to define rules for splitting multiple infixes. But in any case, we definitely don't want to end up with punctuation attached to a token.

I've modified the infix rules to not split off hyphens if they're following certain punctuation. Still not perfect, but at least the non-punctuation tokens are now correct and spaCy will be able to assign the correct POS tags.

['"', 'deserve', ',"--', 'and']
['exception', ';--', 'exclusive']
['day', '.--', 'Is']
['refinement', ':--', 'just']
['memories', '?--', 'To']
['Useful', '.=--', 'Therefore']
['=', 'Hope', '.=--', 'Pandora']

So for now, we'll have to assume that this is the "correct" behaviour. I've edited the regression test to expect the above output, and closing this issue since it's technically fixed. But handling multiple infixes would be a nice additional feature, so if you have a suggestion (or an idea for a PR along those lines), this would definitely be appreciated! 👍

lock · 2018-05-09T03:38:43Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the performance label Feb 2, 2017

ines added a commit that referenced this issue Feb 2, 2017

Add regression test for #801

13a4ab3

ines added a commit that referenced this issue Feb 2, 2017

Keep infixes of punctuation + hyphens as one token (see #801)

012f482

ines added a commit that referenced this issue Feb 2, 2017

Update regression test for #801 to match current expected behaviour

afc6365

ines closed this as completed Feb 2, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer — problems with punctuation? #801

Tokenizer — problems with punctuation? #801

ematvey commented Feb 2, 2017

ines commented Feb 2, 2017

ines commented Feb 2, 2017

lock bot commented May 9, 2018

Tokenizer — problems with punctuation? #801

Tokenizer — problems with punctuation? #801

Comments

ematvey commented Feb 2, 2017

ines commented Feb 2, 2017

ines commented Feb 2, 2017

lock bot commented May 9, 2018