Hyphenated words in French #13663

lsmith77 · 2024-10-15T14:39:36Z

How to reproduce the behaviour

j'imagine des grands-pères is tokanized to j', imagine, des, grands-, pères

https://demos.explosion.ai/displacy?text=j%27imagine%20des%20grands-p%C3%A8res&model=fr_core_news_sm&cpu=1&cph=1

I would expect it to tokanize to j', imagine, des, grands-pères, ie. not split grands-pères.

In English he is a top-performer does not split top-performer

https://demos.explosion.ai/displacy?text=he%20is%20a%20top-performer&model=en_core_web_sm&cpu=1&cph=1

Is this intended or a bug?
If it is intended, how can I adjust the tokanization to not split hyphened words?

The text was updated successfully, but these errors were encountered:

lsmith77 · 2024-10-15T15:47:14Z

Note I have tried the following:

infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

see also https://spacy.io/usage/linguistic-features#native-tokenizer-additions

But this then leads to j'imagine to not be split into two words. Oddly enough in english with the above infixes it does split it's into two words.

lsmith77 · 2024-10-16T09:25:55Z

Alright, I managed to fix things by really making sure I use all of the language specific defaults (see below). What would really help is if all the lists of regexes would actually be dictionaries. That would make it possible to manipulate the defaults in a very targeted way.

For example it would then allow

infixes = English.Defaults.infixes
del infixes['hypen-splitting']

Instead of

infixes = English.Defaults.infixes
for i in range(0, len(infixes)):
  # https://spacy.io/usage/linguistic-features#tokenization
  # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)
  if HYPHENS in infixes[i]:
    infixes.pop(i)
    break

Here my code:

    def custom_tokenizer(self, lang, nlp):
        if lang == LangType.DE:
            infixes = German.Defaults.infixes
            for i in range(0, len(infixes)):
                if ":<>=" in infixes[i]:
                    # handle 'Kund:in' as one word
                    infixes[i] = r"(?<=[{a}])[<>=](?=[{a}])".format(a=ALPHA)
                    break

            rules = German.Defaults.tokenizer_exceptions
            suffixes = German.Defaults.suffixes
            prefixes = German.Defaults.prefixes
            token_match = German.Defaults.token_match
        elif lang == LangType.EN:
            infixes = English.Defaults.infixes
            for i in range(0, len(infixes)):
                if HYPHENS in infixes[i]:
                    # https://spacy.io/usage/linguistic-features#tokenization
                    # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)
                    infixes.pop(i)
                    break

            rules = English.Defaults.tokenizer_exceptions
            suffixes = English.Defaults.suffixes
            prefixes = English.Defaults.prefixes
            token_match = English.Defaults.token_match
        elif lang == LangType.FR:
            # return None
            infixes = French.Defaults.infixes
            for i in range(0, len(infixes)):
                if HYPHENS in infixes[i]:
                    # https://spacy.io/usage/linguistic-features#tokenization
                    # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)
                    infixes.pop(i)
                    break

            rules = French.Defaults.tokenizer_exceptions
            suffixes = French.Defaults.suffixes
            prefixes = French.Defaults.prefixes
            token_match = French.Defaults.token_match
        else:
            return None

        # https://github.com/explosion/spaCy/discussions/12930
        suffixes += [r"\."]

        return Tokenizer(
            vocab=nlp.vocab,
            rules=rules,
            prefix_search=compile_prefix_regex(prefixes).search,
            suffix_search=compile_suffix_regex(suffixes).search,
            infix_finditer=compile_infix_regex(infixes).finditer,
            token_match=token_match,
        )

lsmith77 closed this as completed Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyphenated words in French #13663

Hyphenated words in French #13663

lsmith77 commented Oct 15, 2024

lsmith77 commented Oct 15, 2024 •

edited

Loading

lsmith77 commented Oct 16, 2024

Hyphenated words in French #13663

Hyphenated words in French #13663

Comments

lsmith77 commented Oct 15, 2024

How to reproduce the behaviour

lsmith77 commented Oct 15, 2024 • edited Loading

lsmith77 commented Oct 16, 2024

lsmith77 commented Oct 15, 2024 •

edited

Loading