Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenated words in French #13663

Closed
lsmith77 opened this issue Oct 15, 2024 · 2 comments
Closed

Hyphenated words in French #13663

lsmith77 opened this issue Oct 15, 2024 · 2 comments

Comments

@lsmith77
Copy link

How to reproduce the behaviour

j'imagine des grands-pères is tokanized to j', imagine, des, grands-, pères

https://demos.explosion.ai/displacy?text=j%27imagine%20des%20grands-p%C3%A8res&model=fr_core_news_sm&cpu=1&cph=1

I would expect it to tokanize to j', imagine, des, grands-pères, ie. not split grands-pères.

In English he is a top-performer does not split top-performer

https://demos.explosion.ai/displacy?text=he%20is%20a%20top-performer&model=en_core_web_sm&cpu=1&cph=1

Is this intended or a bug?
If it is intended, how can I adjust the tokanization to not split hyphened words?

@lsmith77
Copy link
Author

lsmith77 commented Oct 15, 2024

Note I have tried the following:

infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

see also https://spacy.io/usage/linguistic-features#native-tokenizer-additions

But this then leads to j'imagine to not be split into two words. Oddly enough in english with the above infixes it does split it's into two words.

@lsmith77
Copy link
Author

Alright, I managed to fix things by really making sure I use all of the language specific defaults (see below). What would really help is if all the lists of regexes would actually be dictionaries. That would make it possible to manipulate the defaults in a very targeted way.

For example it would then allow

infixes = English.Defaults.infixes
del infixes['hypen-splitting']

Instead of

infixes = English.Defaults.infixes
for i in range(0, len(infixes)):
  # https://spacy.io/usage/linguistic-features#tokenization
  # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)
  if HYPHENS in infixes[i]:
    infixes.pop(i)
    break

Here my code:

    def custom_tokenizer(self, lang, nlp):
        if lang == LangType.DE:
            infixes = German.Defaults.infixes
            for i in range(0, len(infixes)):
                if ":<>=" in infixes[i]:
                    # handle 'Kund:in' as one word
                    infixes[i] = r"(?<=[{a}])[<>=](?=[{a}])".format(a=ALPHA)
                    break

            rules = German.Defaults.tokenizer_exceptions
            suffixes = German.Defaults.suffixes
            prefixes = German.Defaults.prefixes
            token_match = German.Defaults.token_match
        elif lang == LangType.EN:
            infixes = English.Defaults.infixes
            for i in range(0, len(infixes)):
                if HYPHENS in infixes[i]:
                    # https://spacy.io/usage/linguistic-features#tokenization
                    # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)
                    infixes.pop(i)
                    break

            rules = English.Defaults.tokenizer_exceptions
            suffixes = English.Defaults.suffixes
            prefixes = English.Defaults.prefixes
            token_match = English.Defaults.token_match
        elif lang == LangType.FR:
            # return None
            infixes = French.Defaults.infixes
            for i in range(0, len(infixes)):
                if HYPHENS in infixes[i]:
                    # https://spacy.io/usage/linguistic-features#tokenization
                    # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)
                    infixes.pop(i)
                    break

            rules = French.Defaults.tokenizer_exceptions
            suffixes = French.Defaults.suffixes
            prefixes = French.Defaults.prefixes
            token_match = French.Defaults.token_match
        else:
            return None

        # https://github.com/explosion/spaCy/discussions/12930
        suffixes += [r"\."]

        return Tokenizer(
            vocab=nlp.vocab,
            rules=rules,
            prefix_search=compile_prefix_regex(prefixes).search,
            suffix_search=compile_suffix_regex(suffixes).search,
            infix_finditer=compile_infix_regex(infixes).finditer,
            token_match=token_match,
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant