-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenated words in French #13663
Comments
Note I have tried the following:
see also https://spacy.io/usage/linguistic-features#native-tokenizer-additions But this then leads to |
Alright, I managed to fix things by really making sure I use all of the language specific defaults (see below). What would really help is if all the lists of regexes would actually be dictionaries. That would make it possible to manipulate the defaults in a very targeted way. For example it would then allow
Instead of
Here my code:
|
How to reproduce the behaviour
j'imagine des grands-pères
is tokanized toj'
,imagine
,des
,grands-
,pères
https://demos.explosion.ai/displacy?text=j%27imagine%20des%20grands-p%C3%A8res&model=fr_core_news_sm&cpu=1&cph=1
I would expect it to tokanize to
j'
,imagine
,des
,grands-pères
, ie. not splitgrands-pères
.In English
he is a top-performer
does not splittop-performer
https://demos.explosion.ai/displacy?text=he%20is%20a%20top-performer&model=en_core_web_sm&cpu=1&cph=1
Is this intended or a bug?
If it is intended, how can I adjust the tokanization to not split hyphened words?
The text was updated successfully, but these errors were encountered: