-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better dictionaries #40
Comments
Enchant is an engine that supports 8 different backends aspell/libreoffice/mozilla... So it can be used as a direct replacement for aspell... and yes we can combine both backends, use one etc.. |
Romanian really seems to be tricky. Haven't tried pyenchant yet but it seems to be unmaintained at moment. People are suggesting using hunspell (libreoffice dict) instead, however, that one is actually even worse than aspell. This is the number of unique spelling mistakes in all Romanian localization files, after filtering out the anglish pws, with 1) aspell 2) hunspell.
There doesn't seem to be a Romanian dict for ispell, at least not in the debian repos. I'll see if I can do anything better with pyenchant. Any other ideas? |
pyenchant (the default version from pip) sadly isn't better. 357 false positives, compared to 343 of aspell and 347 of hunspell.
The script is:
There is one more backend I can try:
|
Sadly, AbiWord doesn't have a Romanian dictionary. Tried googling for another one, but couldn't find anything obvious. Do we know anyone in Romania that could possibly point us to a better dictionary? Also, do you still want me to switch the code from aspell to pyenchant? Pyenchant just calls enchant, which in turn calls either aspell, myspell, ispell or uspell (or hspell, voikko, and zemberek for Hebrew, Finnish or Turkish). So assuming that aspell's dictionaries are the best, it would just add overhead. |
…ut none were better than the vanilla aspell one. Sorry :/. But it seems the Romanian ignore list is full of duplicates as well as words that don't need to be ignored. The current templates only contain ~350 or so words not in the official aspell romanian dictionary, while there were well over 2000 of them in the Romaninan ignore file. Some were duplicates, but not all (see diff). So I shorthened the ignore file to only include stuff that's currently not in the dictionary. Not sure if this is what you want (the other ignore words might have been needed before and might be needed again). Ref: #40
Anything else we can do here? I'm fine with "no, let's close it". |
I reached the same conclusion @ikolar did: The best solution is already implemented, in my opinion. In the end it all boils down to three spell checkers in use today: All the other solutions found pyenchant, nuspell use the above under the hood. All the above spell checkers seem to use the same Romanian dictionary rospell (maybe different versions of it) and no better dictionary seems to exist at this time. LibreOffice, Mozilla, Chrome and so, use hunspell because of the license, even though aspell seems to have superior correction capabilities and is faster. This project already uses a custom dictionary, along the official one, so not much can be done in this direction. Unless we want to write our own solution, or start using some ai/ml/nltk or so, at this time, I see no real alternative. My only suggestion is to keep an eye on nuspell. Maybe some alternatives emerge. Redoing @ikolar tests and adding nuspell to the list:
Yields the same results of course. Nuspell, using hunspell, gives the same result. Links: |
…ut none were better than the vanilla aspell one. Sorry :/. But it seems the Romanian ignore list is full of duplicates as well as words that don't need to be ignored. The current templates only contain ~350 or so words not in the official aspell romanian dictionary, while there were well over 2000 of them in the Romaninan ignore file. Some were duplicates, but not all (see diff). So I shorthened the ignore file to only include stuff that's currently not in the dictionary. Not sure if this is what you want (the other ignore words might have been needed before and might be needed again). Ref: #40
The Story
As a dev,
I want to use a better dictionary,
so that there are less false-positives.
Problem
Currently, we use aspell to check for typos in localization files. Aspell's dictionaries for non-English languages are quite bad, for Romanian language, we need to ignore ~2600 words.
Proposal
Find if there are better alternatives. I believe LibreOffice publishes their dictionaries, can we use those?
Pitfalls
Best practices (DoD)
e.g. help articles or technical docs.
Expectations (AC)
The text was updated successfully, but these errors were encountered: