Better dictionaries #40

zupo · 2018-09-28T12:26:46Z

The Story

Handbook documents: User story & Work process

As a dev,
I want to use a better dictionary,
so that there are less false-positives.

Problem

Currently, we use aspell to check for typos in localization files. Aspell's dictionaries for non-English languages are quite bad, for Romanian language, we need to ignore ~2600 words.

Proposal

Find if there are better alternatives. I believe LibreOffice publishes their dictionaries, can we use those?

Pitfalls

aspell is quick, we may considerably slow down build times by using a different spell check.

Best practices (DoD)

Documentation is revised:
e.g. help articles or technical docs.
Product users are informed. e.g. blog post describing a new feature or bugfix.
Test coverage is 100%.

Expectations (AC)

We no longer have to ignore +2600 words for Romanian localization files.
User story demo uploaded to sprint release.

dz0ny · 2018-09-28T17:08:33Z

Enchant is an engine that supports 8 different backends aspell/libreoffice/mozilla... So it can be used as a direct replacement for aspell... and yes we can combine both backends, use one etc..

ikolar · 2019-01-05T16:21:02Z

Romanian really seems to be tricky.

Haven't tried pyenchant yet but it seems to be unmaintained at moment. People are suggesting using hunspell (libreoffice dict) instead, however, that one is actually even worse than aspell.

This is the number of unique spelling mistakes in all Romanian localization files, after filtering out the anglish pws, with 1) aspell 2) hunspell.

(localizations) ike@stvm:/devel/niteo/localizations$ cat Countries/Romania/.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | aspell --lang=ro --encoding=utf-8 list | wc -l
343
(localizations) ike@stvm:/devel/niteo/localizations$ cat Countries/Romania/.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | hunspell -d ro_RO -l | wc -l
347

There doesn't seem to be a Romanian dict for ispell, at least not in the debian repos.

I'll see if I can do anything better with pyenchant.

Any other ideas?

ikolar · 2019-01-05T16:47:59Z

pyenchant (the default version from pip) sadly isn't better. 357 false positives, compared to 343 of aspell and 347 of hunspell.

cat Countries/Romania/* | aspell --lang=en --encoding=utf-8 list | sort | uniq > ro.all

python pyenchant.py |wc -l
357

The script is:

from enchant.checker import SpellChecker
import codecs

with codecs.open("ro.all", "r", "utf-8") as f:
t = f.read()

chkr = SpellChecker("ro_RO")
chkr.set_text(t)
for err in chkr:
print(err.word)

There is one more backend I can try:

Uspell (primarily Yiddish, Hebrew, and Eastern European languages - hosted on AbiWord github)

ikolar · 2019-01-05T17:04:58Z

Sadly, AbiWord doesn't have a Romanian dictionary. Tried googling for another one, but couldn't find anything obvious.

Do we know anyone in Romania that could possibly point us to a better dictionary?

Also, do you still want me to switch the code from aspell to pyenchant? Pyenchant just calls enchant, which in turn calls either aspell, myspell, ispell or uspell (or hspell, voikko, and zemberek for Hebrew, Finnish or Turkish). So assuming that aspell's dictionaries are the best, it would just add overhead.

…ut none were better than the vanilla aspell one. Sorry :/. But it seems the Romanian ignore list is full of duplicates as well as words that don't need to be ignored. The current templates only contain ~350 or so words not in the official aspell romanian dictionary, while there were well over 2000 of them in the Romaninan ignore file. Some were duplicates, but not all (see diff). So I shorthened the ignore file to only include stuff that's currently not in the dictionary. Not sure if this is what you want (the other ignore words might have been needed before and might be needed again). Ref: #40

zupo · 2019-01-21T18:57:42Z

Anything else we can do here? I'm fine with "no, let's close it".

mkcdq · 2019-03-01T16:44:46Z

I reached the same conclusion @ikolar did: The best solution is already implemented, in my opinion.

In the end it all boils down to three spell checkers in use today:

ispell (BSD License)
aspell (LGPL License)
hunspell (GPL/LGPL/MPL License)

All the other solutions found pyenchant, nuspell use the above under the hood.

All the above spell checkers seem to use the same Romanian dictionary rospell (maybe different versions of it) and no better dictionary seems to exist at this time.

LibreOffice, Mozilla, Chrome and so, use hunspell because of the license, even though aspell seems to have superior correction capabilities and is faster.

This project already uses a custom dictionary, along the official one, so not much can be done in this direction.

Unless we want to write our own solution, or start using some ai/ml/nltk or so, at this time, I see no real alternative. My only suggestion is to keep an eye on nuspell. Maybe some alternatives emerge.

Redoing @ikolar tests and adding nuspell to the list:

time (cat Countries/Romania/*.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | aspell -d ro list | wc -l)
343

real  0m0,056s
user  0m0,036s
sys   0m0,020s

time (cat Countries/Romania/*.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | hunspell -d ro_RO -l | wc -l)
347

real  0m0,173s
user  0m0,191s
sys   0m0,021s

time (cat Countries/Romania/*.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | nuspell -d ro_RO -l | wc -l)
INFO: I/O  locale name=en_US.UTF-8, lang=en, country=US, enc=utf-8
INFO: Pointed dictionary /usr/share/hunspell/ro_RO.{dic,aff}
347

real  0m0,146s
user  0m0,145s
sys   0m0,032s

Yields the same results of course. Nuspell, using hunspell, gives the same result.

Links:

zupo · 2019-03-01T18:41:22Z

As per @ikolar and @mkcdq, there is not a lot of room for improvement here, so I'm closing.

…ut none were better than the vanilla aspell one. Sorry :/. But it seems the Romanian ignore list is full of duplicates as well as words that don't need to be ignored. The current templates only contain ~350 or so words not in the official aspell romanian dictionary, while there were well over 2000 of them in the Romaninan ignore file. Some were duplicates, but not all (see diff). So I shorthened the ignore file to only include stuff that's currently not in the dictionary. Not sure if this is what you want (the other ignore words might have been needed before and might be needed again). Ref: #40

dz0ny added the good first issue Good for newcomers label Dec 4, 2018

zupo assigned ikolar Jan 5, 2019

ikolar mentioned this issue Jan 9, 2019

Clean up Romanian ignore list (PWL) #76

Merged

ikolar closed this as completed Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better dictionaries #40

Better dictionaries #40

zupo commented Sep 28, 2018

dz0ny commented Sep 28, 2018 •

edited

Loading

ikolar commented Jan 5, 2019

ikolar commented Jan 5, 2019

ikolar commented Jan 5, 2019

zupo commented Jan 21, 2019

mkcdq commented Mar 1, 2019 •

edited

Loading

zupo commented Mar 1, 2019

Better dictionaries #40

Better dictionaries #40

Comments

zupo commented Sep 28, 2018

The Story

Problem

Proposal

Pitfalls

Best practices (DoD)

Expectations (AC)

dz0ny commented Sep 28, 2018 • edited Loading

ikolar commented Jan 5, 2019

ikolar commented Jan 5, 2019

ikolar commented Jan 5, 2019

zupo commented Jan 21, 2019

mkcdq commented Mar 1, 2019 • edited Loading

zupo commented Mar 1, 2019

dz0ny commented Sep 28, 2018 •

edited

Loading

mkcdq commented Mar 1, 2019 •

edited

Loading