Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better dictionaries #40

Closed
2 tasks
zupo opened this issue Sep 28, 2018 · 7 comments
Closed
2 tasks

Better dictionaries #40

zupo opened this issue Sep 28, 2018 · 7 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@zupo
Copy link
Contributor

zupo commented Sep 28, 2018

The Story

Handbook documents: User story & Work process

As a dev,
I want to use a better dictionary,
so that there are less false-positives.

Problem

Currently, we use aspell to check for typos in localization files. Aspell's dictionaries for non-English languages are quite bad, for Romanian language, we need to ignore ~2600 words.

Proposal

Find if there are better alternatives. I believe LibreOffice publishes their dictionaries, can we use those?

Pitfalls

  • aspell is quick, we may considerably slow down build times by using a different spell check.

Best practices (DoD)

  • Documentation is revised:
    e.g. help articles or technical docs.
  • Product users are informed. e.g. blog post describing a new feature or bugfix.
  • Test coverage is 100%.

Expectations (AC)

@dz0ny
Copy link
Contributor

dz0ny commented Sep 28, 2018

Enchant is an engine that supports 8 different backends aspell/libreoffice/mozilla... So it can be used as a direct replacement for aspell... and yes we can combine both backends, use one etc..

@dz0ny dz0ny added the good first issue Good for newcomers label Dec 4, 2018
@ikolar
Copy link
Contributor

ikolar commented Jan 5, 2019

Romanian really seems to be tricky.

Haven't tried pyenchant yet but it seems to be unmaintained at moment. People are suggesting using hunspell (libreoffice dict) instead, however, that one is actually even worse than aspell.

This is the number of unique spelling mistakes in all Romanian localization files, after filtering out the anglish pws, with 1) aspell 2) hunspell.

(localizations) ike@stvm:/devel/niteo/localizations$ cat Countries/Romania/.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | aspell --lang=ro --encoding=utf-8 list | wc -l
343
(localizations) ike@stvm:/devel/niteo/localizations$ cat Countries/Romania/
.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | hunspell -d ro_RO -l | wc -l
347

There doesn't seem to be a Romanian dict for ispell, at least not in the debian repos.

I'll see if I can do anything better with pyenchant.

Any other ideas?

@ikolar
Copy link
Contributor

ikolar commented Jan 5, 2019

pyenchant (the default version from pip) sadly isn't better. 357 false positives, compared to 343 of aspell and 347 of hunspell.

cat Countries/Romania/* | aspell --lang=en --encoding=utf-8 list | sort | uniq > ro.all

python pyenchant.py |wc -l
357

The script is:

from enchant.checker import SpellChecker
import codecs

with codecs.open("ro.all", "r", "utf-8") as f:
t = f.read()

chkr = SpellChecker("ro_RO")
chkr.set_text(t)
for err in chkr:
print(err.word)

There is one more backend I can try:

Uspell (primarily Yiddish, Hebrew, and Eastern European languages - hosted on AbiWord github)

@ikolar
Copy link
Contributor

ikolar commented Jan 5, 2019

Sadly, AbiWord doesn't have a Romanian dictionary. Tried googling for another one, but couldn't find anything obvious.

Do we know anyone in Romania that could possibly point us to a better dictionary?

Also, do you still want me to switch the code from aspell to pyenchant? Pyenchant just calls enchant, which in turn calls either aspell, myspell, ispell or uspell (or hspell, voikko, and zemberek for Hebrew, Finnish or Turkish). So assuming that aspell's dictionaries are the best, it would just add overhead.

ikolar added a commit that referenced this issue Jan 9, 2019
…ut none

were better than the vanilla aspell one. Sorry :/.

But it seems the Romanian ignore list is full of duplicates as well as words
that don't need to be ignored.  The current templates only contain ~350 or
so words not in the official aspell romanian dictionary, while there were
well over 2000 of them in the Romaninan ignore file.  Some were duplicates,
but not all (see diff).  So I shorthened the ignore file to only include
stuff that's currently not in the dictionary.

Not sure if this is what you want (the other ignore words might have been
needed before and might be needed again).

Ref: #40
@zupo
Copy link
Contributor Author

zupo commented Jan 21, 2019

Anything else we can do here? I'm fine with "no, let's close it".

@mkcdq
Copy link

mkcdq commented Mar 1, 2019

I reached the same conclusion @ikolar did: The best solution is already implemented, in my opinion.

In the end it all boils down to three spell checkers in use today:

All the other solutions found pyenchant, nuspell use the above under the hood.

All the above spell checkers seem to use the same Romanian dictionary rospell (maybe different versions of it) and no better dictionary seems to exist at this time.

LibreOffice, Mozilla, Chrome and so, use hunspell because of the license, even though aspell seems to have superior correction capabilities and is faster.

This project already uses a custom dictionary, along the official one, so not much can be done in this direction.

Unless we want to write our own solution, or start using some ai/ml/nltk or so, at this time, I see no real alternative. My only suggestion is to keep an eye on nuspell. Maybe some alternatives emerge.

Redoing @ikolar tests and adding nuspell to the list:

time (cat Countries/Romania/*.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | aspell -d ro list | wc -l)
343

real  0m0,056s
user  0m0,036s
sys   0m0,020s

time (cat Countries/Romania/*.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | hunspell -d ro_RO -l | wc -l)
347

real  0m0,173s
user  0m0,191s
sys   0m0,021s

time (cat Countries/Romania/*.html | aspell --encoding=utf-8 --personal=$PWD/.travis/dictionaries/en.pws list | sort | uniq | nuspell -d ro_RO -l | wc -l)
INFO: I/O  locale name=en_US.UTF-8, lang=en, country=US, enc=utf-8
INFO: Pointed dictionary /usr/share/hunspell/ro_RO.{dic,aff}
347

real  0m0,146s
user  0m0,145s
sys   0m0,032s

Yields the same results of course. Nuspell, using hunspell, gives the same result.

Links:

@zupo
Copy link
Contributor Author

zupo commented Mar 1, 2019

As per @ikolar and @mkcdq, there is not a lot of room for improvement here, so I'm closing.

@ikolar ikolar closed this as completed Mar 4, 2019
dz0ny pushed a commit that referenced this issue Mar 10, 2019
…ut none

were better than the vanilla aspell one. Sorry :/.

But it seems the Romanian ignore list is full of duplicates as well as words
that don't need to be ignored.  The current templates only contain ~350 or
so words not in the official aspell romanian dictionary, while there were
well over 2000 of them in the Romaninan ignore file.  Some were duplicates,
but not all (see diff).  So I shorthened the ignore file to only include
stuff that's currently not in the dictionary.

Not sure if this is what you want (the other ignore words might have been
needed before and might be needed again).

Ref: #40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Development

No branches or pull requests

4 participants