Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aspell limitations for English words #617

Open
johnbumgarner opened this issue Aug 30, 2021 · 4 comments
Open

Aspell limitations for English words #617

johnbumgarner opened this issue Aug 30, 2021 · 4 comments

Comments

@johnbumgarner
Copy link

johnbumgarner commented Aug 30, 2021

I'm exploring using the Python package pyenchant in my open source project. Since I'm developing on a Mac the backend of pyenchant is aspell. During testing I noted that some English words are not found, so I'm trying to understand the limitations of aspell.

The code below has 6 English words. It seems that 3 of these words don't exist in aspell dictionaries.

import enchant

words = ["bad", "omen", "smile", 'pneumonoultramicroscopicsilicovolcanoconiosis',
         'supercalifragilisticexpialidocious', 'incomprehensibilities']
for word in words:
    d = enchant.Dict("en_US")
    valid_word = d.check(word)
    print(valid_word)
    True
    True
    True
    False
    False
    False

aspell version info:

aspell --version
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)

Thanks in advance for any assistance.

@DimitriPapadopoulos
Copy link
Contributor

Excellent question.

I think Re: Updating dictionaries gives a few hints. I have started looking into https://github.com/GNUAspell/aspell-lang, it explains how to generate dictionaries that can eventually be uploaded to ftp.gnu.org:

**********************************************************************
         Requirements in order to be upload to ftp.gnu.org
**********************************************************************

The number one requiment is that the dictionary package MUST be made
using "make dist" using the "proc" script as previously desribed.
This will check for a large number of things.

However, this technical documentation does not explain who or which team is currently in charge of running these tools to maintain the dictionaries for each language. You need to search the aspell mailing lists to find these well-hidden teams or individuals.

For English, these might be the web sites you're after:

The first one claims that “This word list is considered both complete and accurate” and points to SCOWL (and friends). The git repository for SCOWL (and friends) is:

@DimitriPapadopoulos
Copy link
Contributor

The strange thing is that all of these words can actually be found in SCOWL (and friends). Make sure you have the most recent dictionaries installed, just in case. I would be interested in your findings, as I have similar issues myself, for example with donut:

>>> import enchant
>>> 
>>> words = ["donut", "donuts"]
>>> dictionary = enchant.Dict("en_US")
>>> 
>>> for word in words:
...     dictionary.check(word)
... 
False
True
>>> 

And some trivia:

@DimitriPapadopoulos
Copy link
Contributor

You may be using the default dictionary size, which is 60 on a scale from 10 to 90. From the aspell man page:

size

(string) The preferred size of the word list. This consists of a two char digit code describing the size of the list, with typical values of: 10=tiny, 20=really small, 30=small, 40=med-small, 50=med, 60=med-large, 70=large, 80=huge, 90=insane.

Have you tried a different size, 80 or even 90 for the kind of uncommon words? Chances are you need to choose proper aspell options, not fix an actual bug.

@DimitriPapadopoulos
Copy link
Contributor

DimitriPapadopoulos commented Jul 10, 2023

It's not the size of the dictionary after all. The case of donut is interesting: issue en-wl/wordlist#310 gives a glimpse of how words are handled:

  • While some words are present in word lists, they are filtered out when building dictionaries, based on criteria such as frequency.
  • Adding words to dictionaries probably starts with sound arguments in an issue such as "donut" should be promoted in en-US en-wl/wordlist#310.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants