Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider ways to automate en_US->en_GB dictionary corrections #1468

Open
peternewman opened this issue Apr 5, 2020 · 17 comments
Open

Consider ways to automate en_US->en_GB dictionary corrections #1468

peternewman opened this issue Apr 5, 2020 · 17 comments

Comments

@peternewman
Copy link
Collaborator

peternewman commented Apr 5, 2020

Do we force this to be only one correction and then provide a function to reverse this dictionary for converting the other way too?

Originally posted by @peternewman in #1142

Maybe we can do this sort of thing someday, but for now I think it makes sense just to have the gb-to-us dict, it's not enabled by default, and if you enable it then you probably want the conversions to be done

Originally posted by @larsoner in #1142

@sebweb3r
Copy link
Contributor

sebweb3r commented Aug 13, 2020

With

awk -F"->" '{print $2 "->" $1}' codespell_lib/data/dictionary_en-GB_to_en-US.txt

one can reverse the dictionary automatically.

I will write some tests, that check dictionary_en-GB_to_en-US.txt against dictionary_en-US_to_en-GB.txt.

@larsoner
Copy link
Member

one can reverse the dictionary automatically.

It seems cleaner just to reverse it in Python at runtime to generate the opposite dict. Then there is less repetition in the repo

@lurch
Copy link
Contributor

lurch commented Aug 31, 2020

I guess using awk would also create an invalid dictionary if dictionary_en-GB_to_en-US.txt ever had a word->word1, word2, line?

@sebweb3r
Copy link
Contributor

yes. it would. but do british words with multiple spellings in us-american english exist? Or vice-versa?

@lurch
Copy link
Contributor

lurch commented Aug 31, 2020

Dunno, I'm not a linguist. But I guess 'gas' in en-US can be spelled as both 'gas' and 'petrol' in en_GB 😉 🤣

@peternewman
Copy link
Collaborator Author

Here be dragons:
https://www.grammarly.com/blog/licence-license/

@lurch
Copy link
Contributor

lurch commented Sep 2, 2020

Here be dragons:
https://www.grammarly.com/blog/licence-license/

Not really. We have licence->license in dictionary_en-GB_to_en-US.txt, just like we also have practise->practice in there. Of course if we ever wanted to convert from en-US to en-GB then we'd have problems ;-)

Kinda weird though in that en-GB the -se is always the verb and the -ce is always the noun, but in en-US they use license for noun-and-verb and practice for noun-and-verb 🙃
https://www.grammarly.com/blog/practice-practise/

Hmm, and just to confuse things en-US also keeps the distinction between advise as the verb and advice as the noun! 🤣
https://www.grammarly.com/blog/advise-advice/
🇬🇧 🇺🇸 📚

@peternewman peternewman changed the title Consider ways to automate en_GB/en_US dictionary corrections Consider ways to automate en_US->en_GB dictionary corrections Sep 2, 2020
@peternewman
Copy link
Collaborator Author

Of course if we ever wanted to convert from en-US to en-GB then we'd have problems ;-)

Which is exactly what this issue is about...

Sorry I've retitled it, as I realise that wasn't very clear.

@lurch
Copy link
Contributor

lurch commented Sep 2, 2020

Maybe for the dictionary_en-GB_to_en-US.txt only we could break some of the rules that apply to other dictionaries, and allow something like

licence,license->license
practice,practise->practice

which would then become:

license->licence,license
practice->practice,practise

when "reversed" into the en_US -> en_GB dictionary; so that when codespell encounters license in en_US text (and the user is correcting to en_GB), the user gets prompted to leave it as 'license' or change it to 'licence'.

Although that might be too confusing, so perhaps a better/simpler approach would be to somehow indicate that the licence->license and practise->practice rules should be ignored when "reversing" them to get the en_US -> en_GB dictionary?

There's also the problem that when converting from en_US to en_GB you'd want to correct "color" to "colour" when used in natural-text, but you'd probably need to leave it as "color" in code-text as many functions / classes / etc. use the US spelling of "color". (hmmm, does codespell have the ability to use different dictionaries based on the file-extension of the file it's currently checking?)

@sebweb3r
Copy link
Contributor

sebweb3r commented Sep 2, 2020

I think, the fact, that BE uses licence and license, but AE only license, is the game breaker for automated reversing.

@larsoner
Copy link
Member

larsoner commented Sep 2, 2020

I think, the fact, that BE uses licence and license, but AE only license, is the game breaker for automated reversing.

The construction rule could be "check for the reverse dict and add entries to it (in Python) as long as there is only one correction". Then GB->US can be as it is, and a new US->GB file (for now) can have the single entry practice->practice,practise. When we load either dict, the US->GB dict gets all reversed entries from GB->US that only have a single correction, and the GB->US dict gets no update (because the one entry in US-GB is ruled out for having multiple corrections).

@lurch
Copy link
Contributor

lurch commented Sep 2, 2020

a new US->GB file (for now) can have the single entry practice->practice,practise

and also license->licence,license ? (or perhaps I misunderstood your comment)

@larsoner
Copy link
Member

larsoner commented Sep 2, 2020

Yeah probably -- just starting off with one for the sake of discussion. I would not expect the dictionary to stay at a single entry :) I'm sure there are many examples...

@peternewman
Copy link
Collaborator Author

I think, the fact, that BE uses licence and license, but AE only license, is the game breaker for automated reversing.

The construction rule could be "check for the reverse dict and add entries to it (in Python) as long as there is only one correction". Then GB->US can be as it is, and a new US->GB file (for now) can have the single entry practice->practice,practise. When we load either dict, the US->GB dict gets all reversed entries from GB->US that only have a single correction, and the GB->US dict gets no update (because the one entry in US-GB is ruled out for having multiple corrections).

It seems a shame to have to duplicate practice->practice,practise (it would also currently hit our corrects to itself test, so we'd need to make that optional for some places), can't we potentially leave it out of GB->US and use US->GB to populate that part of it?

We're got:

US:GB Example Notes
1:1 chips<>crisps Potentially reversible but order is important!
1:1 fries<>chips Potentially reversible but order is important!
1:1 color<>colour Not reversible due to the HTML issue
1:Many gas<>gas/petrol Suggests itself
Many:1 drugstore/pharmacy<>chemists Not a perfect example, but to get the idea

Do we not need potentially four files to cover all cases?

So we can use the 1:Many bit to drive most of it, but I think we still need some way of skipping some words which might seem to be reversible. And in the case of chips/crisps/fries potentially we'd need a non-sorted dictionary, so we don't do crisps->chips, chips->fries type things.

@lurch
Copy link
Contributor

lurch commented Sep 2, 2020

And "fries" in US are "chips" in UK, but "french fries" are present in both UK and US? 🍟

EDIT: And I think "chocolate chip cookies" are the same in both? 🍪 And we definitely wouldn't want to auto-translate "silicon chips" to "silicon crisps" 😆 😋

Given how context-sensitive all of this is, maybe there's not actually much we can do? 😕

@vikivivi
Copy link
Contributor

vikivivi commented May 11, 2021

In en-GB_to_en-US, you can have: aunty->auntie
But when en-US_to_en-GB: it is not auntie->aunty, it should still be auntie->auntie

Please find script which auto generate both en-GB_to_en-US & en-US_to_en-GB dictionaries using SCOWL VarCon #1917

@Alhadis
Copy link

Alhadis commented Oct 12, 2021

Here's how I'd do it: extend the format to use <> or <-> for "reversible" spellings, and implement a --reverse/-R option that behaves similarly to-D/--dictionary, except the specified dictionaries use "reversed" spellings when listed.

For example:

colour<->color
favourite<->favorite
auntie->aunty

(Note the "one-way" conversion of auntie/aunty above; these would be ignored in dictionaries specified by -R).

but do british words with multiple spellings in us-american english exist? Or vice-versa?

Yes. In international/traditional/proper English, meter/metre is used for measuring devices and units of length, respectively. In Yanklish/US English, meter is used for both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants