-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BIP39 French Wordlist - My proposal #152
Conversation
Are there restrictions on the fact that each word should be separated by more than one letter? For instance we had the case of someone writing down "fog" instead of "frog" (or the otherway around I'm not sure), and by chance both were valid words in the dictionnary. |
I did not used a script for that, I eliminated the most glaring similarities. |
Good idea, I will also review it. For the similar words, I will check the combinaison of all levenstein distance, will be quick. Also, I think you word list is not in KD normalization (not a big deal, I'll fix that) |
Ah one more question. I'm not sure about some words which are either very unknown, or often misspelled. (like zircon and wapiti, which is the only I have seen after quick scan, and maybe the only one) Do you think we should change such words ? |
@NicolasDorier Thanks for the help :) I used
and nfkd.pl is
Thanks to Aaron Voisine for this! Of course we can change this kind of words if we find a word that is compatible with the restrictions. |
I reviewed the first 1024, here my difficulties : acerbe I don't understand meaning out of context, more less spelling it All of that is surely subjective. We don't have to replace if you think I am the only one having those difficulties. I'll review the next 1024 later. Let me know what you think about these words. |
Thanks for the review :) |
I agree the bénigne and bluffer can be difficult (I have seen poker players write "bleuffer"...). For the others I would think that any French native speaker must know them, and I don't think that anyone litterate would write "embrillon" or "faquir" ever. I understand some people can have troubles with spelling, but then no words would be safe. |
Don't, I can do that automatically. I will do it once we agreed on the words. (I'll also code something up to verify you respected your restrictions)
I am native speaker, but I admit I am not very good. ;) |
Ok, thanks for your time to code :)
It's also cool to learn words ^^ |
yeah it is cool, but I'd just hope people will not have to spell words on phone, which will happen for unknown words. But I'm fine with it if you think I am one of the only who do not know them. I expect most service provider using BIP39 will auto correct words for the user. (I will surely include that in nbitcoin... even if only for me ;D) |
@NicolasDorier Have you had the time to review the second part? :) |
shit I forgot, working on that sorry |
Here it is : iridium Never heard My remark are typical spelling mistake that can be done. Once you agreed on the words to change let me know, I'll then run some word analysis on the list. (dictionnary check / that your rules are satisfied / that 2 words are not too similar) |
Thanks :) |
Well, I heard about suricate, as far as I was concerned, it was a french comedian group on youtube. :p |
Ok, I propose to change these words:
EDIT: "fakir" too
And wombat (https://i.imgur.com/scN9gIU.jpg) 🐻 What do you think? |
pyjama. pijama ? (I would have bet it was spelled like that) Except those I'm good. I like wombat, but I doubt lots people know. Tell me when you update the list that I run some code on it. |
Thanks :) |
rallonge et pixel, ok pour yatch. |
@NicolasDorier Updated |
thanks, I'll run some word analysis to check everything is fine. (Hopefully before sunday) |
poncer -> ponctuel Same dropbox link |
did you update on github ? I prefer using the github version for my tests, so I'm sure there is no mistake in the modifications. (don't worry about encoding, I'll fix it) Ps : gerbille => never heard :D |
Yes update on github. |
"gerboise", "graffiti", "glycémie" or another ? :) |
ok let's take "graffiti" |
Updated. |
Here similar words (separated by 1 letter, accent removed)
I noted potential problems. Checking other stuff... |
I also noted the following collision with Spanish. (btw, the Spanish list is not normalized on github)
|
@Kirvx @NicolasDorier @EricLarch Thanks guys, this will be going into the next breadwallet update. Vive la France ! |
Suppose I best also add French list to BIP39.NET -----Original Message----- Awesome :) |
Is now added in BIP39.NET |
Yeah thanks :) Le lun. 8 juin 2015 07:51, Thå Shïz [email protected] a écrit :
|
same in NBitcoin (in master branch, will be out for the next release) |
I had a big problem trying to detect the bip39 language as French shares ~5% of its words with English. With the test vector (entropy 000000000000000000000000000000000000; english mnemonic = "abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon abandon about") it is incorrectly detected as French. I've changed my code to check all of the following (see below), however I'd implore the list to be made completely different to English (or at the very least, don't make the first word the same) FRENCH_BIP39_CLASHES = [(1, u'abandon'), (88, u'amateur'), (107, u'angle'), (110, u'animal'), (148, u'aspect'), (190, u'badge'), (230, u'bicycle'), (262, u'bonus'), (277, u'brave'), (323, u'canal'), (328, u'capable'), (347, u'caution'), (403, u'civil'), (409, u'client'), (436, u'concert'), (451, u'correct'), (461, u'coyote'), (478, u'crucial'), (479, u'cruel'), (493, u'cycle'), (498, u'danger'), (562, u'digital'), (573, u'distance'), (594, u'double'), (598, u'dragon'), (631, u'effort'), (725, u'essence'), (757, u'exact'), (763, u'excuse'), (795, u'fatal'), (796, u'fatigue'), (812, u'festival'), (820, u'figure'), (854, u'fortune'), (861, u'fragile'), (880, u'fruit'), (919, u'globe'), (953, u'guide'), (998, u'humble'), (1011, u'image'), (1014, u'immense'), (1017, u'impact'), (1043, u'innocent'), (1053, u'intact'), (1070, u'jaguar'), (1093, u'junior'), (1102, u'label'), (1123, u'lecture'), (1165, u'loyal'), (1178, u'machine'), (1248, u'million'), (1254, u'minute'), (1255, u'miracle'), (1259, u'mobile'), (1286, u'muscle'), (1301, u'nation'), (1302, u'nature'), (1322, u'noble'), (1331, u'notable'), (1381, u'opinion'), (1387, u'orange'), (1409, u'ozone'), (1411, u'palace'), (1416, u'panda'), (1476, u'phrase'), (1478, u'piano'), (1492, u'pizza'), (1524, u'position'), (1548, u'prison'), (1567, u'public'), (1576, u'puzzle'), (1580, u'question'), (1626, u'relief'), (1671, u'rival'), (1674, u'romance'), (1707, u'salon'), (1727, u'science'), (1748, u'sentence'), (1756, u'service'), (1769, u'simple'), (1777, u'social'), (1801, u'source'), (1805, u'spatial'), (1809, u'stable'), (1830, u'surface'), (1833, u'surprise'), (1836, u'suspect'), (1847, u'talent'), (1911, u'train'), (1933, u'tunnel'), (1948, u'unique'), (1954, u'usage'), (1963, u'vague'), (1970, u'valve'), (2008, u'village'), (2014, u'virus'), (2020, u'vital'), (2034, u'volume'), (2039, u'voyage'), (2041, u'wagon')] |
I'm not convinced in that. The Auto Language detect feature is by itself dangerous. (Chinese Tradition and Modern) |
Hi @simcity4242 thanks for bringing this up, it is interesting, I don't really think there is any great reason for us to do Auto detect of the mnemonic language, do you have a specific use case in mind? I'm actually thinking of removing this functionality from BIP39.NET because at the end of the day we don't really need to know the language of the mnemonic on input. Unless of course you have a specific task in mind, it may be wise to just avoid auto language detect altogether. I did it before the french list, and while Nicholas is right in that if it's only ~5% then chances are you will have majority french only every time so shouldn't be an issue, but you will need to account for the edge cases I guess. |
If auto-detection is not possible, you'd need to add to the 12 words the information what wordlist is used. So effectively it would be a 13th word. |
Why do you need to know what language is used tho? |
|
Surely you would detect localization off the system for auto detect just as any other app/program does now? Correct spaces are whatever the user puts in, ideographic to normal happens during Normalization anyway so it doesn't matter what spaces are put in. |
Also if you are inputting the words you can't auto detect language as you type the words in! |
On mobile devices, you generally don't type spaces. Everything is auto-completed. This is especially true if there are well defined dictionaries. You can't use the system locale reliably, as phrases should be exchangable between devices. |
If all encodings use the same type of space then we're good. But I heard that's not the case? |
On mobile devices the OS handles the auto-complete based on a localized dictionary in most cases. Yes the space us different for JP however the Normalization process turns tge ideographic space into ASCII space regardless of what is input so it doesn't matter what space is auto added. |
Japanese phones don't auto-insert spaces at all, in fact. |
Well, I will use a customized auto-complete. Otherwise it will insert words not contained in the word lists, or maybe it's even missing words from the lists. I assume I will be able to append the space myself. |
That is probably best. I like Mycelium's setup. Japanese list is unique with the first 3 characters so it should be easy to auto-complete |
FWIW, for the first word I plan to auto-complete to all the supported word lists at the same time, so essentially the dictionary is a union of the wordlists. For all subsequent words, I exclude the word lists that can't match anymore. If after the 12th word there still would be multiple word lists matching, I maybe ask the user for what list to use (if that's needed, I'm not sure). |
FWIW, I use auto-detection in seedrecover. It's just a UI nicety. The french word list isn't really that much of a problem; the likelihood of an entire (random) 12-word mnemonic being ambiguous between English and French is less than 1 in 5 × 1015. As NicolasDorier already pointed out, It's the Chinese Simplified and Traditional wordlists which are problematic if you want to do auto-detection, they share 62% of their words. That's a 1 in 295 likelihood of ambiguity for a 12-word mnemonic, 1 in 4720 if you also require the checksum be valid. This problem (if it even is one) could have been solved by requiring that for each new word list, if it shares a word with an existing word list, that word must be placed in the same position as it is in the existing word list (or just use Electrum 2.x's method). |
@Thashiznets I initially flagged French because the first test vector contains "abandon" (11/12 words) and my code was just checking the first word (like Electrum), so the English test vectors were returning "French" as language; I've used a workaround Basically, I've been trying to differentiate mnemonic phrases without needing to know if it's BIP39, or Electrum 2.x (or Electrum 1.x, which is much harder). I just think it's prudent to have certainty in knowing what type of mnemonic it is by the words alone. |
Sorry to not answer to this problem, I'm not a tech guy :/ |
Agreed, leave as is, I think trying to guess the spec used i.e. BIP39, Electrum etc could end in tears. |
Hey, I was taking a look on BIP0039 to add Portuguese and then I saw the French wordlist has a lot of words matching the English list. I know it is not on the proposed rules, but I believe it is important to not have words already used in other language mnemonic sets. These are the ones identical to the English list: |
You're right that was a concern during the creation of the wordlist https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin but it wasn't a priority for me, and I think it wasn't easily possible to apply this additional restriction with the other rules. |
@voisine
Here are my restrictions:
4 wordlists used:
Spelling verified with Hunspell French Dictionnary (1990 and Classique) in Notepad++, and meaning verified with https://fr.wiktionary.org and http://www.larousse.fr/ for hundreds words.
Guys can review:
@ecdsa @NicolasDorier @EricLarch @NicolasBigot @pollastri-pierre
Thanks to Thomas Voegtlin for his wordlist!
Please wait before merging.
--- The following message is partially outdated because of the evolution of the wordlist. ---
J'ai défini un maximum de restrictions "raisonnables" pour qu'un individu puisse deviner le plus facilement possible un de ses mots en cas d'oubli (ou s'en souvenir facilement).
Pour les mots "embarrassants", il s'agit de mots qui peuvent être assimilés à une vilaine insulte, de certains mots relatifs à une maladie grave, à la mort, à la pauvreté, au crime, à la violence, au domaine médical, à des attitudes et bien d'autres.
J'ai fait de mon mieux pour supprimer les mots qui présentaient une ressemblance avec un autre mot, à l'oral comme à l'écrit.
Plusieurs centaines de mots qui avaient une différence de 1 lettre (ou 1 lettre différente) avec un autre mot ont été supprimés.
Je considère que le résultat est plutôt satisfaisant, loin d'être parfait, mais tout à fait correct.
Aussi, les restrictions n°6 et 10 sont complémentaires à ce problème.
J'estime qu'il y a 1% de mots potentiellement inconnus du public (comme "quantum"), et 5% de mots avec des sens qui sont potentiellement incertains par le public (comme "fongible").
Je considère ces marges comme convenables.
Notez que certains éléments chimiques du tableau périodique sont présents, les plus populaires.
Pour une vérification plaisante, voici la version imprimable (5 pages PDF A4):
https://www.dropbox.com/sh/xlq3x2anb706uw1/AADUYAqcBvkvUPdhwC2uLWmEa?dl=0
Si vous voulez vérifier en 1 lecture, focalisez-vous sur les restrictions n°2,3,5,8 et 11.
Étant donné l'homogénéité de la liste (et le bon sens qu'elle doit avoir), les mots contraires aux restrictions n°1,4 et 13 devront vous sauter aux yeux.
Comptez 15 minutes de lecture par page.
Je recommande quand même une deuxième lecture.
J'espère que vous apprécierez cette wordlist, c'est un travail de plus de 70 heures que je n'envisageais pas de faire au début, étant donné l'ampleur et la responsabilité de la tâche.
Si un mot vous semble inapproprié, ou si vous avez des remarques à faire par rapport aux restrictions, vous pouvez m'en faire part.
Sachez aussi que si elle vous convient, elle sera intégrée dans une des prochaines versions de breadwallet avec les autres wordlists étrangères.