Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordlist_de_dys2p_7776.txt added #91

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

b068931cc450442b63f5b3d276ea4297

Since the usability of Diceware depends on the quality of the word lists, a word list should consist of words that are as familiar and easy to remember as possible.

Our word list de-7776 is suitable as a diceware word list for five cubes. The words are unique from the fifth letter on. Furthermore, it follows these rules for the most part, but not one hundred percent:

  • Words are three to twelve characters long.
  • No word contains the characters ä, ö, ü and ß.
  • If possible, only familiar nouns, verbs and adjectives should be included, and in their basic form (nouns in the singular, verbs in the infinitive, adjectives in their uninflected form).
  • No proper names, regions, religions, associations, or persons.
  • No words with particularly negative connotations.
  • The "masculine" grammatical gender is preferred. (This is standard for BIP39.)

Copy link
Owner

@ulif ulif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @b068931cc450442b63f5b3d276ea4297!

Could you tell a bit more about how you compiled the list? That could be useful for others.

Furthermore, there are some more problems with it:

  • more than 100 words contain soft hyphen chars (0xc2ad in utf-8). First is "agrarkultur", last "xylophon". They have to be removed before a merge can happen.
  • you tell the list contains no negative connotations, but it also contains "arsch" and other not too friendly words. How did you check?
  • I do not prefer masculine forms despite this conflicts with any bitcoin standards (BIP39?) and I wouldn't even consider this a sign of quality of a wordlist. Au contraire. So, if your list makes it into the collection, there is no guarantee that the list won't be flooded with feminine replacements in the future. If you don't want that, please tell.
  • please give a license for the list and a copyright contact if you are not the copyright holder yourself.
  • could you think of a shorter name for the list? People have to use the middle part as option value, when picking a list.

I am sorry, but at least some of these problems must be addressed (hyphens, license) before a merge can happen.

@b068931cc450442b63f5b3d276ea4297
Copy link
Author

We created the list(s) manually, originally first for diceware with 4 dice with 1296 words, for Monero with 1626 words and Bip39 with 2048 words, and another one because we thought the most common list in German for diceware with 5 dice needed improvement. The lists and some more words are available here. They are all under the CC0-1.0 license.

more than 100 words contain soft hyphen chars (0xc2ad in utf-8). First is "agrarkultur", last "xylophon". They have to be removed before a merge can happen.

In my estimation, according to duden, it is possible for some words, but not necessary, and without hyphens is the more familiar variant. So for example with Agrarkultur

you tell the list contains no negative connotations, but it also contains "arsch" and other not too friendly words. How did you check?

We did that manually as well. We removed some with a rather/purely negative context, left others like "Arsch" in the list, because colloquially it means the buttocks rather than someone being an "ass".

I do not prefer masculine forms despite this conflicts with any bitcoin standards (BIP39?) and I wouldn't even consider this a sign of quality of a wordlist. Au contraire. So, if your list makes it into the collection, there is no guarantee that the list won't be flooded with feminine replacements in the future. If you don't want that, please tell.

This is not a problem at all and can be implemented gladly in such a way.

please give a license for the list and a copyright contact if you are not the copyright holder yourself.

https://github.com/dys2p/wordlists-de CC0-1.0 license

could you think of a shorter name for the list? People have to use the middle part as option value, when picking a list.

That's right, I was unsure about that too. You are welcome to make other suggestions.

After reviewing the list again, my current view is that the one with 1296 words is done so far, and the one with 7776 words still needs a few changes (e.g., a few nouns are plural instead of singular). However, I can't currently estimate exactly when I can revise it again. Sorry about that.

@ulif
Copy link
Owner

ulif commented Sep 17, 2022

more than 100 words contain soft hyphen chars (0xc2ad in utf-8). First is "agrarkultur", last "xylophon". They have to be removed before a merge can happen.

In my estimation, according to duden, it is possible for some words, but not necessary, and without hyphens is the more familiar variant. So for example with Agrarkultur

I am afraid, this is not the point. It is not about grammar but about non-ascii chars, the raw data in your wordlist. Some lines in your wordlist contain "invisible" hyphens. Take, for instance, line 828. It looks at first sight like "basteln",

"b" "a" "s" "t" "e" "l" "n" "\n"
or in hex:
0x62 0x61 0x73 0x74 0x65 0x6c 0x6e 0x0a

i.e. 7 chars plus newline. In fact the line looks like this:

"b" "a" "s" <SOFT-HYPHEN> "t" "e" "l" "n" "\n"
or in hex:
0x62 0x61 0x73 0xc2 0xad 0x74 0x65 0x6c 0x6e 0x0a

i.e. 9 chars plus newline. These (SOFT-HYPHEN) chars can be found in more than 100 words of your list (but not in the others).

Of course such invisible chars can be nasty. Imagine someone copy-pasting a diceword phrase with such hidden chars when setting a password. How should the person later type this password? Will the person be aware of the hidden chars anyway?

I hope that helps to understand what my point is.

@ulif
Copy link
Owner

ulif commented Sep 17, 2022

A quick check on the de-7776-wordlists on https://github.com/dys2p/wordlists-de reveals that they also suffer from the soft-hyphen problem. You might want to fix them as well.

@b068931cc450442b63f5b3d276ea4297
Copy link
Author

I am sorry that I am only now answering again. The soft-hyphen have been removed in the meantime, but I will revise the list again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants