Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Diceware wordlists in multiple languages as part of i18n efforts #999

Open
toholdaquill opened this issue Apr 21, 2015 · 14 comments
Labels
help wanted Issues we would definitely appreciate volunteer help with i18n Anything related to translation or internationalization of SecureDrop

Comments

@toholdaquill
Copy link
Contributor

In addition to translating the SecureDrop interface (see issue #753), it would also be ideal to support Diceware wordlists in multiple languages. Since sources should memorize their codenames for maximum security, this will make it easier for non-English speakers to use SecureDrop. Currently there are Diceware wordlists available in a dozen or so languages, see:

http://world.std.com/~reinhold/diceware.html

Since a journalist never sees a source's codename, it would be ideal to allow a source to select a different language than the journalist's. For instance, a Turkish source could use SecureDrop in Turkish, receive a Turkish codename, but the English-speaking (say) journalist would use an English-language interface.

@garrettr
Copy link
Contributor

garrettr commented May 7, 2015

Potentially useful: @micahflee has started translating Diceware wordlists as part of his Passphrases project.

@toholdaquill
Copy link
Contributor Author

Garrett Robinson:

Potentially useful: @micahflee has started translating Diceware wordlists as part of his Passphrases project.

nice. :)

Question...

I notice that these wordlists are all basically ASCII:

$ file *
catalan-diceware.wordlist: ASCII text
dutch-diceware.wordlist: ASCII text
english-diceware.wordlist: C++ source, ASCII text
french-diceware.wordlist: ISO-8859 text
german-diceware.wordlist: C source, Non-ISO extended-ASCII text
italian-diceware.wordlist: ASCII text
japanese-diceware.wordlist: ASCII text, with CRLF line terminators
polish-diceware.wordlist: ASCII text
securedrop.wordlist: C++ source, ASCII text
swedish-diceware.wordlist: C source, ASCII text

If the goal is to help non-English speakers create strong passphrases
that are easy to memorize, it is important that the words be
orthographically correct. Accents and umlauts aren't decoration
(although they sometimes seem like it to us English speakers), but are
essential parts of the meaning. A Swede, for instance, might choose to
remember the word "smörgåsbord"; but the ASCII equivalent "smorgasbord"
simply isn't a word in Swedish.

Note also that most non-English users have keyboards with
locale-specific layouts. For instance, I'm typing this on a
Spanish-language keyboard with keys like ñ and so forth (which is
actually a PITA for me as an English-speaker, but I digress). Standard
locale-specific hotkeys (e.g. ' + e = é) make it easy to enter chars
like á, é, í, ó, ú, etc. This aids memorability, but also adds a little
bit of entropy--instead of just the 27 chars of the alphabet [[en_*] +
ñ], you actually have thirty-odd utf-8 chars once you include the
various accents and diereses.

So I think there's a choice to be made, how you'd like to proceed adding
multiple language support. You can definitely use these lists now,
knowing that suboptimal, at least in this case, is quite a bit better
than nothing. ("You vant me to memorize a passphrase en zee Engleesh?
Zut alors!")

Long term, though, I think the Western European lists should be
converted to orthographically-correct utf-8, with unicode on the horizon
for Asian language support.

I note that the Diceware Kit for other
Languages
includes
this suggestion:

  1. If you wish to add letter combinations in your language that are
    not in the 26-character Roman alphabet, you of course may do so, but
    consider whether they will be available on all keyboards that your
    users will have.

I think this is well-intentioned but incorrect. Since the goal here is
to offer users a dropdown ("Select your language"), each language choice
should be optimized for users of that language.

Reviewing the Diceware Kit, I'm not seeing any programmatic way to
generate these lists. Suck in a whole dictionary, hacking and slicing
for string length and other regex requirements? Maybe. But that sounds
like more work than building the list by hand, especially since a local
speaker would need to review the list before use, anyway.

Let me know if I can be of further help with this.

@tildelowengrimm
Copy link

How hard is it to type diacritical marks on Tails?

@toholdaquill
Copy link
Contributor Author

On Wed, Nov 18, 2015 at 03:31:13PM -0800, Tom Lowenthal wrote:

How hard is it to type diacritical marks on Tails?

That would depend on the keyboard the user has. A Spanish-speaker would
likely have a Spanish-language keyboard, other languages would have
locale-specific layouts, etc.

@tildelowengrimm
Copy link

Have you tested that, or are you supposing? I've never tried using a non en-us layout with Tails.

@toholdaquill
Copy link
Contributor Author

On Fri, Nov 20, 2015 at 06:14:57PM -0800, Tom Lowenthal wrote:

Have you tested that, or are you supposing? I've never tried using a non en-us layout with Tails.

I own a laptop with a Spanish-language keyboard.

To replicate in Tails, go to Applications --> System Tools -->
Preferences --> System Settings --> Region and Language --> Layouts -->
click the '+' button --> select the new keyboard layout you'd like to
use.

Tails only supports five display languages, but the keyboard can be
configured to any layout you desire.

@tildelowengrimm
Copy link

👍

@philou-felin
Copy link

I agree with the original poster. I had a look at the “Radio-Canada” Secure Box (French Canada) just out of curiosity and noticed that the passphrase was all in English. I think I understand the rationale for SecureDrop creating the passphrase for the user, but it has to be in his/her native tongue.

@redshiftzero redshiftzero removed this from the 1.0 milestone May 11, 2017
@redshiftzero redshiftzero added i18n Anything related to translation or internationalization of SecureDrop help wanted Issues we would definitely appreciate volunteer help with labels Aug 19, 2017
@KwadroNaut
Copy link
Contributor

KwadroNaut commented Sep 27, 2017

Some of the languages on that diceware page contain too many problematic words, non-words etc.
For dutch there's been some nice effort by @remko https://el-tramo.be/blog/diceware-nl/ https://github.com/remko/dicewords/ It could/should be combined with the tests run by the University of Ghent (http://woordentest.ugent.be/ and datasets here: http://crr.ugent.be/programs-data/word-prevalence-values).If it's better to split issues for localization of Diceware lists per localization, please just move this comment to a seperate one.

@ghost
Copy link

ghost commented Nov 4, 2017

Note that there now is support for internationalized word lists (currently just French supported). For Arabic, it would be enough to add a ar.txt file in https://github.com/freedomofpress/securedrop/tree/develop/securedrop/wordlists . However the code must also be modified to support non-ascii words and that is a non trivial change.

@eloquence
Copy link
Member

Note that curating and expanding these word lists is still desirable. It may also be useful to allow admins to configure the preferred language for newly generated journalist designations (which are drawn from a different set of wordlists, currently monolingual).

@KwadroNaut
Copy link
Contributor

Good reminder. @remko updated his tools and wordlists, they can be reused for other languages too if there's a need for it. To my understanding what he produced (and updates) is MIT-licensed (https://github.com/remko/dicewords/blob/master/LICENSE ), the generation and collection of the list is based of the 'open taal' initiative, tl;dr if you're fine with it, I'll create a pull request to either include https://el-tramo.be/diceware/diceware-wordlist-8k-composites-nl.txt or re-do remko's work to generate another Dutch one.

@nabla-c0d3
Copy link
Contributor

Currently the PassphraseGeneratorexplicitly rejects non-ASCII words (to maintain existing behavior):

https://github.com/freedomofpress/securedrop/blob/develop/securedrop/passphrases.py#L59

However, this check can probably be replaced with a check for encode("utf-8") without any problem.

@eloquence
Copy link
Member

eloquence commented Apr 15, 2021

For the sprint starting 4/15, @rmol has committed to sharing a first set of wordlists generated using machine translation, so we can begin evaluating the quality of the results and potentially prepare integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues we would definitely appreciate volunteer help with i18n Anything related to translation or internationalization of SecureDrop
Projects
None yet
Development

No branches or pull requests

8 participants