-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Human-oriented encoding #5
Comments
Human input is tricky due to human vision, human handwriting, and human psychology. Humans will mess up the case, because they had caps lock on, or didn't think it would matter (because uppercase and lowercase is semantically the same character). Humans are also bad at inputting symbols, especially since they're in different locations in different international keyboards (and some don't appear at all!). As well, humans consistently mess up the following (especially when writing down by hand):
That leaves us with a viable character set of:
where o can be substituted for 0, l and i can be substituted for 1, u can be substituted for v, and uppercase can be substituted for lowercase. This is how safe32 is implemented: https://github.com/kstenerud/safe-encoding/blob/master/safe32-specification.md#alphabet |
@kstenerud my idea is to consider special characters to fill in all 64 "holes" with minimal human legibility issues. I do agree that it is tricky but special characters are known to be heavily identifiable. |
There are 32 symbols in the lower 7 bit set:
Of these:
This leaves a set of Even if you could somehow achieve more characters, considering that most human input codes are at most 20 characters:
Small gains for the increased human input complexity (especially considering phone input, where some of the characters are hidden 2 modifiers deep). |
I would consider that $ is so unique from S that it is a counter-example of distinguishable characters. And the brackets are mostly true except curly brackets which are more unique. Other than that you are spot on. |
Here are all of the special characters:
Here they are again, grouped together by confusability:
Non-confusable:
These are the disallowed characters:
Removing the disallowed characters and the set confused with "1" leaves us with:
This leaves 7 "clean" characters and 3 "dirty" ones, for a total of 42. Of these:
This will already be super confusing to grandpa trying to punch in his activation code. All for 1 extra byte in a 20 char code (13 instead of 12). Not worth the effort and the tech support pain. |
Actually I just noticed that tilde Also, ' ` . , will be confuseable if written on paper. |
No, I would make a list of alphabetic characters that are not confusing. |
Did another study
|
In the early days of the web, many human input systems (such as captchas) were case sensitive. None of them are anymore, because analytics showed a significant dropoff in the funnel for case sensitive input systems vs case insensitive. Even if the characters are visually different, uppercase and lowercase letters are semantically the same. Many humans will input the entire sequence in one case (all upper or all lower), regardless of what the original looked like. The point of human input systems is to cater to their psychological and sensory biases so as to minimize erroneous or cumbersome input. Symbols are bad, mixed case is worse, and similar looking letters are right out. |
@kstenerud I would say that if someone were to be able to type out special characters, then they could handle upper case as well. We are in the process of web3.0, and that human perception on characters has gone sharper for the technological age. What I hope to achieve, is a more compact "writable" binary-to-text format (more compact than safe32) that is good for comparing and typing, without impeding the human limit of character recognition. |
We'll just have to disagree on that point, then. I understand your desire for a more compact human-writable encoding, but the savings are minimal. In a 20 character sequence (arguably the most you'd want a human to input), the difference between base 64 and 32 is 15 encoded bytes vs 12 encoded bytes. 8 bytes is already 64 bits, enough to assign a unique ID to every grain of sand on every beach and more. Adding a checksum/CRC would be 1-2 bytes, with the rest not being very useful. So for almost everyone, 10 bytes of data is plenty, which is 14 chars in base64, and 16 chars (aaaa-bbbb-cccc-dddd) in base32. |
And yes, even though unicode should in theory be safe, there are still too many legacy shift-jis, euc, big5 etc systems around. |
@kstenerud agree to disagree then in regards to "human orientation". Also, theoretically speaking, how much can base55 (maximum base) save when compared to base32? If my calculations are correct, for a 512-bit hash base55 would require 86 characters while base32 requires 102 characters. For 256-bit hashes its 45 vs 52 characters. I could be wrong though.
Are 128-bit arithmetic common for C libraries? P.S. Shift-JIS and Big5 has already been deprecated by many Asian governments, so that might not be a good argument. The "wasted unicode bytes" and "lowered performance due to inconsistent character byte lengths" are good arguments on their own right. |
I calculate 45 chunks for 256 bit and 87 for 512 bit, using a 128 bit integer size (18 chunks to 13 bytes). To keep it inside 64 bit ints, you'd need 7 chunks to 5 bytes, which would give 45 chunks again for 256 bit, and 90 chunks for 512 bit. 128 bit integer is not standard C (it's a gnu extension), and it's slow. Also, for unicode solutions, look here: https://github.com/qntm/base2048 |
Firstly, if 45 chunks are constant between the two cases, maybe there is a substantial gain to base55 and other low bases (when compared to 52 characters in base32, a 15% gain). I do hope that it can be used as a basic encoding for SHA256 /512 hashes (from a human oriented perspective) among other things. Secondly, are there any other 128-bit integer (or even 256/512-bit arithmetic) libraries that are fast? I don't think GMP is known to be fast there has to be other libraries that can do that. Would coding in pure C make it faster? P.S. For Unicode solutions, I have seen "better" solutions https://github.com/qntm/base32768 https://github.com/rinick/base2e15 https://github.com/grandchild/base32k (off-topic) |
I don't know of any 128 bit int libraries (haven't really looked). Mostly I'm just waiting for 128 bit to be incorporated into the C standard. |
https://github.com/ridiculousfish/libdivide might have a clue on possibly faster division speeds (can't guarantee that though) |
BTW we have major problems with any encodings containing $ as users routinely enter it unescaped on the command line |
Since there are a list of characters that cannot be used in Windoes, URL, SGML or JSON (with
" # & ' / : < ? \ > | * .
that are all unuable)... Maybe it is time to invent a human-oriented base64 encoding that are less confusing with uppercase and lowercase (e.g. C and c, K and k, O and o, P and p, S and s, V and v, W and w, X and X, Z and z). BinHex isn't perfect.Criteria: Given most common fonts in office, when rewritten by hand, should not cause confusion.
The text was updated successfully, but these errors were encountered: