GitHub - galenhuntington/g56: G56 encoding of binary into alphanumeric ASCII.

G56

This encoding has been superseded by the new G60, which gives better results at tiny additional cost. I am leaving this repo here, however, as it contains insight and history into the design of these encodings.

G43 and G56 are proposed encodings of binary data into ASCII. They are analogous to G86, with the differences that they use smaller character sets, particularly alphanumeric strings, and they are more experimental. See the description of G86 for the general idea.

These two encodings fill the gap between base 32, which is wasteful if case sensitivity can be used, and base 64, which requires non-alphanumeric ASCII characters.

G43

G43 is very similar to G86. The bytes are divided into chunks of 2, which are interpreted as base-258 digits. The integer value is then written in base 43. As 43³ ⩾ 256², this allows every 2 bytes to be encoded as 3 characters from a 43-size set, for a 50% length increase.

For the character set, I tentatively propose the digits, the uppercase BCDFGHJKLMPQVWXYZ, and the same lowercase excluding l.

G43 is a simpler encoding than G56, but it would be unusual for this gain in simplicity to be all that helpful, or that a character set of size 43, but less than 56, is needed. For this reason, from here on only G56 is considered.

I also developed a medium-simple encoding G54, but it is hard to imagine it being needed over G56.

Outline (v0.1)

G56 expands chunks of five bytes to seven characters, for a 40% size increase. The character set used is all ASCII digits, uppercase letters, and lowercase letters, in that order, with the exception of IOU of both cases.

Consider a chunk abcde of five bytes, thought of as numbers 0–255, and construct the integer

12·56⁵·a + 2·56⁴·b + 24·56²·c + 5·56·d + e.

Write this integer in base 56 (big-endian), using the above character set to represent “digits” 0–55, zero-padded on the left if needed to make exactly seven characters, to get the encoding into ASCII.

For a final chunk of less than five bytes, pad bytes of value zero on the end to bring it up to five bytes, and then for the final encoding remove 1, 2, 4, or 5 0 characters from the end, according as whether the number of bytes padded on was 1, 2, 3, or 4.

Although larger integers are used than in G43 and G86, this calculation fits comfortably within a 64-bit integer or a double-precision float.

A message of n bytes is encoded as n+⌈2n/5⌉ characters. As in G86, it preserves lexicographic order, and has the initial segment property. The encoding is reversible because each coefficient exceeds the next by a factor of at least 256. We also see the total cannot exceed 56⁷.

Examples

The input Hello, world! (as bytes) is length 13, so it will be encoded in three chunks, with the last padded with two zero bytes. Plugging into the above formula, we get the integers

477826981519; 291424773082; 715717764608.

We then write these in base 56. The last number, for example, becomes PBZE800. We assemble the three blocks, and remove the two 0s from the end, to yield the final encoding:

FTbpRez9R8v9x2PBZE8

As another example, here is the G56 encoding of the first 256 fractional bits of π:

7kEndAMbwHbRgRWad1S23Eer3x0b8sfMJZ8A0x9W5NPCt

A 128-bit binary string will be encoded as 23 characters, as compared with UUIDs at 36 characters, base 64 at 22 characters, and base 32 at 26 characters.

Future work and variations

The encoding of bytes into an integer can be done in a few ways while still having the intended properties. Specifically, the 24 on c could also be 23, and the 12 on a could be either 10 or 11, for six possible encodings. The choice is basically arbitrary, but I chose 24 and 12 because they are what one might call nice numbers, and also larger numbers “fill out” the target space more.

The choice of characters is similar to that in Crockford Base 32, except of course both cases are used, and L is included. This removes letters most likely to be confused with other characters, and also helps avoid spelling words, particularly undesirable ones.

So, while there are some degrees of freedom, the current choices are likely optimal.

Implementation

As in G86, a reference implementation in Haskell is included, which is also not streaming:

g56 < data.bin > data.txt
g56 -d < data.txt > data.bin

Either cabal install or stack install can be used to build.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Codec		Codec
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
g56.cabal		g56.cabal
main.hs		main.hs
package.yaml		package.yaml
stack.yaml		stack.yaml
test.hs		test.hs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

G56

G43

Outline (v0.1)

Examples

Future work and variations

Implementation

About

Releases

Packages

Languages

License

galenhuntington/g56

Folders and files

Latest commit

History

Repository files navigation

G56

G43

Outline (v0.1)

Examples

Future work and variations

Implementation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages