-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please remove header #3
Comments
A simpler scheme would be to pad with |
In this scheme, if the last two-byte character would need to encode 14 bits, it is not the last character. You must use at least one bit of padding. So it would be followed by a one-byte character |
Here is another idea after reading this comment. Since there are only six illegal characters and three bits are used to encode this, there two values left to use. One of these values can be used to signify if the last two-byte character is shortened (<= 7 encoded bits), and the remainder of the two-byte character would have the encoded data. The nice thing about this is that it removes the header, won't add another byte, and requires only a small change to the implementation (it actually shortens the decoder!) |
Hi @kevinAlbs, great work and congrats on your HN appearance! +1 on getting rid of the header byte, that just ruins any kind of streaming use. There are many ways around this without a header. Probably the simplest is to use one of the two encodings you have left in the With more trickery it would be possible to encode 139 possibilities per encoded byte for an encoding density of 8 bytes per 9 characters (12.5% expansion) instead of 7 bytes per 8 (14.3% expansion). However, encoding will be so much more complex and slow that it's surely not worth it. I'm curious about how you can embed base-122 encoded data in HTML, as unlike JavaScript, that standard seems to forbid control characters. |
@GeertBosch Thank you! I just pushed this commit which implements just that. That sounds interesting, how would you achieve 139 possibilities per encoded byte? And that is a good point regarding control characters. I recall looking at this spec and thinking that the only exceptions were the null char and &, but overlooked this part. All of the browsers I tested on (Edge, IE11, FF, Chrome, Safari) all seem to parse the control characters as expected, but I'd rather not rely on that behavior if undefined, even if using this for HTML is experimental. However, I think there is a straightforward way to move these strings to Javascript without degrading performance (and will also free up the &). |
You can change the Besides the considerable amount of logic for excess values, there is also a significant computational overhead to convert 9 base-139 characters into 8 base-256 bytes. This was more as an exercise to look at what the boundaries are of the amount of data one can store in a UTF-8 string, when excluding a small set of byte values (6 in this case) from consideration. Eight bytes of binary data per nine UTF-8 bytes with 6 reserved values is surprisingly close to the theoretical maximum. |
@GeertBosch Neat! IIUC this still only occurs when the first digit is > 122, so the additional savings will only occur then? |
You'd split your binary input into base-139 digits, so savings will apply always. Even if you encounter only values < 122, you encoded log2(139) = 7.12 bits of information per byte. Of course, splitting binary input in base-139 digits (8 bytes become 9 base-139 digits) is yet more work, so not really practical. Your encoding gets quite close (7 bits per byte) and is far more efficient. |
Ah yes I see what you mean. I was thinking in terms of bits, not base-139 digits. So even if the value is < 122, you're encoding the entire base-139 digit in one byte. Very cool! I suppose I should close this issue since we've strayed a bit and it has been resolved. |
@GeertBosch i've been trying to figure this out for hours and I just don't get it. Could you kindly explain how did you get the number 139 in the first place? and where does 112 come from? could you give a simple example showing how 8 bytes are encoded into 9 base-139 bytes? i would like to try and implement it if I can understand it |
http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_character
The 1 byte header makes it impossible to encode when the input size is unknown. Please remove the header and put it at the end of the stream, similar to the padding in base64. Otherwise, keep up the good work!
The text was updated successfully, but these errors were encountered: