Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible IETF (RFC) standardization document #8

Open
DonaldTsang opened this issue Jun 15, 2019 · 8 comments
Open

Possible IETF (RFC) standardization document #8

DonaldTsang opened this issue Jun 15, 2019 · 8 comments

Comments

@DonaldTsang
Copy link

DonaldTsang commented Jun 15, 2019

https://github.com/json5/json5 is currently proposing to add base64 support for JSON's superset.
They would like to develop an RFC, and I would like to discuss with them about safe encoding.
Is it possible to have an RFC proposal to accompany this repo? Or is that unnecessary?
json5/json5#190

@kstenerud
Copy link
Owner

Sure thing. Standardization is always good :)

@d3x0r
Copy link

d3x0r commented Jun 16, 2019

what/how would base64 represent ?

@kstenerud
Copy link
Owner

kstenerud commented Jun 16, 2019

It would represent any binary blob of data.

The difference between this an other base64 representations are:

  1. The alphabet uses standard alphanumeric characters (a-zA-Z0-9), and uses - and _ as the extra characters, which are safe in every known major text format, and for names in all modern filesystems.

  2. The alphabet uses the same ordering as the UTF-8/ASCII representations of the characters, so the encoded data sorts in the same order as the decoded data.

  3. Whitespace is supported at any point in the encoded data.

  4. There is no padding, because it isn't needed (resulting in a smaller encoded size).

  5. This spec includes a variant with a length header for use when there's no clear delimiter present (this won't be the case in JSON, so it doesn't matter to the JSON spec).

BTW, you may also want to encourage adoption of the safe85 spec, since it's also safe for all modern text formats, and encodes data to a smaller size.

They may also find some inspiration from https://github.com/kstenerud/concise-encoding/blob/master/cte-specification.md

@d3x0r
Copy link

d3x0r commented Jun 16, 2019

many of the differences you list are not differences; they are notable points I suppose.

so your order is '-', '0-9', 'A-F', '_', 'a-f' ... so something that no other base64 encoder resembles.

and ascii85 puts all symbols at the end , which defeats same-sort-order

... I was going to mention my decoder supports all the combinations of these...

62 63 usage
+ / Base64 encoding (first listed on wikipedia)
$ _ what I use... is JS identifier compatible (unlike '-')
. ',' using '.' for filenames, and Base64 encoding for IMAP mailbox names (',')
'-' (part of url safename, the _ being listed above)

But then none of that really applies; since the whole map would have to change.

Safe85...
Looks like a lot of math for 6% savings (1.33:1.25), if the packet is long enough to benefit from that, it could also just be gzipped.

(weak argument, just something i leveraged) Base64 character pairs can be used in a lookup table of 4096 entries; which is a nice roundnumber... but can be used directly for a wide hash index.

I don't see how you can fit 'whispace anywhere' with 'no padding' .. I suppose you're embedding it in a string? So, I wouldn't know if it was a string or binary?

@kstenerud
Copy link
Owner

If you have delimiters already, you don't need padding. If there are no delimiters, you need the length prefix variant (which guarantees truncation detection, something padding can't do).

Anything higher than base64 will require more processing power due to the math. But then again, once you get to high enough throughput requirements, you'd be better off going for a binary format like https://github.com/kstenerud/concise-encoding/blob/master/cbe-specification.md which doesn't need any of this trickery.

It comes down to how much weight is given to human readableness on the wire vs processing cost vs bandwidth cost. Everything is a compromise.

@d3x0r
Copy link

d3x0r commented Jun 16, 2019

re CBE; that puts a lot of magic numbers into the encoding and you might as well use like protobufs, or BSON :)
UTF8 encoding bytes into codepoints is effectively 1.5, and at best (if you supported extended 42 bit encoding) it could approach 1.33; which is where base64 starts...
I hadn't really considered (previously) a base (85, which is 5 * 17) because would seem to useless space would have been more than the saved space, resulting in a gain... 1.25 is compelling; and certainly there's the ability to do long math mods and divs; (and would probably be less than an extra pass of gzip)

@kstenerud
Copy link
Owner

Magic numbers are important; that's how you differentiate the data types efficiently. Protobufs solves a different problem than BSON/JSON/CBE. It doesn't include type data in the encoding, which means that you can only decode if you have an exact copy of the schema. And BSON is too bulky and wasteful, unfortunately.

The UTF-8 codepoint based encodings are designed to get around the twitter character-length limitation. They're not actually smaller. They can't get more efficient in byte length than the byte-oriented encodings like base64 and 85 and 90.

@d3x0r
Copy link

d3x0r commented Jun 16, 2019

exact integer size is fairly irrelevant, subject to implementation by the receiving platform/interpreter/environment...

int, float, are about the only two categories. These are easily capturable in [0-9.-E+] and themselves; add [:TZ] and you have distinguishable dates (A format of data that is often usable as a type itself).
identifiable.
Strings are easy to denote - ""
objects and arrays of other values {} ()
and well you get the idea...

and yes, [ and { are just 'magic numbers' but they don't require an accompanying document, but can instead be intrepreted using common programming knowledge.

(All sorts of distingusable data, without 'magic numbers'... err I lie 'ab' is a magic number for array buffer, 'u8', 'i8', ...'f32' etc... but that's part of a higher level than the syntax. )
JSOX Value BNF

though really this is all divergant from 'binary data transport across text transports'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants