-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give a name to byte and code point's underlying number #305
Conversation
This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It's a bit weird not to treat bytes as numbers automatically... But, the prose for corresponding values between code points and bytes that this patch introduces is quite nice and clear now.
How would this impact algorithms like https://encoding.spec.whatwg.org/#gb18030-encoder ? E.g. would
Let byte1 be pointer / (10 × 126 × 10).
become
Let byte1 be the byte whose values is pointer / (10 × 126 × 10).
? That seems kind of annoying.
We'd only have to do that in the return value, right? And that already has
so it would only be matter of linking bytes and values. I guess we could also say that a code point is always prefixed with U+... then and not typically. |
I see, so the variables named byte1 etc. would not actually be bytes, they'd be byte values? I guess that works.
I've always found the idea that we allow 0x notation for code points weird. I haven't seen it in action but maybe it's done in MIME Sniff or Encoding. |
I hadn't looked in detail to Encoding before writing this, but looking at it now most of it seems to be written assuming this exists. 😊 I'll tidy up the language around code points as well though, to only allow them prefixed. |
|
||
|
||
<h3 id=code-points>Code points</h3> | ||
|
||
<p>A <dfn export lt="code point|character">code point</dfn> is a Unicode code point and is | ||
represented as a four-to-six digit hexadecimal number, typically prefixed with "U+". | ||
represented as "U+" followed by four-to-six <a>ASCII upper hex digits</a>, in the range U+0000 to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this too circular? Not really sure how else to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do it the other way around? Perhaps:
A
code point
is a Unicode code point, whosevalue
is an integer between 0 and 0x10FFFF inclusive and represented as the stringU+
followed by four to sixASCII upper hex digits
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that still has the same issue, as ASCII upper hex digits are defined using the U+ convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding the circularity seems like it would lead to double-defining ASCII upper hex digits and reduce the overall clarity. I don't think people have a hard time with the U+ syntax so much as they have a difficult time with the difference between USVs/code points vs. code units (particularly the UTF-16 variety).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can really avoid circularity in Infra without making it significantly more opaque. See previous discussions at #230 (comment)
This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.
Preview | Diff