Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give a name to byte and code point's underlying number #305

Merged
merged 3 commits into from
May 18, 2020

Conversation

annevk
Copy link
Member

@annevk annevk commented May 15, 2020

This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.


Preview | Diff

This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.
Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It's a bit weird not to treat bytes as numbers automatically... But, the prose for corresponding values between code points and bytes that this patch introduces is quite nice and clear now.

How would this impact algorithms like https://encoding.spec.whatwg.org/#gb18030-encoder ? E.g. would

Let byte1 be pointer / (10 × 126 × 10).

become

Let byte1 be the byte whose values is pointer / (10 × 126 × 10).

? That seems kind of annoying.

@annevk
Copy link
Member Author

annevk commented May 15, 2020

We'd only have to do that in the return value, right? And that already has

Return four bytes whose values

so it would only be matter of linking bytes and values.

I guess we could also say that a code point is always prefixed with U+... then and not typically.

@domenic
Copy link
Member

domenic commented May 15, 2020

I see, so the variables named byte1 etc. would not actually be bytes, they'd be byte values? I guess that works.

I guess we could also say that a code point is always prefixed with U+... then and not typically.

I've always found the idea that we allow 0x notation for code points weird. I haven't seen it in action but maybe it's done in MIME Sniff or Encoding.

@annevk
Copy link
Member Author

annevk commented May 16, 2020

I hadn't looked in detail to Encoding before writing this, but looking at it now most of it seems to be written assuming this exists. 😊

I'll tidy up the language around code points as well though, to only allow them prefixed.



<h3 id=code-points>Code points</h3>

<p>A <dfn export lt="code point|character">code point</dfn> is a Unicode code point and is
represented as a four-to-six digit hexadecimal number, typically prefixed with "U+".
represented as "U+" followed by four-to-six <a>ASCII upper hex digits</a>, in the range U+0000 to
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too circular? Not really sure how else to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do it the other way around? Perhaps:

A code point is a Unicode code point, whose value is an integer between 0 and 0x10FFFF inclusive and represented as the string U+ followed by four to six ASCII upper hex digits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that still has the same issue, as ASCII upper hex digits are defined using the U+ convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding the circularity seems like it would lead to double-defining ASCII upper hex digits and reduce the overall clarity. I don't think people have a hard time with the U+ syntax so much as they have a difficult time with the difference between USVs/code points vs. code units (particularly the UTF-16 variety).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can really avoid circularity in Infra without making it significantly more opaque. See previous discussions at #230 (comment)

@annevk annevk merged commit 88fa454 into master May 18, 2020
@annevk annevk deleted the annevk/byte-code-point-values branch May 18, 2020 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants