Give a name to byte and code point's underlying number #305

annevk · 2020-05-15T06:41:15Z

This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.

domenic

Hmm. It's a bit weird not to treat bytes as numbers automatically... But, the prose for corresponding values between code points and bytes that this patch introduces is quite nice and clear now.

How would this impact algorithms like https://encoding.spec.whatwg.org/#gb18030-encoder ? E.g. would

Let byte1 be pointer / (10 × 126 × 10).

become

Let byte1 be the byte whose values is pointer / (10 × 126 × 10).

? That seems kind of annoying.

annevk · 2020-05-15T17:48:32Z

We'd only have to do that in the return value, right? And that already has

Return four bytes whose values

so it would only be matter of linking bytes and values.

I guess we could also say that a code point is always prefixed with U+... then and not typically.

domenic · 2020-05-15T19:18:46Z

I see, so the variables named byte1 etc. would not actually be bytes, they'd be byte values? I guess that works.

I guess we could also say that a code point is always prefixed with U+... then and not typically.

I've always found the idea that we allow 0x notation for code points weird. I haven't seen it in action but maybe it's done in MIME Sniff or Encoding.

annevk · 2020-05-16T05:52:18Z

I hadn't looked in detail to Encoding before writing this, but looking at it now most of it seems to be written assuming this exists. 😊

I'll tidy up the language around code points as well though, to only allow them prefixed.

annevk · 2020-05-16T06:31:59Z

infra.bs



 <h3 id=code-points>Code points</h3>

 <p>A <dfn export lt="code point|character">code point</dfn> is a Unicode code point and is
-represented as a four-to-six digit hexadecimal number, typically prefixed with "U+".
+represented as "U+" followed by four-to-six <a>ASCII upper hex digits</a>, in the range U+0000 to


Is this too circular? Not really sure how else to do this.

Maybe do it the other way around? Perhaps:

A code point is a Unicode code point, whose value is an integer between 0 and 0x10FFFF inclusive and represented as the string U+ followed by four to six ASCII upper hex digits.

I think that still has the same issue, as ASCII upper hex digits are defined using the U+ convention.

Avoiding the circularity seems like it would lead to double-defining ASCII upper hex digits and reduce the overall clarity. I don't think people have a hard time with the U+ syntax so much as they have a difficult time with the difference between USVs/code points vs. code units (particularly the UTF-16 variety).

I don't think we can really avoid circularity in Infra without making it significantly more opaque. See previous discussions at #230 (comment)

Give a name to byte and code point's underlying number

459d58e

This would be useful for whatwg/url#518 and also clarifies isomorphic decode/encode a bit.

annevk mentioned this pull request May 15, 2020

Editorial: make everything use percent-encode sets whatwg/url#518

Merged

domenic reviewed May 15, 2020

View reviewed changes

domenic approved these changes May 15, 2020

View reviewed changes

require more precision

16091d4

annevk commented May 16, 2020

View reviewed changes

align byte a bit more with code point

5aa6822

annevk merged commit 88fa454 into master May 18, 2020

annevk deleted the annevk/byte-code-point-values branch May 18, 2020 05:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give a name to byte and code point's underlying number #305

Give a name to byte and code point's underlying number #305

annevk commented May 15, 2020 •

edited by pr-preview bot

Loading

domenic left a comment

annevk commented May 15, 2020

domenic commented May 15, 2020

annevk commented May 16, 2020

annevk May 16, 2020

aphillips May 16, 2020

annevk May 17, 2020

aphillips May 17, 2020

domenic May 17, 2020

Give a name to byte and code point's underlying number #305

Give a name to byte and code point's underlying number #305

Conversation

annevk commented May 15, 2020 • edited by pr-preview bot Loading

domenic left a comment

Choose a reason for hiding this comment

annevk commented May 15, 2020

domenic commented May 15, 2020

annevk commented May 16, 2020

annevk May 16, 2020

Choose a reason for hiding this comment

aphillips May 16, 2020

Choose a reason for hiding this comment

annevk May 17, 2020

Choose a reason for hiding this comment

aphillips May 17, 2020

Choose a reason for hiding this comment

domenic May 17, 2020

Choose a reason for hiding this comment

annevk commented May 15, 2020 •

edited by pr-preview bot

Loading