Clarify Unicode and UTF-8 references #929

eksortso · 2022-10-27T18:20:13Z

This PR applies the changes to clarify Unicode and UTF-8 references, done originally for #924 while considering relaxations on control code in comments. These are standalone changes, separate from the original scope of #924. Many thanks to @abelbraaksma and @ChristianSi for their work on this.

eksortso · 2022-10-27T18:28:40Z

@pradyunsg Here are the isolated Unicode and UTF-8 language changes.

Updates the changelog
Emphasizes the UTF-8 encoding requirement
Consistently uses U+xxxx notation for characters throughout
Corrects misapplications of the language for UTF-8 and Unicode

CHANGELOG.md

abelbraaksma

Looks good! Two minor nits. Also, should we specify a minimum Unicode version? Iirc, UTF-8 was only introduced in version 2. Most public, free or just plain built in Unicode support is typically at 5 (for instance .NET Framework on Win7) or higher.

For most things, version wouldn’t matter. Just the introduction to the higher planes (V2) and surrogates (also V2) are relevant, I guess.

Both have been around for ages, not sure how much it’s worth over specifying here. Just thinking out loud.

toml.md

…toml into unicode-clarifications

toml.md

abelbraaksma

The new preamble is very clear, and the rest of the improvements remove ambiguity. Language is hard, esp spec language, I really do appreciate you taking the time to get this right.

TLDR: LGTM!

eksortso · 2022-11-10T19:54:42Z

Couldn't have done it without you guys @abelbraaksma @ChristianSi !

@pradyunsg Here's the Unicode and UTF-8 language cleanup that was moved out of #924 and greatly refined. Think these changes will improve the specification a lot. What do you think?

eksortso · 2023-01-07T22:13:14Z

@pradyunsg Please review these changes and make a decision. Our work coalesced two months ago, and we sent a reminder to review this at that time. What do you think?

pradyunsg

LGTM! This is really nice!

One minor nit-pick again. 😅

toml.md

pradyunsg · 2023-01-10T21:14:12Z

Thanks @eksortso!

abelbraaksma · 2023-01-11T00:49:53Z

Very happy this got approved, considering the discussions that came with it (in the other threads), glad this is in! 👏

ChristianSi · 2023-01-12T12:55:42Z

Last-minute changes are always dangerous. While it's good this was merged, in the last minute (without anyone being able to re-review it) the sentence

Specifically this means that, should a file as a whole not form a well-formed code-unit sequence, the file must be rejected (preferably) or ill-formed byte sequences must be replaced with U+FFFD as per the Unicode specification.

was changed to:

Specifically this means that a file as a whole must form a well-formed code-unit sequence and will be rejected (preferably) or have ill-formed byte sequences replaced with U+FFFD as per the Unicode specification.

Hmm! So now it seems that every TOML file "will be rejected (preferably)" which was probably not the intent!

I'd suggest changing this to:

Specifically this means that a file as a whole must form a well-formed code-unit sequence, otherwise it be rejected (preferably) or ill-formed byte sequences will be replaced with U+FFFD as per the Unicode specification.

Or just return to the original wording.

@eksortso: Will you open a new PR for that? Or should I?

eksortso · 2023-01-12T16:09:12Z

@ChristianSi I'll open the new PR. Even though @pradyunsg wrote the change, I accepted it thinking that all it was doing was restating the requirement in positive terms. I take responsibility for missing the wrong conjunction and the subsequent change in meaning.

The new wording will be as follows. I kept the first MUST clause in its own sentence. The consequences of violating this clause are specified in the second MUST ("otherwise") sentence. Together, these two sentences are semantically identical to the original wording.

Specifically this means that a file as a whole must form a well-formed code-unit sequence. Otherwise, it must be rejected (preferably), or have ill-formed byte sequences replaced with U+FFFD as per the Unicode specification.

eksortso added 2 commits October 27, 2022 14:15

Clarify Unicode and UTF-8 references

daf5d5a

Merge branch 'main' into unicode-clarifications

794ef78

eksortso marked this pull request as ready for review October 27, 2022 18:21

eksortso mentioned this pull request Oct 27, 2022

Permit more control characters in comments #924

Merged

ChristianSi approved these changes Oct 27, 2022

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

eksortso mentioned this pull request Oct 27, 2022

TOML 1.1.0 #928

Open

abelbraaksma approved these changes Oct 28, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

toml.md Outdated Show resolved Hide resolved

eksortso added 2 commits October 27, 2022 22:33

Remove text on handling binary-to-text encodings

6f2b362

toml.md: Replace "spec" with better alternatives

8622e00

ChristianSi suggested changes Oct 29, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

eksortso added a commit to eksortso/toml that referenced this pull request Nov 7, 2022

Revert Unicode language changes (now in PR toml-lang#929)

a338fb0

eksortso added a commit to eksortso/toml that referenced this pull request Nov 7, 2022

Revert Unicode language changes (now in PR toml-lang#929)

0dade12

eksortso added 4 commits November 7, 2022 17:40

Note that TOML has no low-level binary data syntax

41503a9

Note binary-to-text encoding is not part of TOML

e9bb4d1

Note binary-to-text encoding is not part of TOML

04c6ba4

Merge branch 'unicode-clarifications' of https://github.com/eksortso/…

2598960

…toml into unicode-clarifications

ChristianSi suggested changes Nov 9, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

Reformulate warning not to mix Unicode and bytes

42bab11

ChristianSi approved these changes Nov 9, 2022

View reviewed changes

abelbraaksma approved these changes Nov 10, 2022

View reviewed changes

eksortso mentioned this pull request Nov 10, 2022

clarify string descriptions #875

Open

pradyunsg approved these changes Jan 10, 2023

View reviewed changes

toml.md Outdated Show resolved Hide resolved

toml.md Show resolved Hide resolved

eksortso and others added 2 commits January 10, 2023 15:48

Update toml.md: Rephrase UTF-8 req in bullet list

998dd55

Merge branch 'main' into unicode-clarifications

ba10eb4

pradyunsg merged commit 97ae4f9 into toml-lang:main Jan 10, 2023

eksortso deleted the unicode-clarifications branch January 10, 2023 21:35

eksortso mentioned this pull request Jan 12, 2023

Correct requirement of well-formed code-unit sequence #951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify Unicode and UTF-8 references #929

Clarify Unicode and UTF-8 references #929

eksortso commented Oct 27, 2022

eksortso commented Oct 27, 2022

abelbraaksma left a comment •

edited

Loading

abelbraaksma left a comment

eksortso commented Nov 10, 2022

eksortso commented Jan 7, 2023

pradyunsg left a comment

pradyunsg commented Jan 10, 2023

abelbraaksma commented Jan 11, 2023 •

edited

Loading

ChristianSi commented Jan 12, 2023

eksortso commented Jan 12, 2023 •

edited

Loading

Clarify Unicode and UTF-8 references #929

Clarify Unicode and UTF-8 references #929

Conversation

eksortso commented Oct 27, 2022

eksortso commented Oct 27, 2022

abelbraaksma left a comment • edited Loading

Choose a reason for hiding this comment

abelbraaksma left a comment

Choose a reason for hiding this comment

eksortso commented Nov 10, 2022

eksortso commented Jan 7, 2023

pradyunsg left a comment

Choose a reason for hiding this comment

pradyunsg commented Jan 10, 2023

abelbraaksma commented Jan 11, 2023 • edited Loading

ChristianSi commented Jan 12, 2023

eksortso commented Jan 12, 2023 • edited Loading

abelbraaksma left a comment •

edited

Loading

abelbraaksma commented Jan 11, 2023 •

edited

Loading

eksortso commented Jan 12, 2023 •

edited

Loading