Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify Unicode and UTF-8 references #929

Merged
merged 11 commits into from
Jan 10, 2023

Conversation

eksortso
Copy link
Contributor

This PR applies the changes to clarify Unicode and UTF-8 references, done originally for #924 while considering relaxations on control code in comments. These are standalone changes, separate from the original scope of #924. Many thanks to @abelbraaksma and @ChristianSi for their work on this.

@eksortso eksortso marked this pull request as ready for review October 27, 2022 18:21
@eksortso
Copy link
Contributor Author

@pradyunsg Here are the isolated Unicode and UTF-8 language changes.

  • Updates the changelog
  • Emphasizes the UTF-8 encoding requirement
  • Consistently uses U+xxxx notation for characters throughout
  • Corrects misapplications of the language for UTF-8 and Unicode

CHANGELOG.md Show resolved Hide resolved
@eksortso eksortso mentioned this pull request Oct 27, 2022
Copy link
Contributor

@abelbraaksma abelbraaksma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Two minor nits. Also, should we specify a minimum Unicode version? Iirc, UTF-8 was only introduced in version 2. Most public, free or just plain built in Unicode support is typically at 5 (for instance .NET Framework on Win7) or higher.

For most things, version wouldn’t matter. Just the introduction to the higher planes (V2) and surrogates (also V2) are relevant, I guess.

Both have been around for ages, not sure how much it’s worth over specifying here. Just thinking out loud.

toml.md Outdated Show resolved Hide resolved
toml.md Outdated Show resolved Hide resolved
toml.md Outdated Show resolved Hide resolved
eksortso added a commit to eksortso/toml that referenced this pull request Nov 7, 2022
eksortso added a commit to eksortso/toml that referenced this pull request Nov 7, 2022
toml.md Outdated Show resolved Hide resolved
Copy link
Contributor

@abelbraaksma abelbraaksma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new preamble is very clear, and the rest of the improvements remove ambiguity. Language is hard, esp spec language, I really do appreciate you taking the time to get this right.

TLDR: LGTM!

@eksortso
Copy link
Contributor Author

Couldn't have done it without you guys @abelbraaksma @ChristianSi !

@pradyunsg Here's the Unicode and UTF-8 language cleanup that was moved out of #924 and greatly refined. Think these changes will improve the specification a lot. What do you think?

@eksortso
Copy link
Contributor Author

eksortso commented Jan 7, 2023

@pradyunsg Please review these changes and make a decision. Our work coalesced two months ago, and we sent a reminder to review this at that time. What do you think?

Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is really nice!

One minor nit-pick again. 😅

toml.md Outdated Show resolved Hide resolved
toml.md Show resolved Hide resolved
@pradyunsg pradyunsg merged commit 97ae4f9 into toml-lang:main Jan 10, 2023
@pradyunsg
Copy link
Member

Thanks @eksortso!

@eksortso eksortso deleted the unicode-clarifications branch January 10, 2023 21:35
@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jan 11, 2023

Very happy this got approved, considering the discussions that came with it (in the other threads), glad this is in! 👏

@ChristianSi
Copy link
Contributor

Last-minute changes are always dangerous. While it's good this was merged, in the last minute (without anyone being able to re-review it) the sentence

Specifically this means that, should a file as a whole not form a well-formed code-unit sequence, the file must be rejected (preferably) or ill-formed byte sequences must be replaced with U+FFFD as per the Unicode specification.

was changed to:

Specifically this means that a file as a whole must form a well-formed code-unit sequence and will be rejected (preferably) or have ill-formed byte sequences replaced with U+FFFD as per the Unicode specification.

Hmm! So now it seems that every TOML file "will be rejected (preferably)" which was probably not the intent!

I'd suggest changing this to:

Specifically this means that a file as a whole must form a well-formed code-unit sequence, otherwise it be rejected (preferably) or ill-formed byte sequences will be replaced with U+FFFD as per the Unicode specification.

Or just return to the original wording.

@eksortso: Will you open a new PR for that? Or should I?

@eksortso
Copy link
Contributor Author

eksortso commented Jan 12, 2023

@ChristianSi I'll open the new PR. Even though @pradyunsg wrote the change, I accepted it thinking that all it was doing was restating the requirement in positive terms. I take responsibility for missing the wrong conjunction and the subsequent change in meaning.

The new wording will be as follows. I kept the first MUST clause in its own sentence. The consequences of violating this clause are specified in the second MUST ("otherwise") sentence. Together, these two sentences are semantically identical to the original wording.

Specifically this means that a file as a whole must form a well-formed code-unit sequence. Otherwise, it must be rejected (preferably), or have ill-formed byte sequences replaced with U+FFFD as per the Unicode specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants