-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permit more control characters in comments #924
Permit more control characters in comments #924
Conversation
I think I could get on board with this. However, I don't understand why this seems to redefine the concept of a newline. The spec already clearly says: "Newline means LF (0x0A) or CRLF (0x0D 0x0A)." So, so comment ends at a newline, whether it's LF or CRLF. That's very clear and non-negotiable. So I don't understand what this language regarding "exclude a final carriage return" is meant to say. Another issue is whether a lone CR (0x0A) should be allowed and ignored in comments. The PR current does so, but I think it should rather be rejected. Some not fully compliant parsers might treat CR as a linebreak. That would allow smuggling data into a document that a compliant parser ignores as part of a comment. If such CR's are rejected this risk is reduced since the compliant parser would then reject the document altogether. |
So I would suggest to change the wording to:
The ABNF would need to be adjusted as well. |
Also should to use U+ syntax instead of 0x for codepoints. The way it's phrased now is kinda confusing; U+0A is the newline; this is already allowed in comments, because every line ends with a comment. In general, I don't really see the point of this though; no one raised any practical concerns over this ("I tried to do X but I can't and the specification limits me") and it just seems to introduce needless churn for implementers who will need to update the implementations, tests, etc. I would not be in favour of merging this. |
@ChristianSi What I was attempting to do was not redefining the concept of newlines. But let me come back to that. I expect comments to ignore everything to the end of the line. For practical purposes, that means starting from a free-standing hash character, scanning ahead until the first LF character (or the EOF) is found, and disposing of everything from the hash up to but not including that LF character. And by everything, I mean everything, including CR characters. Call it a "naive" approach to comment disposal. Your concern about "smuggling data into a document" would be mostly baseless, except for the existence of comment-preserving parsers. Such a parser cannot use the naive approach described previously, because if the TOML document uses CR LF for newlines, the final CR must be excluded from the comment registered by the parser. The last sentence of my change to Strike the entire paragraph, and replace the first paragraph in the Comment section with the following, which simplifies the thing further:
I explicitly say "code points" here because that includes all characters, all ASCII control codes, all U+everything, except for a U+000A, which is a newline in TOML. It makes no sense to make an exception for U+000D, except to parsers that read all newlines first. And the authors of those parsers should be savvy enough to know what to keep if they're preserving a comment. @arp242 This PR addresses a number of points raised by @abelbraaksma in #567, specifically getting rid of (almost) all restrictions on characters permitted within comments, because we do not want a document rejected just because a control character shows up in a comment. And if you've ever copy/pasted Excel cells into a text widget before, you do come across stray CR characters far too often. |
Yes, I read that; but I don't see any practical problems. And both LF and CRLF end of lines are already allowed; if the ABNF excludes that then that's a bug in the ABNF. |
A CR LF line ending is still permitted, because if you follow the ABNF to the letter, the CR in the newline is considered part of the content of the comment; it's all in |
OK, I think I can live with lonely CR's being ignored. And I agree with @arp242 that even the wording
is still a bit confusing – after all, if you add an LF to a comment, you don't produce an error, but simply close the comment. Maybe it would be better to express this as something like:
Another issue that needs to be addressed is Unicode validity. Wikipedia writes:
Explaining the calculating in a footnote:
What to do with these surrogate code points? They are "technically invalid" and have no place in a well-formed UTF-8 string. Hence they are not allowed in TOML strings, for good reasons. But should their inclusion in a comment be tolerated, or should it cause the document to be rejected as invalid? As I argued earlier, it might be reasonable to allow either behavior, stating something like:
There may be a few other code points to which the same reasoning applies, but I'll leave that to the Unicode experts. |
@ChristianSi You said:
That makes sense. In fact, let's remove the parentheses. The definition of a newline immediately precedes this section, so it ought to be clear why we mention LF and CRLF. Here's my reworked version.
I weighed the value of keeping the word "valid" in there. I was presuming that the document was already verified as being valid UTF-8, meaning no illegal bytes and no byte streams representing surrogate code points (among others) would be present. If a TOML document is read into a string for processing, then validation is performed before the hash symbol is even processed as the start of a comment. But you are coming from a perspective in which the validity of the document's byte stream is yet to be verified. For languages that live close to the bare metal, that could be the case. But surrogates are always invalid in UTF-8, and if they are present in any text document, then that's a bigger problem than a TOML parser ought to handle. Certainly bigger than isolated CRs from Excel. Invalid byte streams, including surrogate code points, should be rejected outright. Parsers should not tolerate them in comments, as they should not tolerate invalid UTF-8 documents in general. But we can still say "valid" code points, so there's no ambiguity whatsoever. I will make one modification to the ABNF: rename |
As an implementation maintainer I'm slightly opposed to this change simply because it seems this isn't something that users are asking for but rather something that I'm also concerned about allowing lone CR in comments. Assuming that text editors render lone CR as a line ending, I think it creates an opportunity to craft valid TOML documents that look different to human eye than a compliant parser. For example: # All line endings in this document are lone CRs
a = "hello"
b = "world" Assuming that all line endings in the document are CRs, the document is still valid. It is a single line TOML document consisting of one comment. I.e. the content, when read by a compliant parser is an empty mapping |
The only places where I've seen single CRs as line endings are in text documents on pre-OSX MacOS systems and in text cells in Excel. Since we are not creating the TOML standard for twenty-year-old Macs, uncoupled CRs created by text editors ought to be a rare occurrence. When comments were first defined in TOML, there were no restrictions on control codes in comments. That was changed a few years ago when all control codes except horizontal tabs were excluded. Was this a burden for parser writers to handle when it was introduced? A little, though I'd have to check if they were doing it right. Was it a limitation that users were asking for? It was not, based on what I've read here. Will existing parsers have to relax these limitations if this PR is merged? Well yes, but they would also be making changes to implement other new features in the future TOML v1.1.0. Do I have a stake in this? Just in how users copy whole Excel cells into text widgets and the pain I've experienced in stripping loose CRs out of database columns when they are not properly removed or converted. Naive users won't be flummoxed by CRs in comments any longer. This PR doesn't address single CRs showing up anywhere else though. |
I agree lone CRs are a rare occurrence. My concern is someone with malicious intent creating a TOML document that looks different to human eye compared to a compliant parser. Status quo, where the compliant parser errors, may be safer than what this PR proposes. |
I think @hukkin's concern is valid if editors tend to interpret lone CR's as newlines. What you see should be what you get, so a TOML file that has different linebreaks in a text editor than what a parser sees it not a good idea. So what do text editors do? I saved @hukkin's CR-only example from above as a text file: with open('test.txt', 'w') as f:
f.write('# All line endings in this document are lone CRs\ra = "hello"\rb = "world"\r') and opened it in a few editors, with the following results:
You are all invited to make the same experiment with the editors you commonly use or can think of, and report back. For me I must say that two our of four is too close for comfort. Gedit's behavior is especially problematic, and since it's the default GUI text editor in Ubuntu (the one that opens when you double-click a text file), I assume it's pretty widely used. Therefore I return to my earlier position that lone CRs should remain forbidden in comments, even if all other control characters are allowed. |
@ChristianSi I tried basically the same thing in several Windows 10 text editors. I altered your code slightly, writing the same file in binary mode: with open('testCR.txt', 'wb') as f:
f.write(b'# All line endings in this document are lone CRs\ra = "hello"\rb = "world"\r') When I opened them, these were the results:
I also ran The mixed behaviors are discomforting. So despite my misgivings expressed previously, I'm changing my mind. Lone CRs ought to be forbidden in comments. But @hukkin's position is to retain the status quo. Keep all control codes except tab out of comments. Let me run one more acid test. What happens when the DEL control code, U+007F, is present? with open('testDEL.txt', 'w') as f:
f.write("""\
# All lines in this document end with a DEL character.\x7f
a = "hello"\x7f
b = "world"\x7f
""") Notepad and Notepad2 show the DEL character as an unknown character, with a rectangular placeholder. VS Code and Notepad++ show the character as a "DEL" symbol, the former even displaying the symbol in red. micro shows them as whitespace characters. @ChristianSi @abelbraaksma You may want to test how these appear in *nix environment text editors, and how they appear when you The similarities in DEL's appearances, though less glaring, still suggest that allowing other control characters in comments may still be sensible. But I am starting to think that, per @hukkin, keeping the status quo is the best route for us to take. I'm not about to close this PR though; in fact, I will disallow CR characters and replace the "0x"s with "U+00"s in @abelbraaksma I'm beginning to consider closing this PR, so if you want to continue making your case, now's the time to do it. As for anyone else with a stake in this, including @pradyunsg @BurntSushi @mojombo, what are your thoughts? |
I haven't really been following this discourse, but FWIW my position is this:
|
@eksortso It's been a while since I wrote my original opinion. But from your comments above, I think we have basically two things to consider:
Thoughts on line endingsI think the first point above belongs to editors. However, if you compare it to some W3C standards, they are liberal in what they accept, but require normalization. That is, they allow However, the algorithm is ridiculously simple:
Henceforth, line endings are then normalized, and what we call a In practice, this would mean removing the following from the ABNF:
and making newlines implicit (i.e., part of the prose, not the ABNF). Or, conversely, since alternatives in parsing are ordered, this could work, I think (which, btw, already suggests that the current code is actually incorrect?):
Thoughts on control charactersCurrently, it is clear to me that comments are overly restrictive (i.e., control characters are absent), but also too lenient (illegal unicode characters are allowed). The range, as it currently stands is:
If we expand this, we get:
Let's break this down:
Conclusion?Sorry for the long post (it didn't set out that way!). Basically, we have this situation:
My vote: let's take the union of these two and simplify parsing. A comment can just contain any Unicode character, with just two exceptions: line endings (including
Or alternatively (that is, allowing
Meanwhile, I do think we should correct the newline handling. I realize this may be a little too lenient for some, but I see no harm in doing so. Any existing correct TOML file out there still parses just fine, and once this is adopted, parsers won't have to treat comments any different than "ignore from |
Hmm, I don't think we should allow lone CR's as linebreaks – quite simply, since we had discussed that in the past and decided against it. I'm not sure where that was, but anyone willing to search should be able to find it. And just reversing one's position every few years without really new arguments having come up is not a good thing. More specifically, there are currently no TOML files that use CR's as linebreaks. And since this linebreak convention isn't used anywhere anymore, except maybe on some computers that have long ago ended up in museums, there is just no reason why they should ever come into existence. As for @abelbraaksma's suggestions on |
@ChristianSi thanks for clarifying. I’m not aware of the old decisions, I wholeheartedly agree to staying consistent. I have seen such file in the wild, usually as a result of bad switching between platforms or editors mistreating LFs (try opening such files in Notepad, then manually adding newlines on top of those). But bad editing is no reason for allowing such things, I agree. And the times that DNX had LFCR (as opposed to CRLF) is indeed far in the past. |
I'll use @abelbraaksma's first The matter of normalizing line endings (just LF and CRLF in TOML's case) is appealing to me, especially since we're going to someday form an RFC based on the TOML spec, per #870. But seeing as text editors can unintentionally mix line endings, I won't push it. |
A few more notes. I do not like surrogate code points being encoded in UTF-8, because it is in fact illegal to have such byte streams in UTF-8. And in fact, we banned surrogates from TOML documents in #619 three years ago with the introduction of But to simplify comment scanning while disallowing the historic line-breaking ASCII codes (please help: what's a better name for the collected set of codes CR, FF, VT, and LF?), I'll use the new Expect a little rearrangement. |
And now that I think about it, I ought to strip out the word "valid" in "All valid code points" since we're allowing invalid code points in comments now. But somebody give me a better description of what %x0A through %x0D should be called collectively. |
Fix newline description Co-authored-by: Taneli Hukkinen <[email protected]>
I wrote this about surrogate codepoints:
What does this mean for us?Not much, really. TOML currently requires, encoding as UTF-8. What this means is, that it is impossible to encounter surrogate characters. If you do encounter them, you'll encounter them prior to reading them as UTF-8. By the time your stream turns from a byte stream into a proper UTF-8 stream, those (illegal) surrogate pairs are no longer there, they must be replaced by So, what does this really mean for us?We should not treat surrogate characters as special. They don't exist. The only place where we should disallow them, is in escape sequences. I.e., you should not be allowed to write If we still have remnants of ABNF dealing with surrogates, we should really (probably?) remove them. Or we should remove the requirement for TOML being valid UTF-8, in which case it makes sense to embed the Unicode rules in the spec. For reference, this FAQ summarizes the above as well: UTF-8 FAQ |
@pradyunsg Could we split the changes into two separate PRs? The Unicode language changed a lot. Suppose we could isolate those changes and address them separately? UPDATE: Already isolated those changes. See #929. If that's merged, then this PR will be greatly simplified. |
a338fb0
to
0dade12
Compare
I stripped this PR back down to its essentials, that is, the re-allowance of most control characters inside comments. Anything pertaining to Unicode language changes was already moved over to #929, which can be reviewed separately. @ChristianSi @abelbraaksma Want to take another look? @pradyunsg Can you review this PR again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good! Let’s finish this! :)
@pradyunsg This is a lot simpler since the last time you looked at it. We're allowing all but five (UTF-8-valid) code points in comments. Is this good to merge? |
@pradyunsg Could you please review this and make a decision? It's been two months since the last reminder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a minor nit-pick. Thanks for a very extensive discussion here folks. ^>^
Sure thing. I didn't know this was a style guideline, but I know now. Co-authored-by: Pradyun Gedam <[email protected]>
for more information, see https://pre-commit.ci
Thanks @eksortso! ^.^ |
Well done, great work! |
This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer. No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this. --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality. Fixes toml-lang#995
This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer. No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. [None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so... --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality. Fixes toml-lang#995
In reconsideration of the limitation of control codes within comments discussed in #567, a case was made to simplify comment parsing. This PR intends to replace those previously imposed restrictions with a simplified approach that permits anything except newlines.