Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Missing escape of char literal in interpolation #861

Closed
JohelEGP opened this issue Dec 1, 2023 · 27 comments
Closed

[BUG] Missing escape of char literal in interpolation #861

JohelEGP opened this issue Dec 1, 2023 · 27 comments
Labels
bug Something isn't working

Comments

@JohelEGP
Copy link
Contributor

JohelEGP commented Dec 1, 2023

Title: Missing escape of char literal in UFCS in interpolation.

Minimal reproducer (https://cpp2.godbolt.org/z/r1f73EP85):

#include <ranges>
main: () = {
  using namespace std::views;
  s := "".std::string();
  _ = "((_ = s.split('\n'), 0))$";
}
Commands:
cppfront main.cpp2
clang++18 -std=c++23 -stdlib=libc++ -lc++abi -pedantic-errors -Wall -Wextra -Wconversion -Werror=unused-result -Werror=unused-value -Werror=unused-parameter -I . main.cpp

Expected result: static_cast<void>(cpp2::to_string((static_cast<void>(CPP2_UFCS(split, std::move(s), '\n')), 0)));.

Actual result and error: static_cast<void>(cpp2::to_string((static_cast<void>(CPP2_UFCS(split, std::move(s), 'n')), 0)));.

Cpp2 lowered to Cpp1:
//=== Cpp2 type declarations ====================================================


#include "cpp2util.h"

#line 1 "/app/example.cpp2"


//=== Cpp2 type definitions and function declarations ===========================

#line 1 "/app/example.cpp2"
#include <ranges>
auto main() -> int;

//=== Cpp2 function definitions =================================================

#line 1 "/app/example.cpp2"

#line 2 "/app/example.cpp2"
auto main() -> int{
  using namespace std::views;
  auto s {CPP2_UFCS_0(std::string, "")}; 
  static_cast<void>(cpp2::to_string((static_cast<void>(CPP2_UFCS(split, std::move(s), 'n')), 0)));
}
Output:
Program returned: 0
@JohelEGP JohelEGP added the bug Something isn't working label Dec 1, 2023
@JohelEGP
Copy link
Contributor Author

JohelEGP commented Dec 1, 2023

Hmm... Maybe I should be using '\\\n' or a interpolated raw string literal.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 1, 2023

Looks like it's not disabling string quoting inside the capture. You get the same result with just this:

main: () = {
  _ = "('\n')$";
}

@JohelEGP
Copy link
Contributor Author

JohelEGP commented Dec 1, 2023

Hmm... Maybe I should be using '\\\n' or a interpolated raw string literal.

Documentation might help to build an intuition on how this should work.
Looking at the grammar, I should even be able to use any expression in an interpolation.
That's because the string doesn't continue until the interpolation ends.
So I think just '\n' should work, without extra \s and without using a raw string literal.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 1, 2023

Character literals such as \x10 and \u0010 don't parse properly either.

@JohelEGP JohelEGP changed the title [BUG] Missing escape of char literal in UFCS in interpolation [BUG] Missing escape of char literal in interpolation Dec 6, 2023
@hsutter
Copy link
Owner

hsutter commented Dec 15, 2023

It looks like this behavior was added in #464 and #481. The block of code that does it is here:

https://github.com/hsutter/cppfront/blame/main/source/lex.h#L430-L436

I'm not sure why we're removing those \ characters. @filipsajdak do you recall?

When I comment out that block, all regression tests pass unchanged, and the code reported in this issue is fixed.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 15, 2023

It was to fix #461
Looks like something more involved is needed to fix that while not breaking character literals.

@hsutter
Copy link
Owner

hsutter commented Dec 18, 2023

Here's the code from #461, showing the state before #461 was fixed:

  std::cout << "\"\"" << '\n'; // Prints `""`.
  std::cout << $R"("")" << '\n'; // Prints `""`.
  std::cout << "\"(\"\")$\"" << '\n'; // Prints `""`.
  std::cout << "(\"\\\"\\\"\")$" << '\n'; // cppfront asserts

My question is, should those last two have to escape inside the captures? Because what's in the captures should be an expression, and \"\" and \"\\\"\\\"\" aren't valid expressions (you can't just write those in code).

In other words, the capture source expression is not part of the string itself (and therefore shouldn't be escaped) -- its evaluation is part of the string (after being passed through to_string).

So I would instead expect this behavior:

  std::cout << "\"\"" << '\n';            // prints "" (as above)
  std::cout << $R"("")" << '\n';          // prints "" (as above)
  std::cout << "\"(\"\")$\"" << '\n';     // error, \"\" isn't a valid expression
  std::cout << "\"("")$\"" << '\n';       // this should print ""
  std::cout << "(\"\\\"\\\"\")$" << '\n'; // error, \"\\\"\\\"\" isn't a valid expression
  std::cout << "("\\\"\\\"")$" << '\n';   // this should print \\\"\\\"
  std::cout << "("\"\"")$" << '\n';       // this should print \"\"

Before I make that change, does that make sense to this group?

@JohelEGP
Copy link
Contributor Author

  std::cout << "("\\\"\\\"")$" << '\n';   // this should print \\\"\\\"
  std::cout << "("\"\"")$" << '\n';       // this should print \"\"

The comments look wrong to me.
I think they should be

  std::cout << "("\\\"\\\"")$" << '\n';   // this should print \"\"
  std::cout << "("\"\"")$" << '\n';       // this should print ""

If this is correct, it's the behavior I expect.

@hsutter
Copy link
Owner

hsutter commented Dec 18, 2023

Ah, I think you're right about those last two, thanks.

In particular, these two should be the same and print \"\" as you say:

std::cout << "("\\\"\\\"")$" << '\n';
std::cout << "" << "\\\"\\\"" << "" << '\n';

And these two should be the same and print "" as you say:

std::cout << "("\"\"")$" << '\n';
std::cout << "" << "\"\"" << "" << '\n';

Right?

Let me try a few other cases...

a := 40;
b := 2;
s: std::string = "xyzzy";

std::cout << "(a+b)$"; // should print 42
std::cout << "(s+"plugh")$"; // should print xyzzyplugh
std::cout << "("plugh")$"; // should print plugh
std::cout << "(s+"\"\n")$"; // should print xyzzy" plus a newline
std::cout << "("\"\n")$"; // should print " plus a newline
std::cout << "("")$"; // should print nothing

Do all these look correct now?

@JohelEGP
Copy link
Contributor Author

Yes.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 18, 2023

I'm a bit unsure about how unquoted double quotes should interact with interpolation captures.

std::cout << "(" + s + ")$";

Should this be interpreted as the string ( plus a string variable plus the string )$ and be printed as (xyzzy)$ or as a string containing a capture of a string iteral and be printed as + s +?

@hsutter
Copy link
Owner

hsutter commented Dec 18, 2023

OK, I'm just going to fix this immediate Issue.

However, I would still like to get to where " doesn't need to be escaped inside capture-expressions inside string literals. It currently does because in an example like "("")$" that currently lexes as a string literal "(" followed by a string literal of ")$" which gets an error because it doesn't have the matching ( (but even if it wasn't an error, it would be flagged as ungrammatical later since two string literals can't be adjacent).

Of course, you can see above that my wish is to lex "("")$" as a single string literal, but I now see that requires knowing that the second " is not intended to be the closing quote to terminate the string literal... which would require not only lookahead but (I fear possibly) speculative lexing which is a terrible idea. So for now I'm going to leave " inside string interpolations alone, they must still be escaped (but \ and such do not need to be escaped). But this may well be what tips me over the edge to making $ prefix, which would make lexing this simple, align with other languages, and address the steady trickle of requests to make prefix 😉 ... at the expense of making captures in functions require parentheses more often, but maybe it's fine to make captures stand out more (notably, capture-by-pointer such as var&$*++ would become $(var&)*++ which may be actually better and address that current slightly warty syntax).

@hsutter
Copy link
Owner

hsutter commented Dec 18, 2023

I'm a bit unsure about how unquoted double quotes should interact with interpolation captures.

Right you are. I think you just identified the same issue in a racing comment while I was writing the above. 👍

@gregmarr
Copy link
Contributor

gregmarr commented Dec 18, 2023

(but even if it wasn't an error, it would be flagged as ungrammatical later since two string literals can't be adjacent)

Is that an intentional change from Cpp1?

https://en.cppreference.com/w/cpp/language/string_literal

String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).

https://www.godbolt.org/z/j1v6qrK81

    std::cout << "("")$";
Program returned: 0
()$

@hsutter
Copy link
Owner

hsutter commented Dec 18, 2023

Is that an intentional change from Cpp1?

Yes. I've considered allowing that, but haven't gone there yet. If Cpp2 did allow merging consecutive string literal tokens, though, it would be in the grammar and not in the preprocessor.

Updated to add: I see in the quote above it says "phase 6 (after the preprocessor)" but I don't think that parenthetical is quite right, because it's not until phase 7 that "each preprocessing token is converted into a token." So up to the end of phase 6 we still have only preprocessing tokens.

@gregmarr
Copy link
Contributor

If cppfront doesn't support that, then should there be a "how to fix it" message?

@gregmarr
Copy link
Contributor

gregmarr commented Dec 18, 2023

Updated to add: I see in the quote above it says "phase 6 (after the preprocessor)" but I don't think that parenthetical is quite right, because it's not until phase 7 that "each preprocessing token is converted into a token." So up to the end of phase 6 we still have only preprocessing tokens.

I read "preprocessing tokens" as the output of the preprocessor. Phase 4 is the preprocessor, phase 5 is determining the common encoding for adjacent string literals, phase 6 is concatenating those adjacent string literals, and phase 7 is lexing, so I think "after the preprocessor" is accurate. The handling is covered in [lex.string] as well so it may even generally be handled inside the lexer.

@JohelEGP
Copy link
Contributor Author

Is the problem with "("")$" that
"the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token"?

But this may well be what tips me over the edge to making $ prefix, which would make lexing this simple, align with other languages, and address the steady trickle of requests to make prefix 😉 ... at the expense of making captures in functions require parentheses more often, but maybe it's fine to make captures stand out more (notably, capture-by-pointer such as var&$*++ would become $(var&)*++ which may be actually better and address that current slightly warty syntax).

Please, resist.
I very much prefer not having redundant parentheses.
I think postfix operators is justice.

I really liked where this way going,
until I noticed that you intended to make $ prefix everywhere,
and not just in string literals.
That's unfortunate, but it makes sense in the greater scheme of things.

How about alternatives?
I think @msadeqhe once presented a syntax similar to "("x$")".
It's an extension to adjacent string literals, with captures adjoined: x$"y", "x"y$, "x"y$"z".

@gregmarr
Copy link
Contributor

Is the problem with "("")$" that

No, I don't think so. The problem is that the first non-escaped " ends the string literal. To have that not be the case, you need a prefix to change the string parsing behavior, as in raw string literals, or you need some other indicator that you are entering a special parsing mode. If the ( is that special indicator, then you need to escape that ( in all cases where you don't want it to indicate that special behavior. Trading "you don't need to escape " in a capture" for "you need to escape ( when not capturing" doesn't sound like a fair trade, especially since we already have a syntax for "you don't need to escape " in a string".

@JohelEGP
Copy link
Contributor Author

Can we get that behavior by changing the grammar of string-literal and extending [lex.phases]?
This is the current grammar (missing the format-specifier #387 (comment)):

//G string-literal:
//G     encoding-prefix? '"' s-char-seq? '"'
//G     encoding-prefix? 'R"' d-char-seq? '(' s-char-seq? ')' d-char-seq? '"'
//G
//G s-char-seq:
//G     interpolation? s-char
//G     interpolation? s-char-seq s-char
//G
//G d-char-seq:
//G     d-char
//G
//G interpolation:
//G     '(' expression ')' '$'

The idea is for "(, with anything in-between, to be a potential-interpolation-prefix.
If that is followed by an expression, optional format-specifier, and )$ (call it interpolation-suffix),
it can be considered part of a string-literal prefix (that still needs the closing ").
If what follows "( isn't that, then it can be a string-literal-part.
Then, at some point in the lex phase, you concatenate potential-interpolation-prefix
with interpolation-suffix and string-literal-part to form a string-literal.
Or maybe that can just be part of the grammar.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 18, 2023

Can we get that behavior by changing the grammar of string-literal

The idea is for "(, with anything in-between, to be a potential-interpolation-prefix. If that is followed by an expression, optional format-specifier, and )$ (call it interpolation-suffix), it can be considered part of a string-literal prefix (that still needs the closing "). If what follows "( isn't that, then it can be a string-literal-part. Then, at some point in the lex phase, you concatenate potential-interpolation-prefix with interpolation-suffix and string-literal-part to form a string-literal. Or maybe that can just be part of the grammar.

That would still change the meaning of these well-formed lines of Cpp1 code.

std::cout << "("")$";
std::cout << "(" + s + ")$";

"( potential-interpolation-prefix
"" expression (an empty string)
)$" interpolation-suffix

"( potential-interpolation-prefix
" + s + " expression (the string + s +)
)$" interpolation-suffix

@JohelEGP
Copy link
Contributor Author

I'm fine with that, for now.
My goal is to keep $ suffix.

@hsutter
Copy link
Owner

hsutter commented Dec 18, 2023

OK, let's not quickly switch from suffix $.

I'm not too concerned about having a string literal with interpolation change meaning from Cpp1... they already do that for every case (e.g., "value is (val)$\n" has a different meaning) and so far nobody has raised that part as a concern IIRC.

I still want to keep lexing simple (modulo the backtrack we need for interpolation right now).

I'll think about it some more...

@gregmarr
Copy link
Contributor

I'm not too concerned about a single string either, but eliminating the need to escape the quotes can change from multiple strings to a single fixed string, and thus are much more concerning.

std::cout << "(" + someFunctionThatStoresStuffInTheDatabaseAndReturnsString() + ")$";

@hsutter
Copy link
Owner

hsutter commented Dec 21, 2023

For now the status quo is:

  • suffix $ which is working well
  • the only character that needs to be escaped inside a string interpolation is ", which isn't bad especially considering Greg's counterexamples, and that it is inside a string literal after all

I think that part is fine then until we get more new information.

For consecutive string literals: I see their absence as a good simplification in general. But it's true that sometimes it's useful to break a string literal across source lines, and interpolation does make for longer (if fewer) string literals. My thought there was to recognize it grammatically but require + between: "a string literal" + "another one" + "still another". My experience in Cpp1 today has been that we (at least I) still accidently write what looks like string literal addition pretty regularly when concatenating a long string made up of various parts, and it works just fine when a string object precedes it:

// ok in Cpp1
f( my_string + "a string literal" + "another one" + "still another" );
// error in Cpp1 - can't add char[K] types
f ( "a string literal" + "another one" + "still another" );

and the usual fix is to write std::string{"a string literal"} + "another one" + "still another". So lowering of binary_expression could just inject std::string{ and } around the first literal when it sees a sequence of string-literal + string_literal + string_literal.

That's been my thought anyway. How does that strike you?

@gregmarr
Copy link
Contributor

gregmarr commented Dec 21, 2023

For consecutive string literals: I see their absence as a good simplification in general. But it's true that sometimes it's useful to break a string literal across source lines, and interpolation does make for longer (if fewer) string literals. My thought there was to recognize it grammatically but require + between: "a string literal" + "another one" + "still another".

So lowering of binary_expression could just inject std::string{ and } around the first literal when it sees a sequence of string-literal + string_literal + string_literal.

std::string{"a string literal"} + "another one" + "still another".

This is a runtime concat.

To preserve the current behavior, you would need to convert it to this compile-time concat.

std::string{"a string literal" "another one" "still another"};

Something to watch out for however is that cpp1 supports putting a literal operator on the end of the final string literal, since the strings are combined before tokenization:

#include <string>
#include <iostream>

using namespace std::literals::string_literals;

int main()
{
    auto str = "foo" "bar" "baz"s;
    std::cout << str;
}

@JohelEGP
Copy link
Contributor Author

I'm not too concerned about a single string either, but eliminating the need to escape the quotes can change from multiple strings to a single fixed string, and thus are much more concerning.

std::cout << "(" + someFunctionThatStoresStuffInTheDatabaseAndReturnsString() + ")$";

An interpolation with " could be required to be in a raw string literal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants