You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modify the lexical syntax of string literals to have string gaps, which are escape sequences of the form "\" newline whitespace* that have the interpretation of an empty string. For example,
"this is \
a string"
is equivalent to "this is a string".
Motivation
In projects that have a line length limit (such as std and mathlib), long strings need to be wrapped to respect this limit. Long strings appear in error messages in meta code.
A workaround one sees is using string interpolation with empty strings, since the expression between curly braces may itself contain whitespace that is ignored:
s!"this is {
""}a string"
However, (1) this feels like a hack to many who use Lean and (2) this term produces "this is " ++ "" ++ "a string", which is not an atomic string literal. Note that users do not write
"this is
a string"
because this includes extraneous whitespace in the string -- in particular, this example is equivalent to "this is\n a string".
String literals should have syntax to be able to continue them across multiple lines without incorporating the intervening whitespace. Using Haskell terminology, we want "gaps" in the lexical syntax of string literals.
Manual-level explanation
String literals can be broken across multiple lines without introducing additional whitespace by ending lines with the \ character. All the whitespace up until the first non-whitespace character on the next line is ignored. These \-escaped line breaks are known as string gaps.
For example, the string literal
"this is \
a string"
has a string gap starting at the end of the first line, and the string literal is equivalent to "this is a string"
You must have a line break immediately after the \, otherwise it is an error. For example, you are not allowed to have comments within strings.
"this is \ -- Having a comment here is not allowed.
a string"
If you need to start a line with whitespace, one trick is to use an escape code such as \x20 (space) to start the next line, since otherwise whitespace is consumed as part of the string gap. For example, in the following the string still has a space between is and a.
"this is\
\x20a string"
Reference-level explanation
In the lexical syntax of string literals, string gaps correspond to string_gap in the following grammar:
To account for systems using different line endings, newline matches both \n and \r\n.
The lexical syntax allows a string gap to consume one or more newlines. However, to prevent confusion, the parser should throw an error if the string gap contains more than one. When there is more than one newline, it becomes unclear what the intended interpretation is meant to be, and also it invites users to try to put comments in intervening blank lines. As such, this is disallowed:
"this is \
a string"
The technical reason for allowing multiple newlines in the lexical syntax but disallowing them during parsing is that otherwise the lexical syntax would indicate that a second newline terminates a string gap.
During elaboration time, the Lean.Syntax.isStrLit? function is responsible for decoding parsed string literals, including processing all escape codes. This is where string gaps are skipped. Note that this function shares code with character literal decoding, and so the shared code needs to take care that character literals do not process string gaps.
Drawbacks
String gaps add to the complexity of the lexical syntax. This implementation of string gaps makes it somewhat difficult to include whitespace at the beginning of the continuing line. This is the one place in Lean that needs to be aware of \r\n line endings.
Rationale and alternatives
String gaps already have precedent in other languages. This RFC is essentially the Rust implementation of the concept. However, in Rust they choose to make multiple newlines a warning issued later rather than catching it at parse time. For simplicity, we have this condition raise a parser error (rather than waiting for elaboration time to emit a warning). Note that the lexical syntax for gaps is still "\" newline whitespace*, but we choose to have the parser throw an error if whitespace* contains a newline. One way this makes the code simpler is that we can easily throw an error at the position of a second newline in the gap, if it exists.
Alternatives:
Haskell-style string gaps have the lexical syntax "\" whitespace+ "\". These neatly solve the problem of where the next line continues, even if it starts with whitespace.
"this is\
\ a string"
However, beginning lines with \ will play havoc on syntax highlighting, especially if the character after the second \ is the closing " for the string literal. Everything you can do with Haskell-style gaps can be represented in some way with Rust-style gaps, with small variations in line lengths.
Trying to adapt C's automatic string concatenation, where the juxtaposition "this is" " a string" is equivalent to the concatenated version. However, this creates an ambiguity when passing strings as arguments to functions.
Allowing interpolated strings to include expression-free {} notation
s!"this is {
}a string"
While this would be a smaller change to the lexical syntax, this does not produce a string literal. It also forces users to switch to using an interpolated string if they want gaps in strings.
Using unrestricted "\" whitespace+ gaps. These are easy to parse, but here is an example confusion this can lead to.
"this is \ -- that's the first half
a string" -- and that's the second
The first "comment" is actually part of the string. Requiring that the first whitespace character be a newline prevents this.
Using "\" whitespace+ gaps where the whitespace must contain at least one newline. This circumvents the need for special CRLF handling, but allowing spaces immediately after \ and before the first newline might be confusing.
Other considerations
Note that the whitespace in a string gap must be whitespace. Comments are not permitted. The parser does not interpret string literals (that is, it does not do escape sequence replacements), and they are instead interpreted during elaboration time. If the gaps were to admit comments, this interpretation process would need to be able to handle comment syntax as well. This limitation is also found in the Haskell 2010 language specification.
The implementation needs to be careful not to allow string gaps inside character literals. String literals and character literals share code for handling escapes, but the string gap feature only makes sense for strings.
Future possibilities
Rather than having special handling for \r\n, we can consider having a global preprocessing step that does \r\n to \n conversion for the entire input file. This approach was taken by Rust.
This enables features such as multiline strings where each line begins with some delimiter, which would be another way to control whitespace at the beginnings of lines. This would be useful for embedding blocks of code in string literals.
d!"this is \
line 1
| line 2, indented
|line 3"
We could include an escape such as \> that also stands for the empty string. This could serve as an explicit end to a string gap. For example,
#2821)
Implements "gaps" in string literals. These are escape sequences of the
form `"\" newline whitespace+` that have the interpretation of an empty
string. For example,
```
"this is \
a string"
```
is equivalent to `"this is a string"`. These are modeled after string
continuations in
[Rust](https://doc.rust-lang.org/beta/reference/tokens.html#string-literals).
Implements RFC #2838
Summary
Modify the lexical syntax of string literals to have string gaps, which are escape sequences of the form
"\" newline whitespace*
that have the interpretation of an empty string. For example,is equivalent to
"this is a string"
.Motivation
In projects that have a line length limit (such as std and mathlib), long strings need to be wrapped to respect this limit. Long strings appear in error messages in meta code.
A workaround one sees is using string interpolation with empty strings, since the expression between curly braces may itself contain whitespace that is ignored:
However, (1) this feels like a hack to many who use Lean and (2) this term produces
"this is " ++ "" ++ "a string"
, which is not an atomic string literal. Note that users do not writebecause this includes extraneous whitespace in the string -- in particular, this example is equivalent to
"this is\n a string"
.String literals should have syntax to be able to continue them across multiple lines without incorporating the intervening whitespace. Using Haskell terminology, we want "gaps" in the lexical syntax of string literals.
Manual-level explanation
String literals can be broken across multiple lines without introducing additional whitespace by ending lines with the
\
character. All the whitespace up until the first non-whitespace character on the next line is ignored. These\
-escaped line breaks are known as string gaps.For example, the string literal
has a string gap starting at the end of the first line, and the string literal is equivalent to
"this is a string"
You must have a line break immediately after the
\
, otherwise it is an error. For example, you are not allowed to have comments within strings.If you need to start a line with whitespace, one trick is to use an escape code such as
\x20
(space) to start the next line, since otherwise whitespace is consumed as part of the string gap. For example, in the following the string still has a space betweenis
anda
.Reference-level explanation
In the lexical syntax of string literals, string gaps correspond to
string_gap
in the following grammar:To account for systems using different line endings,
newline
matches both \n and \r\n.The lexical syntax allows a string gap to consume one or more newlines. However, to prevent confusion, the parser should throw an error if the string gap contains more than one. When there is more than one newline, it becomes unclear what the intended interpretation is meant to be, and also it invites users to try to put comments in intervening blank lines. As such, this is disallowed:
The technical reason for allowing multiple newlines in the lexical syntax but disallowing them during parsing is that otherwise the lexical syntax would indicate that a second newline terminates a string gap.
During elaboration time, the
Lean.Syntax.isStrLit?
function is responsible for decoding parsed string literals, including processing all escape codes. This is where string gaps are skipped. Note that this function shares code with character literal decoding, and so the shared code needs to take care that character literals do not process string gaps.Drawbacks
String gaps add to the complexity of the lexical syntax. This implementation of string gaps makes it somewhat difficult to include whitespace at the beginning of the continuing line. This is the one place in Lean that needs to be aware of
\r\n
line endings.Rationale and alternatives
String gaps already have precedent in other languages. This RFC is essentially the Rust implementation of the concept. However, in Rust they choose to make multiple newlines a warning issued later rather than catching it at parse time. For simplicity, we have this condition raise a parser error (rather than waiting for elaboration time to emit a warning). Note that the lexical syntax for gaps is still
"\" newline whitespace*
, but we choose to have the parser throw an error ifwhitespace*
contains a newline. One way this makes the code simpler is that we can easily throw an error at the position of a second newline in the gap, if it exists.Alternatives:
Haskell-style string gaps have the lexical syntax
"\" whitespace+ "\"
. These neatly solve the problem of where the next line continues, even if it starts with whitespace.However, beginning lines with
\
will play havoc on syntax highlighting, especially if the character after the second\
is the closing"
for the string literal. Everything you can do with Haskell-style gaps can be represented in some way with Rust-style gaps, with small variations in line lengths.Trying to adapt C's automatic string concatenation, where the juxtaposition
"this is" " a string"
is equivalent to the concatenated version. However, this creates an ambiguity when passing strings as arguments to functions.Allowing interpolated strings to include expression-free
{}
notationWhile this would be a smaller change to the lexical syntax, this does not produce a string literal. It also forces users to switch to using an interpolated string if they want gaps in strings.
Using unrestricted
"\" whitespace+
gaps. These are easy to parse, but here is an example confusion this can lead to.The first "comment" is actually part of the string. Requiring that the first whitespace character be a newline prevents this.
Using
"\" whitespace+
gaps where the whitespace must contain at least one newline. This circumvents the need for special CRLF handling, but allowing spaces immediately after\
and before the first newline might be confusing.Other considerations
Note that the whitespace in a string gap must be whitespace. Comments are not permitted. The parser does not interpret string literals (that is, it does not do escape sequence replacements), and they are instead interpreted during elaboration time. If the gaps were to admit comments, this interpretation process would need to be able to handle comment syntax as well. This limitation is also found in the Haskell 2010 language specification.
The implementation needs to be careful not to allow string gaps inside character literals. String literals and character literals share code for handling escapes, but the string gap feature only makes sense for strings.
Future possibilities
Rather than having special handling for
\r\n
, we can consider having a global preprocessing step that does\r\n
to\n
conversion for the entire input file. This approach was taken by Rust.This enables features such as multiline strings where each line begins with some delimiter, which would be another way to control whitespace at the beginnings of lines. This would be useful for embedding blocks of code in string literals.
We could include an escape such as
\>
that also stands for the empty string. This could serve as an explicit end to a string gap. For example,Community discussion
Zulip thread
Impact
Add 👍 to issues you consider important. If others benefit from the changes in this proposal being added, please ask them to add 👍 to it.
The text was updated successfully, but these errors were encountered: