-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink string representation #68
Comments
I will probably add a switch to toggle between these modes:
|
Got some WIP code that implements an
A 'true UTF-8' mode (option 3 or 4) would be considerably mode involved, and perhaps not worth it. Still considering it, though. |
I changed my mind; I won't keep current behaviour as the default, maybe I won't even keep it as an option; nobody seems to expect or desire it anyway. The default will be no mangling and no literal interpretation; this allows users to parse Unicode source code without hassle, while those interested in string literals can choose some other mode that ensures a coherent interpretation. I implemented UTF-8 modes too, but they're a little hacky. I also still need to document the option. |
Finally committed as fstirlitz:2b04739...fstirlitz:10666c7. Leaving out UTF-8 modes for the moment; I may add them later. I’m leaving this issue open until I make a decision, but either way it goes, it’s not a release blocker. |
I'd still be interested in UTF-8. I've tried reading up on x-user-defined but did not come away with an understanding where it would break down -- I am interested in literal strings as I want to use luaparse to change lua source into JS. |
(cribbed from
README.md
)Unlike strings in JavaScript, Lua strings are not Unicode strings, but bytestrings (sequences of 8-bit values); likewise, implementations of Lua parse the source code as a sequence of octets. However, the input to this parser is a JavaScript string, i.e. a sequence of 16-bit code units (not necessarily well-formed UTF-16). This poses a problem of how those code units should be interpreted, particularly if they are outside the Basic Latin block ('ASCII').
Currently, this parser handles Unicode input by encoding it in WTF-8, and reinterpreting the resulting code units as Unicode code points. This applies to string literals and (if
extendedIdentifiers
is enabled) to identifiers as well. Lua byte escapes inside string literals are interpreted directly as code points, while Lua 5.3\u{}
escapes are similarly decoded as UTF-8 code units reinterpreted as code points. It is as if the parser input was being interpreted as ISO-8859-1, while actually being encoded in UTF-8.This ensures that no otherwise-valid input will be rejected due to encoding errors. Assuming the input was originally encoded in UTF-8 (which includes the case of only containing ASCII characters), it also preserves the following properties:
extendedIdentifiers
is enabled) will have the same representation in the AST if and only if they represent the same string in the source code: e.g. the Lua literals'💩'
,'\u{1f4a9}'
and'\240\159\146\169'
will all have"\u00f0\u009f\u0092\u00a9"
in their.value
property, and likewiselocal 💩
will have the same string in its.name
property;String.prototype.charCodeAt
method in JS can be directly used to emulate Lua'sstring.byte
(with one argument, after shifting offsets by 1), and likewiseString.prototype.substr
can be used similarly to Lua'sstring.sub
;.length
property of decoded string values in the AST is equal to the value that the#
operator would return in Lua.Maintaining those properties makes the logic of static analysers and code transformation tools simpler. However, it poses a problem when displaying strings to the user and serialising AST back into a string; to recover the original bytestrings, values transformed in this way will have to be encoded in ISO-8859-1.
Other solutions to this problem may be considered in the future. Some of them have been listed below, with their drawbacks:
x-user-defined
encoding) and rejects code points that cannot appear in that encoding; may be useful for source code in encodings other than UTF-8x-user-defined
cannot take advantage of compact representation of ISO-8859-1 strings in certain JavaScript enginesArrayBuffer
orUint8Array
for source code and/or string literalsMap
andWeakMap
insteadArray
of numbers in the range [0, 256)String
values, and requiring that escape sequences in literals constitute well-formed UTF-8; an exception is thrown if they do notsurrogateescape
encoding error handler("\xc4" .. "\x99") == "\xc4\x99"
Cf. discussion under c05822d.
The text was updated successfully, but these errors were encountered: