Unexpected token ILLEGAL on regex \. #44

JarLob · 2018-09-27T16:00:49Z

This code (regex with \.) triggers unexpected token: var isHtml = /\.html$/;

The text was updated successfully, but these errors were encountered:

sebastienros · 2018-09-27T16:16:40Z

Note to my future self: It works on http://esprima.org/demo/parse.html so the fix should be easy to find by just doing a step by step debug session and find out the difference between esprima and esprima-dotnet.

JarLob · 2018-09-27T16:24:46Z

Forgot to mention it is from tokenizer:

 	Esprima.dll!Esprima.ErrorHandler.ThrowError(int index = 427, int line = 15, int column = 18, string message = "Unexpected token ILLEGAL") Line 37	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.ThrowUnexpectedToken(string message = "Unexpected token ILLEGAL") Line 173	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.GetComplexIdentifier() Line 568	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.ScanIdentifier() Line 659	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.Lex() Line 1679	C#	Symbols loaded.
>	Esprima.Sample.dll!Esprima.Sample.Program.Tokenize(Esprima.Scanner scanner = {Esprima.Scanner}) Line 29	C#	Symbols loaded.

KvanTTT · 2018-09-27T18:46:43Z

Yes, I'm also very interested in fixing of lexer errors with regex. I described similar errors in #40 (comment).

sebastienros · 2018-09-29T17:21:19Z

I can't repro this issue. I added this test and it works fine both on master and dev. Can you provide a better unit test?

        [Fact]
        public void ShouldRegularExpressionGH44()
        {
            var parser = new JavaScriptParser(@"var isHtml = /\.html$/");
            var program = parser.ParseProgram();
        }

JarLob · 2018-09-29T17:31:44Z

I got it in Esprima.Sample as you can see from the call stack.

        [Fact]
        public void ShouldRegularExpressionGH44()
        {
            var scanner = new Scanner(@"var isHtml = /\.html$/");

            var   tokens = new List<Token>();
            Token token;

            do
            {
                scanner.ScanComments();
                token = scanner.Lex();
                tokens.Add(token);
            } while (token.Type != TokenType.EOF);
        }

KvanTTT · 2018-09-29T17:51:42Z

@sebastienros it does not work only with Scanner, not parser.

olliejm · 2020-06-25T11:00:35Z

I'm still having this issue with a regex that also contains \..

I cloned dev branch and ran the unit test posted by JarLob above, it still failed with both his regex and mine.

In my case I'm experiencing the problem via a Jint execution of a file containing the problematic regex, using Jint 3.0.0-beta-1715, with Esprima 1.0.1258

olliejm · 2020-06-25T13:50:42Z

Here is my particular regex:

var urlRegex = /(https?)\:\/\/[A-Za-z0-9\.\-]+(\/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*/gi;

The error occurs on this line: https://github.com/sebastienros/esprima-dotnet/blob/dev/src/Esprima/Scanner.cs#L602 when processing the first \ escape character in the line;

sebastienros · 2020-06-27T22:09:34Z

I can't repro these issues if I use the parser directly, or the ScanRegex method. I think that the issue is in the Esprima.Sample source that you all seem to be following. The parser does more than just call Lex and does some lookaheads that will help scan the regex. Otherwise it's trying to scan an identifier instead.

Is your intent to actually iterate over each Token of a script, like the sample is supposed to work?

IntranetFactory · 2020-06-28T06:39:27Z

We use

var urlRegex = /(https?)\:\/\/[A-Za-z0-9\.\-]+(\/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*/gi;

in a JInt script and that triggers the error in the Esprima dependency.

lahma · 2020-06-29T13:10:07Z

Tested above regex with latest Jint 3 using REPL and worked just fine. Maybe someone should post a complete sample with used library versions.

KurtGokhan · 2021-01-02T21:37:32Z

Here is the simplest failing program: /\./

This is the code from the sample project with that regex, and it is throwing the error in the title.

        public static void Main(string[] args)
        {
            var scanner = new Scanner(@"/\./");
            Tokenize(scanner);
        }

        private static void Tokenize(Scanner scanner)
        {
            var tokens = new List<Token>();
            Token token;

            do
            {
                scanner.ScanComments();
                token = scanner.Lex();
                tokens.Add(token);
            } while (token.Type != TokenType.EOF);

            Console.WriteLine(JsonConvert.SerializeObject(tokens, Formatting.Indented));
        }

To be clear, this script /\./ does not fail in Jint. I have a 500kb script that includes a lot of Regex like this so I think this may be causing it. My situation is a bit more weird, because that 500kb script normally does not fail, but fails when I create a Release build. Maybe Jint uses the Scanner when some optimizations are enabled?

Nevermind edit: It does not happen in Jint. It only happens in Scanner.

adams85 · 2023-02-25T13:03:46Z

Tracked this down: the root cause of the issue is the JS syntax itself, more precisely, the / and /= tokens, which are ambiguous. When the scanner encounters one of them, it can't tell whether it's the beginning of a regexp or it's a division operator without knowing the context. But to know the context, you need to parse the code... For more details, see the comments of this SO answer (please note that the accepted answer is wrong):

Technically, there are a couple ambiguities that are unavoidable at the lexical level. For example, (a+b)/c vs. if (x) /foo/.exec('bar') (close-paren can precede either). Also, ++ /foo/.abc and a++ / b (plus-plus can precede either). Together with -- these are the only ones I know of.

There's also a problem with }: function f() {}(newline)/1/g versus var x = {}(newline)/1/g, since the the latter doesn't enforce semicolon insertion.

To sum it up, you can't (reliably) use the scanner in standalone mode when the code contains regexps. Which situation is kinda sad but it looks there's no solution to this problem. What we can maybe do to mitigate it is to allow the user to provide some best effort algorithm, which would enable tokenization in some specific cases instead of failing with error. What do you think? Would such an addition make sense?

adams85 mentioned this issue Feb 25, 2023

Comments are not handled at all? #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected token ILLEGAL on regex \. #44

Unexpected token ILLEGAL on regex \. #44

JarLob commented Sep 27, 2018

sebastienros commented Sep 27, 2018

JarLob commented Sep 27, 2018

KvanTTT commented Sep 27, 2018 •

edited

Loading

sebastienros commented Sep 29, 2018

JarLob commented Sep 29, 2018

KvanTTT commented Sep 29, 2018

olliejm commented Jun 25, 2020

olliejm commented Jun 25, 2020

sebastienros commented Jun 27, 2020

IntranetFactory commented Jun 28, 2020

lahma commented Jun 29, 2020

KurtGokhan commented Jan 2, 2021 •

edited

Loading

adams85 commented Feb 25, 2023 •

edited

Loading

Unexpected token ILLEGAL on regex \. #44

Unexpected token ILLEGAL on regex \. #44

Comments

JarLob commented Sep 27, 2018

sebastienros commented Sep 27, 2018

JarLob commented Sep 27, 2018

KvanTTT commented Sep 27, 2018 • edited Loading

sebastienros commented Sep 29, 2018

JarLob commented Sep 29, 2018

KvanTTT commented Sep 29, 2018

olliejm commented Jun 25, 2020

olliejm commented Jun 25, 2020

sebastienros commented Jun 27, 2020

IntranetFactory commented Jun 28, 2020

lahma commented Jun 29, 2020

KurtGokhan commented Jan 2, 2021 • edited Loading

adams85 commented Feb 25, 2023 • edited Loading

KvanTTT commented Sep 27, 2018 •

edited

Loading

KurtGokhan commented Jan 2, 2021 •

edited

Loading

adams85 commented Feb 25, 2023 •

edited

Loading