Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected token ILLEGAL on regex \. #44

Open
JarLob opened this issue Sep 27, 2018 · 13 comments
Open

Unexpected token ILLEGAL on regex \. #44

JarLob opened this issue Sep 27, 2018 · 13 comments

Comments

@JarLob
Copy link
Contributor

JarLob commented Sep 27, 2018

This code (regex with \.) triggers unexpected token: var isHtml = /\.html$/;

@sebastienros
Copy link
Owner

Note to my future self: It works on http://esprima.org/demo/parse.html so the fix should be easy to find by just doing a step by step debug session and find out the difference between esprima and esprima-dotnet.

@JarLob
Copy link
Contributor Author

JarLob commented Sep 27, 2018

Forgot to mention it is from tokenizer:

 	Esprima.dll!Esprima.ErrorHandler.ThrowError(int index = 427, int line = 15, int column = 18, string message = "Unexpected token ILLEGAL") Line 37	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.ThrowUnexpectedToken(string message = "Unexpected token ILLEGAL") Line 173	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.GetComplexIdentifier() Line 568	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.ScanIdentifier() Line 659	C#	Symbols loaded.
 	Esprima.dll!Esprima.Scanner.Lex() Line 1679	C#	Symbols loaded.
>	Esprima.Sample.dll!Esprima.Sample.Program.Tokenize(Esprima.Scanner scanner = {Esprima.Scanner}) Line 29	C#	Symbols loaded.

@KvanTTT
Copy link

KvanTTT commented Sep 27, 2018

Yes, I'm also very interested in fixing of lexer errors with regex. I described similar errors in #40 (comment).

@sebastienros
Copy link
Owner

I can't repro this issue. I added this test and it works fine both on master and dev. Can you provide a better unit test?

        [Fact]
        public void ShouldRegularExpressionGH44()
        {
            var parser = new JavaScriptParser(@"var isHtml = /\.html$/");
            var program = parser.ParseProgram();
        }

@JarLob
Copy link
Contributor Author

JarLob commented Sep 29, 2018

I got it in Esprima.Sample as you can see from the call stack.

        [Fact]
        public void ShouldRegularExpressionGH44()
        {
            var scanner = new Scanner(@"var isHtml = /\.html$/");

            var   tokens = new List<Token>();
            Token token;

            do
            {
                scanner.ScanComments();
                token = scanner.Lex();
                tokens.Add(token);
            } while (token.Type != TokenType.EOF);
        }

@KvanTTT
Copy link

KvanTTT commented Sep 29, 2018

@sebastienros it does not work only with Scanner, not parser.

@olliejm
Copy link

olliejm commented Jun 25, 2020

I'm still having this issue with a regex that also contains \..

I cloned dev branch and ran the unit test posted by JarLob above, it still failed with both his regex and mine.

In my case I'm experiencing the problem via a Jint execution of a file containing the problematic regex, using Jint 3.0.0-beta-1715, with Esprima 1.0.1258

@olliejm
Copy link

olliejm commented Jun 25, 2020

Here is my particular regex:

var urlRegex = /(https?)\:\/\/[A-Za-z0-9\.\-]+(\/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*/gi;

The error occurs on this line: https://github.com/sebastienros/esprima-dotnet/blob/dev/src/Esprima/Scanner.cs#L602 when processing the first \ escape character in the line;

@sebastienros
Copy link
Owner

I can't repro these issues if I use the parser directly, or the ScanRegex method. I think that the issue is in the Esprima.Sample source that you all seem to be following. The parser does more than just call Lex and does some lookaheads that will help scan the regex. Otherwise it's trying to scan an identifier instead.

Is your intent to actually iterate over each Token of a script, like the sample is supposed to work?

@IntranetFactory
Copy link

We use

var urlRegex = /(https?)\:\/\/[A-Za-z0-9\.\-]+(\/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*/gi;

in a JInt script and that triggers the error in the Esprima dependency.

@lahma
Copy link
Collaborator

lahma commented Jun 29, 2020

Tested above regex with latest Jint 3 using REPL and worked just fine. Maybe someone should post a complete sample with used library versions.

@KurtGokhan
Copy link
Contributor

KurtGokhan commented Jan 2, 2021

Here is the simplest failing program: /\./

This is the code from the sample project with that regex, and it is throwing the error in the title.

        public static void Main(string[] args)
        {
            var scanner = new Scanner(@"/\./");
            Tokenize(scanner);
        }

        private static void Tokenize(Scanner scanner)
        {
            var tokens = new List<Token>();
            Token token;

            do
            {
                scanner.ScanComments();
                token = scanner.Lex();
                tokens.Add(token);
            } while (token.Type != TokenType.EOF);

            Console.WriteLine(JsonConvert.SerializeObject(tokens, Formatting.Indented));
        }

To be clear, this script /\./ does not fail in Jint. I have a 500kb script that includes a lot of Regex like this so I think this may be causing it. My situation is a bit more weird, because that 500kb script normally does not fail, but fails when I create a Release build. Maybe Jint uses the Scanner when some optimizations are enabled?

Nevermind edit: It does not happen in Jint. It only happens in Scanner.

@adams85
Copy link
Collaborator

adams85 commented Feb 25, 2023

Tracked this down: the root cause of the issue is the JS syntax itself, more precisely, the / and /= tokens, which are ambiguous. When the scanner encounters one of them, it can't tell whether it's the beginning of a regexp or it's a division operator without knowing the context. But to know the context, you need to parse the code... For more details, see the comments of this SO answer (please note that the accepted answer is wrong):

Technically, there are a couple ambiguities that are unavoidable at the lexical level. For example, (a+b)/c vs. if (x) /foo/.exec('bar') (close-paren can precede either). Also, ++ /foo/.abc and a++ / b (plus-plus can precede either). Together with -- these are the only ones I know of.

There's also a problem with }: function f() {}(newline)/1/g versus var x = {}(newline)/1/g, since the the latter doesn't enforce semicolon insertion.

To sum it up, you can't (reliably) use the scanner in standalone mode when the code contains regexps. Which situation is kinda sad but it looks there's no solution to this problem. What we can maybe do to mitigate it is to allow the user to provide some best effort algorithm, which would enable tokenization in some specific cases instead of failing with error. What do you think? Would such an addition make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants