Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Case Insensitivity Proof of Concept #1092

Closed
wants to merge 9 commits into from

Conversation

KvanTTT
Copy link
Member

@KvanTTT KvanTTT commented Jan 8, 2016

At present time there are quite a lot of case insensitive languages: Pascal, PHP, SQLite, TSQL and other. In the usual case this feature implemented via fragment rules (code generation approach, see here for example) or via overriding LA method in the input stream (runtime approach as mention @jimidle here).

I implemented new option and syntax in ANTLR to support case insensitivity. In my approach it is possible to declare caseInsensitive and caseSensitive modes (it can be used in grammar for parsing both PHP code (caseInsensitive keywords) and JavaScript code (caseSensitive keywords):

mode JavaScript, caseSensitive;
Break: 'break';
Do: 'do';
Instanceof: 'instanceof';
...

mode PHP, caseInsensitive;
Abstract: 'abstract';
Array: 'array';
As: 'as';

Also it is possible to declare caseInsensitive option for entire grammar (combined grammar also supported):

grammar pascal;
options { caseInsensitive = true; }
program
   : programHeading 'interface'? block '.'
   ;
programHeading
   : 'program' identifier ('(' identifierList ')')? ';'
   | 'unit' identifier ';'
   ;
// Another parser rules
...

INTERFACE: 'interface';
PROGRAM: 'program';
UNIT: 'unit';
// Another tokens with string literals (without fragment rules).

Moreover it is possible to declare caseSensitive and caseInsensitive tokens in the same mode like this:

lexer grammar L;
options { caseInsensitive = true; }
Token_1: 'a';

mode DEFAULT_MODE, caseSensitive;
Token_3: 'd'

See unit tests for detail and other cases.

Сase insensitivity implemented via code generation approach, but it can be replaced with runtime approach too if necessary. This feature improve grammar readability and make grammar creation more easy.

If this pull request overall is good, I'll finish remaining issues with separated lexer and parser and other. Otherwise I'll try to separate fixed symbol issues and make another pull request. Suggestions are welcome.

Fixed exception with invalid escape sequence: antlr#1077
Improved STRING_LITERALS_AND_SETS_CANNOT_BE_EMPTY warning. Now it works with empty sets too [].
Added new CHARACTERS_COLLISION_IN_SET warning ([a-f][d-n], [aa-z], 'F'..'A' etc.)
Added unit tests for mentioned features.
@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 8, 2016

Sorry, I did not run runtime tests. I'll fix it.

…set and it's tool code generation error).

Fixed runtime tests (removed invalid escaped sequence '\u'.
…start position in full path).

Added -O=-inline option for fixing "Method too complex" exception.
@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 8, 2016

I have not idea why "Method L:_serializeATN () is too complex" exception now occured for C# runtime. 😒

@ericvergnaud
Copy link
Contributor

Probably because the recursive string concatenation ("abc" + "bcd" + … ) is now exceeding the compiler stack limit

Le 9 janv. 2016 à 07:00, Ivan Kochurkin [email protected] a écrit :

I have not idea why "Method L:_serializeATN () is too complex" exception occured for C# runtime.


Reply to this email directly or view it on GitHub #1092 (comment).

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 10, 2016

@ericvergnaud, no, this issue related to lexer. Maybe method has too big size. Local tests are passed (Windows 7). I tried to disable "inline" optimization as described here, but test still failed.
I don't understangd why it occured after my pull request.

@ericvergnaud
Copy link
Contributor

lexer also has its own serializedATN, and the failing method is TestHugeLexer
I haven’t look at your PR, but given the nature of the proposal, I guess it almost doubles the size of the ATN.

Le 10 janv. 2016 à 19:12, Ivan Kochurkin [email protected] a écrit :

@ericvergnaud https://github.com/ericvergnaud, no, this issue related to lexer. Maybe method has too big size. Local tests are passed (Windows 7). I tried to disable "inline" optimization as described here https://bugzilla.xamarin.com/show_bug.cgi?id=5092, but test still failed.
I don't understangd why it occured after my pull request.


Reply to this email directly or view it on GitHub #1092 (comment).

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 11, 2016

@ericvergnaud, I understand you, but in this test serializedATN size should not be changed because of caseInsensitive option is not enabled. But I'll look at it in more detail.

@KvanTTT KvanTTT changed the title Case Insensitivity Proof of Concept [RFC] Case Insensitivity Proof of Concept Jan 11, 2016
@parrt
Copy link
Member

parrt commented Jan 28, 2016

Any change to the meta-language requires deep thought on my part. Not sure when I can devote the time.

@parrt
Copy link
Member

parrt commented Mar 30, 2016

Hi. I'm going to close not because it's not an excellent job but because it's a fairly significant change and I'm nervous about unintended consequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants