[RFC] Case Insensitivity Proof of Concept #1092

KvanTTT · 2016-01-08T11:56:04Z

At present time there are quite a lot of case insensitive languages: Pascal, PHP, SQLite, TSQL and other. In the usual case this feature implemented via fragment rules (code generation approach, see here for example) or via overriding LA method in the input stream (runtime approach as mention @jimidle here).

I implemented new option and syntax in ANTLR to support case insensitivity. In my approach it is possible to declare caseInsensitive and caseSensitive modes (it can be used in grammar for parsing both PHP code (caseInsensitive keywords) and JavaScript code (caseSensitive keywords):

mode JavaScript, caseSensitive;
Break: 'break';
Do: 'do';
Instanceof: 'instanceof';
...

mode PHP, caseInsensitive;
Abstract: 'abstract';
Array: 'array';
As: 'as';

Also it is possible to declare caseInsensitive option for entire grammar (combined grammar also supported):

grammar pascal;
options { caseInsensitive = true; }
program
   : programHeading 'interface'? block '.'
   ;
programHeading
   : 'program' identifier ('(' identifierList ')')? ';'
   | 'unit' identifier ';'
   ;
// Another parser rules
...

INTERFACE: 'interface';
PROGRAM: 'program';
UNIT: 'unit';
// Another tokens with string literals (without fragment rules).

Moreover it is possible to declare caseSensitive and caseInsensitive tokens in the same mode like this:

lexer grammar L;
options { caseInsensitive = true; }
Token_1: 'a';

mode DEFAULT_MODE, caseSensitive;
Token_3: 'd'

See unit tests for detail and other cases.

Сase insensitivity implemented via code generation approach, but it can be replaced with runtime approach too if necessary. This feature improve grammar readability and make grammar creation more easy.

If this pull request overall is good, I'll finish remaining issues with separated lexer and parser and other. Otherwise I'll try to separate fixed symbol issues and make another pull request. Suggestions are welcome.

Fixed exception with invalid escape sequence: antlr#1077 Improved STRING_LITERALS_AND_SETS_CANNOT_BE_EMPTY warning. Now it works with empty sets too []. Added new CHARACTERS_COLLISION_IN_SET warning ([a-f][d-n], [aa-z], 'F'..'A' etc.) Added unit tests for mentioned features.

KvanTTT · 2016-01-08T12:30:22Z

Sorry, I did not run runtime tests. I'll fix it.

…set and it's tool code generation error). Fixed runtime tests (removed invalid escaped sequence '\u'.

…start position in full path). Added -O=-inline option for fixing "Method too complex" exception.

KvanTTT · 2016-01-08T23:00:46Z

I have not idea why "Method L:_serializeATN () is too complex" exception now occured for C# runtime. 😒

ericvergnaud · 2016-01-09T01:21:54Z

Probably because the recursive string concatenation ("abc" + "bcd" + … ) is now exceeding the compiler stack limit

Le 9 janv. 2016 à 07:00, Ivan Kochurkin [email protected] a écrit :

I have not idea why "Method L:_serializeATN () is too complex" exception occured for C# runtime.

—
Reply to this email directly or view it on GitHub #1092 (comment).

KvanTTT · 2016-01-10T11:12:01Z

@ericvergnaud, no, this issue related to lexer. Maybe method has too big size. Local tests are passed (Windows 7). I tried to disable "inline" optimization as described here, but test still failed.
I don't understangd why it occured after my pull request.

ericvergnaud · 2016-01-10T11:57:14Z

lexer also has its own serializedATN, and the failing method is TestHugeLexer
I haven’t look at your PR, but given the nature of the proposal, I guess it almost doubles the size of the ATN.

Le 10 janv. 2016 à 19:12, Ivan Kochurkin [email protected] a écrit :

@ericvergnaud https://github.com/ericvergnaud, no, this issue related to lexer. Maybe method has too big size. Local tests are passed (Windows 7). I tried to disable "inline" optimization as described here https://bugzilla.xamarin.com/show_bug.cgi?id=5092, but test still failed.
I don't understangd why it occured after my pull request.

—
Reply to this email directly or view it on GitHub #1092 (comment).

KvanTTT · 2016-01-11T11:41:56Z

@ericvergnaud, I understand you, but in this test serializedATN size should not be changed because of caseInsensitive option is not enabled. But I'll look at it in more detail.

parrt · 2016-01-28T18:36:26Z

Any change to the meta-language requires deep thought on my part. Not sure when I can devote the time.

parrt · 2016-03-30T16:59:02Z

Hi. I'm going to close not because it's not an excellent job but because it's a fairly significant change and I'm nervous about unintended consequences.

KvanTTT added 7 commits January 8, 2016 00:29

Fixed exception with absent lexer rules.

118f365

Improved char collision checks.

1bb45ff

Added semantic checks for caseSensitive option.

7362e1b

Fixed invalid escape short sequences [\u24]. antlr#1077

aac8d34

Fixed '-' and ']' support in char set.

4cf3187

caseInsensitive option detection refactored.

6f9312f

KvanTTT added 2 commits January 9, 2016 00:15

Removed runtime tests with reversed range (now it's treated as empty …

2e4c21e

…set and it's tool code generation error). Fixed runtime tests (removed invalid escaped sequence '\u'.

Fixed csharp runtime tests running on Windows ('\' is not allowed at …

dcd0e61

…start position in full path). Added -O=-inline option for fixing "Method too complex" exception.

KvanTTT changed the title ~~Case Insensitivity Proof of Concept~~ [RFC] Case Insensitivity Proof of Concept Jan 11, 2016

KvanTTT mentioned this pull request Jan 11, 2016

Optionally report unused lexer/parser rules #1069

Closed

parrt added the status:not-fixing label Mar 30, 2016

parrt closed this Mar 30, 2016

KvanTTT mentioned this pull request Dec 15, 2016

Character issues #1517

Closed

KvanTTT mentioned this pull request Feb 21, 2017

Ignore case new syntax #1002

Closed

KvanTTT mentioned this pull request Dec 2, 2017

Add a new CharStream that converts the symbols to upper or lower case. #2046

Closed

KvanTTT mentioned this pull request Dec 9, 2021

Implement caseInsensitive option #3399

Merged

KvanTTT deleted the case_insensitive branch April 8, 2022 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Case Insensitivity Proof of Concept #1092

[RFC] Case Insensitivity Proof of Concept #1092

KvanTTT commented Jan 8, 2016

KvanTTT commented Jan 8, 2016

KvanTTT commented Jan 8, 2016

ericvergnaud commented Jan 9, 2016

KvanTTT commented Jan 10, 2016

ericvergnaud commented Jan 10, 2016

KvanTTT commented Jan 11, 2016

parrt commented Jan 28, 2016

parrt commented Mar 30, 2016

[RFC] Case Insensitivity Proof of Concept #1092

[RFC] Case Insensitivity Proof of Concept #1092

Conversation

KvanTTT commented Jan 8, 2016

KvanTTT commented Jan 8, 2016

KvanTTT commented Jan 8, 2016

ericvergnaud commented Jan 9, 2016

KvanTTT commented Jan 10, 2016

ericvergnaud commented Jan 10, 2016

KvanTTT commented Jan 11, 2016

parrt commented Jan 28, 2016

parrt commented Mar 30, 2016