Increase ATN states size limit, simplify ATN serialization #3546

KvanTTT · 2022-02-20T19:47:06Z

I'm returning to the increasing of ATN states size and I've also simplified serialization (related to Unicode encoding).

ATN states size can be > 65535, up to 2^31-1:

fixes Add a more understandable message than "Serialized ATN data element .... element ... out of range 0..65535" #1863
fixes UnsupportedOperationException while generating code for large grammars. #2732
fixes Serialized ATN data element 810567 element 11 out of range 0..65535 #3338

Take a look at C# code: it contains small changes because it already has Read and Write methods (as well as other runtimes except for Java).

If Java and C# are ok, I'll complete other runtimes.

Writing, Reading methods have comprehensive tests: testATNDataWriterReaderCompact, testATNDataWriterReaderRaw.

Clean up ATN serializer/deserializer code

…er.MAX_VALUE) fix antlr#840, fix antlr#1863, fix antlr#2732, fix antlr#3338

parrt · 2022-02-20T22:45:26Z

Can you describe what needs to change to handle 32-bit ints using 16-bit unicode? I assume this is easy to fix for non-Java. Just use int not short. What is effect on Java ATN code size? Ah. I see:

runtime/Java/src/org/antlr/v4/runtime/atn/ATNDataWriter.java

Basically using high bits to encode right? Hmm...again I'm worried about a very rare cases causes a big code change. Not that this approach is "wrong" but always gotta worry about breaking stuff. Also wouldn't this hurt the common case to use upper bit(s)?

KvanTTT · 2022-02-21T10:08:25Z

Can you describe what needs to change to handle 32-bit ints using 16-bit Unicode?

We used a different encoding for 16-bit and 32-bit values:

antlr4/runtime/Java/src/org/antlr/v4/runtime/atn/ATNSerializer.java

Lines 166 to 201 in db8a483

    
           List<IntervalSet> bmpSets = new ArrayList<>(); 
        
           List<IntervalSet> smpSets = new ArrayList<>(); 
        
           for (IntervalSet set : sets.keySet()) { 
        
           	if (!set.isNil() && set.getMaxElement() <= Character.MAX_VALUE) { 
        
           		bmpSets.add(set); 
        
           	} 
        
           	else { 
        
           		smpSets.add(set); 
        
           	} 
        
           } 
        
           serializeSets( 
        
           	data, 
        
           	bmpSets, 
        
           	new CodePointSerializer() { 
        
           		@Override 
        
           		public void serializeCodePoint(IntegerList data, int cp) { 
        
           			data.add(cp); 
        
           		} 
        
           	}); 
        
           serializeSets( 
        
           	data, 
        
           	smpSets, 
        
           	new CodePointSerializer() { 
        
           		@Override 
        
           		public void serializeCodePoint(IntegerList data, int cp) { 
        
           			serializeInt(data, cp); 
        
           		} 
        
           	}); 
        
           Map<IntervalSet, Integer> setIndices = new HashMap<>(); 
        
           int setIndex = 0; 
        
           for (IntervalSet bmpSet : bmpSets) { 
        
           	setIndices.put(bmpSet, setIndex++); 
        
           } 
        
           for (IntervalSet smpSet : smpSets) { 
        
           	setIndices.put(smpSet, setIndex++); 
        
           }

Now the encoding is universal.

Basically using high bits to encode right?

Yes, two high bits are used for service purposes.

Also wouldn't this hurt the common case to use upper bit(s)?

All big Unicode values are encoded using not more than 2 integers in tests as it was before, I've checked: https://github.com/antlr/antlr4/pull/3546/files#diff-45a1efdccdd2d1ab783a98c0d1dc54c09914461d0c4bb0237d9dbe17bd4168dcR39

3 bytes are used for very big values (>= 2^31) and for negative numbers except -1 (negative numbers aren't used for serialization at all):

encoding	count	type
00xx xxxx xxxx xxxx	1	int (14 bit)
01xx xxxx xxxx xxxx xxxx xxxx xxxx xxxx	2	int (30 bit)
1000 0000 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx	3	int (32 bit)
1111 1111 1111 1111	1	-1 (0xFFFF)

Hmm...again I'm worried about a very rare cases causes a big code change. Not that this approach is "wrong" but always gotta worry about breaking stuff.

All values within the range Integer.MIN_VALUE..Integer.MAX_VALUE are covered by tests https://github.com/antlr/antlr4/pull/3546/files#diff-3ac04ac7612676174d4634b5bea3e2ef612af5a06f6887090e36450067be8d6fR163-R207:

@Test public void testATNDataWriterReaderCompact() {
	IntegerList integerList = new IntegerList();
	ATNDataWriter writer = new ATNDataWriter(integerList, "Java");
	assertEquals(1, writer.write(0));
	assertEquals(1, writer.write(-1));
	assertEquals(1, writer.write(42));
	assertEquals(2, writer.write(1 << 14));
	assertEquals(2, writer.write(0xFFFF));
	assertEquals(3, writer.write(Integer.MAX_VALUE));
	assertEquals(3, writer.write(Integer.MIN_VALUE));
	assertEquals(13, integerList.size());
	char[] charArray = Utils.toCharArray(integerList);
	ATNDataReader reader = new ATNDataReader(charArray);
	assertEquals(0, reader.read());
	assertEquals(-1, reader.read());
	assertEquals(42, reader.read());
	assertEquals(1 << 14, reader.read());
	assertEquals(0xFFFF, reader.read());
	assertEquals(Integer.MAX_VALUE, reader.read());
	assertEquals(Integer.MIN_VALUE, reader.read());
}
@Test public void testATNDataWriterReaderRaw() {
	IntegerList integerList = new IntegerList();
	ATNDataWriter writer = new ATNDataWriter(integerList, "Java");
	writer.writeInt32(0);
	writer.writeInt32(-1);
	writer.writeInt32(42);
	writer.writeInt32(1 << 14);
	writer.writeInt32(0xFFFF);
	writer.writeInt32(Integer.MAX_VALUE);
	writer.writeInt32(Integer.MIN_VALUE);
	assertEquals(7 * 2, integerList.size());
	char[] charArray = Utils.toCharArray(integerList);
	ATNDataReader reader = new ATNDataReader(charArray);
	assertEquals(0, reader.readInt32());
	assertEquals(-1, reader.readInt32());
	assertEquals(42, reader.readInt32());
	assertEquals(1 << 14, reader.readInt32());
	assertEquals(0xFFFF, reader.readInt32());
	assertEquals(Integer.MAX_VALUE, reader.readInt32());
	assertEquals(Integer.MIN_VALUE, reader.readInt32());
}

The ATN serializer format is already not back-compatible with the previous versions (because of UUID removing and version changing).

parrt · 2022-02-21T18:25:27Z

Hmm... I think @ericvergnaud had a concern about double encoding; first we encode to get larger integers and then we have to use modified UTF-8. in principle that's okay but it makes me nervous. I think Eric's concern was about all of the decoding on a phone.

Also, 14 bits only gives us 16384 max, which I think is hitting the size of the lexer for SQL grammars. I don't remember where we put the numbers I had but isn't it going to hurt grammar size for such large but 16-bit conforming grammars?

I see that we allow 32 bit values and sets but I don't see how our current code allows 32-bit single edge transitions in the ATN. do you have any idea if that is the case?

KvanTTT · 2022-02-21T18:50:57Z

Hmm... I think @ericvergnaud had a concern about double encoding; first we encode to get larger integers and then we have to use modified UTF-8. in principle that's okay but it makes me nervous. I think Eric's concern was about all of the decoding on a phone.

In my changes, there is no need to encode small numbers at first and big numbers at second, natural order is preserved. It should not break decoding on a phone.

Also, it looks like such Unicode encoding was introduced by @bhamiltoncx in fd4246c

Also, 14 bits only gives us 16384 max, which I think is hitting the size of the lexer for SQL grammars. I don't remember where we put the numbers I had but isn't it going to hurt grammar size for such large but 16-bit conforming grammars?

It does not hit the size of the lexer for SQL grammars because the max value is 13163 there: #3505 (comment) Anyway, SQL lexer is very big, other grammars are almost always smaller.

I see that we allow 32 bit values and sets but I don't see how our current code allows 32-bit single edge transitions in the ATN. do you have any idea if that is the case?

I've added the generated test getAtnStatesSizeMoreThan65535Descriptor that fails on the previous version. Actually, max is 31-bit transitions because of int is signed type.

parrt · 2022-02-21T22:52:00Z

thanks for the link back to the empirical tests I did for the SQL grammar.

Anyway, SQL lexer is very big, other grammars are almost always smaller

If this is true, seems like we are doing some major surgery to support a tiny subset of the population. on the other hand, it sounds like there are a number of people that are finding this useful; you passed some links to issues.

we could minimize the size of this change by limiting it to only those targets that need to use strings to encode integer lists, right? certainly that his java. can C# do static integer arrays properly? if so, then we don't need this for C# either. we should definitely not do this for any targets such as C++ that can directly encode static arrays. does JavaScript use strings? Dang, yep, looks like JavaScript uses strings as well:

const serializedATN = ["<model.serialized; wrap={",<\n>    "}>"].join("");

it looks like Java, C#, JS, Dart, Php, Python2/3, Swift. So, everybody except Go and C++? Dang. we should definitely get away from strings for any language that allows it. certainly this should be avoided for Python as it can simply do atn = [1,2,3,4...] right?

parrt · 2022-02-22T01:53:43Z

Ok, I've spent a lot of time thinking about this today and went back over the arguments from before. basically I'd like to stick with my original suggestion:

In other words, we begin the process by serializing an ATN into a list of integers. Then, we figure out the maximum value and see if everything fits in 16 bits. If so, we leave everything as is, otherwise we convert ALL int values to two \uXXXX chars rather than a single char.

It seems to me that we should simply get an IntegerList out of serialize() with the version number first and then the word size (16 or 32) with all words 16 or 32, no bit twiddling. The version number plus the word size indicates the encoding for words. (We can add a method to IntegerList to compute the maximum value.) In the future, it's possible I'd consider the optimization you have here, but I'd rather not do bit twiddling across all targets in one big move. The targets can do whatever they want with this information, generating strings or simply creating static short or int lists like C++ and Go. any target that is allowed to generate static integers should do so.

Sounds like we still need to convert the Java ATNDeserializer to use a "reader", as you have suggested, so that it can read either 16 or 32 depending on what kind of data we have stored. Others seem to use something similar but for example Python directly refers to readInt vs readInt32. We'll need a generic "read" that pulls 16 or 32 bits depending on the value following the version number.

KvanTTT · 2022-02-22T13:36:25Z

How to represent -1 value in 16-bit and 32-bit modes, 0xFFFF and 0xFFFFFFFF? I am afraid the current serialization works incorrectly if positive value 0xFFFF conflicts with the special value reserved for -1.

We'll need a generic "read" that pulls 16 or 32 bits depending on the value following the version number.

It also requires "bit twiddling" if 32-bit integer mode is activated. It's simple but it exists:

public int read() {
	return data[p++] | (data[p++] << 16);
}

parrt · 2022-02-23T01:06:54Z

Yup, sounds like we need to limit to 0xFFFF - 1 for valid 16 bit values.

You're right shifting by 16 is twiddling :) but simple enough I think.

KvanTTT · 2022-02-23T14:39:18Z

Yup, sounds like we need to limit to 0xFFFF - 1 for valid 16 bit values.

It's quite a strange solution, especially considering ANTLR is also a binary parser and 0xFFFF is a completely correct token type. I've created an issue for this: #3555

It seems to me that we should simply get an IntegerList out of serialize() with the version number first and then the word size (16 or 32) with all words 16 or 32, no bit twiddling. The version number plus the word size indicates the encoding for words. (We can add a method to IntegerList to compute the maximum value.)

I've tried to implement this idea but encountered the following problems:

First two integers of ATN always should take 1 element instead of 2 (to preserve backward compatibility for version field). It complicates conversion to char arrays.
It's unclear how to store -1 value in 16-bit and 32-bit modes, 0xFFFF and 0xFFFFFFFF? It is ugly and actually incorrect because of the abovementioned issue with the FFFF token type. Also, Java runtime requires -2 shifting.
It generates bigger ATNs. Plain Int arrays take twice more memory.
It complicates the serialization format because the second field with int size (16 or 32) should be filled after completion of the whole serialization step. Also, functionality with detection of max value should be integrated into IntegerList.

Ok, let's postpone fixing mentioned bugs because I don't think implementing such encoding is a very good idea. Actually, I think using IntegerList for serialization is not very good at all (specifically, restricting integers within 0..0xFFFF range), ByteList is preferable and it's a standard way in most binary serializers. I've described binary encoding using byte array in #3494

But feel free to use my commit with the writer/reader and implement your solution if you think it's correct.

parrt · 2022-02-24T01:12:12Z

Ok, sounds good. Valid points you raise but we can discuss in future. Thanks for all your efforts.

@KvanTTT

) * refactor serialize so we don't need comments * more cleanup during refactor * store language in serializer obj * A lexer rule token type should never be -1 (EOF). 0 is fragment but then must be > 0. * Go uses int not uint16 for ATN now. java/go/python3 pass * remove checks for 0xFFFF in Go. * C++ uint16_t to int for ATN. * add mac php dir; fix type on accept() for generated code to be mixed. * Add test from @KvanTTT. This PR fixes #3555 for non-Java targets. * cleanup and add big lexer from #3546 * increase mvn mem size to 2G * increase mvn mem size to 8G * turn off the big ATN lexer test as we have memory issues during testing. * Fixes #3592 * Revert "C++ uint16_t to int for ATN." This reverts commit 4d2ebbf. # Conflicts: # runtime/Cpp/runtime/src/atn/ATNSerializer.cpp # runtime/Cpp/runtime/src/tree/xpath/XPathLexer.cpp * C++ uint16_t to int32_t for ATN. * rm unnecessary include file, updating project file. get rid of the 0xFFFF does in the C++ deserialization * rm refs to 0xFFFF in swift * javascript tests were running as Node...added to ignore list. * don't distinguish between 16 and 32 bit char sets in serialization; Python2/3 updated to work with this change. * update C++ to deserialize only 32-bit sets * 0xFFFF -> -1 for C++ target. * get other targets to use 32-bit sets in serialization. tests pass locally. * refactor to reduce code size * add comment * oops. comment out call to writeSerializedATNIntegerHistogram(). I wonder if this is why it ran out of memory during testing? * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * Turn off this big lexer because we get memory errors during continuous integration * Intermediate commit where I have shuffled around all of the -1 flipping and bumping by two. work still needs to be done because the token stream rewriter stuff fails. and I assume the other decoding for human readability testing if doesn't work * convert decode to use int[]; remove dead code. don't use serializeAsChar stuff. more tests pass. * more tests passing. simplify. When copying atn, must run ATN through serializer to set some state flags. * 0xFFFD+ are not valid char * clean up. tests passing now * huge clean up. Got Java working with 32-bit ATNs!Still working on cleanup but I want to run the tests * Cleanup the hack I did earlier; everything still seems to work * Use linux DCO not our old contributors certificate of origin * remove bump-by-2 code * clean up per @KvanTTT. Can't test locally on this box. Will see what CI says. * tweak comment * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551. * see if C++ works in CI for huge ATN

@OverRide

* Get rid of reflection in CodeGenerator * Rename TargetType -> Language * Remove TargetType enum, use String instead as it was before Create CodeGenerator only one time during grammar processing, refactor code * Add default branch to appendEscapedCodePoint for unofficial targets (Kotlin) * Remove getVersion() overrides from Targets since they return the same value * Remove getLanguage() overrides from Targets since common implementation returns correct value * [again] don't use "quiet" option for mvn tests...hard to figure out what's wrong when failed. * normalize targets to 80 char strings for ATN serialization, except Java which needs big strings for efficiency. * Update actions.md fixed a small typo * Rename `CodeGenerator.createCodeGenerator` to `CodeGenerator.create` * Replace constants on string literals in `appendEscapedCodePoint` * Restore API of Target getLanguage(): protected -> public as it was before appendUnicodeEscapedCodePoint(int codePoint, StringBuilder sb, boolean escape): protected -> private (it's a new helper method, no need for API now) Added comment for appendUnicodeEscapedCodePoint * Introduce caseInsensitive lexer rule option, fixes #3436 * don't ahead of time compile for DART. See 8ca8804#commitcomment-62642779 * Simplify test rig related to timeouts (#3445) * remove all -q quiet mvn options to see output on CI servers. * run the various unit test classes in parallel rather than each individual test method, all except for Swift at the moment: `-Dparallel=classes -DthreadCount=4` * use bigger machine at circleci * No more test groups like parser1, parser2. * simplify Swift like the other tests * fix whitespace issues * use 4.10 not 4.9.4 * improve releasing antlr doc * Add Support For Swift Package Manager (#3132) * Add Swift Package Manager Support * Swift Package Dynamic * 【fix】【test】Fix run process path Co-authored-by: Terence Parr <[email protected]> * use src 11 for tool, but 8 for plugin/runtime (#3450) * use src 11 for tool, but 8 for plugin/runtime/runtime-tests. * use 11 in CI builds * cpp/cmake: Fix library install directories (#3447) This installs DLLs in bin directory instead of lib. * Python local import fixes (#3232) * Fixed pygrun relative import issue * Added name to contributors.txt Co-authored-by: Terence Parr <[email protected]> * Update javadoc to 8 and 11 (#3454) * no need for plugin in runtime, always gen svg from dot for javadoc, gen 1.8 not 1.7 doc for runtime. Gen 11 for tool. * tweak doc for 1.8 runtime. Test rig should gen 1.8 not 1.7 * [Go] Fix (*BitSet).equals (#3455) * set tool version for testing * oops reversion tool version as it's not sync'd with runtime and not time to release yet. * Remove unused variable from generated code (#3459) * [C++] Fix bugs in UnbufferedCharStream (#3420) * Escape bad words during grammar generation (#3451) * Escape reserved words during grammar generation, fixes #1070 (for -> for_ but RULE_for) Deprecate USE_OF_BAD_WORD * Make name and escapedName consistent across tool and codegen classes Fix other pull request notes * Rename NamedActionChunk to SymbolRefChunk * try out windows runners * rename workflow * Update windows.yml Fix cmd line issue * fix maven issue on windows * use jdk 11 * remove arch arg * display Github status for windows * try testing python3 on windows * try new run for python3 windows * try new run for python3 windows (again) * try new run for python3 windows (again2) * try new run for python3 windows (again3) * try new run for python3 windows (again4) * try new run for python3 windows (again5) * try new run for python3 windows * try new run for python3 windows * try new run for python3 windows * ugh i give up. python won't install on github actions. * Update windows.yml try python 3 * Update windows.yml * Update run-tests-python3.cmd * Update run-tests-python3.cmd * Create run-tests-python2.cmd * Update windows.yml * Update run-tests-python2.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Create run-tests-javascript.cmd * Update run-tests-javascript.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Create run-tests-csharp.cmd * Update windows.yml * fix warnings in C# CI * Update windows.yml * Update windows.yml * Create run-tests-dart.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-dart.cmd * Update run-tests-dart.cmd * Update run-tests-dart.cmd * Update run-tests-dart.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Create run-tests-go.cmd * Update windows.yml * Update windows.yml * Update windows.yml * GitHub action php (#3474) * Update windows.yml * Create run-tests-php.cmd * Update run-tests-php.cmd * Update run-tests-php.cmd * Update run-tests-php.cmd * Update run-tests-php.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-php.cmd * Update windows.yml * Cleanup ci (#3476) * Delete .appveyor directory * Delete .travis directory * Improve CI concurrency (#3477) * Update windows.yml * Update windows.yml * Update windows.yml * Optimize toArray replace toArray(new T[size]) with toArray(new T[0]) for better performance https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion * add contributor * resolve conflicts * fix-maven-concurrency (#3479) * fix-maven-concurrency * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-python2.cmd * Update run-tests-python3.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-php.cmd * Update windows.yml * Update run-tests-dart.cmd * Update run-tests-csharp.cmd * Update run-tests-go.cmd * Update run-tests-java.cmd * Update run-tests-javascript.cmd * Update run-tests-php.cmd * Update run-tests-python2.cmd * Update run-tests-python3.cmd * increase Windows CI concurrency for all targets except Dart * Preserve line separators for input runtime tests data (#3483) * Preserve line separators for input data in runtime tests, fix test data Refactor and improve performance of BaseRuntimeTest * Add LineSeparator (\n, \r\n) tests * Set up .gitattributes for LineSeparator_LF.txt (eol=lf) and LineSeparator_CRLF.txt (eol=crlf) * Restore `\n` for all input in runtime tests, add extra LexerExec tests (LineSeparatorLf, LineSeparatorCrLf) * Add generated LargeLexer test, remove LargeLexer.txt descriptor * tweak name to be GeneratedLexerDescriptors * [JavaScript] Migrate from jest to jasmine * [C++] Fix Windows min/max macro collision * [C++] Update cmake README.md to C++17 * remove unnecessary comparisons. * Add useful function writeSerializedATNIntegerHistogram for writing out information concerning how many of each integer value appear in a serialized ATN. * fix comment indicating what goes in the serialized ATN. * move writeSerializedATNIntegerHistogram out of runtime. * follow guidelines * Fix .interp file parsing test for the Java runtime. Also includes separating the generation of the .interp file from writing it out so that we can use both independently. * Delete files no longer needed. Should have been part of #3520 * [C++] Optimizations and cleanups and const correctness, oh my * [C++] Optimize LL1Analyzer * [C++] Fix missing virtual destructors * Remove not used PROTECTED, PUBLIC, PRIVATE tokens from ANTLRLexer.g * Remove ANTLR 3 stuff from ANTLR grammars, deprecate ANTLR 3 errors * Remove not used imaginary tokens from ANTLRParser.g * Fix misprints in grammars * ATN serialized data: remove shifting by 2, remove UUID; fix #3515 Regenerate XPathLexer files * Disable native runtime tests (see #3521) * Implement Java-specific ATN data optimization (+-2 shift) * [C++] Remove now unused antlrcpp::Guid * pull new branch diagram from master * use dev not master branch for CI github * update doc from master * add back missing author * [C++] Fix const correctness in ATN and DFA * keep getSerializedATNSegmentLimit at max int * Fixes #3259 make InErrorRecoveryMode public for go * Change code gen template to capitalize InErrorRecoveryMode * [C++] Improve multithreaded performance, fix TSAN error, and fix profiling ATN simulator setup bug * Get rid of unnecessary allocations and calculations in SerializedATN * Get rid of excess char escaping in generated files, decrease size of output files Fix creation of excess fragments for Dart, Cpp, PHP runtimes * Swift: fix binary serialization and use instead of JSON * Fix targetCharValueEscape, make them final and static * [C++] Cleanup ATNDeserializer and remove related deprecated methods from ATNSimulator * Fix for #3557 (getting "go test" to work again). * Convert Python2/3 to use int arrays not strings for ATN encodings (#3561) * Convert Python2/3 to use int arrays not strings for ATN encodings. Also make target indicate int vs string. * rename and reverse ATNSerializedAsInts * add override * remove unneeded method * [C++] Drastically improve multi-threaded performance (#3550) Thanks guys. A major advancement. * [C++] Remove duplicate includes and remove unused includes (#3563) * [C++] Lazily deserialize ATN in generated code (#3562) * [Docs] Update Swift Docs (#3458) * Add Swift Package Manager Support * Swift Package Dynamic * 【fix】【test】Fix run process path * [Docs] [Swift] update link, remove expired descriptions Co-authored-by: Terence Parr <[email protected]> * Ascii only ATN serialization (#3566) * go back to generating pure ascii ATN serializations to avoid issues where target compilers might assume ascii vs utf-8. * forgot I had to change php on previous ATN serialization tweak. * change how we escapeChar() per target. * oops; gotta use escapeChar method * rm unneeded case * add @OverRide * use ints not chars for C# (#3567) * use ints not chars for C# * oops. remove 'quotes' * regen from XPathLexer.g4 * simplify ATN with bypass alts mechanism in Java. * Change string to int[] for serialized ATN for C#; removed unneeded `use System` from XPathLexer.g4; regen that grammar. * [C++] Use camel case name in generated lexers and parsers (#3565) * Change string to int array for serialized ATN for JavaScript (#3568) * perf: Add default implementation for Visit in ParseTreeVisitor. (#3569) * perf: Add default implementation for Visit in ParseTreeVisitor. Reference: https://github.com/antlr/antlr4/blob/ad29539cd2e94b2599e0281515f6cbb420d29f38/runtime/Java/src/org/antlr/v4/runtime/tree/AbstractParseTreeVisitor.java#L18 * doc: add contributor * Don't use utf decoding...these are just ints (#3573) * [Go] Cleanup and fix ATN deserialization verification (#3574) * [C++] Force generated static data type name to titlecase (#3572) * Use int array not string for ATN in Swift (#3575) * [C++] Fix generated Lexer static data constructor (#3576) * Use int array not string for ATN in Dart (#3578) * Fix PHP codegen to support int ATN serialization (#3579) * Update listener documentation to satisfy the discussion about improving exception handling: #3162 * tweak * [C++] Remove unused LexerATNSimulator::match_calls (#3570) * [C++] Remove unused LexerATNSimulator::match_calls * Remove match_calls from other targets * [Java] Preserve serialized ATN version 3 compatibility (#3583) * add jcking to the contributors list * Update releasing-antlr.md * [C++] Avoid using dynamic_cast where possible by using hand rolled RTTI (#3584) * Revert "[Java] Preserve serialized ATN version 3 compatibility (#3583)" This reverts commit 01bc811. * [C++] Add ANTLR4CPP_PUBLIC attributes to various symbols (#3588) * Update editorconfig for c++ (#3586) * Make it easier to contribute: Add c++ configuration for .editorconfig. Using the observed style with 2 indentation spaces. Signed-off-by: Henner Zeller <[email protected]> * Add hzeller to contributors.txt Signed-off-by: Henner Zeller <[email protected]> * Fix code style and typing to support PHP 8 (#3582) * [Go] Port locking algorithm from C++ to Go (#3571) * Use linux DCO not our old contributors certificate of origin * [C++] Fix bugs in SemanticContext (#3595) * [Go] Do not export Array2DHashSet which is an implementation detail (#3597) * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551. * Use signed ints for ATN serialization not uint16, except for java (#3591) * refactor serialize so we don't need comments * more cleanup during refactor * store language in serializer obj * A lexer rule token type should never be -1 (EOF). 0 is fragment but then must be > 0. * Go uses int not uint16 for ATN now. java/go/python3 pass * remove checks for 0xFFFF in Go. * C++ uint16_t to int for ATN. * add mac php dir; fix type on accept() for generated code to be mixed. * Add test from @KvanTTT. This PR fixes #3555 for non-Java targets. * cleanup and add big lexer from #3546 * increase mvn mem size to 2G * increase mvn mem size to 8G * turn off the big ATN lexer test as we have memory issues during testing. * Fixes #3592 * Revert "C++ uint16_t to int for ATN." This reverts commit 4d2ebbf. # Conflicts: # runtime/Cpp/runtime/src/atn/ATNSerializer.cpp # runtime/Cpp/runtime/src/tree/xpath/XPathLexer.cpp * C++ uint16_t to int32_t for ATN. * rm unnecessary include file, updating project file. get rid of the 0xFFFF does in the C++ deserialization * rm refs to 0xFFFF in swift * javascript tests were running as Node...added to ignore list. * don't distinguish between 16 and 32 bit char sets in serialization; Python2/3 updated to work with this change. * update C++ to deserialize only 32-bit sets * 0xFFFF -> -1 for C++ target. * get other targets to use 32-bit sets in serialization. tests pass locally. * refactor to reduce code size * add comment * oops. comment out call to writeSerializedATNIntegerHistogram(). I wonder if this is why it ran out of memory during testing? * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * Turn off this big lexer because we get memory errors during continuous integration * Intermediate commit where I have shuffled around all of the -1 flipping and bumping by two. work still needs to be done because the token stream rewriter stuff fails. and I assume the other decoding for human readability testing if doesn't work * convert decode to use int[]; remove dead code. don't use serializeAsChar stuff. more tests pass. * more tests passing. simplify. When copying atn, must run ATN through serializer to set some state flags. * 0xFFFD+ are not valid char * clean up. tests passing now * huge clean up. Got Java working with 32-bit ATNs!Still working on cleanup but I want to run the tests * Cleanup the hack I did earlier; everything still seems to work * Use linux DCO not our old contributors certificate of origin * remove bump-by-2 code * clean up per @KvanTTT. Can't test locally on this box. Will see what CI says. * tweak comment * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551. * see if C++ works in CI for huge ATN * Use linux DCO not our old contributors certificate of origin (#3598) * Use linux DCO not our old contributors certificate of origin * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551. * use linux DCO * use linux DCO * Use linux DCO not our old contributors certificate of origin * update release documentation Signed-off-by: Terence Parr <[email protected]> * Equivalent of #3537 * clean up setup * clean up doc version * [Swift] improvements to equality functions (#3302) * fix default equality * equality cases * optional unwrapping * [Swift] Use for in loops (#3303) * common for in loops * reversed loop * drop first loop * for in with default BitSet * [Go] Fix symbol collision in generated lexers and parsers (#3603) * [C++] Refactor and optimize SemanticContext (#3594) * [C++] Devirtualize hand rolled RTTI for performance (#3609) * [C++] Add T::is for type hierarchy checks and remove some dynamic_cast (#3612) * [C++] Avoid copying statically generated serialized ATNs (#3613) * [C++] Refactor PredictionContext and yet more performance improvements (#3608) * [C++] Cleanup DFA, DFAState, LexerAction, and yet more performance improvements (#3615) * fix dependabot issues * [Swift] use stdlib (single pass) (#3602) * this was added to the stdlib in Swift 5 * &>> is defined as lhs >> (rhs % lhs.bitwidth) * the stdlib has these * reduce loops * use indices * append(contentsOf:) * Array literal init works for sets too! * inline and remove bit query functions * more optional handling (#3605) * [C++] Minor improvements to PredictionContext (#3616) * use php runtime dev branch to test dev * update doc to be more explicit about the interaction between lexer actions and semantic predicates; Fixes #3611. Fixes #3606. Signed-off-by: Terence Parr <[email protected]> * Refactor js runtime in preparation of future improvements * refactor, 1 file per class, use import, use module semantics, use webpack 5, use eslint * all tests pass * simplifications and alignment with standard js idioms * simplifications and alignment with standard js idioms * support reading legacy ATN * support both module and non-module imports * fix failing tests * fix failing tests * No longer necessary too generate sets or single atom transit that are bigger than 16bits. (#3620) * Updated getting started with Cpp documentation. (#3628) Included specific examples of using ANTLR4_TAG and ANTLR4_ZIP_REPOSITORY in the sample CMakeLists file. * [C++] Free ATNConfig lookup set in readonly ATNConfigSet (#3630) * [C++] Implement configurable PredictionContextMergeCache (#3627) * Allow to choose to switch off building tests in C++ (#3624) The new option to cmake ANTLR_BUILD_CPP_TESTS is default on (so the behavior is as before), but it provides a way to switch off if not needed. The C++ tests pull in an external dependency (googletests), which might conflict if ANTLR is used as a subproject in another cmake project. Signed-off-by: Henner Zeller <[email protected]> * Fix NPE for undefined label, fix #2788 * An interval ought to be a value Interval was a pointer to 2 Ints it ought to be just 2 Ints, which is smaller and more semantically correct, with no need for a cache. However, this technically breaks metadata and AnyObject conformance but people shouldn't be relying on those for an Interval. * [C++] Remove more dynamic_cast usage * [C++] Introduce version macros * add license prefix * Prep 4.10 (#3599) * Tweak doc * Swift was referring to hardcoded version * Start version update script. * add files to update * clean up setup * clean up setup * clean up setup * don't need file * don't need file * Fixes #3600. add instructions and associated code necessary to build the xpath lexers. * clean up version nums * php8 * php8 * php8 * php8 * php8 * php8 * php8 * php8 * tweak doc * ok, i give up. php won't bump up too v8 * tweak doc * version number bumped to 4.10 in runtime. * Change the doc for releasing and update to use latest ST 4.3.2 * fix dart version to 4.10.0 * cmd files Cannot use export bash command. * try fixing php ci again * working on deploy Signed-off-by: Terence Parr <[email protected]> * php8 always install. * set js to 4.10.0 not 4.10 * turn off apt update for php circleci * try w/o cimg/php * try setting branch * ok i give up * tweak * update docs for release. * php8 circleci * use 3.5.3 antlr * use 3.5.3-SNAPSHOT antlr * use full 3.5.3 antlr * [Swift] reduce Optionals in APIs (#3621) * ParserRuleContext.children see comment in removeLastChild * TokenStream.getText * Parser._parseListeners this might require changes to the code templates? * ATN {various} * make computeReachSet return empty, not nil * overrides refine optionality * BufferedTokenStream getHiddenTokensTo{Left, Right} return empty not nil * Update Swift.stg * avoid breakage by adding overload of `getText` in extension * tweak to kick off build Signed-off-by: Terence Parr <[email protected]> * try parallelism: 4 circleci * Revert "[Swift] reduce Optionals in APIs (#3621)" This reverts commit b5ccba0. * tweaks to doc * Improve the deploy script and tweak the released doc. * use 4.10 not Snapshot for scripts Co-authored-by: Ivan Kochurkin <[email protected]> Co-authored-by: Alexandr <[email protected]> Co-authored-by: 100mango <[email protected]> Co-authored-by: Biswapriyo Nath <[email protected]> Co-authored-by: Benjamin Spiegel <[email protected]> Co-authored-by: Justin King <[email protected]> Co-authored-by: Eric Vergnaud <[email protected]> Co-authored-by: Harry Chan <[email protected]> Co-authored-by: Ken Domino <[email protected]> Co-authored-by: chenquan <[email protected]> Co-authored-by: Marcos Passos <[email protected]> Co-authored-by: Henner Zeller <[email protected]> Co-authored-by: Dante Broggi <[email protected]> Co-authored-by: chris-miner <[email protected]>

@KvanTTT

…591) * refactor serialize so we don't need comments * more cleanup during refactor * store language in serializer obj * A lexer rule token type should never be -1 (EOF). 0 is fragment but then must be > 0. * Go uses int not uint16 for ATN now. java/go/python3 pass * remove checks for 0xFFFF in Go. * C++ uint16_t to int for ATN. * add mac php dir; fix type on accept() for generated code to be mixed. * Add test from @KvanTTT. This PR fixes antlr/antlr4#3555 for non-Java targets. * cleanup and add big lexer from antlr/antlr4#3546 * increase mvn mem size to 2G * increase mvn mem size to 8G * turn off the big ATN lexer test as we have memory issues during testing. * Fixes #3592 * Revert "C++ uint16_t to int for ATN." This reverts commit 4d2ebbf5671a5b373d2ca3b5a05464ccb8b71b52. # Conflicts: # runtime/Cpp/runtime/src/atn/ATNSerializer.cpp # runtime/Cpp/runtime/src/tree/xpath/XPathLexer.cpp * C++ uint16_t to int32_t for ATN. * rm unnecessary include file, updating project file. get rid of the 0xFFFF does in the C++ deserialization * rm refs to 0xFFFF in swift * javascript tests were running as Node...added to ignore list. * don't distinguish between 16 and 32 bit char sets in serialization; Python2/3 updated to work with this change. * update C++ to deserialize only 32-bit sets * 0xFFFF -> -1 for C++ target. * get other targets to use 32-bit sets in serialization. tests pass locally. * refactor to reduce code size * add comment * oops. comment out call to writeSerializedATNIntegerHistogram(). I wonder if this is why it ran out of memory during testing? * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * Turn off this big lexer because we get memory errors during continuous integration * Intermediate commit where I have shuffled around all of the -1 flipping and bumping by two. work still needs to be done because the token stream rewriter stuff fails. and I assume the other decoding for human readability testing if doesn't work * convert decode to use int[]; remove dead code. don't use serializeAsChar stuff. more tests pass. * more tests passing. simplify. When copying atn, must run ATN through serializer to set some state flags. * 0xFFFD+ are not valid char * clean up. tests passing now * huge clean up. Got Java working with 32-bit ATNs!Still working on cleanup but I want to run the tests * Cleanup the hack I did earlier; everything still seems to work * Use linux DCO not our old contributors certificate of origin * remove bump-by-2 code * clean up per @KvanTTT. Can't test locally on this box. Will see what CI says. * tweak comment * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551c9a674a0a1e045b9a710800df28e72c10. * see if C++ works in CI for huge ATN

@OverRide

* Get rid of reflection in CodeGenerator * Rename TargetType -> Language * Remove TargetType enum, use String instead as it was before Create CodeGenerator only one time during grammar processing, refactor code * Add default branch to appendEscapedCodePoint for unofficial targets (Kotlin) * Remove getVersion() overrides from Targets since they return the same value * Remove getLanguage() overrides from Targets since common implementation returns correct value * [again] don't use "quiet" option for mvn tests...hard to figure out what's wrong when failed. * normalize targets to 80 char strings for ATN serialization, except Java which needs big strings for efficiency. * Update actions.md fixed a small typo * Rename `CodeGenerator.createCodeGenerator` to `CodeGenerator.create` * Replace constants on string literals in `appendEscapedCodePoint` * Restore API of Target getLanguage(): protected -> public as it was before appendUnicodeEscapedCodePoint(int codePoint, StringBuilder sb, boolean escape): protected -> private (it's a new helper method, no need for API now) Added comment for appendUnicodeEscapedCodePoint * Introduce caseInsensitive lexer rule option, fixes #3436 * don't ahead of time compile for DART. See antlr/antlr4@8ca8804#commitcomment-62642779 * Simplify test rig related to timeouts (#3445) * remove all -q quiet mvn options to see output on CI servers. * run the various unit test classes in parallel rather than each individual test method, all except for Swift at the moment: `-Dparallel=classes -DthreadCount=4` * use bigger machine at circleci * No more test groups like parser1, parser2. * simplify Swift like the other tests * fix whitespace issues * use 4.10 not 4.9.4 * improve releasing antlr doc * Add Support For Swift Package Manager (#3132) * Add Swift Package Manager Support * Swift Package Dynamic * 【fix】【test】Fix run process path Co-authored-by: Terence Parr <[email protected]> * use src 11 for tool, but 8 for plugin/runtime (#3450) * use src 11 for tool, but 8 for plugin/runtime/runtime-tests. * use 11 in CI builds * cpp/cmake: Fix library install directories (#3447) This installs DLLs in bin directory instead of lib. * Python local import fixes (#3232) * Fixed pygrun relative import issue * Added name to contributors.txt Co-authored-by: Terence Parr <[email protected]> * Update javadoc to 8 and 11 (#3454) * no need for plugin in runtime, always gen svg from dot for javadoc, gen 1.8 not 1.7 doc for runtime. Gen 11 for tool. * tweak doc for 1.8 runtime. Test rig should gen 1.8 not 1.7 * [Go] Fix (*BitSet).equals (#3455) * set tool version for testing * oops reversion tool version as it's not sync'd with runtime and not time to release yet. * Remove unused variable from generated code (#3459) * [C++] Fix bugs in UnbufferedCharStream (#3420) * Escape bad words during grammar generation (#3451) * Escape reserved words during grammar generation, fixes #1070 (for -> for_ but RULE_for) Deprecate USE_OF_BAD_WORD * Make name and escapedName consistent across tool and codegen classes Fix other pull request notes * Rename NamedActionChunk to SymbolRefChunk * try out windows runners * rename workflow * Update windows.yml Fix cmd line issue * fix maven issue on windows * use jdk 11 * remove arch arg * display Github status for windows * try testing python3 on windows * try new run for python3 windows * try new run for python3 windows (again) * try new run for python3 windows (again2) * try new run for python3 windows (again3) * try new run for python3 windows (again4) * try new run for python3 windows (again5) * try new run for python3 windows * try new run for python3 windows * try new run for python3 windows * ugh i give up. python won't install on github actions. * Update windows.yml try python 3 * Update windows.yml * Update run-tests-python3.cmd * Update run-tests-python3.cmd * Create run-tests-python2.cmd * Update windows.yml * Update run-tests-python2.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Create run-tests-javascript.cmd * Update run-tests-javascript.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Create run-tests-csharp.cmd * Update windows.yml * fix warnings in C# CI * Update windows.yml * Update windows.yml * Create run-tests-dart.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-dart.cmd * Update run-tests-dart.cmd * Update run-tests-dart.cmd * Update run-tests-dart.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Create run-tests-go.cmd * Update windows.yml * Update windows.yml * Update windows.yml * GitHub action php (#3474) * Update windows.yml * Create run-tests-php.cmd * Update run-tests-php.cmd * Update run-tests-php.cmd * Update run-tests-php.cmd * Update run-tests-php.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-php.cmd * Update windows.yml * Cleanup ci (#3476) * Delete .appveyor directory * Delete .travis directory * Improve CI concurrency (#3477) * Update windows.yml * Update windows.yml * Update windows.yml * Optimize toArray replace toArray(new T[size]) with toArray(new T[0]) for better performance https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion * add contributor * resolve conflicts * fix-maven-concurrency (#3479) * fix-maven-concurrency * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-python2.cmd * Update run-tests-python3.cmd * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update run-tests-php.cmd * Update windows.yml * Update run-tests-dart.cmd * Update run-tests-csharp.cmd * Update run-tests-go.cmd * Update run-tests-java.cmd * Update run-tests-javascript.cmd * Update run-tests-php.cmd * Update run-tests-python2.cmd * Update run-tests-python3.cmd * increase Windows CI concurrency for all targets except Dart * Preserve line separators for input runtime tests data (#3483) * Preserve line separators for input data in runtime tests, fix test data Refactor and improve performance of BaseRuntimeTest * Add LineSeparator (\n, \r\n) tests * Set up .gitattributes for LineSeparator_LF.txt (eol=lf) and LineSeparator_CRLF.txt (eol=crlf) * Restore `\n` for all input in runtime tests, add extra LexerExec tests (LineSeparatorLf, LineSeparatorCrLf) * Add generated LargeLexer test, remove LargeLexer.txt descriptor * tweak name to be GeneratedLexerDescriptors * [JavaScript] Migrate from jest to jasmine * [C++] Fix Windows min/max macro collision * [C++] Update cmake README.md to C++17 * remove unnecessary comparisons. * Add useful function writeSerializedATNIntegerHistogram for writing out information concerning how many of each integer value appear in a serialized ATN. * fix comment indicating what goes in the serialized ATN. * move writeSerializedATNIntegerHistogram out of runtime. * follow guidelines * Fix .interp file parsing test for the Java runtime. Also includes separating the generation of the .interp file from writing it out so that we can use both independently. * Delete files no longer needed. Should have been part of antlr/antlr4#3520 * [C++] Optimizations and cleanups and const correctness, oh my * [C++] Optimize LL1Analyzer * [C++] Fix missing virtual destructors * Remove not used PROTECTED, PUBLIC, PRIVATE tokens from ANTLRLexer.g * Remove ANTLR 3 stuff from ANTLR grammars, deprecate ANTLR 3 errors * Remove not used imaginary tokens from ANTLRParser.g * Fix misprints in grammars * ATN serialized data: remove shifting by 2, remove UUID; fix #3515 Regenerate XPathLexer files * Disable native runtime tests (see #3521) * Implement Java-specific ATN data optimization (+-2 shift) * [C++] Remove now unused antlrcpp::Guid * pull new branch diagram from master * use dev not master branch for CI github * update doc from master * add back missing author * [C++] Fix const correctness in ATN and DFA * keep getSerializedATNSegmentLimit at max int * Fixes #3259 make InErrorRecoveryMode public for go * Change code gen template to capitalize InErrorRecoveryMode * [C++] Improve multithreaded performance, fix TSAN error, and fix profiling ATN simulator setup bug * Get rid of unnecessary allocations and calculations in SerializedATN * Get rid of excess char escaping in generated files, decrease size of output files Fix creation of excess fragments for Dart, Cpp, PHP runtimes * Swift: fix binary serialization and use instead of JSON * Fix targetCharValueEscape, make them final and static * [C++] Cleanup ATNDeserializer and remove related deprecated methods from ATNSimulator * Fix for #3557 (getting "go test" to work again). * Convert Python2/3 to use int arrays not strings for ATN encodings (#3561) * Convert Python2/3 to use int arrays not strings for ATN encodings. Also make target indicate int vs string. * rename and reverse ATNSerializedAsInts * add override * remove unneeded method * [C++] Drastically improve multi-threaded performance (#3550) Thanks guys. A major advancement. * [C++] Remove duplicate includes and remove unused includes (#3563) * [C++] Lazily deserialize ATN in generated code (#3562) * [Docs] Update Swift Docs (#3458) * Add Swift Package Manager Support * Swift Package Dynamic * 【fix】【test】Fix run process path * [Docs] [Swift] update link, remove expired descriptions Co-authored-by: Terence Parr <[email protected]> * Ascii only ATN serialization (#3566) * go back to generating pure ascii ATN serializations to avoid issues where target compilers might assume ascii vs utf-8. * forgot I had to change php on previous ATN serialization tweak. * change how we escapeChar() per target. * oops; gotta use escapeChar method * rm unneeded case * add @OverRide * use ints not chars for C# (#3567) * use ints not chars for C# * oops. remove 'quotes' * regen from XPathLexer.g4 * simplify ATN with bypass alts mechanism in Java. * Change string to int[] for serialized ATN for C#; removed unneeded `use System` from XPathLexer.g4; regen that grammar. * [C++] Use camel case name in generated lexers and parsers (#3565) * Change string to int array for serialized ATN for JavaScript (#3568) * perf: Add default implementation for Visit in ParseTreeVisitor. (#3569) * perf: Add default implementation for Visit in ParseTreeVisitor. Reference: https://github.com/antlr/antlr4/blob/ad29539cd2e94b2599e0281515f6cbb420d29f38/runtime/Java/src/org/antlr/v4/runtime/tree/AbstractParseTreeVisitor.java#L18 * doc: add contributor * Don't use utf decoding...these are just ints (#3573) * [Go] Cleanup and fix ATN deserialization verification (#3574) * [C++] Force generated static data type name to titlecase (#3572) * Use int array not string for ATN in Swift (#3575) * [C++] Fix generated Lexer static data constructor (#3576) * Use int array not string for ATN in Dart (#3578) * Fix PHP codegen to support int ATN serialization (#3579) * Update listener documentation to satisfy the discussion about improving exception handling: antlr/antlr4#3162 * tweak * [C++] Remove unused LexerATNSimulator::match_calls (#3570) * [C++] Remove unused LexerATNSimulator::match_calls * Remove match_calls from other targets * [Java] Preserve serialized ATN version 3 compatibility (#3583) * add jcking to the contributors list * Update releasing-antlr.md * [C++] Avoid using dynamic_cast where possible by using hand rolled RTTI (#3584) * Revert "[Java] Preserve serialized ATN version 3 compatibility (#3583)" This reverts commit 01bc811557adad0de63e8db85b78ca8885480378. * [C++] Add ANTLR4CPP_PUBLIC attributes to various symbols (#3588) * Update editorconfig for c++ (#3586) * Make it easier to contribute: Add c++ configuration for .editorconfig. Using the observed style with 2 indentation spaces. Signed-off-by: Henner Zeller <[email protected]> * Add hzeller to contributors.txt Signed-off-by: Henner Zeller <[email protected]> * Fix code style and typing to support PHP 8 (#3582) * [Go] Port locking algorithm from C++ to Go (#3571) * Use linux DCO not our old contributors certificate of origin * [C++] Fix bugs in SemanticContext (#3595) * [Go] Do not export Array2DHashSet which is an implementation detail (#3597) * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551c9a674a0a1e045b9a710800df28e72c10. * Use signed ints for ATN serialization not uint16, except for java (#3591) * refactor serialize so we don't need comments * more cleanup during refactor * store language in serializer obj * A lexer rule token type should never be -1 (EOF). 0 is fragment but then must be > 0. * Go uses int not uint16 for ATN now. java/go/python3 pass * remove checks for 0xFFFF in Go. * C++ uint16_t to int for ATN. * add mac php dir; fix type on accept() for generated code to be mixed. * Add test from @KvanTTT. This PR fixes antlr/antlr4#3555 for non-Java targets. * cleanup and add big lexer from antlr/antlr4#3546 * increase mvn mem size to 2G * increase mvn mem size to 8G * turn off the big ATN lexer test as we have memory issues during testing. * Fixes #3592 * Revert "C++ uint16_t to int for ATN." This reverts commit 4d2ebbf5671a5b373d2ca3b5a05464ccb8b71b52. # Conflicts: # runtime/Cpp/runtime/src/atn/ATNSerializer.cpp # runtime/Cpp/runtime/src/tree/xpath/XPathLexer.cpp * C++ uint16_t to int32_t for ATN. * rm unnecessary include file, updating project file. get rid of the 0xFFFF does in the C++ deserialization * rm refs to 0xFFFF in swift * javascript tests were running as Node...added to ignore list. * don't distinguish between 16 and 32 bit char sets in serialization; Python2/3 updated to work with this change. * update C++ to deserialize only 32-bit sets * 0xFFFF -> -1 for C++ target. * get other targets to use 32-bit sets in serialization. tests pass locally. * refactor to reduce code size * add comment * oops. comment out call to writeSerializedATNIntegerHistogram(). I wonder if this is why it ran out of memory during testing? * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore. note that the swift target takes over a minute to lex it. I've turned off Node but it does not seem to terminate but it could terminate eventually. * Turn off this big lexer because we get memory errors during continuous integration * Intermediate commit where I have shuffled around all of the -1 flipping and bumping by two. work still needs to be done because the token stream rewriter stuff fails. and I assume the other decoding for human readability testing if doesn't work * convert decode to use int[]; remove dead code. don't use serializeAsChar stuff. more tests pass. * more tests passing. simplify. When copying atn, must run ATN through serializer to set some state flags. * 0xFFFD+ are not valid char * clean up. tests passing now * huge clean up. Got Java working with 32-bit ATNs!Still working on cleanup but I want to run the tests * Cleanup the hack I did earlier; everything still seems to work * Use linux DCO not our old contributors certificate of origin * remove bump-by-2 code * clean up per @KvanTTT. Can't test locally on this box. Will see what CI says. * tweak comment * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551c9a674a0a1e045b9a710800df28e72c10. * see if C++ works in CI for huge ATN * Use linux DCO not our old contributors certificate of origin (#3598) * Use linux DCO not our old contributors certificate of origin * Revert "Use linux DCO not our old contributors certificate of origin" This reverts commit b0f8551c9a674a0a1e045b9a710800df28e72c10. * use linux DCO * use linux DCO * Use linux DCO not our old contributors certificate of origin * update release documentation Signed-off-by: Terence Parr <[email protected]> * Equivalent of antlr/antlr4#3537 * clean up setup * clean up doc version * [Swift] improvements to equality functions (#3302) * fix default equality * equality cases * optional unwrapping * [Swift] Use for in loops (#3303) * common for in loops * reversed loop * drop first loop * for in with default BitSet * [Go] Fix symbol collision in generated lexers and parsers (#3603) * [C++] Refactor and optimize SemanticContext (#3594) * [C++] Devirtualize hand rolled RTTI for performance (#3609) * [C++] Add T::is for type hierarchy checks and remove some dynamic_cast (#3612) * [C++] Avoid copying statically generated serialized ATNs (#3613) * [C++] Refactor PredictionContext and yet more performance improvements (#3608) * [C++] Cleanup DFA, DFAState, LexerAction, and yet more performance improvements (#3615) * fix dependabot issues * [Swift] use stdlib (single pass) (#3602) * this was added to the stdlib in Swift 5 * &>> is defined as lhs >> (rhs % lhs.bitwidth) * the stdlib has these * reduce loops * use indices * append(contentsOf:) * Array literal init works for sets too! * inline and remove bit query functions * more optional handling (#3605) * [C++] Minor improvements to PredictionContext (#3616) * use php runtime dev branch to test dev * update doc to be more explicit about the interaction between lexer actions and semantic predicates; Fixes #3611. Fixes #3606. Signed-off-by: Terence Parr <[email protected]> * Refactor js runtime in preparation of future improvements * refactor, 1 file per class, use import, use module semantics, use webpack 5, use eslint * all tests pass * simplifications and alignment with standard js idioms * simplifications and alignment with standard js idioms * support reading legacy ATN * support both module and non-module imports * fix failing tests * fix failing tests * No longer necessary too generate sets or single atom transit that are bigger than 16bits. (#3620) * Updated getting started with Cpp documentation. (#3628) Included specific examples of using ANTLR4_TAG and ANTLR4_ZIP_REPOSITORY in the sample CMakeLists file. * [C++] Free ATNConfig lookup set in readonly ATNConfigSet (#3630) * [C++] Implement configurable PredictionContextMergeCache (#3627) * Allow to choose to switch off building tests in C++ (#3624) The new option to cmake ANTLR_BUILD_CPP_TESTS is default on (so the behavior is as before), but it provides a way to switch off if not needed. The C++ tests pull in an external dependency (googletests), which might conflict if ANTLR is used as a subproject in another cmake project. Signed-off-by: Henner Zeller <[email protected]> * Fix NPE for undefined label, fix antlr#2788 * An interval ought to be a value Interval was a pointer to 2 Ints it ought to be just 2 Ints, which is smaller and more semantically correct, with no need for a cache. However, this technically breaks metadata and AnyObject conformance but people shouldn't be relying on those for an Interval. * [C++] Remove more dynamic_cast usage * [C++] Introduce version macros * add license prefix * Prep 4.10 (#3599) * Tweak doc * Swift was referring to hardcoded version * Start version update script. * add files to update * clean up setup * clean up setup * clean up setup * don't need file * don't need file * Fixes #3600. add instructions and associated code necessary to build the xpath lexers. * clean up version nums * php8 * php8 * php8 * php8 * php8 * php8 * php8 * php8 * tweak doc * ok, i give up. php won't bump up too v8 * tweak doc * version number bumped to 4.10 in runtime. * Change the doc for releasing and update to use latest ST 4.3.2 * fix dart version to 4.10.0 * cmd files Cannot use export bash command. * try fixing php ci again * working on deploy Signed-off-by: Terence Parr <[email protected]> * php8 always install. * set js to 4.10.0 not 4.10 * turn off apt update for php circleci * try w/o cimg/php * try setting branch * ok i give up * tweak * update docs for release. * php8 circleci * use 3.5.3 antlr * use 3.5.3-SNAPSHOT antlr * use full 3.5.3 antlr * [Swift] reduce Optionals in APIs (#3621) * ParserRuleContext.children see comment in removeLastChild * TokenStream.getText * Parser._parseListeners this might require changes to the code templates? * ATN {various} * make computeReachSet return empty, not nil * overrides refine optionality * BufferedTokenStream getHiddenTokensTo{Left, Right} return empty not nil * Update Swift.stg * avoid breakage by adding overload of `getText` in extension * tweak to kick off build Signed-off-by: Terence Parr <[email protected]> * try parallelism: 4 circleci * Revert "[Swift] reduce Optionals in APIs (#3621)" This reverts commit b5ccba03c8fa9108975bf13044ce10caed6f579c. * tweaks to doc * Improve the deploy script and tweak the released doc. * use 4.10 not Snapshot for scripts Co-authored-by: Ivan Kochurkin <[email protected]> Co-authored-by: Alexandr <[email protected]> Co-authored-by: 100mango <[email protected]> Co-authored-by: Biswapriyo Nath <[email protected]> Co-authored-by: Benjamin Spiegel <[email protected]> Co-authored-by: Justin King <[email protected]> Co-authored-by: Eric Vergnaud <[email protected]> Co-authored-by: Harry Chan <[email protected]> Co-authored-by: Ken Domino <[email protected]> Co-authored-by: chenquan <[email protected]> Co-authored-by: Marcos Passos <[email protected]> Co-authored-by: Henner Zeller <[email protected]> Co-authored-by: Dante Broggi <[email protected]> Co-authored-by: chris-miner <[email protected]>

KvanTTT added 3 commits February 20, 2022 22:36

Add ATNDataReader, ATNDataWriter

88f9bc7

Clean up ATN serializer/deserializer code

Support of full int range in serializer and deserializer (up to Integ…

a8378d3

…er.MAX_VALUE) fix antlr#840, fix antlr#1863, fix antlr#2732, fix antlr#3338

Fix C# runtime to support full int range

c29474e

This was referenced Feb 20, 2022

ATN serialization improvements (Java only for demo) #3505

Closed

Set version to 4.10 #3460

Closed

KvanTTT changed the title ~~Increase ATN states size limit~~ Increase ATN states size limit, simplify ATN serialization Feb 21, 2022

KvanTTT closed this Feb 23, 2022

parrt added a commit to parrt/antlr4 that referenced this pull request Mar 19, 2022

cleanup and add big lexer from antlr#3546

4bc9e38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase ATN states size limit, simplify ATN serialization #3546

Increase ATN states size limit, simplify ATN serialization #3546

KvanTTT commented Feb 20, 2022 •

edited

Loading

parrt commented Feb 20, 2022 •

edited

Loading

KvanTTT commented Feb 21, 2022 •

edited

Loading

parrt commented Feb 21, 2022

KvanTTT commented Feb 21, 2022 •

edited

Loading

parrt commented Feb 21, 2022

parrt commented Feb 22, 2022

KvanTTT commented Feb 22, 2022 •

edited

Loading

parrt commented Feb 23, 2022

KvanTTT commented Feb 23, 2022 •

edited

Loading

parrt commented Feb 24, 2022

Increase ATN states size limit, simplify ATN serialization #3546

Increase ATN states size limit, simplify ATN serialization #3546

Conversation

KvanTTT commented Feb 20, 2022 • edited Loading

parrt commented Feb 20, 2022 • edited Loading

KvanTTT commented Feb 21, 2022 • edited Loading

parrt commented Feb 21, 2022

KvanTTT commented Feb 21, 2022 • edited Loading

parrt commented Feb 21, 2022

parrt commented Feb 22, 2022

KvanTTT commented Feb 22, 2022 • edited Loading

parrt commented Feb 23, 2022

KvanTTT commented Feb 23, 2022 • edited Loading

parrt commented Feb 24, 2022

KvanTTT commented Feb 20, 2022 •

edited

Loading

parrt commented Feb 20, 2022 •

edited

Loading

KvanTTT commented Feb 21, 2022 •

edited

Loading

KvanTTT commented Feb 21, 2022 •

edited

Loading

KvanTTT commented Feb 22, 2022 •

edited

Loading

KvanTTT commented Feb 23, 2022 •

edited

Loading