LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) #452

rmuir · 2021-11-18T00:44:53Z

Upgrade jflex.

Change doesn't alter the behavior of any of the analyzers (unicode version or grammar refactorings), just the minimal to get new tooling working.

Upgrade jflex. Change doesn't alter the behavior of any of the analyzers (unicode version or grammar refactorings), just the minimal to get new tooling working.

rmuir · 2021-11-18T00:52:03Z

I thought to try to do a large refactoring here and quickly backed off, I think let's just upgrade to the latest jflex as a standalone change.

The trickiest parts (and ones needing close review):

the skeleton files. I reviewed diff -u of the two current skeleton files, making sure i understood the changes needed to work without buffer expansion. Then I updated the default skeleton file from jflex sources (there are many changes) and created the "without buffer expansion" version with appropriate logic.
jflex change to support > 2GB files. Some variables become long and we have to change some code to work with it. In general Lucene won't work with such files (e.g. OffsetAttribute is based on int), so I tried to just add minimal casts. I'd rather not use Math.toIntExact as it may impact performance. If we want to improve safety, maybe it should be via a cheaper mechanism.
warnings suppression. jflex starts to try to suppress its own warnings, but they do the warning in a nonstandard way, and you can't stack these annotations, so we have to add a little hack. I commented on the issues in the jflex bug tracker (linked in the gradle hack).

dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java

rmuir · 2021-11-18T00:54:41Z

gradle/generation/jflex.gradle

+          token: 'SuppressWarnings("FallThrough")',
+          value: 'SuppressWarnings({"fallthrough","unused"})'
+    )
+


This is the hack we have to do for now, see jflex-de/jflex#762 where a method is being discussed to customize the suppress warnings without find-replace

This reverts commit 1eb3961.

dweiss

lgtm

dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java

…his out...

rmuir · 2021-11-18T22:25:44Z

For convenience of reviewing: here is the diff between default skeleton and "buffer-expansion-disabled" skeleton. It is kinda the only way to review it since we brought in all the upstream changes.

think:lucene[LUCENE-10239]$ diff -u gradle/generation/jflex/skeleton.default.txt gradle/generation/jflex/skeleton.disable.buffer.expansion.txt
--- gradle/generation/jflex/skeleton.default.txt	2021-11-17 19:04:46.844620167 -0500
+++ gradle/generation/jflex/skeleton.disable.buffer.expansion.txt	2021-11-17 19:05:01.124853267 -0500
@@ -5,7 +5,7 @@
   /** Initial size of the lookahead buffer. */
 --- private static final int ZZ_BUFFERSIZE = ...;

-  /** Lexical states. */
+  /** Lexical States. */
 ---  lexical states, charmap

   /** Error code for "Unknown internal scanner error". */
@@ -94,18 +94,11 @@
       zzStartRead = 0;
     }

-    /* is the buffer big enough? */
-    if (zzCurrentPos >= zzBuffer.length - zzFinalHighSurrogate) {
-      /* if not: blow it up */
-      char newBuffer[] = new char[zzBuffer.length * 2];
-      System.arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length);
-      zzBuffer = newBuffer;
-      zzEndRead += zzFinalHighSurrogate;
-      zzFinalHighSurrogate = 0;
-    }
-
     /* fill the buffer with new input */
-    int requested = zzBuffer.length - zzEndRead;
+    int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate;
+    if (requested == 0) {
+      return true;
+    }
     int numRead = zzReader.read(zzBuffer, zzEndRead, requested);

     /* not supposed to occur according to specification of java.io.Reader */
@@ -119,6 +112,9 @@
         if (numRead == requested) { // We requested too few chars to encode a full Unicode character
           --zzEndRead;
           zzFinalHighSurrogate = 1;
+          if (numRead == 1) {
+            return true;
+          }
         } else {                    // There is room in the buffer for at least one more char
           int c = zzReader.read();  // Expecting to read a paired low surrogate char
           if (c == -1) {

dweiss · 2021-11-19T11:34:08Z

I tried to look up why this no-buffer-expansion is needed. I see LUCENE-8527 and some corner cases there... but why is it used here and there and not all across the board (some tokenizers use the default and others use the no-buffer version).

rmuir · 2021-11-19T11:45:06Z

@dweiss see https://issues.apache.org/jira/browse/LUCENE-5897 for more background on that

sarowe

LGTM, thanks Robert.

I like the strategy of upgrading the dependency first and then working on the Unicode upgrades later.

rmuir · 2021-11-19T14:23:28Z

thank you @dweiss and @sarowe for reviewing.

Upgrade jflex. Change doesn't alter the behavior of any of the analyzers (unicode version or grammar refactorings), just the minimal to get new tooling working.

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2)

7fbd5c0

Upgrade jflex. Change doesn't alter the behavior of any of the analyzers (unicode version or grammar refactorings), just the minimal to get new tooling working.

rmuir requested review from dweiss and sarowe November 18, 2021 00:44

rmuir commented Nov 18, 2021

View reviewed changes

dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java Show resolved Hide resolved

rmuir commented Nov 18, 2021

View reviewed changes

rmuir added 2 commits November 18, 2021 01:19

LUCENE-10239: oops, i didnt commit everything

1eb3961

Revert "LUCENE-10239: oops, i didnt commit everything"

f5037c2

This reverts commit 1eb3961.

dweiss approved these changes Nov 18, 2021

View reviewed changes

dev-tools/missing-doclet/src/main/java/org/apache/lucene/missingdoclet/MissingDoclet.java Show resolved Hide resolved

LUCENE-10239: restore commented version of debug info, we'll figure t…

1be1eeb

…his out...

sarowe approved these changes Nov 19, 2021

View reviewed changes

rmuir merged commit af831d2 into apache:main Nov 19, 2021

asfgit pushed a commit that referenced this pull request Nov 19, 2021

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) (#452)

ee56d31

Upgrade jflex. Change doesn't alter the behavior of any of the analyzers (unicode version or grammar refactorings), just the minimal to get new tooling working.

asfimport mentioned this pull request Mar 22, 2022

upgrade jflex (1.7.0 -> 1.8.2) [LUCENE-10239] #11275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) #452

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) #452

rmuir commented Nov 18, 2021

rmuir commented Nov 18, 2021

rmuir Nov 18, 2021

dweiss left a comment

rmuir commented Nov 18, 2021

dweiss commented Nov 19, 2021

rmuir commented Nov 19, 2021

sarowe left a comment

rmuir commented Nov 19, 2021

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) #452

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) #452

Conversation

rmuir commented Nov 18, 2021

rmuir commented Nov 18, 2021

rmuir Nov 18, 2021

Choose a reason for hiding this comment

dweiss left a comment

Choose a reason for hiding this comment

rmuir commented Nov 18, 2021

dweiss commented Nov 19, 2021

rmuir commented Nov 19, 2021

sarowe left a comment

Choose a reason for hiding this comment

rmuir commented Nov 19, 2021