Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) #452

Merged
merged 4 commits into from
Nov 19, 2021
Merged

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Nov 18, 2021

Upgrade jflex.

Change doesn't alter the behavior of any of the analyzers (unicode version or grammar refactorings), just the minimal to get new tooling working.

Upgrade jflex.

Change doesn't alter the behavior of any of the analyzers (unicode
version or grammar refactorings), just the minimal to get new tooling
working.
@rmuir rmuir requested review from dweiss and sarowe November 18, 2021 00:44
@rmuir
Copy link
Member Author

rmuir commented Nov 18, 2021

I thought to try to do a large refactoring here and quickly backed off, I think let's just upgrade to the latest jflex as a standalone change.

The trickiest parts (and ones needing close review):

  1. the skeleton files. I reviewed diff -u of the two current skeleton files, making sure i understood the changes needed to work without buffer expansion. Then I updated the default skeleton file from jflex sources (there are many changes) and created the "without buffer expansion" version with appropriate logic.

  2. jflex change to support > 2GB files. Some variables become long and we have to change some code to work with it. In general Lucene won't work with such files (e.g. OffsetAttribute is based on int), so I tried to just add minimal casts. I'd rather not use Math.toIntExact as it may impact performance. If we want to improve safety, maybe it should be via a cheaper mechanism.

  3. warnings suppression. jflex starts to try to suppress its own warnings, but they do the warning in a nonstandard way, and you can't stack these annotations, so we have to add a little hack. I commented on the issues in the jflex bug tracker (linked in the gradle hack).

token: 'SuppressWarnings("FallThrough")',
value: 'SuppressWarnings({"fallthrough","unused"})'
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the hack we have to do for now, see jflex-de/jflex#762 where a method is being discussed to customize the suppress warnings without find-replace

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rmuir
Copy link
Member Author

rmuir commented Nov 18, 2021

For convenience of reviewing: here is the diff between default skeleton and "buffer-expansion-disabled" skeleton. It is kinda the only way to review it since we brought in all the upstream changes.

think:lucene[LUCENE-10239]$ diff -u gradle/generation/jflex/skeleton.default.txt gradle/generation/jflex/skeleton.disable.buffer.expansion.txt
--- gradle/generation/jflex/skeleton.default.txt	2021-11-17 19:04:46.844620167 -0500
+++ gradle/generation/jflex/skeleton.disable.buffer.expansion.txt	2021-11-17 19:05:01.124853267 -0500
@@ -5,7 +5,7 @@
   /** Initial size of the lookahead buffer. */
 --- private static final int ZZ_BUFFERSIZE = ...;

-  /** Lexical states. */
+  /** Lexical States. */
 ---  lexical states, charmap

   /** Error code for "Unknown internal scanner error". */
@@ -94,18 +94,11 @@
       zzStartRead = 0;
     }

-    /* is the buffer big enough? */
-    if (zzCurrentPos >= zzBuffer.length - zzFinalHighSurrogate) {
-      /* if not: blow it up */
-      char newBuffer[] = new char[zzBuffer.length * 2];
-      System.arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length);
-      zzBuffer = newBuffer;
-      zzEndRead += zzFinalHighSurrogate;
-      zzFinalHighSurrogate = 0;
-    }
-
     /* fill the buffer with new input */
-    int requested = zzBuffer.length - zzEndRead;
+    int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate;
+    if (requested == 0) {
+      return true;
+    }
     int numRead = zzReader.read(zzBuffer, zzEndRead, requested);

     /* not supposed to occur according to specification of java.io.Reader */
@@ -119,6 +112,9 @@
         if (numRead == requested) { // We requested too few chars to encode a full Unicode character
           --zzEndRead;
           zzFinalHighSurrogate = 1;
+          if (numRead == 1) {
+            return true;
+          }
         } else {                    // There is room in the buffer for at least one more char
           int c = zzReader.read();  // Expecting to read a paired low surrogate char
           if (c == -1) {

@dweiss
Copy link
Contributor

dweiss commented Nov 19, 2021

I tried to look up why this no-buffer-expansion is needed. I see LUCENE-8527 and some corner cases there... but why is it used here and there and not all across the board (some tokenizers use the default and others use the no-buffer version).

@rmuir
Copy link
Member Author

rmuir commented Nov 19, 2021

@dweiss see https://issues.apache.org/jira/browse/LUCENE-5897 for more background on that

Copy link
Contributor

@sarowe sarowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Robert.

I like the strategy of upgrading the dependency first and then working on the Unicode upgrades later.

@rmuir
Copy link
Member Author

rmuir commented Nov 19, 2021

thank you @dweiss and @sarowe for reviewing.

@rmuir rmuir merged commit af831d2 into apache:main Nov 19, 2021
asfgit pushed a commit that referenced this pull request Nov 19, 2021
Upgrade jflex.

Change doesn't alter the behavior of any of the analyzers (unicode
version or grammar refactorings), just the minimal to get new tooling
working.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants