Lucene search #11542

LoayGhreeb · 2024-07-28T04:54:10Z

Lucene search backend

Follow-up to: #8963, #8206, #11326.

Indexing

All bib fields and linked files (PDFs) are indexed separately in two different indexes.
Indexing operations startup, adding, removing, and updating are performed in the background. Each index operates in a separate thread.
Startup:
- Bib Fields Index: Recalculated for the entire library on startup.
- Linked Files Index: Only the differences between the current library and previously indexed files are recalculated. Files that have been updated on disk will be reindexed.
Storage:
- Bib Fields Index: Stored in memory rather than on disk, due to the non-persistent of BibEntry#hashCode across sessions.
- Linked Files Index: Stored in the directory provided by AppDirs.
Each bib entry is stored as a Lucene document. Each bib field is tokenized and added to the document. Additionally, all bib fields (except the "Groups" due to #7996) are collected into one field "any", is used as the default field during searches.
For both the Bib Fields Index and Linked Files Index, the IndexWriter is opened only once at startup and remains open during the runtime.
During shutdown, all changes are committed to the index, and the index is optimized by merging all segments into a single segment.

Analyzing

Bib Fields Index: A custom analyzer is used to support "contains" searches, LaTeX, and Unicode characters. The analyzer includes:
- WhitespaceTokenizer: Suitable for bib fields as it does not escape special characters, preserving LaTeX formatting.
- LowerCaseFilter: Converting all text to lowercase.
- StopFilter: Removes English stop words.
- LatexToUnicodeFoldingFilter: Converts LaTeX-encoded characters to their Unicode equivalents.
- ASCIIFoldingFilter: Converts Unicode characters to their ASCII equivalents.
- EdgeNGramTokenFilter: To support the "prefix" or "starts with" searches. Although the NGramTokenizer could be used for "contains" searches, but it is slower during indexing.
The same analyzer used for indexing bib fields is also used for searching, but without the EdgeNGramTokenFilter.
The Linked files index uses the EnglishAnalyzer for both indexing and searching. This analyzer converts all strings to lowercase, removing English stop words, and uses PorterStemFilterwhich reduces words to their base or root form, known as the "stem". For example, terms like "computer", "compute", "computations", and "computerized" will all be reduced to the stem "comput", to get more relevant search results.

Searching

Both bib fields and linked files are searched in the background. If the search flags include full-text search, a MultiReader is used to search across both indexes.
Searches using Lucene Near Real-Time (NRT) search, allowing queries to include uncommitted changes. The SearcherManager is used to manage the IndexSearchers to be refreshed before each search query to get the matches without the need to commit changes.
Lucene Search Syntax: https://lucene.apache.org/core/9_11_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview
Support for regular expression searches: https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/util/automaton/RegExp.html

Search Results

Added a new column displaying the search score.
The file icon in the table now displays a magnifying glass when search results are found within a linked file.
Fixed issues with highlighting search results in the Preview Viewer and the Source Tab.

Search Groups free-search expression

Caution

Before proceeding, create a backup of your library. This is an alpha release, and the search syntax is changed.

If the library contains Search Groups, users will be prompted to migrate the search syntax to the new syntax.
Search Group matches are now cached, and switching between search groups improved.

Removed

Case-sensitive and exact match searches are no longer supported.
Removed case-sensitive and regular expression toggles for the search bar, and search groups dialog.
Removed the description of search strings.
Removed all search rules.

Screenshots

Search groups migration.
Full-text search results.
Highlighting search results.

Closes: #8857
Closes: #11374
Closes: #11378
Closes: #8626
Closes: #11595
Closes: #11246
Closes: #7996
Closes: #8067
Closes: #1975

Mandatory checks

Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Co-authored-by: Christoph <[email protected]>

# Conflicts: # CHANGELOG.md # src/main/java/org/jabref/gui/LibraryTab.java # src/main/java/org/jabref/gui/StateManager.java # src/main/java/org/jabref/gui/openoffice/OpenOfficePanel.java

This reverts commit 536ecfa.

…neSearchBackend

# Conflicts: # CHANGELOG.md # src/jmh/java/org/jabref/benchmarks/Benchmarks.java # src/main/java/org/jabref/gui/JabRefFrame.java # src/main/java/org/jabref/gui/LibraryTab.java # src/main/java/org/jabref/gui/entryeditor/EntryEditor.java # src/main/java/org/jabref/gui/entryeditor/fileannotationtab/FulltextSearchResultsTab.java # src/main/java/org/jabref/gui/externalfiles/ExternalFilesEntryLinker.java # src/main/java/org/jabref/gui/externalfiles/ImportHandler.java # src/main/java/org/jabref/gui/groups/GroupDialogView.java # src/main/java/org/jabref/gui/groups/GroupsPreferences.java # src/main/java/org/jabref/gui/maintable/MainTable.java # src/main/java/org/jabref/gui/maintable/MainTableColumnFactory.java # src/main/java/org/jabref/gui/maintable/columns/FileColumn.java # src/main/java/org/jabref/gui/preview/PreviewPanel.java # src/main/java/org/jabref/gui/search/GlobalSearchBar.java # src/main/java/org/jabref/gui/search/RebuildFulltextSearchIndexAction.java # src/main/java/org/jabref/gui/search/SearchResultsTableDataModel.java # src/main/java/org/jabref/logic/pdf/search/indexing/IndexingTaskManager.java # src/main/java/org/jabref/model/database/BibDatabaseContext.java # src/main/java/org/jabref/model/pdf/search/SearchFieldConstants.java # src/main/java/org/jabref/model/search/rules/SearchRules.java # src/main/java/org/jabref/preferences/JabRefPreferences.java # src/main/java/org/jabref/preferences/SearchPreferences.java # src/test/java/org/jabref/gui/groups/GroupTreeViewModelTest.java

# Conflicts: # CHANGELOG.md # src/main/java/org/jabref/model/search/rules/ContainsBasedSearchRule.java # src/main/java/org/jabref/model/search/rules/GrammarBasedSearchRule.java # src/main/java/org/jabref/model/search/rules/RegexBasedSearchRule.java

…neSearchBackend

https://github.com/apache/lucene/blob/68cc8734ca28a9db800e4192a636d3b490cfd41a/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L101-L110

src/main/java/org/jabref/model/groups/SearchGroup.java

src/main/java/org/jabref/model/search/SearchFieldConstants.java

When closing JabRef, only ask users to wait for the linked files indexer to finish. The bib fields indexer is recalculated on startup, so it doesn't need to be completed before shutdown.

github-actions · 2024-09-04T18:48:14Z

The build for this PR is no longer available. Please visit https://builds.jabref.org/main/ for the latest build.

calixtus · 2024-09-05T14:16:09Z

src/main/java/org/jabref/gui/StateManager.java

@@ -58,6 +59,7 @@ public class StateManager {
    private final OptionalObjectProperty<LibraryTab> activeTab = OptionalObjectProperty.empty();
    private final ObservableList<BibEntry> selectedEntries = FXCollections.observableArrayList();
    private final ObservableMap<String, ObservableList<GroupTreeNode>> selectedGroups = FXCollections.observableHashMap();
+    private final ObservableMap<String, LuceneManager> luceneManagers = FXCollections.observableHashMap();


For later discussion: This is maybe a hint that LuceneManager should be called different. Maybe "SearchIndex" or sthg alike? Usually you have one manager for the entire app, not a manager for each file...

calixtus · 2024-09-05T18:33:36Z

src/main/java/org/jabref/gui/entryeditor/fileannotationtab/FulltextSearchResultsTab.java

-                    for (String resultTextHtml : searchResult.getAnnotationsResultStringsHtml()) {
-                        content.getChildren().addAll(TooltipTextUtil.createTextsFromHtml(resultTextHtml.replace("</b> <b>", " ")));
-                        content.getChildren().addAll(new Text(System.lineSeparator()), lineSeparator(0.8), createPageLink(linkedFile, searchResult.getPageNumber()));
+        stateManager.activeSearchQuery(SearchType.NORMAL_SEARCH).get().ifPresent(searchQuery -> {


Looks very weird to put that into a lambda expression. Also makes stack traces way longer. Maybe a simple if check is enough or better - fail fast strategy (if (!activeSearchQuery.isPresent()) { return; } )

calixtus · 2024-09-05T18:43:14Z

src/main/java/org/jabref/gui/maintable/MainTableDataModel.java

 import com.tobiasdiez.easybind.EasyBind;
 import com.tobiasdiez.easybind.Subscription;
+import org.jspecify.annotations.Nullable;


I don't think that @Siedlerchr will like this...

Yeah... by default all is nullable in java

calixtus · 2024-09-05T18:56:16Z

src/main/java/org/jabref/model/groups/SearchGroup.java

    private static final Logger LOGGER = LoggerFactory.getLogger(SearchGroup.class);
-    private final GroupSearchQuery query;
+
+    @ADR(38)


In DuplicateSearch it uses the comment format:

In Java, annotations are limited. We tried to use e-adr whereever possible. Where not, we used Java comments. #research.

calixtus · 2024-09-05T20:08:12Z

🎉

subhramit · 2024-09-06T09:28:48Z

Congratulations, Loay!

btut and others added 30 commits November 6, 2022 19:28

Use pattern matching for cast

0f32e91

Co-authored-by: Christoph <[email protected]>

Fix pattern matching

231f200

Merge branch 'main' into version6

202f514

Merge branch 'version6' into luceneSearchBackend

f602207

Fix merge

675d75c

Speed up switches between sorting/filtering modes

49eeb1d

Merge remote-tracking branch 'upstream/main' into luceneSearchBackend

a25fa00

# Conflicts: # CHANGELOG.md # src/main/java/org/jabref/gui/LibraryTab.java # src/main/java/org/jabref/gui/StateManager.java # src/main/java/org/jabref/gui/openoffice/OpenOfficePanel.java

Fixed merge errors

08aca25

Fixed small issues

15b2152

Removed obsolete tests, fixed some tests

f8b643b

Merge branch 'main' into luceneSearchBackend

229fed7

Fixed merge error in CHANGELOG.md

536ecfa

Fixed checkstyle

1117e17

Fixed more tests

ff4ad1c

Removed obsolete tests

7bdccf0

Fixes "Fixed merge error in CHANGELOG.md" by removing duplicate entries

40de951

This reverts commit 536ecfa.

Merge remote-tracking branch 'upstream/luceneSearchBackend' into luce…

cbb5461

…neSearchBackend

Merge branch 'main' into luceneSearchBackend

9e8863d

WiP on tests

1f38a68

Checkstyle

00cecf8

Checkstyle

ad85c77

Merge remote-tracking branch 'upstream/luceneSearchBackend' into luce…

986bf3f

…neSearchBackend

Update Java version

ed5cebf

Refine logging

afc4d67

Fix compile error

2f3203f

Add LuceneTest

c771458

Update CHANGELOG.md

b795fa6

LoayGhreeb and others added 4 commits September 3, 2024 01:32

Merge branch 'main' into LuceneSearch

54fd783

Fix line break

e6f3b5d

Fix tests

95458d3

Use EnglishAnalyzer for indexing/searching linked files

e5665ce

https://github.com/apache/lucene/blob/68cc8734ca28a9db800e4192a636d3b490cfd41a/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L101-L110

koppor reviewed Sep 3, 2024

View reviewed changes

src/main/java/org/jabref/model/groups/SearchGroup.java Outdated Show resolved Hide resolved

src/main/java/org/jabref/model/search/SearchFieldConstants.java Show resolved Hide resolved

LoayGhreeb and others added 2 commits September 4, 2024 09:24

Ask to wait for linked files indexing on shutdown

ce21f4a

When closing JabRef, only ask users to wait for the linked files indexer to finish. The bib fields indexer is recalculated on startup, so it doesn't need to be completed before shutdown.

Merge branch 'main' into LuceneSearch

cf92f0a

LoayGhreeb mentioned this pull request Sep 4, 2024

No more shown finished background tasks #11574

Merged

6 tasks

LoayGhreeb added 4 commits September 4, 2024 14:05

Use EdgeNGram instead of NGram

6dbc187

Return comment

c9d8230

Update CHANGELOG.md

fa8344b

Merge branch 'main' into LuceneSearch

5f05f4a

LoayGhreeb marked this pull request as ready for review September 4, 2024 18:23

Merge branch 'main' into LuceneSearch

ab87e24

calixtus reviewed Sep 5, 2024

View reviewed changes

calixtus approved these changes Sep 5, 2024

View reviewed changes

koppor enabled auto-merge September 5, 2024 19:55

koppor approved these changes Sep 5, 2024

View reviewed changes

koppor added this pull request to the merge queue Sep 5, 2024

Merged via the queue into main with commit 6af91b9 Sep 5, 2024
31 of 32 checks passed

koppor deleted the LuceneSearch branch September 5, 2024 20:07

This was referenced Sep 7, 2024

Update lucene version #11719

Merged

Fix search test NPE #11749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene search #11542

Lucene search #11542

LoayGhreeb commented Jul 28, 2024 •

edited

Loading

github-actions bot commented Sep 4, 2024 •

edited

Loading

calixtus Sep 5, 2024

calixtus Sep 5, 2024

calixtus Sep 5, 2024

Siedlerchr Sep 5, 2024

calixtus Sep 5, 2024 •

edited

Loading

koppor Sep 5, 2024

calixtus commented Sep 5, 2024

subhramit commented Sep 6, 2024

Lucene search #11542

Lucene search #11542

Conversation

LoayGhreeb commented Jul 28, 2024 • edited Loading

Lucene search backend

Indexing

Analyzing

Searching

Search Results

Search Groups free-search expression

Removed

Screenshots

Mandatory checks

github-actions bot commented Sep 4, 2024 • edited Loading

calixtus Sep 5, 2024

Choose a reason for hiding this comment

calixtus Sep 5, 2024

Choose a reason for hiding this comment

calixtus Sep 5, 2024

Choose a reason for hiding this comment

Siedlerchr Sep 5, 2024

Choose a reason for hiding this comment

calixtus Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

koppor Sep 5, 2024

Choose a reason for hiding this comment

calixtus commented Sep 5, 2024

subhramit commented Sep 6, 2024

LoayGhreeb commented Jul 28, 2024 •

edited

Loading

github-actions bot commented Sep 4, 2024 •

edited

Loading

calixtus Sep 5, 2024 •

edited

Loading