Releases · dbmdz/solr-ocrhighlighting

13 Sep 11:43

jbaiter

0.9.1

2780ad2

0.9.1: Solr 9.7 compatibility, fixes Latest

Latest

Changed

During indexing, we now only need a single pass through the input files, instead of
two, this is in preparation for the S3 storage backend, where we don't have the luxury
of relying on a page cache to paper over our inefficencies.

Fixed

Fix bug that resulted in missed matches during highlighting (#442, thanks @schmika!)
Fix bug that resulted in incomplete reads from the input file under some circumstances (#441, thanks @schmika!)
Compatibility with Solr 9.7

Contributors

schmika

Assets 4

12 Jun 07:28

jbaiter

0.9.0

c701e35

0.9.0: Major Performance Improvements

Major performance and stability improvements in this release, upgrading is highly recommended.

Changed:

Add support for multithreaded highlighting. Uses all available logical CPU cores by default and can be tweaked with the numHighlightingThreads and maxQueuedPerThread attributes on the OcrHighlightComponent in solrconfig.xml.
Removed PageCacheWarmer, no longer needed due to multithreading support.
Completely refactored, simplified and optimized I/O stack to reduce number of file system reads and allocations/data copies during highlighting, accounting for a significant performance improvement over previous versions (4-8 times faster in a synthetic benchmark that was not I/O-bound)
We no longer memory-map files for reading. Benchmarking revealed that it did not improve performance with the new I/O stack (probably due to the reduced amount of actual reads), on the contrary, performance was improved for many concurrent queries. A huge drawback of the memory-mapped approach was that in the presence of I/O errors like disappearing mounts, truncated files, etc, the JVM could simply crash (due to the kernel sending a SIGBUS signal when encountering an I/O error).
When locating breaks in the forward direction, we used to put the break point at the end of the limiting element opening tag. With the new implementation, the break point is now at the start of the limiting tag open element, i.e. no part of the limiting element is contained in the created section. This leads to a small change in the scores assigned to passages (since BM25 uses the length of the scored content in its calculations).

Fixed:

When using source pointers with multiple files, the plugin no longer leaks file descriptors. We previously didn't close the currently opened file when opening the next one.

Assets 4

03 May 12:33

jbaiter

0.8.6

b4df5e7

0.8.6: Solr 9.6 Support

Changed:

Add support for Solr 9.6
Removed unused classes
Refactored timeout logic to match new approach used in Solr >= 9.5
Dependency Updates

Assets 4

25 Apr 14:33

jbaiter

0.8.5

5c5be52

0.8.5

Changed:

Missing files no longer fail the complete search request, instead the OCR
highlighting for the document is skipped
Add support for Solr 9.5
Updated documentation with warning for Solr 9 users to disable security sandboxing
when using pointers to external files

Fixed:

Regular highlighting in case no hl field can be determined works again (#404)
Passage building across more than two concatenated files works now (#422)

Assets 4

29 Jan 10:57

jbaiter

0.8.4

d6935fa

0.8.4

Changed:

Add support for Solr 9.4
Improved sanitization of broken OCR XML during parsing

Fixed:

More robust bytecode patching for Solr 7/8
Frontend in example setup is working again

Assets 4

08 Mar 12:07

github-actions

wip

3e22d53

WIP build (use at own risk) Pre-release

Pre-release

wip

Update setup-python action

Assets 34

21 Oct 14:46

jbaiter

0.8.3

1f4fddb

0.8.3

Another bugfix release, fixing some edge cases with 'odd' OCR files.

Bugfixes:

hOCR: Fix truncated passages during highlighting due to incomplete forward passes while parsing candidate passages.
All Formats: Use an iterative solution for skipping empty words instead of a recursive strategy, which could lead to stack overflows when encountering OCR files with many empty words.

Other Changes:

We now have pre-releases in the Solr repository that can be used to experiment with the latest changes in the plugin before the official release. For users not using the repository, a pre-release build is also pushed to the GitHub Releases page on every update to the repository.

Assets 4

22 Sep 16:28

jbaiter

0.8.2

9f4201a

0.8.2

Bugfix release for an edge-case in hOCR parsing.

Bugfixes:

hOCR: Fix stack overflow when handling empty words in combination with a partially
hyphenated word

Other Changes:

Improved error message in case of errors during highlighting, the message now includes the source pointer of the failed document, or if storing OCR in the index, the beginning of the broken content. Also included is the internal Lucene document identifier. By adding the [docid] field to the returned fields for the failing query, the internal id is added to very document in the result set for a failing query, which should allow quick identification of the documents that cause issues during highlighting.

Assets 4

10 Jun 08:56

jbaiter

0.8.1

a505a25

0.8.1

This is a bugfix release targeting mainly the MiniOCR and ALTO implementations.

Bufgfixes:

ALTO: Fix handling of empty words. Previously any words after a word element with no text would be skipped entirely during indexing 😱😱.
MiniOCR: Fix handling of empty words, Previously a word element with no text would make the parser crash.
MiniOCR: Make the wh attribute on <p> page elements actually optional. The documentation said it was optional, but the parser would crash when attempting to handle elements without the attribute

Other Changes:

A warning will now be logged if none of the fields requested with hl.ocr.fl exist or are defined as stored fields. Previously highlighting would just not work, with no indications to users as to why this was the case.

Assets 4

01 Jun 10:53

jbaiter

0.8.0

5a5a0b3

0.8.0

The major improvement in this version is compatibility with Solr 9.

Due to a number of API changes in Solr and Lucene, we now have to ship two separate releases,
one for Solr 7 and 8 and one for Solr 9, so please take extra care when downloading to pick
the correct release. In the Package Repository, the Solr 7/8 release will always have version
with the suffix -solr78.

We also changed the package namespaces for all user-facing components so they are easier
to identify and write. What this means is that you will need to change the class="..."
attributes in your solrconfig.xml and schema.xml to match the new package namespaces.
Whenever you previously had de.digitalcollections.solrocr.<other stuff>.ClassName, you
now have to simply write solrocr.ClassName.

New Features:

For users running Solr in the Solrcloud mode, the plugin can now be installed via Solr's
Package Manager:

$ bin/solr package add-repo dbmdz.github.io https://dbmdz.github.io/solr
$ bin/solr package install ocrhighlighting  # For Solr 9
$ bin/solr package install ocrhighlighting:0.8.0-solr78  # For Solr 7 and 8

Note that Solr 7/8 users need to manually specify the version.

API changes:

Changed deployment process to use two separate packages, one for Solr 9 and later and one for Solr 7/8, with a -solr78.jar suffix
Changed namespace of all user-facing components to simply solrocr and moved all
user-facing component classes to it:
- de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory
  → solrocr.OcrCharFilterFactory
- de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory
  → solrocr.ExternalUtf8ContentFilterFactory
- de.digitalcollections.solrocr.lucene.OcrAlternativesFilterFactory
  → solrocr.OcrAlternativesFilterFactory
- de.digitalcollections.solrocr.lucene.OcrHighlightComponent
  → solrocr.OcrHighlightComponent

Bugfixes

Fix handling of quoted property values in hOCR title tags. We deviate a bit from the spec
to be more compatible with existing real-world data: Values like x_source can now either
be quoted in single- or double-quotes, or not at all, the parser will handle every case.

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Releases: dbmdz/solr-ocrhighlighting

0.9.1: Solr 9.7 compatibility, fixes

Contributors

0.9.0: Major Performance Improvements

0.8.6: Solr 9.6 Support

0.8.5

0.8.4

WIP build (use at own risk)

0.8.3

0.8.2

0.8.1

0.8.0