Releases · dbmdz/solr-ocrhighlighting

22 Mar 08:04

jbaiter

0.7.2

294720c

0.7.2

Yet another bugfix release.

Bugfixes:

Fixed using single-quotes in MiniOCR input, previously these files were not recognized as valid MiniOCR files (#247, thanks @mspalti for the fix!)
Fixed OutOfBoundsException when using alternatives with very long tokens (#230, thanks @fd17 for the report and review)

Contributors

mspalti and fd17

Assets 5

24 Sep 19:51

jbaiter

0.7.1

5224ae7

0.7.1

Another bugfix release, upgrading is recommended.

Bugfixes:

Fix text display and "number of snippets" slider in demo setup
Fix instances where we were using Java SDK methods that relied on a default locale, which led to
hard-to-debug issues in some locales
Fix an issue where a highlight rectangle would sometimes be oversized
Fix issue in XML input validation when encountering very long XML opening tags
Really fix handling of documents with no content (at all)
Fix issue with namespaced ALTO documents

Assets 5

12 Jul 12:45

bitzl

0.7.0

015efe4

0.7.0

Release version including various fixes:

Solr returns error with empty ALTO, by design ? #163 Documents with no Content
Fix indexing of OCR fields that only contain an empty string (fixes #167) #171 Indexing documents with elements that contain only an empty string
Empty ocrHighlighting snippets (n:1) #173 Empty OCR highlighting snippets

Assets 5

11 May 15:27

jbaiter

0.6.0

a0f9702

0.6.0

This is a major new release with significant improvements in stability, accuracy and most importantly performance.
Updating is highly recommended, especially for ALTO users, who can expect a speed-up in indexing of up to 6000% (i.e. 60x as fast). We also recommend updating your JVM to at least Java 11 (LTS), since Java 9 introduced a feature that speeds up highlighting significantly.

Performance:

Indexing performance drastically improved for ALTO, slightly worse for hOCR and MiniOCR. Under the hood we switched from Regular Expression and Automaton-based parsing to a proper XML parser to support more features and improve correctness. This drastically improved indexing performance for ALTO (6000% speedup, the previous implementation was pathologically slow), but caused a big hit for hOCR (~57% slower) and a slight hit for MiniOCR (~15% slower). These numbers are based on benchmarks done on a ramdisk, so the changes are very likely to be less pronounced in practice, depending on the choice of storage. Note that this makes the parser also more strict in regard to whitespace. If you were indexing OCR documents without any whitespace between word elements before, you will run into problems (see #147).
Highlighting performance significantly improved for all formats.
The time for highlighting a single snippet has gone down for all formats (ALTO 12x as fast, hOCR 10x as fast, MiniOCR 6x as fast). Again, these numbers are based on benchmarks performed on a ramdisk and might be less pronounced in practice, depending on the storage layer.

New Features:

Indexing alternative forms encoded in the source OCR files.
All supported formats offer a way to encode alternative readings for recognized words. The plugin can now parse these from the input files and index them at the same position as the default form. This is a form of index-time term expansion (much like the Synonym Graph Filter shipping with Solr). For example, if you OCR file has the alternatives christmas and christrias for the token clistrias in the span presents on clistrias eve, users would be able to search for "presents christmas" and "presents clistrias" and would get the correct match in both cases, both with full highlighting. Refer to the corresponding section in the documentation for instructions on setting it up.
On-the-fly repair of 'broken' markup.
OcrCharFilterFactory has a new option fixMarkup that enables on-the-fly repair of invalid XML in OCR input documents, namely problems that can arise when the markup contains unescaped instances of <, > and &. This option is disabled by default, we recommend enabling it when your OCR engine exhibits this problem and you are unable to fix the files on disk, since it incurs a bit of a performance hit during indexing.
Return snippets in order of appearance.
By default, Solr scores each highlighted passage as a "mini-document" and returns the passages ordered by their decreasing score. While this is a good match for a lot of use cases, there are many other that are better suited with a simple by-appearance order. This can now be controlled with the new hl.ocr.scorePassages parameter, which will switch to the by-appearance sort order if set to off (it is set to on by default)

API changes:

No more need for an explicit hl.fl parameter for highlighting non-OCR fields. By default, if highlighting is enabled and no hl.fl parameter is passed by the user, Solr falls back to highlighting every stored field in the document. Previously this did not work with the plugin and users had to always explicitly specify which fields they wanted to have highlighted. This is no longer necessary, the default behavior now works as expected.
Add a new hl.ocr.trackPages parameter to disable page tracking during highlighting.
This is intended for users who index one page per document, in these cases seeking backwards to determine the page identifier a match is not needed, since the containing document contains enough information to identify the page, improving highlighting performance due to the need for less backwards-seeking in the input files.
Add new expandAlternatives attribute to OcrCharFilterFactory. This enables the parsing of alternative readings from input files (see above and the corresponding section in the documentation)
Add new hl.ocr.scorePassages parameter to disable sorting of passages by their score. See the above section unter New Features for an explanation of this flag.

Bugfixes:

Improved tolerance for incomplete bounding boxes. Previously the occurrence of an incomplete bounding box in a snippet (i.e. with one or more missing coordinates) would crash the whole query. We now simply insert a 0 default value in these cases.
Improvements in the handling of hyphenated terms. This release fixes a few bugs in edge cases when handling hyphenated words during indexing, highlighting and snippet text generation.
Handle empty field values during indexing. This would previously lead to an exception since the OCR parsers would try to either load a file from the empty string or parse OCR markup from it.

Assets 5

07 Oct 16:08

jbaiter

0.5.0

a69e586

0.5.0

No breaking changes this time around, but a few essential bugfixes, more stability and a new feature.

API changes:

Snippets are now sorted by their descending score/relevancy. Previously the order was non-deterministic, which
broke the use case for dynamically fetching more snippets.
Add a new boolean hl.ocr.alignSpans parameter to align text and image spans. This new option (disabled by
default) ensures that the spans in text and image match, i.e. it forces the <em>...</em> in the highlighted text
to correspond to actual OCR word boundaries.

Bugfixes:

Fix regular highlighting in distributed setup. Regular, non-OCR highlighting was broken in previous versions due
to a bad check in the shard response collection phase if users only requested regular highlighting, but not for OCR
fields
Highlight spans are now always consistent with the spans designated in text. Due to a bug, it would sometimes
happen that the number of spans was inconsistent between the two.
Fix de-hyphenation in ALTO region texts. Previously only the complete snippet text would be de-hyphenated, but not
the individual regions.
Fix post-match content detection in ALTO. A bug in this part of the code resulted in crashes when highlighting
certain ALTO documents.

Assets 5

02 Jun 15:21

jbaiter

0.4.1

90d0c99

0.4.1

This is a patch release with a fix for excessive memory usage during indexing, especially when indexing multiple large (>100MiB) documents in parallel.

Assets 5

11 May 11:01

jbaiter

0.4.0

96d00d3

0.4.0

This is a major release with a focus on compatibility and performance.

Fixes compatibility with Solr/Lucene 8.4 and 7.6. We now also have an integration test suite that checks for compatibility with all Solr versions >= 7.5 on every change, so compatibility breakage should be kept to a minimum in the future.

Breaking API changes:

Add new pages key to snippet response with page dimensions. This can be helpful if you need to calculate the snippet coordinates relative to the page image dimensions.
Replace page key on regions and highlights with pageIdx. That is, instead of a string with the corresponding page identifier, we have a numerical index into the pages array of the snippet. This reduces the redundancy introduced by the new pages parameter at the cost of having to do some pointer chasing in clients.
Add new parentRegionIdx key on highlights. This is a numerical index into the regions array and allows for multi-column/multi-page highlighting, where a single highlighting span can be composed of regions on multiple disjunct parts of the page or even multiple pages.

Format changes:

hocr: Add support for retrieving page identifier from x_source an ppageno properties
hocr: Strip out title tag during indexing and highlighting
ALTO: The plugin now supports ALTO files with coordinates expressed as floating point numbers (thanks to @mspalti!)

Performance:

Add concurrent preloading for highlighting target files. This can result in a nice performance boost, since by the time the plugin gets to actually highlighting the files, their contents are already in the OS' page cache. See the Performance Tuning section in the docs for more context.
This release changes the way we handle UTF-8 during context generation, resulting in an additional ~25% speed up compared to previous versions.

Miscellaneous:

Log warnings during source pointer parsing
Filter out empty files during indexing
Add new documentation section on performance tuning
Empty regions or regions with only whitespace are no longer included in the output

Assets 5

26 Jul 15:34

stefan-it

0.3.1

a15c613

0.3.1

Fix compatiblity with Solr 8.2.

Assets 3

25 Jul 16:03

jbaiter

0.3

f0514e6

0.3

This release brings some sweeping changes across the codebase, all aimed at making the plugin much simpler to use and less complicated to maintain. However, this also means a lot of breaking changes. It's best to go through the documentation (which has been simplified and was largely rewritten) again and see what changes you need to apply to your setup.

Specifying path resolving is no longer neccessary. You now pass a pointer to one or more files (or regions thereof) directly in the index document. The pointer will be stored with the document and used to locate the input file(s) during highlighting. Refer to the documentation for more details. This should also increase indexing performance and decrease the memory requirements, since the complete OCR document does not need to be kept in memory.
hl.weightMatches now works with UTF8. You no longer need to ASCII-encode your OCR files to be able to use Solr's superior highlighting approach. Due to the first change, the plugin now takes care of mapping UTF8 byte-offsets to character offsets by itself. This also means all code related to storing byte offsets in payloads is gone.
Specifying the OCR format is no longer neccessary. The plugin now offers a single OcrFormatCharFilter that will auto-detect the OCR format used for a given document and select the correct analysis chain. This means that using multiple OCR formats for the same field is now possible!
Performance improvements. Some optimizations were done to the way the plugin seeks through the OCR files. You should see a substantial performance improvement for documents with a low density of multi-byte codepoints, especially English. Also included is a new hl.ocr.maxPassages parameter to control how many passages are looked at for building the response, which can have an enormous impact on performance.

Major Breaking Changes:

HighlightComponent is now called OcrHighlightComponent for more clarity
OCR fields to be highlighted now need to be passed with the hl.ocr.fl parameter
Auto-detection of highlightable fields is no longer possible with the standard highlighter, fields to be highlighted need to be passed explicitely with the hl.fl parameter
In the order of components, the OCR highlighting component needs to come before the standard highlighter to avoid conflicts.

Assets 3

16 Jul 12:32

jbaiter

0.2

f4ffc66

0.2

High-Level Changes

Breaking Change: ALTO and hOCR now have custom CharFilter implementations that should be used instead of HTMLStripCharFilterFactory. Refer to the documentation for more details.
Feature: Resolve Hyphenation at indexing time for all supported formats. If a word is broken across multiple lines, it will be indexed as the dehyphenated form. During highlighting, the parts on both lines will be highlighted appropriately.
Fix calculation of passages with matches spanning multiple lines, in previous versions some passages would be too small
Fix hl.fl parameter handling, a bug in 0.1 made this parameter not have any effect

Full List of Changes:

f4ffc66 Release 0.2
c734bd8 build(deps): bump maven-javadoc-plugin from 3.1.0 to 3.1.1
53f5634 Add more labels to query params image
cf7274f Update query_params image
daecb07 Fix bug in offsets-parser that led to a crash on stray ampersands in the input
16dfba2 Update docs
49bfcfc Merge branch 'dependabot/maven/org.apache.commons-commons-text-1.7'
83effb5 Merge branch 'master' into dependabot/maven/org.apache.commons-commons-text-1.7
ed3ccbe build(deps-dev): bump version.junit from 5.4.2 to 5.5.0
d22f28c build(deps): bump commons-text from 1.6 to 1.7

See more

6377264 example: Fix bug in frontend code that broke images when switching between cores
7961505 example: bugfixes/compatiblity
88f8591 example: Add frontend for ALTO demo, replace UV with M3, make ingesting multi-threaded
0f5dd31 example: Fix hyphenation-triggered bugs in ALTO Content Search
4bdd79c docs: Add caution about ALTO coordinates in response format
86214a9 example: Implement IIIF Content Search for ALTO example
0278ec9 Add ALTO with non-contiguous documents to example (backend part)
46879a8 alto: Resolve hyphenation at query time
111b8f2 alto: Resolve hyphenation at indexing time
6647705 hocr: Resolve hyphenation at indexing time]
ce1ad57 Bump maven-source-plugin from 3.0.1 to 3.1.0
4b52a92 Fix hl.fl parameter handling
eede167 Fix calculation of passages with matches spanning multiple lines
4771099 Bump verison to 0.2-SNAPSHOT
03bc0dd Update docs for 0.1
ffde516 Update example IIIF Content Search implementation for 0.1
4991c07 GitLab CI: Use JDK8 to build release JARs
4948d46 Fix example for 0.1 release
4d703b1 Prep pom.xml for Maven Central release

This list of changes was auto generated.

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

High-Level Changes

Full List of Changes:

Releases: dbmdz/solr-ocrhighlighting

0.7.2

Contributors

0.7.1

0.7.0

0.6.0

0.5.0

0.4.1

0.4.0

0.3.1

0.3

0.2

High-Level Changes

Full List of Changes: