0.4.0
This is a major release with a focus on compatibility and performance.
- Fixes compatibility with Solr/Lucene 8.4 and 7.6. We now also have an integration test suite that checks for compatibility with all Solr versions >= 7.5 on every change, so compatibility breakage should be kept to a minimum in the future.
Breaking API changes:
- Add new
pages
key to snippet response with page dimensions. This can be helpful if you need to calculate the snippet coordinates relative to the page image dimensions. - Replace
page
key on regions and highlights withpageIdx
. That is, instead of a string with the corresponding page identifier, we have a numerical index into thepages
array of the snippet. This reduces the redundancy introduced by the newpages
parameter at the cost of having to do some pointer chasing in clients. - Add new
parentRegionIdx
key on highlights. This is a numerical index into theregions
array and allows for multi-column/multi-page highlighting, where a single highlighting span can be composed of regions on multiple disjunct parts of the page or even multiple pages.
Format changes:
- hocr: Add support for retrieving page identifier from
x_source
anppageno
properties - hocr: Strip out title tag during indexing and highlighting
- ALTO: The plugin now supports ALTO files with coordinates expressed as floating point numbers (thanks to @mspalti!)
Performance:
- Add concurrent preloading for highlighting target files. This can result in a nice performance boost, since by the time the plugin gets to actually highlighting the files, their contents are already in the OS' page cache. See the Performance Tuning section in the docs for more context.
- This release changes the way we handle UTF-8 during context generation, resulting in an additional ~25% speed up compared to previous versions.
Miscellaneous:
- Log warnings during source pointer parsing
- Filter out empty files during indexing
- Add new documentation section on performance tuning
- Empty regions or regions with only whitespace are no longer included in the output