Empty ocrHighlighting field: parsing error for ocr field #288

Alfablos · 2022-10-11T13:41:53Z

Hello,
thank you for your work, it's really appreciated!

I'd like to point out an issue which seems to be related to a parsing error.

It seems that this issue has already showed up but apparently it was fixed according to here (issue #173).

Context

Solr is used to store hocr documents in the ocr_text field, so filesystem mode is NOT enabled.
It is running in a Docker container (official image) with no other custom plugin enabled.
The hocr files are generated by Tesseract.

The exception happens while either querying Solr via web UI or the select API (Called by a slightly modified iiif-prezi).
The records for the troublesome document are empty, this causes prezi to send a 503 to the calling viewer (separate issue).

  "ocrHighlighting":{
    "error-document":{},

This is a snippet of the main.py in iiif-prezi:

        for page_snips in ocr_hls.values():
            # Not properly handling errors
            if not page_snips:
                print('[ERROR] Empty snippet returned from Solr, this is most likely due to not being able to parse a '
                      'document!')
                print('[ERROR] RECORDS LOSS MAY HAVE OCCURRED!')
                continue
            snips = page_snips[solr_ocr_field]['snippets']
            out['snippets'].extend(snips)
            out['numTotal'] += page_snips[solr_ocr_field]['numTotal']

An example query could be

hl.snippets=4096&hl.weightMatches=true&q=il&df=ocr_text&hl=true&indent=true&fl=id&q.op=OR&hl.ocr.fl=ocr_text

docker-compose.yml (excerpt):

  iiif-prezi:
    command: pipenv run prod
    build:
      context: ./iiif-prezi
    container_name: iiif-search-prezi
    volumes:
      - $VOLUMES_ROOT_PATH/iiif-prezi/nginx/logs:/data
      - ./iiif-prezi/main.py:/usr/src/app/main.py
    restart: unless-stopped
    environment:
      - CFG_SOLR_BASE=${SOLR_BASE_URL}
      - CFG_SERVER_NAME=${IIIF_PRESENTATION_SERVER_NAME}
      - CFG_APP_PATH=/iiif/presentation
      - CFG_PROTOCOL=${IIIF_PRESENTATION_MANIFEST_PROTOCOL}
      - CFG_SOLR_CORE=${SOLR_CORE}
      - CFG_SOLR_OCR_FIELD=${SOLR_OCR_FIELD}
    networks:
      - iiif-search-service-network
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.iiif.rule=PathPrefix(`/iiif/presentation`)"
      - "traefik.http.middlewares.iiif-replace.replacepathregex.regex=^/iiif/presentation/(.*)"
      - "traefik.http.middlewares.iiif-replace.replacepathregex.replacement=/$$1"
      - "traefik.http.routers.iiif.entrypoints=web"
      - "traefik.http.routers.iiif.middlewares=iiif-replace"

  solr:
    image: solr:8
    environment:
      SOLR_HEAP: 4G
    volumes:
        - $VOLUMES_ROOT_PATH/solr/home:/var/solr
        - $VOLUMES_ROOT_PATH/solr/plugins/ocrhighlighting/solr-ocrhighlighting-0.8.1-solr78.jar:/opt/solr/contrib/ocrhighlighting/lib/solr-ocrhighlighting-0.8.1-solr78.jar
    networks:
      - iiif-search-service-network
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.solr.rule=PathPrefix(`/solr`)"
      - "traefik.http.routers.solr.entrypoints=web"
      - "traefik.http.routers.solr.priority=1"

Versions

Solr: 8.11.2

Ocr Highlighting plugin: 0.8.2-solr78 / 0.8.1-solr78

Tesseract: 4.1.1

Docker host: Ubuntu 20.04.5 LTS

Docker:

Client: Docker Engine - Community
 Version:           20.10.18
 API version:       1.41
 Go version:        go1.18.6
 Git commit:        b40c2f6
 Built:             Thu Sep  8 23:11:45 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.18
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.6
  Git commit:       e42327a
  Built:            Thu Sep  8 23:09:37 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker-compose:

docker-compose version 1.29.2, build 5becea4c
docker-py version: 5.0.0
CPython version: 3.7.10
OpenSSL version: OpenSSL 1.1.0l  10 Sep 2019

What happens

We get an error while performing queries on a hocr file loaded in the ocr_text field in Solr.

The error seems to be only related to some documents. Also, it seems that the query is performed multiple times.

Here's the partial stacktrace (I'm attaching the full version as a file to this issue):

solr_1        | 2022-10-11 11:16:05.401 ERROR (qtp1350751778-16) [   x:hocr_test] s.OcrHighlighter Could not highlight OCR content for document => java.lang.IllegalArgumentException: Invalid range: [2346..-1)
solr_1        | 	at com.google.common.collect.Range.<init>(Range.java:352)
solr_1        | java.lang.IllegalArgumentException: Invalid range: [2346..-1)
solr_1        | 	at com.google.common.collect.Range.<init>(Range.java:352) ~[?:?]
solr_1        | 	at com.google.common.collect.Range.create(Range.java:155) ~[?:?]
solr_1        | 	at com.google.common.collect.Range.closedOpen(Range.java:189) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.model.OcrFormat.getContainingWordLimits(OcrFormat.java:112) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.lucene.OcrPassageFormatter.adjustPositionToCharacterEntities(OcrPassageFormatter.java:176) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.lucene.OcrPassageFormatter.getHighlightedFragment(OcrPassageFormatter.java:159) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.lucene.OcrPassageFormatter.format(OcrPassageFormatter.java:195) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.lucene.OcrPassageFormatter.format(OcrPassageFormatter.java:101) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.lucene.OcrFieldHighlighter.highlightFieldForDoc(OcrFieldHighlighter.java:104) ~[?:?]
solr_1        | 	at solrocr.OcrHighlighter.highlightOcrFields(OcrHighlighter.java:421) ~[?:?]
solr_1        | 	at com.github.dbmdz.solrocr.solr.SolrOcrHighlighter.doHighlighting(SolrOcrHighlighter.java:78) ~[?:?]
solr_1        | 	at solrocr.OcrHighlightComponent.process(OcrHighlightComponent.java:122) ~[?:?]
solr_1        | 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:369) ~[?:?]
solr_1        | 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) ~[?:?]
solr_1        | 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2637) ~[?:?]
solr_1        | 	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:791) ~[?:?]
solr_1        | 	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:564) ~[?:?]
solr_1        | 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) ~[?:?]

...

Attachments

Solr logs
a solr core where I put a working and a non working examples

Thanks a lot again for your effort

issue_solr.txt
cursed-core.tar.gz

The text was updated successfully, but these errors were encountered:

jbaiter · 2022-10-11T14:36:04Z

Thank you for the excellent bug report, I'll investigate!

-- Update: Could reproduce it, seems to be an issue that only occurs when storing the OCR in the index, if the file is referenced from disk it works. Investigating further.

-- Update: It's a pretty fundamental bug in the way we generate OCR passages that's unrelated to the source of the OCR, with the stored approach we were just hitting the right circumstances by accident! Thanks so much for uncovering it :-)

When locating breaks based on hOCR classes, we read the input in blocks of 64k chars. Previously, due to an error in the looping logic, we would not actually read any blocks besides the first one and set the limit to the end of the first block. This should be fixed now, more blocks are read as needed.

Alfablos · 2022-10-11T21:50:53Z

@jbaiter thank you so much! Also you were superfast!
I'll catch up with my colleagues about this and let you know if the issue is gone for good, at least in our case.
What I can tell you right now is that no exception is logged Solr side and the "error-document" is now populated with records.

Thank you, thank you, thank you!

Alfablos · 2022-10-27T12:54:15Z

Hi,
my team made all the checks, they have no issues whatsoever :)
Thanks!

jbaiter added the bug Something isn't working label Oct 11, 2022

jbaiter self-assigned this Oct 11, 2022

jbaiter closed this as completed in 66f3055 Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty ocrHighlighting field: parsing error for ocr field #288

Empty ocrHighlighting field: parsing error for ocr field #288

Alfablos commented Oct 11, 2022

jbaiter commented Oct 11, 2022 •

edited

Loading

Alfablos commented Oct 11, 2022

Alfablos commented Oct 27, 2022

Empty ocrHighlighting field: parsing error for ocr field #288

Empty ocrHighlighting field: parsing error for ocr field #288

Comments

Alfablos commented Oct 11, 2022

Context

Versions

What happens

Attachments

jbaiter commented Oct 11, 2022 • edited Loading

Alfablos commented Oct 11, 2022

Alfablos commented Oct 27, 2022

jbaiter commented Oct 11, 2022 •

edited

Loading