You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
thank you for your work, it's really appreciated!
I'd like to point out an issue which seems to be related to a parsing error.
It seems that this issue has already showed up but apparently it was fixed according to here (issue #173).
Context
Solr is used to store hocr documents in the ocr_text field, so filesystem mode is NOT enabled.
It is running in a Docker container (official image) with no other custom plugin enabled.
The hocr files are generated by Tesseract.
The exception happens while either querying Solr via web UI or the select API (Called by a slightly modified iiif-prezi).
The records for the troublesome document are empty, this causes prezi to send a 503 to the calling viewer (separate issue).
"ocrHighlighting":{
"error-document":{},
This is a snippet of the main.py in iiif-prezi:
forpage_snipsinocr_hls.values():
# Not properly handling errorsifnotpage_snips:
print('[ERROR] Empty snippet returned from Solr, this is most likely due to not being able to parse a ''document!')
print('[ERROR] RECORDS LOSS MAY HAVE OCCURRED!')
continuesnips=page_snips[solr_ocr_field]['snippets']
out['snippets'].extend(snips)
out['numTotal'] +=page_snips[solr_ocr_field]['numTotal']
Thank you for the excellent bug report, I'll investigate!
-- Update: Could reproduce it, seems to be an issue that only occurs when storing the OCR in the index, if the file is referenced from disk it works. Investigating further.
-- Update: It's a pretty fundamental bug in the way we generate OCR passages that's unrelated to the source of the OCR, with the stored approach we were just hitting the right circumstances by accident! Thanks so much for uncovering it :-)
When locating breaks based on hOCR classes, we read the input in blocks
of 64k chars. Previously, due to an error in the looping logic, we would
not actually read any blocks besides the first one and set the limit to
the end of the first block. This should be fixed now, more blocks are
read as needed.
@jbaiter thank you so much! Also you were superfast!
I'll catch up with my colleagues about this and let you know if the issue is gone for good, at least in our case.
What I can tell you right now is that no exception is logged Solr side and the "error-document" is now populated with records.
Hello,
thank you for your work, it's really appreciated!
I'd like to point out an issue which seems to be related to a parsing error.
It seems that this issue has already showed up but apparently it was fixed according to here (issue #173).
Context
Solr is used to store hocr documents in the ocr_text field, so filesystem mode is NOT enabled.
It is running in a Docker container (official image) with no other custom plugin enabled.
The hocr files are generated by Tesseract.
The exception happens while either querying Solr via web UI or the select API (Called by a slightly modified iiif-prezi).
The records for the troublesome document are empty, this causes prezi to send a 503 to the calling viewer (separate issue).
This is a snippet of the main.py in iiif-prezi:
An example query could be
docker-compose.yml (excerpt):
Versions
Solr: 8.11.2
Ocr Highlighting plugin: 0.8.2-solr78 / 0.8.1-solr78
Tesseract: 4.1.1
Docker host: Ubuntu 20.04.5 LTS
Docker:
docker-compose:
What happens
We get an error while performing queries on a hocr file loaded in the ocr_text field in Solr.
The error seems to be only related to some documents. Also, it seems that the query is performed multiple times.
Here's the partial stacktrace (I'm attaching the full version as a file to this issue):
Attachments
Thanks a lot again for your effort
issue_solr.txt
cursed-core.tar.gz
The text was updated successfully, but these errors were encountered: