-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR and spatial search #70
Comments
Doing this "properly" for arbitrary regions is out of scope for this specific plugin I'm afraid, since it does not store any information about the actual coordinates in the index and thus can't query for it (e.g. like solr's Spatial Search). One hacky way to go about this would be to add a There is however currently support for filtering by a specific page in a document, check out the |
Sorry I've been so long coming back to this. The ideal would be if we could search for "the first instance of a term after the previous" and/or "a term X & Y away from an anchor term" where the anchor would be something like a chapter title. Java isn't really my forte but I'll certainly look into it. |
If you want to implement search inside of chapters, you could just index your documents at the chapter-level by creating source pointers that point to the markup for that chapter, this is described in the documentation here: https://dbmdz.github.io/solr-ocrhighlighting/indexing/#one-or-more-partial-files-per-solr-document. Otherwise this is hard to implement with Lucene/Solr and the plugin in its current form, you could try sloppy phrase queries like I'm not sure if the approach proposed in my first response is going to work for you, since you'd need to know the specific region on a given page where a match is allowed to occur. This could be useful for a feature like "search only in headers/footers" (if those headers/footers appear in the same positions every time), but that is not your use case if I understood you correctly? |
What I'm thinking is something like an old census form where the scans are all slightly wonky. The idea would be that you could use something like "Name" as an anchor and search for the first instance of that then, knowing that the the subject's name would be X & Y pixels from the anchor term provide the actual name as the result. So like a position-aware query but based on the actual OCR coordinates rather than the position of the term in the text. |
I see! A hacky and probably inefficient way to do this without changes to the plugin could be:
|
Right, I see. But presumably that wouldn't work with a wildcard search (search for any name in the region) because you wouldn't get the highlighting for that? |
Yes, correct, if you're just interested in the general content of a region, you can replace step 4 and 5 with just parsing the OCR for the page and extracting the text in the subject name regions yourself. |
Thanks. I'll come back with any progress I make. |
More of a feature request than an issue but it would be incredibly useful if the HOCR data could be used for querying as well as highlighting. For example searching for a word within a specific region of the document by its page and/or coordinates.
The text was updated successfully, but these errors were encountered: