-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No blank between words with 0.6.0 compiled from main #147
Comments
Thank you for the bug report! |
@jbaiter This is the content of field into Solr after ingesting, is this what you need?
Thanks !!! |
Thank you, that helps a lot :-) |
@jbaiter Great! so quick, I'm available to check the fix in our production deployment! Thanks. |
@jbaiter thanks so much. You are just awesome 🥇 |
So I just built a testcase with the provided page, and for some reason I can't seem to reproduce the problem. For example, here's the snippet I get for the query
This tells me that the whitespace-handling from the OCR parser is correct for this file, since we find a match for the phrase.
I just noticed that the same schema works with 0.5.0, so this is really something that is in the plugin. Can you please provide a sample page for which you are certain that the problem is happening? E.g. a page where one of the terms/term sequences from your screenshot is occurring. P.S.: If you're using MiniOCR to save on index space, you're leaving a few bytes on the table by not stripping the extraneous whitespace :-) The only whitespace that is needed is the one between the individual words, everything else is ignored anyway and takes up precious space. In your case, "minifying" the file would result in saving ~20% of the file size (7.1KiB vs 5.8KiB uncompressed). The practical impact is likely to be a lot smaller, though, since Lucene compresses segments with LZ4, but it's maybe something you might want to benchmark if space is of consideration for you. |
@jbaiter thanks a lot, attached the page exactly as indexed in the screenshot above. Is this what you need? |
@jbaiter When pdf searchable:
Thanks again |
Sorry, I had a slight misunderstanding, I just noticed that the |
Do you mean this?: <fieldType name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType> |
And are you using |
Yes, exactly, thank you! Any reason you're using the |
@jbaiter Double thanks! I'll check it next hours. Really I don't remember why we are using |
If you use something like the |
@jbaiter I checked but also with StandardTokenizer I have the same issue (plugin 0.6.0): |
Nope :-( |
Sorry, I was on a wrong trail this morning, it does not have to do with the external/stored state after all :-/ Could you do me a favor and paste the exact string value that you get back when you retrieve the document for the "numero 3" document from the index? I.e. the one you get from |
@jbaiter I switched back to 0.5.0, does it matter? |
I can extract both eventually |
No, it shouldn't matter :-) Since you're storing the OCR in the index, the actual stored value is just whatever you posted to the collection when you indexed the document. The plugin version only plays a role afterwards, when the plugin indexes the OCR or highlights it. I want to make sure that the actual OCR that is stored in the index doesn't have any whitespace issues. |
Here:
|
There you go, the OCR that you feed to the index does not have any whitespace between the words! |
@jbaiter a last question (I hope) why that happens with 0.6.0 and not with 0.5.0? Anyway thanks really a lot |
Good question! The 0.5.0 code wrapped Lucene's For example, this is what your whitespace-less document looked like after being run through the
The new parser only outputs whatever whitespace there is in the input document (and normalizes runs of consecutive spaces to a single space character to deal with indentation). If there is no whitespace in the input document, the parsed text will not have any whitespace either. |
Thanks a lot for your time on this, have a nice evening!!! Take into account, here there is a really good bottle of wine waiting for you for when come to Italy!! |
@jbaiter I was trying to compile from main your plugin (resulting in a 0.6.0-SNAPSHOT) and installed over a Solr 8.8.1.
I found that words are indexed without space between, like this
I switched back to 0.5.0 without change anything and the right indexing happens:
Have you any notes about this? I missed some new configuration parameters?
Thanks for your fantastic plugin
The text was updated successfully, but these errors were encountered: