-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/chipper v2 #232
Feat/chipper v2 #232
Conversation
- refactor `tables.py` so that the structure element confidence threshold values are loaded from `inference_config` - refactor intersection over box area threshold in `objects_to_structure` to config intead of using hardwired value of 0.5 (default is still 0.5)
- add config yaml (copied from `unstructured` repo) - helps with dev's Quality of Life
Auto scale table images so that the text height is optimum for `tesseract` OCR inference. This functionality will scaling images where the estimated mean text height based on the `inference_config` setup: table images with text height below `inference_config.TESSERACT_MIN_TEXT_HEIGHT` or above `inference_config.TESSERACT_MAX_TEXT_HEIGHT` are scaled so that the text height is at `inference_config.TESSERACT_OPTIMUM_TEXT_HEIGHT`. This PR resolves [CORE-1863](https://unstructured-ai.atlassian.net/browse/CORE-1863) ## test - this PR adds a unit test to confirm auto scale is triggered - test the tokens computed without zoom and with zoom with the attached image: with zoom the tokens should include the correct text "Japanese" in the table on the page. Without zoom (call get_tokens using main) we won't see this token and instead you might find a token that look like "Inpanere". For this specific document it is best to set `TESSERACT_MIN_TEXT_HEIGHT` to 12. ![layout-parser-paper-with-table](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/7963bba0-67cb-48ee-b338-52b1c2620fc0) [CORE-1863]: https://unstructured-ai.atlassian.net/browse/CORE-1863?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
This PR adds three possible values for `source` field: * `pdfminer` as source for elements directly obtained from PDFs. * `OCR-tesseract` and `OCR-paddle` for elements obtained with the respective OCR engines. All those new values are stored in a new class `Source` in unstructured_inference>constants.py This would help users filter certain elements depending on how were obtained.
update `merge_inferred_layout_with_extracted_layout` to keep extracted image elements
) ### Summary A user is flagging the assertion error for paddle language code: ``` AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng ``` and tried setting the `ocr_languages` param to 'en' (the correct lang code for english in paddle) but also didn't work. The reason is that the `ocr_languages` uses the mapping for tesseract code which will convert `en` to `eng` since thats the correct lang code for english in tesseract. The quick workaround here is stop passing the lang code to paddle and let it use default `en`, and this will be addressed once we have the lang code mapping for paddle. ### Test looks like user used this branch and got the lang parameter working from [linked comments](Unstructured-IO/unstructured-api#247 (comment)) :) on api repo: ``` pip install paddlepaddle pip install "unstructured.PaddleOCR" export ENTIRE_PAGE_OCR=paddle make run-web-app ``` * check error before this change: ``` curl -X 'POST' 'http://localhost:8000/general/v0/general' -H 'accept: application/json' -F 'files=@sample-docs/english-and-korean.png' -F 'ocr_languages=en' | jq -C . | less -R ``` will see the error: ``` { "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng" } ``` also in logger you will see `INFO Loading paddle with CPU on language=eng...` since tesseract mapping converts `en` to `eng`. * check after this change: Checkout to this branch and install inference repo into your env (the same env thats running api) with `pip install -e .` Rerun `make run-web-app` Run the curl command again, you won't get the result on m1 chip since paddle doesn't work on it but from the logger info you can see `2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with CPU on language=en...`, which means the lang parameter is using default `en` (logger info is coming from [this line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)). --------- Co-authored-by: shreyanid <[email protected]>
I haven't looked though the code but I threw some 15 pdf samples (mostly part of sample-docs and ingestion) using The import I acknowledge the presence of From the 15 pdf samples, 94 elements were detected in total but no I tested more on a sample russian.pdf. Some observations: This is the output of Now, this is the output of With the last changes introduced in this PR, the output is: I am testing to plot to returned element coords but I have not clear what is the reference image for them. Are the coords returned considering the resolution needed for chipper when doing pdf2image? @ajjimeno. |
So here is an example of using a settings constant to represent this parameters: 0871489 credit to @qued
|
PR to support schema changes introduced from [PR 232](Unstructured-IO/unstructured-inference#232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>
This PR adds chipper v2 to the library, leveraging inference speedup and support for hierarchy (assign a parent element to the elements)
The main library changes are
chipperv2
is available. There are significant changes that necessitate code changes in pre and post-processing beyond just changes to the weights.TextRegion
object, which now has abbox
property, rather than being a subclass ofRectangle
LayoutElement
now has aparent
property that can be populated with anotherLayoutElement
to represent inferred hierarchy.LocationlessLayoutElement
is removed. Elements produced by Chipper now have bounding boxes, and if we need to handle elements without bounding boxes in the future, we can make thebbox
property optional.Note that because of the schema changes, ingest tests aren't working properly due to the incompatibility with the current
unstructured
version.In order to test this new model you can use
Also, you can try with an image: