Feat/chipper v2 #232

benjats07 · 2023-09-30T20:38:08Z

This PR adds chipper v2 to the library, leveraging inference speedup and support for hierarchy (assign a parent element to the elements)

The main library changes are

New model, chipperv2 is available. There are significant changes that necessitate code changes in pre and post-processing beyond just changes to the weights.
a schema change to the TextRegion object, which now has a bbox property, rather than being a subclass of Rectangle
LayoutElement now has a parent property that can be populated with another LayoutElement to represent inferred hierarchy.
LocationlessLayoutElement is removed. Elements produced by Chipper now have bounding boxes, and if we need to handle elements without bounding boxes in the future, we can make the bbox property optional.

Note that because of the schema changes, ingest tests aren't working properly due to the incompatibility with the current unstructured version.

In order to test this new model you can use

from unstructured_inference.constants import OCRMode
from unstructured_inference.inference import layout
from unstructured_inference.models.base import get_model

file = "sample-docs/layout-parser-paper-fast.pdf"
model = get_model("chipperv2")
doc = layout.DocumentLayout.from_file(
    file,
    model,
    ocr_mode=OCRMode.FULL_PAGE.value,
    supplement_with_ocr_elements=True,
    ocr_strategy="never",
)

print(doc)

Also, you can try with an image:


file = "sample-docs/example_table.jpg"
model = get_model("chipperv2")
doc = layout.DocumentLayout.from_image_file(
    file,
    model,
    ocr_mode=OCRMode.FULL_PAGE.value,
    supplement_with_ocr_elements=True,
    ocr_strategy="never",
)

print(doc)

- refactor `tables.py` so that the structure element confidence threshold values are loaded from `inference_config` - refactor intersection over box area threshold in `objects_to_structure` to config intead of using hardwired value of 0.5 (default is still 0.5)

- add config yaml (copied from `unstructured` repo) - helps with dev's Quality of Life

Auto scale table images so that the text height is optimum for `tesseract` OCR inference. This functionality will scaling images where the estimated mean text height based on the `inference_config` setup: table images with text height below `inference_config.TESSERACT_MIN_TEXT_HEIGHT` or above `inference_config.TESSERACT_MAX_TEXT_HEIGHT` are scaled so that the text height is at `inference_config.TESSERACT_OPTIMUM_TEXT_HEIGHT`. This PR resolves [CORE-1863](https://unstructured-ai.atlassian.net/browse/CORE-1863) ## test - this PR adds a unit test to confirm auto scale is triggered - test the tokens computed without zoom and with zoom with the attached image: with zoom the tokens should include the correct text "Japanese" in the table on the page. Without zoom (call get_tokens using main) we won't see this token and instead you might find a token that look like "Inpanere". For this specific document it is best to set `TESSERACT_MIN_TEXT_HEIGHT` to 12. ![layout-parser-paper-with-table](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/7963bba0-67cb-48ee-b338-52b1c2620fc0) [CORE-1863]: https://unstructured-ai.atlassian.net/browse/CORE-1863?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

This PR adds three possible values for `source` field: * `pdfminer` as source for elements directly obtained from PDFs. * `OCR-tesseract` and `OCR-paddle` for elements obtained with the respective OCR engines. All those new values are stored in a new class `Source` in unstructured_inference>constants.py This would help users filter certain elements depending on how were obtained.

update `merge_inferred_layout_with_extracted_layout` to keep extracted image elements

) ### Summary A user is flagging the assertion error for paddle language code: ``` AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng ``` and tried setting the `ocr_languages` param to 'en' (the correct lang code for english in paddle) but also didn't work. The reason is that the `ocr_languages` uses the mapping for tesseract code which will convert `en` to `eng` since thats the correct lang code for english in tesseract. The quick workaround here is stop passing the lang code to paddle and let it use default `en`, and this will be addressed once we have the lang code mapping for paddle. ### Test looks like user used this branch and got the lang parameter working from [linked comments](Unstructured-IO/unstructured-api#247 (comment)) :) on api repo: ``` pip install paddlepaddle pip install "unstructured.PaddleOCR" export ENTIRE_PAGE_OCR=paddle make run-web-app ``` * check error before this change: ``` curl -X 'POST' 'http://localhost:8000/general/v0/general' -H 'accept: application/json' -F 'files=@sample-docs/english-and-korean.png' -F 'ocr_languages=en' | jq -C . | less -R ``` will see the error: ``` { "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng" } ``` also in logger you will see `INFO Loading paddle with CPU on language=eng...` since tesseract mapping converts `en` to `eng`. * check after this change: Checkout to this branch and install inference repo into your env (the same env thats running api) with `pip install -e .` Rerun `make run-web-app` Run the curl command again, you won't get the result on m1 chip since paddle doesn't work on it but from the logger info you can see `2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with CPU on language=en...`, which means the lang parameter is using default `en` (logger info is coming from [this line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)). --------- Co-authored-by: shreyanid <[email protected]>

LaverdeS · 2023-10-09T15:49:10Z

I haven't looked though the code but I threw some 15 pdf samples (mostly part of sample-docs and ingestion) using chipperv2 to see if the hierarchical relationships were present and could not find any.

The import from unstructured_inference.constants import OCRMode suggested in the PR description throws a ImportError: cannot import name 'OCRMode' from 'unstructured_inference.constants', so the snippet I used is the one suggested by @ajjimeno in his last comment.

I acknowledge the presence of List elements and the parent attribute of the LayoutElements but its value is always None, even when is certain that a hierarchy is present.

From the 15 pdf samples, 94 elements were detected in total but no List-item was present, which is also odd, considering chipperv2 returns this category in the original XML output. I suspect this is related to the token2json transform from DonutProcessor @ajjimeno. This adds to the lack of parent attribute diff than None, a makes most of the returned documents to miss a considerable amount of the pdf content.

I tested more on a sample russian.pdf.

Some observations:

This is the output of chipperv2 in XML using it directly from HF:

This shows chipperv2 successfully parses the document returning List and List-item elements in a hierarchical structure.

Now, this is the output of chipperv2 being called through unstructured-inference=='0.6.1':

All the content is present, but there is no chance for hierarchical info to fit in the element class data and the List elements are ignored.

With the last changes introduced in this PR, the output is:

Which is missing a lot of the content (all from which were Listor List-items in the XML), and has a suspicious Image element from 'source': <Source.PDFMINER: 'pdfminer'>.

I am testing to plot to returned element coords but I have not clear what is the reference image for them. Are the coords returned considering the resolution needed for chipper when doing pdf2image? @ajjimeno.

badGarnet · 2023-10-09T18:44:17Z

I see many requests for constants. Could you provide an example of what is required in those requests? Is it a nice to have or a compulsory requirement? Thanks!

So here is an example of using a settings constant to represent this parameters: 0871489 credit to @qued
There are a few reasons:

some values like "<s_" are used in multiple places so it is easier to maintain and less error prune if we define it once and use the variable name instead of typing the actual value
some values may be a specific value that is really part of a configuration of the model like rank=128. Define it explicitly makes the code easier to read later on (so it is more clear on what knobs are present in the model and what values they are; otherwise someone would have to read through the code to find all those places and values)
technically they are not required to make the model run but they would make maintenance much easier down the road so I guess you can call them nice to have. Hence I used question mark as a suggestion instead of requiring change on those.

PR to support schema changes introduced from [PR 232](Unstructured-IO/unstructured-inference#232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>

qued and others added 30 commits September 21, 2023 18:07

Chipper v2 and update tests

ba50236

Chipper support for hierarchy

3324a24

Cleaned and tidied chipper.py

0027105

Adding location information to Chipper

778f8d3

Fixes layout class issues

a1787c8

TextRegion has a Rectangle, not is a Rectangle

785eca1

Clean up consequences of OO structure change

70bb305

Update tests

9d54bfd

linting

2cc2b24

Remove LocationlessLayoutElement

a1f1e0d

Remove LocationlessLayoutElement downstream

d1fa7d8

Re-enable chipper postprocessor test

b309cc3

parent_id -> parent (store object ref)

a39144a

Add parent field to LayoutElement

5e27e80

Fixes for width and height stuff

ca17a22

Changes to bounding boxes

40f3cde

feat: add pre commit hook (#220)

040a2bc

- add config yaml (copied from `unstructured` repo) - helps with dev's Quality of Life

fix: padded boxes are not rescaled/shifted correctly (#229)

c785a14

Feat/219 keep extracted image elements (#225)

4ee2647

update `merge_inferred_layout_with_extracted_layout` to keep extracted image elements

chore: changelog fix, cut release 0.6.5 (#230)

97068f7

Merge branch 'main' into feat/chipper-v2

824e9f5

linting

3026ee6

fix: more linting issues

a9ea05f

Linting and version update

86cb621

Additional Chipper linting

6bc524f

Additional Chipper linting

9b57f11

qued added 3 commits October 9, 2023 10:18

Make signature for from_coords less opaque

844868b

Add note re dropping LocationlessLayoutElement

16b3437

Allow user to specify device for chipper

6aa76ab

qued added 5 commits October 9, 2023 11:31

Made swap head hidden layer size a parameter

0871489

Make start token prefix a parameter

a591613

use device for donut

5a39c2f

Merge branch 'main' into feat/chipper-v2

eb88428

Revert donut device change

3ab9531

qued added 2 commits October 9, 2023 14:56

Enforce source is of Source type

20a0992

Update tests

634549e

qued requested a review from badGarnet October 9, 2023 20:13

qued mentioned this pull request Oct 9, 2023

chore: process chipper hierarchy Unstructured-IO/unstructured#1634

Merged

qued added 7 commits October 9, 2023 23:01

dont use mutable default values

088bcfe

modify annotate for new LayoutElement

f43cf4f

Remove test code

3d89cda

default sources value

0a7f51b

Update test

bdddca2

Make deduplication normal method not staticmethod

9e12997

override dedup in chipper

ba44595

qued requested a review from ryannikolaidis October 10, 2023 19:16

dsanmart mentioned this pull request Oct 11, 2023

feat: Apple Silicon support for Chipper Model #239

Open

badGarnet approved these changes Oct 11, 2023

View reviewed changes

Merge branch 'main' into feat/chipper-v2

b579e8e

qued enabled auto-merge (squash) October 11, 2023 16:36

qued merged commit f55671c into main Oct 11, 2023

qued deleted the feat/chipper-v2 branch October 11, 2023 16:48

0-hero mentioned this pull request Oct 14, 2023

chipperv2 unusable due to private model #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/chipper v2 #232

Feat/chipper v2 #232

benjats07 commented Sep 30, 2023 •

edited by qued

Loading

LaverdeS commented Oct 9, 2023

badGarnet commented Oct 9, 2023

Feat/chipper v2 #232

Feat/chipper v2 #232

Conversation

benjats07 commented Sep 30, 2023 • edited by qued Loading

LaverdeS commented Oct 9, 2023

badGarnet commented Oct 9, 2023

benjats07 commented Sep 30, 2023 •

edited by qued

Loading