Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/chipper v2 #232

Merged
merged 79 commits into from
Oct 11, 2023
Merged

Feat/chipper v2 #232

merged 79 commits into from
Oct 11, 2023

Conversation

benjats07
Copy link
Contributor

@benjats07 benjats07 commented Sep 30, 2023

This PR adds chipper v2 to the library, leveraging inference speedup and support for hierarchy (assign a parent element to the elements)

The main library changes are

  • New model, chipperv2 is available. There are significant changes that necessitate code changes in pre and post-processing beyond just changes to the weights.
  • a schema change to the TextRegion object, which now has a bbox property, rather than being a subclass of Rectangle
  • LayoutElement now has a parent property that can be populated with another LayoutElement to represent inferred hierarchy.
  • LocationlessLayoutElement is removed. Elements produced by Chipper now have bounding boxes, and if we need to handle elements without bounding boxes in the future, we can make the bbox property optional.

Note that because of the schema changes, ingest tests aren't working properly due to the incompatibility with the current unstructured version.

In order to test this new model you can use

from unstructured_inference.constants import OCRMode
from unstructured_inference.inference import layout
from unstructured_inference.models.base import get_model

file = "sample-docs/layout-parser-paper-fast.pdf"
model = get_model("chipperv2")
doc = layout.DocumentLayout.from_file(
    file,
    model,
    ocr_mode=OCRMode.FULL_PAGE.value,
    supplement_with_ocr_elements=True,
    ocr_strategy="never",
)

print(doc)

Also, you can try with an image:


file = "sample-docs/example_table.jpg"
model = get_model("chipperv2")
doc = layout.DocumentLayout.from_image_file(
    file,
    model,
    ocr_mode=OCRMode.FULL_PAGE.value,
    supplement_with_ocr_elements=True,
    ocr_strategy="never",
)

print(doc)

qued and others added 30 commits September 21, 2023 18:07
- refactor `tables.py` so that the structure element confidence
threshold values are loaded from `inference_config`
- refactor intersection over box area threshold in
`objects_to_structure` to config intead of using hardwired value of 0.5
(default is still 0.5)
- add config yaml (copied from `unstructured` repo)
- helps with dev's Quality of Life
Auto scale table images so that the text height is optimum for
`tesseract` OCR inference. This functionality will scaling images where
the estimated mean text height based on the `inference_config` setup:
table images with text height below
`inference_config.TESSERACT_MIN_TEXT_HEIGHT` or above
`inference_config.TESSERACT_MAX_TEXT_HEIGHT` are scaled so that the text
height is at `inference_config.TESSERACT_OPTIMUM_TEXT_HEIGHT`.

This PR resolves
[CORE-1863](https://unstructured-ai.atlassian.net/browse/CORE-1863)

## test

- this PR adds a unit test to confirm auto scale is triggered
- test the tokens computed without zoom and with zoom with the attached
image: with zoom the tokens should include the correct text "Japanese"
in the table on the page. Without zoom (call get_tokens using main) we
won't see this token and instead you might find a token that look like
"Inpanere". For this specific document it is best to set
`TESSERACT_MIN_TEXT_HEIGHT` to 12.

![layout-parser-paper-with-table](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/7963bba0-67cb-48ee-b338-52b1c2620fc0)


[CORE-1863]:
https://unstructured-ai.atlassian.net/browse/CORE-1863?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
This PR adds three possible values for `source` field:
* `pdfminer` as source for elements directly obtained from PDFs.
* `OCR-tesseract` and `OCR-paddle` for elements obtained with the
respective OCR engines.

All those new values are stored in a new class `Source` in unstructured_inference>constants.py

This would help users filter certain elements depending on how were
obtained.
update `merge_inferred_layout_with_extracted_layout` to keep extracted image elements
)

### Summary

A user is flagging the assertion error for paddle language code:
```
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
```
and tried setting the `ocr_languages` param to 'en' (the correct lang
code for english in paddle) but also didn't work.
The reason is that the `ocr_languages` uses the mapping for tesseract
code which will convert `en` to `eng` since thats the correct lang code
for english in tesseract.

The quick workaround here is stop passing the lang code to paddle and
let it use default `en`, and this will be addressed once we have the
lang code mapping for paddle.

### Test
looks like user used this branch and got the lang parameter working from
[linked
comments](Unstructured-IO/unstructured-api#247 (comment))
:)
on api repo:
```
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export ENTIRE_PAGE_OCR=paddle
make run-web-app
```
* check error before this change:
```
curl  -X 'POST'  'http://localhost:8000/general/v0/general'   -H 'accept: application/json'  -F 'files=@sample-docs/english-and-korean.png'   -F 'ocr_languages=en'  | jq -C . | less -R
```
will see the error:
```
{
  "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng"
}
```
also in logger you will see `INFO Loading paddle with CPU on
language=eng...` since tesseract mapping converts `en` to `eng`.
* check after this change:

Checkout to this branch and install inference repo into your env (the
same env thats running api) with `pip install -e .`

Rerun `make run-web-app`

Run the curl command again, you won't get the result on m1 chip since
paddle doesn't work on it but from the logger info you can see
`2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with
CPU on language=en...`, which means the lang parameter is using default
`en` (logger info is coming from [this
line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)).

---------

Co-authored-by: shreyanid <[email protected]>
@LaverdeS
Copy link
Contributor

LaverdeS commented Oct 9, 2023

I haven't looked though the code but I threw some 15 pdf samples (mostly part of sample-docs and ingestion) using chipperv2 to see if the hierarchical relationships were present and could not find any.

The import from unstructured_inference.constants import OCRMode suggested in the PR description throws a ImportError: cannot import name 'OCRMode' from 'unstructured_inference.constants', so the snippet I used is the one suggested by @ajjimeno in his last comment.

I acknowledge the presence of List elements and the parent attribute of the LayoutElements but its value is always None, even when is certain that a hierarchy is present.

From the 15 pdf samples, 94 elements were detected in total but no List-item was present, which is also odd, considering chipperv2 returns this category in the original XML output. I suspect this is related to the token2json transform from DonutProcessor @ajjimeno. This adds to the lack of parent attribute diff than None, a makes most of the returned documents to miss a considerable amount of the pdf content.

I tested more on a sample russian.pdf.

Some observations:

This is the output of chipperv2 in XML using it directly from HF:
image
This shows chipperv2 successfully parses the document returning List and List-item elements in a hierarchical structure.

Now, this is the output of chipperv2 being called through unstructured-inference=='0.6.1':
image
All the content is present, but there is no chance for hierarchical info to fit in the element class data and the List elements are ignored.

With the last changes introduced in this PR, the output is:
image
Which is missing a lot of the content (all from which were Listor List-items in the XML), and has a suspicious Image element from 'source': <Source.PDFMINER: 'pdfminer'>.

I am testing to plot to returned element coords but I have not clear what is the reference image for them. Are the coords returned considering the resolution needed for chipper when doing pdf2image? @ajjimeno.

@badGarnet
Copy link
Collaborator

I see many requests for constants. Could you provide an example of what is required in those requests? Is it a nice to have or a compulsory requirement? Thanks!

So here is an example of using a settings constant to represent this parameters: 0871489 credit to @qued
There are a few reasons:

  • some values like "<s_" are used in multiple places so it is easier to maintain and less error prune if we define it once and use the variable name instead of typing the actual value
  • some values may be a specific value that is really part of a configuration of the model like rank=128. Define it explicitly makes the code easier to read later on (so it is more clear on what knobs are present in the model and what values they are; otherwise someone would have to read through the code to find all those places and values)
  • technically they are not required to make the model run but they would make maintenance much easier down the road so I guess you can call them nice to have. Hence I used question mark as a suggestion instead of requiring change on those.

@qued qued enabled auto-merge (squash) October 11, 2023 16:36
@qued qued merged commit f55671c into main Oct 11, 2023
@qued qued deleted the feat/chipper-v2 branch October 11, 2023 16:48
github-merge-queue bot pushed a commit to Unstructured-IO/unstructured that referenced this pull request Oct 13, 2023
PR to support schema changes introduced from [PR
232](Unstructured-IO/unstructured-inference#232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Chipper error. TypeError: argument 'ids': 'NoneType' object cannot be converted to 'Sequence'
9 participants