Skip to content

Commit

Permalink
chore: process chipper hierarchy (#1634)
Browse files Browse the repository at this point in the history
PR to support schema changes introduced from [PR
232](Unstructured-IO/unstructured-inference#232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
  • Loading branch information
4 people authored Oct 13, 2023
1 parent 94836cf commit 8100f1e
Show file tree
Hide file tree
Showing 32 changed files with 853 additions and 842 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ jobs:
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
needs: [setup, lint]
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -216,6 +217,7 @@ jobs:
- name: Test
env:
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
source .venv-${{ matrix.extra }}/bin/activate
# NOTE(newelh) - determine what needs to be installed here
Expand Down Expand Up @@ -425,5 +427,6 @@ jobs:
run: |
source .venv/bin/activate
echo "UNS_API_KEY=${{ secrets.UNS_API_KEY }}" > uns_test_env_file
echo "UNSTRUCTURED_HF_TOKEN=${{ secrets.HF_TOKEN }}" > uns_test_env_file
make docker-build
make docker-test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

### Enhancements

* **bump `unstructured-inference` to `0.7.3`** The updated version of `unstructured-inference` supports a new version of the Chipper model, as well as a cleaner schema for its output classes. Support is included for new inference features such as hierarchy and ordering.
* **Expose skip_infer_table_types in ingest CLI.** For each connector a new `--skip-infer-table-types` parameter was added to map to the `skip_infer_table_types` partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify the file types for which we should attempt table extraction.
* **Add flag to ingest CLI to raise error if any single doc fails in pipeline** Currently if a single doc fails in the pipeline, the whole thing halts due to the error. This flag defaults to log an error but continue with the docs it can.

Expand Down
10 changes: 5 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ jsonschema-specifications==2023.7.1
# via jsonschema
jupyter==1.0.0
# via -r requirements/dev.in
jupyter-client==8.3.1
jupyter-client==8.4.0
# via
# ipykernel
# jupyter-console
Expand All @@ -153,7 +153,7 @@ jupyter-client==8.3.1
# qtconsole
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.3.2
jupyter-core==5.4.0
# via
# -c requirements/constraints.in
# ipykernel
Expand All @@ -178,7 +178,7 @@ jupyter-server==2.7.3
# notebook-shim
jupyter-server-terminals==0.4.4
# via jupyter-server
jupyterlab==4.0.6
jupyterlab==4.0.7
# via notebook
jupyterlab-pygments==0.2.2
# via nbconvert
Expand Down Expand Up @@ -213,7 +213,7 @@ nest-asyncio==1.5.8
# via ipykernel
nodeenv==1.8.0
# via pre-commit
notebook==7.0.4
notebook==7.0.5
# via jupyter
notebook-shim==0.2.3
# via
Expand Down Expand Up @@ -320,7 +320,7 @@ rfc3986-validator==0.1.1
# via
# jsonschema
# jupyter-events
rpds-py==0.10.4
rpds-py==0.10.6
# via
# jsonschema
# referencing
Expand Down
6 changes: 5 additions & 1 deletion requirements/extra-docx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,9 @@ lxml==4.9.3
# via
# -c requirements/base.txt
# python-docx
python-docx==0.8.11
python-docx==1.0.0
# via -r requirements/extra-docx.in
typing-extensions==4.8.0
# via
# -c requirements/base.txt
# python-docx
6 changes: 5 additions & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,9 @@ lxml==4.9.3
# python-docx
pypandoc==1.11
# via -r requirements/extra-odt.in
python-docx==0.8.11
python-docx==1.0.0
# via -r requirements/extra-odt.in
typing-extensions==4.8.0
# via
# -c requirements/base.txt
# python-docx
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ contourpy==1.1.1
# via matplotlib
cssselect==1.2.0
# via premailer
cssutils==2.7.1
cssutils==2.8.0
# via premailer
cycler==0.12.1
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pdf2image
pdfminer.six
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.7.2
unstructured-inference==0.7.3
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
4 changes: 2 additions & 2 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ pyparsing==3.0.9
# via
# -c requirements/constraints.in
# matplotlib
pypdfium2==4.20.0
pypdfium2==4.21.0
# via pdfplumber
pytesseract==0.3.10
# via layoutparser
Expand Down Expand Up @@ -231,7 +231,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.7.2
unstructured-inference==0.7.3
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ azure-datalake-store==0.0.53
# via adlfs
azure-identity==1.14.1
# via adlfs
azure-storage-blob==12.18.2
azure-storage-blob==12.18.3
# via adlfs
certifi==2023.7.22
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ google-crc32c==1.5.0
# via google-resumable-media
google-resumable-media==2.6.0
# via google-cloud-storage
googleapis-common-protos==1.60.0
googleapis-common-protos==1.61.0
# via google-api-core
idna==3.4
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ charset-normalizer==3.3.0
# requests
google-api-core==2.12.0
# via google-api-python-client
google-api-python-client==2.102.0
google-api-python-client==2.103.0
# via -r requirements/ingest-google-drive.in
google-auth==2.23.3
# via
Expand All @@ -26,7 +26,7 @@ google-auth==2.23.3
# google-auth-httplib2
google-auth-httplib2==0.1.1
# via google-api-python-client
googleapis-common-protos==1.60.0
googleapis-common-protos==1.61.0
# via google-api-core
httplib2==0.22.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jsonpatch==1.33
# via langchain
jsonpointer==2.4
# via jsonpatch
langchain==0.0.311
langchain==0.0.313
# via -r requirements/ingest-openai.in
langsmith==0.0.43
# via langchain
Expand Down
2 changes: 1 addition & 1 deletion requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ mccabe==0.7.0
# via flake8
multidict==6.0.4
# via yarl
mypy==1.5.1
mypy==1.6.0
# via -r requirements/test.in
mypy-extensions==1.0.0
# via
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/pdf_image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def __init__(self, number: int, image: Image):
@property
def elements(self):
return [
layout.LayoutElement(
layout.LayoutElement.from_coords(
type="Title",
x1=0,
y1=0,
Expand Down
Loading

0 comments on commit 8100f1e

Please sign in to comment.