Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: process chipper hierarchy #1634

Merged
merged 32 commits into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
f07e171
Handle parents from unstructured-inference
qued Oct 3, 2023
ef08907
Add utility functions
qued Oct 3, 2023
b846737
Skip sorting for chipperv2
qued Oct 3, 2023
f78726f
Update tests
qued Oct 3, 2023
0a52d02
Merge branch 'main' into chore/process-chipper-hierarchy
qued Oct 3, 2023
5f0c88f
Make sorting logic more consistent
qued Oct 3, 2023
71071a3
Test sorting
qued Oct 3, 2023
b78aecc
Temp CI change to sync with unstructured-inference
qued Oct 4, 2023
5517e4d
Add tests for first and only
qued Oct 4, 2023
0184c50
Add docstrings
qued Oct 4, 2023
73849fb
Merge branch 'main' into chore/process-chipper-hierarchy
LaverdeS Oct 5, 2023
93029de
Merge branch 'main' into chore/process-chipper-hierarchy
qued Oct 9, 2023
e9f25dc
Updates to OCR to support new LayoutElement struct
qued Oct 9, 2023
16df60c
Updates to ocr for LayoutElement changes
qued Oct 9, 2023
9692236
pip-compile
qued Oct 12, 2023
ddec49c
missed two pip-compile updates
qued Oct 12, 2023
65eb518
Merge branch 'main' into chore/process-chipper-hierarchy
qued Oct 12, 2023
ad3a707
update changelog
qued Oct 12, 2023
ba33f8d
more changelog detail
qued Oct 12, 2023
51774b7
Remove alternate branch of inference
qued Oct 12, 2023
ee0adb0
skip ocr and filtering with chipper
qued Oct 12, 2023
836447c
Add tests for new chipper behavior
qued Oct 12, 2023
a353ceb
Merge branch 'main' into chore/process-chipper-hierarchy
qued Oct 12, 2023
67fd08e
pip-compile
qued Oct 12, 2023
f234944
Merge branch 'main' into chore/process-chipper-hierarchy
LaverdeS Oct 12, 2023
524d9c1
Add HF token to CI
qued Oct 12, 2023
73fe3d6
Merge branch 'chore/process-chipper-hierarchy' of github.com:Unstruct…
qued Oct 12, 2023
e01643a
chore: process chipper hierarchy <- Ingest test fixtures update (#1740)
ryannikolaidis Oct 12, 2023
3207e7f
Merge branch 'main' into chore/process-chipper-hierarchy
qued Oct 12, 2023
43d1595
Merge branch 'main' into chore/process-chipper-hierarchy
christinestraub Oct 12, 2023
73af31b
Skip normal hierarchy discovery for chipper
qued Oct 13, 2023
675394b
typing
qued Oct 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ jobs:
runs-on: ubuntu-latest
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
needs: [setup, lint]
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -216,6 +217,7 @@ jobs:
- name: Test
env:
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
source .venv-${{ matrix.extra }}/bin/activate
# NOTE(newelh) - determine what needs to be installed here
Expand Down Expand Up @@ -422,5 +424,6 @@ jobs:
run: |
source .venv/bin/activate
echo "UNS_API_KEY=${{ secrets.UNS_API_KEY }}" > uns_test_env_file
echo "UNSTRUCTURED_HF_TOKEN=${{ secrets.HF_TOKEN }}" > uns_test_env_file
make docker-build
make docker-test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

### Enhancements

* **bump `unstructured-inference` to `0.7.3`** The updated version of `unstructured-inference` supports a new version of the Chipper model, as well as a cleaner schema for its output classes. Support is included for new inference features such as hierarchy and ordering.
* **Expose skip_infer_table_types in ingest CLI.** For each connector a new `--skip-infer-table-types` parameter was added to map to the `skip_infer_table_types` partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify the file types for which we should attempt table extraction.

### Features
Expand Down
10 changes: 5 additions & 5 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ jsonschema-specifications==2023.7.1
# via jsonschema
jupyter==1.0.0
# via -r requirements/dev.in
jupyter-client==8.3.1
jupyter-client==8.4.0
# via
# ipykernel
# jupyter-console
Expand All @@ -153,7 +153,7 @@ jupyter-client==8.3.1
# qtconsole
jupyter-console==6.6.3
# via jupyter
jupyter-core==5.3.2
jupyter-core==5.4.0
# via
# -c requirements/constraints.in
# ipykernel
Expand All @@ -178,7 +178,7 @@ jupyter-server==2.7.3
# notebook-shim
jupyter-server-terminals==0.4.4
# via jupyter-server
jupyterlab==4.0.6
jupyterlab==4.0.7
# via notebook
jupyterlab-pygments==0.2.2
# via nbconvert
Expand Down Expand Up @@ -213,7 +213,7 @@ nest-asyncio==1.5.8
# via ipykernel
nodeenv==1.8.0
# via pre-commit
notebook==7.0.4
notebook==7.0.5
# via jupyter
notebook-shim==0.2.3
# via
Expand Down Expand Up @@ -320,7 +320,7 @@ rfc3986-validator==0.1.1
# via
# jsonschema
# jupyter-events
rpds-py==0.10.4
rpds-py==0.10.6
# via
# jsonschema
# referencing
Expand Down
6 changes: 5 additions & 1 deletion requirements/extra-docx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,9 @@ lxml==4.9.3
# via
# -c requirements/base.txt
# python-docx
python-docx==0.8.11
python-docx==1.0.0
# via -r requirements/extra-docx.in
typing-extensions==4.8.0
# via
# -c requirements/base.txt
# python-docx
6 changes: 5 additions & 1 deletion requirements/extra-odt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,9 @@ lxml==4.9.3
# python-docx
pypandoc==1.11
# via -r requirements/extra-odt.in
python-docx==0.8.11
python-docx==1.0.0
# via -r requirements/extra-odt.in
typing-extensions==4.8.0
# via
# -c requirements/base.txt
# python-docx
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ contourpy==1.1.1
# via matplotlib
cssselect==1.2.0
# via premailer
cssutils==2.7.1
cssutils==2.8.0
# via premailer
cycler==0.12.1
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ pdf2image
pdfminer.six
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.7.2
unstructured-inference==0.7.3
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
4 changes: 2 additions & 2 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ pyparsing==3.0.9
# via
# -c requirements/constraints.in
# matplotlib
pypdfium2==4.20.0
pypdfium2==4.21.0
# via pdfplumber
pytesseract==0.3.10
# via layoutparser
Expand Down Expand Up @@ -231,7 +231,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.7.2
unstructured-inference==0.7.3
# via -r requirements/extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ azure-datalake-store==0.0.53
# via adlfs
azure-identity==1.14.1
# via adlfs
azure-storage-blob==12.18.2
azure-storage-blob==12.18.3
# via adlfs
certifi==2023.7.22
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ google-crc32c==1.5.0
# via google-resumable-media
google-resumable-media==2.6.0
# via google-cloud-storage
googleapis-common-protos==1.60.0
googleapis-common-protos==1.61.0
# via google-api-core
idna==3.4
# via
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest-google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ charset-normalizer==3.3.0
# requests
google-api-core==2.12.0
# via google-api-python-client
google-api-python-client==2.102.0
google-api-python-client==2.103.0
# via -r requirements/ingest-google-drive.in
google-auth==2.23.3
# via
Expand All @@ -26,7 +26,7 @@ google-auth==2.23.3
# google-auth-httplib2
google-auth-httplib2==0.1.1
# via google-api-python-client
googleapis-common-protos==1.60.0
googleapis-common-protos==1.61.0
# via google-api-core
httplib2==0.22.0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jsonpatch==1.33
# via langchain
jsonpointer==2.4
# via jsonpatch
langchain==0.0.311
langchain==0.0.313
# via -r requirements/ingest-openai.in
langsmith==0.0.43
# via langchain
Expand Down
2 changes: 1 addition & 1 deletion requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ mccabe==0.7.0
# via flake8
multidict==6.0.4
# via yarl
mypy==1.5.1
mypy==1.6.0
# via -r requirements/test.in
mypy-extensions==1.0.0
# via
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/pdf_image/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def __init__(self, number: int, image: Image):
@property
def elements(self):
return [
layout.LayoutElement(
layout.LayoutElement.from_coords(
type="Title",
x1=0,
y1=0,
Expand Down
Loading