Skip to content

Commit

Permalink
feat: serialize ingest docs as json (#1178)
Browse files Browse the repository at this point in the history
  • Loading branch information
ryannikolaidis authored Aug 31, 2023
1 parent 2777313 commit 076b1e3
Show file tree
Hide file tree
Showing 48 changed files with 486 additions and 463 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.10-dev2
## 0.10.10-dev4

### Enhancements

Expand All @@ -7,6 +7,7 @@
on carriage returns in the XML. Since `partition_xml` no longer calls `partition_text`,
`min_partition` and `max_partition` are no longer supported in `partition_xml`.
* Bump `unstructured-inference==0.5.18`, change non-default detectron2 classification threshold
* Serialize IngestDocs to JSON when passing to subprocesses

### Features

Expand Down
4 changes: 3 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,9 @@ jinja2==3.1.2
markupsafe==2.1.3
# via jinja2
packaging==23.1
# via sphinx
# via
# -c requirements/base.txt
# sphinx
pygments==2.16.1
# via
# furo
Expand Down
1 change: 1 addition & 0 deletions requirements/base.in
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ tabulate
requests
beautifulsoup4
emoji
dataclasses-json
12 changes: 12 additions & 0 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ charset-normalizer==3.2.0
# via requests
click==8.1.7
# via nltk
dataclasses-json==0.5.14
# via -r requirements/base.in
emoji==2.8.0
# via -r requirements/base.in
filetype==1.2.0
Expand All @@ -26,8 +28,14 @@ joblib==1.3.2
# via nltk
lxml==4.9.3
# via -r requirements/base.in
marshmallow==3.20.1
# via dataclasses-json
mypy-extensions==1.0.0
# via typing-inspect
nltk==3.8.1
# via -r requirements/base.in
packaging==23.1
# via marshmallow
python-magic==0.4.27
# via -r requirements/base.in
regex==2023.8.8
Expand All @@ -40,6 +48,10 @@ tabulate==0.9.0
# via -r requirements/base.in
tqdm==4.66.1
# via nltk
typing-extensions==4.7.1
# via typing-inspect
typing-inspect==0.9.0
# via dataclasses-json
urllib3==1.26.16
# via
# -c requirements/constraints.in
Expand Down
4 changes: 3 additions & 1 deletion requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,9 @@ jinja2==3.1.2
markupsafe==2.1.3
# via jinja2
packaging==23.1
# via sphinx
# via
# -c requirements/base.txt
# sphinx
pygments==2.16.1
# via
# furo
Expand Down
2 changes: 2 additions & 0 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ overrides==7.4.0
# via jupyter-server
packaging==23.1
# via
# -c requirements/base.txt
# -c requirements/test.txt
# build
# ipykernel
Expand Down Expand Up @@ -377,6 +378,7 @@ traitlets==5.9.0
# qtconsole
typing-extensions==4.7.1
# via
# -c requirements/base.txt
# -c requirements/test.txt
# async-lru
# filelock
Expand Down
1 change: 1 addition & 0 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ opencv-python==4.8.0.76
# unstructured-inference
packaging==23.1
# via
# -c requirements/base.txt
# huggingface-hub
# matplotlib
# onnxruntime
Expand Down
1 change: 1 addition & 0 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ numpy==1.24.4
# transformers
packaging==23.1
# via
# -c requirements/base.txt
# huggingface-hub
# transformers
pyyaml==6.0.1
Expand Down
1 change: 1 addition & 0 deletions requirements/ingest-airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ requests==2.31.0
# pyairtable
typing-extensions==4.7.1
# via
# -c requirements/base.txt
# pyairtable
# pydantic
urllib3==1.26.16
Expand Down
1 change: 1 addition & 0 deletions requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ six==1.16.0
# isodate
typing-extensions==4.7.1
# via
# -c requirements/base.txt
# azure-core
# azure-storage-blob
urllib3==1.26.16
Expand Down
4 changes: 3 additions & 1 deletion requirements/ingest-s3.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,9 @@ s3fs==2023.6.0
six==1.16.0
# via python-dateutil
typing-extensions==4.7.1
# via aioitertools
# via
# -c requirements/base.txt
# aioitertools
urllib3==1.26.16
# via
# -c requirements/base.txt
Expand Down
3 changes: 3 additions & 0 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,12 @@ mypy==1.5.1
# via -r requirements/test.in
mypy-extensions==1.0.0
# via
# -c requirements/base.txt
# black
# mypy
packaging==23.1
# via
# -c requirements/base.txt
# black
# pytest
pathspec==0.11.2
Expand Down Expand Up @@ -117,6 +119,7 @@ types-urllib3==1.26.25.14
# via types-requests
typing-extensions==4.7.1
# via
# -c requirements/base.txt
# black
# mypy
# pydantic
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,7 @@
"type": "NarrativeText",
"element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "880f80ca-cebf-48d0-b639-aeb671b3c431",
"server_relative_url": "/Shared Documents/fake-text.txt"
},
"date_created": "2023-06-16T05:04:55Z",
"date_modified": "2023-06-16T05:04:55Z"
},
"data_source": {},
"filename": "fake-text.txt",
"filetype": "text/plain"
},
Expand All @@ -21,15 +13,7 @@
"type": "Address",
"element_id": "a9d4657034aa3fdb5177f1325e912362",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "880f80ca-cebf-48d0-b639-aeb671b3c431",
"server_relative_url": "/Shared Documents/fake-text.txt"
},
"date_created": "2023-06-16T05:04:55Z",
"date_modified": "2023-06-16T05:04:55Z"
},
"data_source": {},
"filename": "fake-text.txt",
"filetype": "text/plain"
},
Expand All @@ -39,15 +23,7 @@
"type": "Title",
"element_id": "9c218520320f238595f1fde74bdd137d",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "880f80ca-cebf-48d0-b639-aeb671b3c431",
"server_relative_url": "/Shared Documents/fake-text.txt"
},
"date_created": "2023-06-16T05:04:55Z",
"date_modified": "2023-06-16T05:04:55Z"
},
"data_source": {},
"filename": "fake-text.txt",
"filetype": "text/plain"
},
Expand All @@ -57,15 +33,7 @@
"type": "ListItem",
"element_id": "39a3ae572581d0f1fe7511fd7b3aa414",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "880f80ca-cebf-48d0-b639-aeb671b3c431",
"server_relative_url": "/Shared Documents/fake-text.txt"
},
"date_created": "2023-06-16T05:04:55Z",
"date_modified": "2023-06-16T05:04:55Z"
},
"data_source": {},
"filename": "fake-text.txt",
"filetype": "text/plain"
},
Expand All @@ -75,15 +43,7 @@
"type": "ListItem",
"element_id": "fc1adcb8eaceac694e500a103f9f698f",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "880f80ca-cebf-48d0-b639-aeb671b3c431",
"server_relative_url": "/Shared Documents/fake-text.txt"
},
"date_created": "2023-06-16T05:04:55Z",
"date_modified": "2023-06-16T05:04:55Z"
},
"data_source": {},
"filename": "fake-text.txt",
"filetype": "text/plain"
},
Expand All @@ -93,15 +53,7 @@
"type": "ListItem",
"element_id": "0b61e826b1c4ab05750184da72b89f83",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "880f80ca-cebf-48d0-b639-aeb671b3c431",
"server_relative_url": "/Shared Documents/fake-text.txt"
},
"date_created": "2023-06-16T05:04:55Z",
"date_modified": "2023-06-16T05:04:55Z"
},
"data_source": {},
"filename": "fake-text.txt",
"filetype": "text/plain"
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,7 @@
"type": "NarrativeText",
"element_id": "c08fcabe68ba13b7a7cc6592bd5513a8",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "0dfe3d76-00c0-42db-ae1b-8cf22d4b3f10",
"server_relative_url": "/Shared Documents/ideas-page.html"
},
"date_created": "2023-06-16T05:04:47Z",
"date_modified": "2023-06-16T05:04:47Z"
},
"data_source": {},
"filename": "ideas-page.html",
"filetype": "text/html",
"page_number": 1,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,7 @@
"type": "Table",
"element_id": "3e65b02bec20bb1056bd23a3b4ecd0f6",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "b9956a33-8079-4321-91ea-609def07394d",
"server_relative_url": "/Shared Documents/stanley-cups.xlsx"
},
"date_created": "2023-06-16T05:05:05Z",
"date_modified": "2023-06-16T05:05:05Z"
},
"data_source": {},
"filename": "stanley-cups.xlsx",
"filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"page_number": 1,
Expand All @@ -24,15 +16,7 @@
"type": "Table",
"element_id": "0699dddf33814117e04654068f5182f6",
"metadata": {
"data_source": {
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "b9956a33-8079-4321-91ea-609def07394d",
"server_relative_url": "/Shared Documents/stanley-cups.xlsx"
},
"date_created": "2023-06-16T05:05:05Z",
"date_modified": "2023-06-16T05:05:05Z"
},
"data_source": {},
"filename": "stanley-cups.xlsx",
"filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"page_number": 2,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,7 @@
"type": "Title",
"element_id": "b4e929d8bcfe04189801a8ed61496d17",
"metadata": {
"data_source": {
"version": "1.2",
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "2b564fff-e9bb-4b64-9822-64f96a20ea10",
"absolute_url": "https://unstructuredio.sharepoint.com/SitePages/Home.aspx"
},
"date_created": "0001-01-01T08:00:00Z",
"date_modified": "2023-06-16T05:12:51Z"
},
"data_source": {},
"filename": "Home.html",
"filetype": "text/html",
"page_number": 1
Expand All @@ -23,16 +14,7 @@
"type": "Title",
"element_id": "8d14f6e72de8f18ab1ee5c5330f00653",
"metadata": {
"data_source": {
"version": "1.2",
"record_locator": {
"site": "https://unstructuredio.sharepoint.com/",
"unique_id": "2b564fff-e9bb-4b64-9822-64f96a20ea10",
"absolute_url": "https://unstructuredio.sharepoint.com/SitePages/Home.aspx"
},
"date_created": "0001-01-01T08:00:00Z",
"date_modified": "2023-06-16T05:12:51Z"
},
"data_source": {},
"filename": "Home.html",
"filetype": "text/html",
"page_number": 1
Expand Down
Loading

0 comments on commit 076b1e3

Please sign in to comment.