Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(release): bump unstructured #183

Merged
merged 7 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
## 0.0.35-dev0
## 0.0.35

* Bump unstructured library to 0.9.2
* Fix a misleading error in make docker-test

## 0.0.34
Expand Down
30 changes: 24 additions & 6 deletions pipeline-notebooks/pipeline-general.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -921,6 +921,7 @@
"[{'type': 'UncategorizedText',\n",
" 'element_id': 'db1ca22813f01feda8759ff04a844e56',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this field has been in the metadata for a while - did something just change for it to show here? This will probably get us into trouble with make check-notebooks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what's happening :') Not sure what just changed, this commit was thrown out as a possibility. But it isn't happening on the current main branch of api, so that means some package bump (likely unstructured) caused the change. Would appreciate some help debugging where the change came from, and why the difference in file directory is not happening locally but happening in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this release comparison it doesn't seem like anything recently changed related to the file_directory field

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a fun one! I have some cycles now to take a look as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think it is that commit. Here it looks like file_directory will be set whenever a filename is present, and that commit will have us sending metadata_filename all the time. All the different filename params confuse me, but tldr is that we should remove that field like the ones over here.

" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -929,6 +930,7 @@
" {'type': 'NarrativeText',\n",
" 'element_id': 'a663c393a5e143c01ef2bb5c98efa2c1',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -937,6 +939,7 @@
" {'type': 'NarrativeText',\n",
" 'element_id': 'ce65ca3bef59957d3f1c2bab5725c82f',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -945,6 +948,7 @@
" {'type': 'NarrativeText',\n",
" 'element_id': 'd7bcf988af9f06042d83e25c531e5744',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -953,6 +957,7 @@
" {'type': 'Title',\n",
" 'element_id': '5550577db69c2c8aabcd90979698120a',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -961,6 +966,7 @@
" {'type': 'Title',\n",
" 'element_id': 'ca1c571d993b6c1ed8ef56a06c16ba22',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -969,6 +975,7 @@
" {'type': 'Title',\n",
" 'element_id': 'd5b612de8cd918addd9569b0255b65b2',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand All @@ -977,6 +984,7 @@
" {'type': 'Title',\n",
" 'element_id': '2e0b9e8ee04b9594a9c26d8535b818ff',\n",
" 'metadata': {'filename': 'family-day.eml',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'message/rfc822',\n",
" 'sent_from': ['Mallori Harrell <[email protected]>'],\n",
" 'sent_to': ['Mallori Harrell <[email protected]>'],\n",
Expand Down Expand Up @@ -1015,7 +1023,7 @@
{
"data": {
"text/plain": [
"'type,text,element_id,filename,filetype,sent_from,sent_to,subject,sender\\nUncategorizedText,\"Hi All,\",db1ca22813f01feda8759ff04a844e56,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nNarrativeText,Get excited for our first annual family day!\\xa0,a663c393a5e143c01ef2bb5c98efa2c1,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nNarrativeText,\"There will be face painting, a petting zoo, funnel cake and more.\",ce65ca3bef59957d3f1c2bab5725c82f,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nNarrativeText,Make sure to RSVP!,d7bcf988af9f06042d83e25c531e5744,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Best.,5550577db69c2c8aabcd90979698120a,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Mallori Harrell,ca1c571d993b6c1ed8ef56a06c16ba22,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Unstructured Technologies,d5b612de8cd918addd9569b0255b65b2,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Data Scientist,2e0b9e8ee04b9594a9c26d8535b818ff,family-day.eml,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\n'"
"'type,text,element_id,filename,file_directory,filetype,sent_from,sent_to,subject,sender\\nUncategorizedText,\"Hi All,\",db1ca22813f01feda8759ff04a844e56,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nNarrativeText,Get excited for our first annual family day!\\xa0,a663c393a5e143c01ef2bb5c98efa2c1,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nNarrativeText,\"There will be face painting, a petting zoo, funnel cake and more.\",ce65ca3bef59957d3f1c2bab5725c82f,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nNarrativeText,Make sure to RSVP!,d7bcf988af9f06042d83e25c531e5744,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Best.,5550577db69c2c8aabcd90979698120a,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Mallori Harrell,ca1c571d993b6c1ed8ef56a06c16ba22,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Unstructured Technologies,d5b612de8cd918addd9569b0255b65b2,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\nTitle,Data Scientist,2e0b9e8ee04b9594a9c26d8535b818ff,family-day.eml,/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs,message/rfc822,[\\'Mallori Harrell <[email protected]>\\'],[\\'Mallori Harrell <[email protected]>\\'],Family Day,Mallori Harrell <[email protected]>\\n'"
]
},
"execution_count": null,
Expand Down Expand Up @@ -1068,23 +1076,33 @@
"text/plain": [
"[{'type': 'NarrativeText',\n",
" 'element_id': '1df8eeb8be847c3a1a7411e3be3e0396',\n",
" 'metadata': {'filename': 'fake-text.txt', 'filetype': 'text/plain'},\n",
" 'metadata': {'filename': 'fake-text.txt',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'text/plain'},\n",
" 'text': 'This is a test document to use for unit tests.'},\n",
" {'type': 'Title',\n",
" 'element_id': '9c218520320f238595f1fde74bdd137d',\n",
" 'metadata': {'filename': 'fake-text.txt', 'filetype': 'text/plain'},\n",
" 'metadata': {'filename': 'fake-text.txt',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'text/plain'},\n",
" 'text': 'Important points:'},\n",
" {'type': 'ListItem',\n",
" 'element_id': '39a3ae572581d0f1fe7511fd7b3aa414',\n",
" 'metadata': {'filename': 'fake-text.txt', 'filetype': 'text/plain'},\n",
" 'metadata': {'filename': 'fake-text.txt',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'text/plain'},\n",
" 'text': 'Hamburgers are delicious'},\n",
" {'type': 'ListItem',\n",
" 'element_id': 'fc1adcb8eaceac694e500a103f9f698f',\n",
" 'metadata': {'filename': 'fake-text.txt', 'filetype': 'text/plain'},\n",
" 'metadata': {'filename': 'fake-text.txt',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'text/plain'},\n",
" 'text': 'Dogs are the best'},\n",
" {'type': 'ListItem',\n",
" 'element_id': '0b61e826b1c4ab05750184da72b89f83',\n",
" 'metadata': {'filename': 'fake-text.txt', 'filetype': 'text/plain'},\n",
" 'metadata': {'filename': 'fake-text.txt',\n",
" 'file_directory': '/Users/shreyanid/Documents/all-unstructured/unstructured-api/sample-docs',\n",
" 'filetype': 'text/plain'},\n",
" 'text': 'I love fuzzy blankets'}]"
]
},
Expand Down
34 changes: 18 additions & 16 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ attrs==23.1.0
autoflake==2.2.0
# via unstructured-api-tools
beautifulsoup4==4.12.2
# via nbconvert
# via
# nbconvert
# unstructured
bleach==6.0.0
# via nbconvert
certifi==2023.7.22
Expand All @@ -40,8 +42,10 @@ coloredlogs==15.0.1
# via onnxruntime
contourpy==1.1.0
# via matplotlib
cryptography==41.0.3
# via pdfminer-six
cryptography==41.0.2
# via
# pdfminer-six
# unstructured
cycler==0.11.0
# via matplotlib
defusedxml==0.7.1
Expand Down Expand Up @@ -101,7 +105,7 @@ jinja2==3.1.2
# nbconvert
# torch
# unstructured-api-tools
joblib==1.3.1
joblib==1.3.2
# via nltk
jsonschema==4.19.0
# via nbformat
Expand Down Expand Up @@ -140,7 +144,7 @@ mpmath==1.3.0
# via sympy
msg-parser==1.2.0
# via unstructured
mypy==1.4.1
mypy==1.5.0
# via unstructured-api-tools
mypy-extensions==1.0.0
# via mypy
Expand Down Expand Up @@ -174,7 +178,7 @@ omegaconf==2.3.0
# via effdet
onnxruntime==1.15.1
# via unstructured-inference
opencv-python==4.8.0.74
opencv-python==4.8.0.76
# via
# layoutparser
# unstructured-inference
Expand Down Expand Up @@ -221,7 +225,7 @@ platformdirs==3.10.0
# via jupyter-core
portalocker==2.7.0
# via iopath
protobuf==4.23.4
protobuf==4.24.0
# via onnxruntime
pycocotools==2.0.6
# via effdet
Expand Down Expand Up @@ -274,15 +278,15 @@ pyyaml==6.0.1
# timm
# transformers
# uvicorn
pyzmq==25.1.0
pyzmq==25.1.1
# via jupyter-client
ratelimit==2.2.1
# via -r requirements/base.in
referencing==0.30.2
# via
# jsonschema
# jsonschema-specifications
regex==2023.6.3
regex==2023.8.8
# via
# nltk
# transformers
Expand All @@ -297,7 +301,7 @@ rpds-py==0.9.2
# via
# jsonschema
# referencing
safetensors==0.3.1
safetensors==0.3.2
# via
# timm
# transformers
Expand Down Expand Up @@ -340,9 +344,9 @@ torchvision==0.15.2
# effdet
# layoutparser
# timm
tornado==6.3.2
tornado==6.3.3
# via jupyter-client
tqdm==4.65.0
tqdm==4.66.1
# via
# huggingface-hub
# iopath
Expand All @@ -365,24 +369,22 @@ types-urllib3==1.26.25.14
# via types-requests
typing-extensions==4.7.1
# via
# annotated-types
# fastapi
# huggingface-hub
# iopath
# mypy
# pydantic
# pydantic-core
# pypdf
# starlette
# torch
# uvicorn
tzdata==2023.3
# via pandas
unstructured[local-inference]==0.9.0
unstructured[local-inference]==0.9.2
# via -r requirements/base.in
unstructured-api-tools==0.10.10
# via -r requirements/base.in
unstructured-inference==0.5.7
unstructured-inference==0.5.9
# via unstructured
urllib3==2.0.4
# via requests
Expand Down
29 changes: 15 additions & 14 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ beautifulsoup4==4.12.2
# via
# -r requirements/base.txt
# nbconvert
# unstructured
black==23.7.0
# via -r requirements/test.in
bleach==6.0.0
Expand Down Expand Up @@ -98,15 +99,16 @@ contourpy==1.1.0
# matplotlib
coverage[toml]==7.2.7
# via pytest-cov
cryptography==41.0.3
cryptography==41.0.2
# via
# -r requirements/base.txt
# pdfminer-six
# unstructured
cycler==0.11.0
# via
# -r requirements/base.txt
# matplotlib
debugpy==1.6.7
debugpy==1.6.7.post1
# via ipykernel
decorator==5.1.1
# via ipython
Expand Down Expand Up @@ -254,7 +256,7 @@ jinja2==3.1.2
# nbconvert
# torch
# unstructured-api-tools
joblib==1.3.1
joblib==1.3.2
# via
# -r requirements/base.txt
# nltk
Expand Down Expand Up @@ -366,7 +368,7 @@ msg-parser==1.2.0
# via
# -r requirements/base.txt
# unstructured
mypy==1.4.1
mypy==1.5.0
# via
# -r requirements/base.txt
# -r requirements/test.in
Expand Down Expand Up @@ -435,7 +437,7 @@ onnxruntime==1.15.1
# via
# -r requirements/base.txt
# unstructured-inference
opencv-python==4.8.0.74
opencv-python==4.8.0.76
# via
# -r requirements/base.txt
# layoutparser
Expand Down Expand Up @@ -529,7 +531,7 @@ prompt-toolkit==3.0.39
# via
# ipython
# jupyter-console
protobuf==4.23.4
protobuf==4.24.0
# via
# -r requirements/base.txt
# onnxruntime
Expand Down Expand Up @@ -598,7 +600,6 @@ pytest-mock==3.11.1
python-dateutil==2.8.2
# via
# -r requirements/base.txt
# arrow
# jupyter-client
# matplotlib
# pandas
Expand Down Expand Up @@ -641,7 +642,7 @@ pyyaml==6.0.1
# timm
# transformers
# uvicorn
pyzmq==25.1.0
pyzmq==25.1.1
# via
# -r requirements/base.txt
# ipykernel
Expand All @@ -661,7 +662,7 @@ referencing==0.30.2
# jsonschema
# jsonschema-specifications
# jupyter-events
regex==2023.6.3
regex==2023.8.8
# via
# -r requirements/base.txt
# nltk
Expand All @@ -687,7 +688,7 @@ rpds-py==0.9.2
# -r requirements/base.txt
# jsonschema
# referencing
safetensors==0.3.1
safetensors==0.3.2
# via
# -r requirements/base.txt
# timm
Expand Down Expand Up @@ -769,7 +770,7 @@ torchvision==0.15.2
# effdet
# layoutparser
# timm
tornado==6.3.2
tornado==6.3.3
# via
# -r requirements/base.txt
# ipykernel
Expand All @@ -778,7 +779,7 @@ tornado==6.3.2
# jupyterlab
# notebook
# terminado
tqdm==4.65.0
tqdm==4.66.1
# via
# -r requirements/base.txt
# huggingface-hub
Expand Down Expand Up @@ -838,11 +839,11 @@ tzdata==2023.3
# via
# -r requirements/base.txt
# pandas
unstructured[local-inference]==0.9.0
unstructured[local-inference]==0.9.2
# via -r requirements/base.txt
unstructured-api-tools==0.10.10
# via -r requirements/base.txt
unstructured-inference==0.5.7
unstructured-inference==0.5.9
# via
# -r requirements/base.txt
# unstructured
Expand Down
Loading