Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Klaijan/xlsx sub tables #1585

Merged
merged 42 commits into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a948837
find subtables
Klaijan Sep 28, 2023
2cdbbce
fix bug
Klaijan Sep 29, 2023
c9d65c5
add EXPECTED_TITLE to test cases
Klaijan Sep 29, 2023
1c6f199
big fixed
Klaijan Sep 29, 2023
91c2ac7
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Sep 29, 2023
c54a9f5
update changelog
Klaijan Sep 29, 2023
907b689
update changelog
Klaijan Sep 29, 2023
e14c9ac
add language detection
Klaijan Sep 29, 2023
109f196
add language test
Klaijan Sep 29, 2023
de5d710
make pip-compile
Klaijan Oct 1, 2023
0fb799c
attempt to fix elastic search error in test fixture
Klaijan Oct 2, 2023
9419e9e
request timeout 60s
Klaijan Oct 2, 2023
3ee486f
request timeout 30s
Klaijan Oct 2, 2023
e3979ce
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 2, 2023
d8a51fb
reorder es test
Klaijan Oct 2, 2023
0efd7fe
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 2, 2023
3bb3c29
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 2, 2023
68d6db2
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 3, 2023
27e499e
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 3, 2023
5c742fa
Update CHANGELOG.md
Klaijan Oct 3, 2023
c7bf892
Klaijan/xlsx sub tables <- Ingest test fixtures update (#1626)
ryannikolaidis Oct 3, 2023
f4aa793
Update create_and_fill_es.py
Klaijan Oct 3, 2023
476b3cf
Update xlsx.py
Klaijan Oct 3, 2023
f9e3bb0
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 3, 2023
224ed98
Update test-ingest.sh
Klaijan Oct 3, 2023
17c2ee2
add languages params description in CHANGELOG
Klaijan Oct 3, 2023
a5f5a0b
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 3, 2023
833c0fc
linting
Klaijan Oct 3, 2023
a1b9be6
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 3, 2023
4bf6805
add test. edit filter function. add _get_metadata for cleanliness
Klaijan Oct 4, 2023
e4ae06c
linting
Klaijan Oct 4, 2023
e84ab88
Merge branch 'klaijan/xlsx-sub-tables' of https://github.com/Unstruct…
Klaijan Oct 4, 2023
69a73e9
add example doc
Klaijan Oct 4, 2023
efcfa51
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 4, 2023
6b9d97f
update test_auto
Klaijan Oct 4, 2023
c027f20
Merge branch 'klaijan/xlsx-sub-tables' of https://github.com/Unstruct…
Klaijan Oct 4, 2023
3a0a210
add test and example docs
Klaijan Oct 4, 2023
3f75489
Klaijan/xlsx sub tables <- Ingest test fixtures update (#1638)
ryannikolaidis Oct 4, 2023
4187edd
Merge branch 'main' into klaijan/xlsx-sub-tables
Klaijan Oct 4, 2023
ae52401
edit test
Klaijan Oct 4, 2023
4a942ca
Klaijan/xlsx sub tables <- Ingest test fixtures update (#1639)
ryannikolaidis Oct 4, 2023
451cfab
Update test-ingest.sh
Klaijan Oct 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

### Enhancements

* **Adds XLSX document level language detection** Enhancing on top of language detection functionality in previous release, we now support language detection within `.xlsx` file type at Element level.
* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
Expand All @@ -11,7 +12,7 @@

### Features

### Features
* **XLSX can now reads subtables within one sheet** Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This `partition_xlsx` now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.

### Fixes

Expand Down
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -r requirements/build.in
# requests
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via
# -c requirements/base.txt
# requests
Expand Down Expand Up @@ -54,7 +54,7 @@ mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
packaging==23.1
packaging==23.2
# via
# -c requirements/base.txt
# sphinx
Expand Down
4 changes: 2 additions & 2 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ certifi==2023.7.22
# requests
chardet==5.2.0
# via -r requirements/base.in
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via requests
click==8.1.7
# via nltk
Expand Down Expand Up @@ -40,7 +40,7 @@ numpy==1.24.4
# via
# -c requirements/constraints.in
# -r requirements/base.in
packaging==23.1
packaging==23.2
# via marshmallow
python-iso639==2023.6.15
# via -r requirements/base.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ certifi==2023.7.22
# -c requirements/constraints.in
# -r requirements/build.in
# requests
charset-normalizer==3.2.0
charset-normalizer==3.3.0
# via
# -c requirements/base.txt
# requests
Expand Down Expand Up @@ -54,7 +54,7 @@ mdurl==0.1.2
# via markdown-it-py
myst-parser==2.0.0
# via -r requirements/build.in
packaging==23.1
packaging==23.2
# via
# -c requirements/base.txt
# sphinx
Expand Down
Loading
Loading