Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Table chunking #1540

Merged
merged 20 commits into from
Oct 3, 2023
Merged

chore: Table chunking #1540

merged 20 commits into from
Oct 3, 2023

Conversation

amanda103
Copy link
Contributor

@amanda103 amanda103 commented Sep 27, 2023

This change is adding to our add_chunking_strategy logic so that we are able to chunk Table elements' text and text_as_html params. In order to keep the functionality under the same by_title chunking strategy we have renamed the combine_under_n_chars to max_characters. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements.

*renaming the variable to max_characters will also reflect the 'hard max' we will implement for large elements in followup PRs

Additionally -> some lint changes snuck in when I ran make tidy hence the minor changes in unrelated files :)

TODO:
✅ add unit tests
--> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy
✅ update changelog

To manually test:

In [1]: filename="example-docs/example-10k.html"

In [2]: from unstructured.chunking.title import chunk_table_element

In [3]: from unstructured.partition.auto import partition

In [4]: elements = partition(filename)

# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)

# examine text and text_as_html params
ln [6]: for c in chunks:
                    print(c.text)
                    print(c.metadata.text_as_html)

@amanda103 amanda103 force-pushed the amanda/chunk-by-chars-table branch from bd3a13b to 7e76610 Compare September 27, 2023 20:38
@amanda103 amanda103 marked this pull request as ready for review September 27, 2023 20:39
@amanda103 amanda103 force-pushed the amanda/chunk-by-chars-table branch 2 times, most recently from c70ba59 to cb4cb42 Compare September 29, 2023 00:46
@cragwolfe cragwolfe requested a review from newelh September 29, 2023 07:35
unstructured/chunking/title.py Show resolved Hide resolved
unstructured/chunking/title.py Show resolved Hide resolved
@amanda103 amanda103 force-pushed the amanda/chunk-by-chars-table branch from 6544681 to 229199e Compare September 29, 2023 21:49
combine_text_under_n_chars=5,
)
elements = partition_docx(filename)
chunks = chunk_by_title(elements, max_characters=9, combine_text_under_n_chars=5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also have assert to check for this example that

  • without combining text under n chars set to 5 there are tiny chunks
  • with the setting we have text with minimum 5 chars

@amanda103 amanda103 force-pushed the amanda/chunk-by-chars-table branch from f93de18 to ac2e78a Compare October 3, 2023 00:37
@amanda103 amanda103 merged commit 1fb4642 into main Oct 3, 2023
39 checks passed
@amanda103 amanda103 deleted the amanda/chunk-by-chars-table branch October 3, 2023 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants