Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfctr(chunking): extract general-purpose objects to base #2281

Merged
merged 1 commit into from
Dec 16, 2023

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Dec 15, 2023

Many of the classes defined in unstructured.chunking.title are applicable to any chunking strategy and will shortly be used for the "by-character" chunking strategy as well.

Move these and their tests to unstructured.chunking.base.

Along the way, rename TextPreChunkBuilder to PreChunkBuilder because it will be generalized in a subsequent PR to also take Table elements such that inter-pre-chunk overlap can be implemented.

Otherwise, no logic changes, just moves.

@scanny scanny force-pushed the scanny/extract-base-objects branch from 9ed8682 to 0298b1d Compare December 15, 2023 19:07
@scanny scanny changed the base branch from main to scanny/extract-ChunkingOptions December 15, 2023 19:51
Base automatically changed from scanny/extract-ChunkingOptions to main December 15, 2023 20:38
@scanny scanny force-pushed the scanny/extract-base-objects branch from 0298b1d to deaff54 Compare December 15, 2023 22:31
@scanny scanny marked this pull request as ready for review December 15, 2023 23:24
Extract objects from `unstructured.title` that are not particular to
that chunking strategy and place them in `unstructured.base` so they can
be re-used by other chunking strategies.

Also move the tests to `test_base.py`.
Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@scanny scanny added this pull request to the merge queue Dec 16, 2023
@scanny
Copy link
Collaborator Author

scanny commented Dec 16, 2023

Thanks @christinestraub! :)

Merged via the queue into main with commit 36e81c3 Dec 16, 2023
51 checks passed
@scanny scanny deleted the scanny/extract-base-objects branch December 16, 2023 18:14
Coniferish pushed a commit that referenced this pull request Dec 18, 2023
Many of the classes defined in `unstructured.chunking.title` are
applicable to any chunking strategy and will shortly be used for the
"by-character" chunking strategy as well.

Move these and their tests to `unstructured.chunking.base`.

Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because
it will be generalized in a subsequent PR to also take `Table` elements
such that inter-pre-chunk overlap can be implemented.

Otherwise, no logic changes, just moves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants