Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sectioner dissociated titles from their chunk #1861

Merged
merged 2 commits into from
Oct 30, 2023

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Oct 24, 2023

disassociated-titles

Executive Summary. Section titles are often combined with the prior section and then missing from the section they belong to.

Chunk combination is a behavior in which two succesive small chunks are combined into a single chunk that better fills the chunk window. Chunking can be and by default is configured to combine sequential small chunks that will together fit within the full chunk window (default 500 chars).

Combination is only valid for "whole" chunks. The current implementation attempts to combine at the element level (in the sectioner), meaning a small initial element (such as a Title) is combined with the prior section without considering the remaining length of the section that title belongs to. This frequently causes a title element to be removed from the chunk it belongs to and added to the prior, otherwise unrelated, chunk.

Example:

elements: List[Element] = [
    Title("Lorem Ipsum"),  # 11
    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
    Title("Rhoncus"),  # 7
    Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."),  # 57
]

chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)

# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

Technical Summary. Combination cannot be effectively performed at the element level, at least not without complicating things with arbitrary look-ahead into future elements. Much more straightforward is to combine sections once they have been formed from the element stream.

Fix. Introduce an intermediate stream processor that accepts a stream of sections and emits a stream of sometimes-combined sections.

The solution implemented in this PR builds upon introducing _Section objects to replace the List[Element] primitive used previously:

  • _TextSection gets the .combine() method and .text_length property which allows a combining client to produce a combined section (only text-sections are ever combined).
  • _SectionCombiner is introduced to encapsulate the logic of combination, acting as a "filter", accepting a stream of sections and emitting the same type, just with some resulting from two or more combined input sections: (Iterable[_Section]) -> Iterator[_Section].
  • _TextSectionAccumulator is a helper to _SectionCombiner that takes responsibility for repeatedly accumulating sections, characterizing their length and doing the actual combining (calling _Section.combine(other_section)) when instructed. Very similar in concept to _TextSectionBuilder, just at the section level instead of element level.
  • Remove attempts to combine sections at the element level from _split_elements_by_title_and_table() and install _SectionCombiner as filter between sectioner and chunker.

@scanny scanny changed the base branch from main to scanny/sectioner-considers-separator October 24, 2023 23:01
@scanny scanny force-pushed the scanny/sectioner-considers-separator branch from 08bd213 to fa158e8 Compare October 25, 2023 00:40
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch from 79cf5a3 to 21c2b69 Compare October 25, 2023 00:40
@scanny scanny marked this pull request as ready for review October 25, 2023 00:41
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch 2 times, most recently from bfad58b to a88eede Compare October 25, 2023 05:20
@scanny scanny force-pushed the scanny/sectioner-considers-separator branch 2 times, most recently from 640314b to 72cfe88 Compare October 26, 2023 18:44
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch from a88eede to c4e3532 Compare October 26, 2023 18:50
@scanny scanny force-pushed the scanny/sectioner-considers-separator branch from 72cfe88 to 20bb5a0 Compare October 26, 2023 19:37
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch from c4e3532 to 0bf4100 Compare October 26, 2023 19:38
@scanny scanny force-pushed the scanny/sectioner-considers-separator branch 2 times, most recently from 400cac4 to 405a2cd Compare October 26, 2023 20:33
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch from 0bf4100 to dc3d34d Compare October 26, 2023 20:34
Base automatically changed from scanny/sectioner-considers-separator to main October 26, 2023 22:20
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch 3 times, most recently from 561e15c to e1b0340 Compare October 27, 2023 19:03
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One non-blocking question.

@@ -20,6 +20,9 @@
Text,
Title,
)
from unstructured.utils import lazyproperty

_Section: TypeAlias = Union["_NonTextSection", "_TableSection", "_TextSection"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to start using the 3.10 syntax with |?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm glad you mentioned this, turns out it actually can as long as the whole expression is surrounded in quotes.
I was having trouble finding a clear answer on search and this was the best I could come up with, but having another search now I turned up by accident that this will work:

_Section: TypeAlias = "_NonTextSection | _TableSection | _TextSection"

I'll change this to that. The quotes are required because this actually gets parsed as an assignment statement and prior to 3.10 the from __future__ import annotations can't change that particular behavior.

Still, the stringified version is much more pleasing and also consistent of course. Thanks for mentioning this! :)

@scanny scanny force-pushed the scanny/fix-disassociated-titles branch 4 times, most recently from 0f0f993 to de9a0e7 Compare October 28, 2023 04:21
@scanny scanny force-pushed the scanny/fix-disassociated-titles branch from de9a0e7 to 7e191d8 Compare October 29, 2023 18:15
@scanny scanny added this pull request to the merge queue Oct 30, 2023
Merged via the queue into main with commit 7373391 Oct 30, 2023
44 checks passed
@scanny scanny deleted the scanny/fix-disassociated-titles branch October 30, 2023 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants