-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sectioner dissociated titles from their chunk #1861
Conversation
08bd213
to
fa158e8
Compare
79cf5a3
to
21c2b69
Compare
bfad58b
to
a88eede
Compare
640314b
to
72cfe88
Compare
a88eede
to
c4e3532
Compare
72cfe88
to
20bb5a0
Compare
c4e3532
to
0bf4100
Compare
400cac4
to
405a2cd
Compare
0bf4100
to
dc3d34d
Compare
561e15c
to
e1b0340
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One non-blocking question.
unstructured/chunking/title.py
Outdated
@@ -20,6 +20,9 @@ | |||
Text, | |||
Title, | |||
) | |||
from unstructured.utils import lazyproperty | |||
|
|||
_Section: TypeAlias = Union["_NonTextSection", "_TableSection", "_TextSection"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to start using the 3.10 syntax with |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm glad you mentioned this, turns out it actually can as long as the whole expression is surrounded in quotes.
I was having trouble finding a clear answer on search and this was the best I could come up with, but having another search now I turned up by accident that this will work:
_Section: TypeAlias = "_NonTextSection | _TableSection | _TextSection"
I'll change this to that. The quotes are required because this actually gets parsed as an assignment statement and prior to 3.10 the from __future__ import annotations
can't change that particular behavior.
Still, the stringified version is much more pleasing and also consistent of course. Thanks for mentioning this! :)
0f0f993
to
de9a0e7
Compare
de9a0e7
to
7e191d8
Compare
disassociated-titles
Executive Summary. Section titles are often combined with the prior section and then missing from the section they belong to.
Chunk combination is a behavior in which two succesive small chunks are combined into a single chunk that better fills the chunk window. Chunking can be and by default is configured to combine sequential small chunks that will together fit within the full chunk window (default 500 chars).
Combination is only valid for "whole" chunks. The current implementation attempts to combine at the element level (in the sectioner), meaning a small initial element (such as a
Title
) is combined with the prior section without considering the remaining length of the section that title belongs to. This frequently causes a title element to be removed from the chunk it belongs to and added to the prior, otherwise unrelated, chunk.Example:
Technical Summary. Combination cannot be effectively performed at the element level, at least not without complicating things with arbitrary look-ahead into future elements. Much more straightforward is to combine sections once they have been formed from the element stream.
Fix. Introduce an intermediate stream processor that accepts a stream of sections and emits a stream of sometimes-combined sections.
The solution implemented in this PR builds upon introducing
_Section
objects to replace theList[Element]
primitive used previously:_TextSection
gets the.combine()
method and.text_length
property which allows a combining client to produce a combined section (only text-sections are ever combined)._SectionCombiner
is introduced to encapsulate the logic of combination, acting as a "filter", accepting a stream of sections and emitting the same type, just with some resulting from two or more combined input sections:(Iterable[_Section]) -> Iterator[_Section]
._TextSectionAccumulator
is a helper to_SectionCombiner
that takes responsibility for repeatedly accumulating sections, characterizing their length and doing the actual combining (calling_Section.combine(other_section)
) when instructed. Very similar in concept to_TextSectionBuilder
, just at the section level instead of element level._split_elements_by_title_and_table()
and install_SectionCombiner
as filter between sectioner and chunker.