-
Notifications
You must be signed in to change notification settings - Fork 816
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles **Executive Summary**. Section titles are often combined with the prior section and then missing from the section they belong to. _Chunk combination_ is a behavior in which two succesive small chunks are combined into a single chunk that better fills the chunk window. Chunking can be and by default is configured to combine sequential small chunks that will together fit within the full chunk window (default 500 chars). Combination is only valid for "whole" chunks. The current implementation attempts to combine at the element level (in the sectioner), meaning a small initial element (such as a `Title`) is combined with the prior section without considering the remaining length of the section that title belongs to. This frequently causes a title element to be removed from the chunk it belongs to and added to the prior, otherwise unrelated, chunk. Example: ```python elements: List[Element] = [ Title("Lorem Ipsum"), # 11 Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."), # 55 Title("Rhoncus"), # 7 Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."), # 57 ] chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80) # -- want -------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.') CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') # -- got --------------------- CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus') CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.') ``` **Technical Summary.** Combination cannot be effectively performed at the element level, at least not without complicating things with arbitrary look-ahead into future elements. Much more straightforward is to combine sections once they have been formed from the element stream. **Fix.** Introduce an intermediate stream processor that accepts a stream of sections and emits a stream of sometimes-combined sections. The solution implemented in this PR builds upon introducing `_Section` objects to replace the `List[Element]` primitive used previously: - `_TextSection` gets the `.combine()` method and `.text_length` property which allows a combining client to produce a combined section (only text-sections are ever combined). - `_SectionCombiner` is introduced to encapsulate the logic of combination, acting as a "filter", accepting a stream of sections and emitting the same type, just with some resulting from two or more combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`. - `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes responsibility for repeatedly accumulating sections, characterizing their length and doing the actual combining (calling `_Section.combine(other_section)`) when instructed. Very similar in concept to `_TextSectionBuilder`, just at the section level instead of element level. - Remove attempts to combine sections at the element level from `_split_elements_by_title_and_table()` and install `_SectionCombiner` as filter between sectioner and chunker.
- Loading branch information
Showing
3 changed files
with
488 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.