fix: split-chunks appear out-of-order (#1824)

**Executive Summary.** Code inspection in preparation for adding the chunk-overlap feature revealed a bug causing split-chunks to be inserted out-of-order. For example, elements like this: ``` Text("One" + 400 chars) Text("Two" + 400 chars) Text("Three" + 600 chars) Text("Four" + 400 chars) Text("Five" + 600 chars) ``` Should produce chunks: ``` CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("Four") # (400 chars) CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) ``` but produced this instead: ``` CompositeElement("Five ...") # (500 chars) CompositeElement("rest of Five ...") # (100 chars) CompositeElement("Three ...") # (500 chars) CompositeElement("rest of Three ...") # (100 chars) CompositeElement("One ...") # (400 chars) CompositeElement("Two ...") # (400 chars) CompositeElement("Four") # (400 chars) ``` This PR fixes that behavior that was introduced on Oct 9 this year in commit: f98d5e6 when adding chunk splitting. **Technical Summary** The essential transformation of chunking is: ``` elements sections chunks List[Element] -> List[List[Element]] -> List[CompositeElement] ``` 1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_ semantically-related elements into _sections_ (`List[Element]`), in the best case, that would be a title (heading) and the text that follows it (until the next title). A heading and its text is often referred to as a _section_ in publishing parlance, hence the name. 2. The _chunker_ (`chunk_by_title()` currently) does two things: 1. first it _consolidates_ the elements of each section into a single `ConsolidatedElement` object (a "chunk"). This includes both joining the element text into a single string as well as consolidating the metadata of the section elements. 2. then if necessary it _splits_ the chunk into two or more `ConsolidatedElement` objects when the consolidated text is too long to fit in the specified window (`max_characters`). Chunk splitting is only required when a single element (like a big paragraph) has text longer than the specified window. Otherwise a section and the chunk that derives from it reflects an even element boundary. `chunk_by_title()` was elaborated in commit f98d5e6 to add this "chunk-splitting" behavior. At the time there was some notion of wanting to "split from the end backward" such that any small remainder chunk would appear first, and could possibly be combined with a small prior chunk. To accomplish this, split chunks were _inserted_ at the beginning of the list instead of _appended_ to the end. The `chunked_elements` variable (`List[CompositeElement]`) holds the sequence of chunks that result from the chunking operation and is the returned value for `chunk_by_title()`. This was the list "split-from-the-end" chunks were inserted at the beginning of and that unfortunately produces this out-of-order behavior because the insertion was at the beginning of this "all-chunks-in-document" list, not a sublist just for this chunk. Further, the "split-from-the-end" behavior can produce no benefit because chunks are never combined, only _elements_ are combined (across semantic boundaries into a single section when a section is small) and sectioning occurs _prior_ to chunking. The fix is to rework the chunk-splitting passage to a straighforward iterative algorithm that works both when a chunk must be split and when it doesn't. This algorithm is also very easily extended to implement split-chunk-overlap which is coming up in an immediately following PR. ```python # -- split chunk into CompositeElements objects maxlen or smaller -- text_len = len(text) start = 0 remaining = text_len while remaining > 0: end = min(start + max_characters, text_len) chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta)) start = end - overlap remaining = text_len - end ``` *Forensic analysis* The out-of-order-chunks behavior was introduced in commit 4ea7168 on 10/09/2023 in the same PR in which chunk-splitting was introduced. --------- Co-authored-by: Shreya Nidadavolu <[email protected]> Co-authored-by: shreyanid <[email protected]>
Unstructured-IO · Oct 21, 2023 · 82c8adb · 82c8adb
1 parent ce40cdc
commit 82c8adb
Show file tree

Hide file tree

Showing 4 changed files with 32 additions and 23 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,4 @@
-## 0.10.25-dev9
+## 0.10.25
 
 ### Enhancements
 
@@ -19,10 +19,10 @@ ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the en
 * **Fix chunks breaking on regex-metadata matches.** Fixes "over-chunking" when `regex_metadata` was used, where every element that contained a regex-match would start a new chunk.
 * **Fix regex-metadata match offsets not adjusted within chunk.** Fixes incorrect regex-metadata match start/stop offset in chunks where multiple elements are combined.
 * **Map source cli command configs when destination set** Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
-* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding
-an `__init__.py` file under the folder.
+* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding an `__init__.py` file under the folder.
 * **Fix a bug when `parition_pdf` get `model_name=None`** In API usage the `model_name` value is `None` and the `cast` function in `partition_pdf` would return `None` and lead to attribution error. Now we use `str` function to explicit convert the content to string so it is garanteed to have `starts_with` and other string functions as attributes
 * **Fix html partition fail on tables without `tbody` tag** HTML tables may sometimes just contain headers without body (`tbody` tag)
+* **Fix out-of-order sequencing of split chunks.** Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded `max_characters`.
 
 ## 0.10.24
 

diff --git a/test_unstructured/chunking/test_title.py b/test_unstructured/chunking/test_title.py
@@ -23,6 +23,24 @@
 from unstructured.partition.html import partition_html
 
 
+def test_it_splits_a_large_section_into_multiple_chunks():
+    elements: List[Element] = [
+        Title("Introduction"),
+        Text(
+            "Lorem ipsum dolor sit amet consectetur adipiscing elit. In rhoncus ipsum sed lectus"
+            " porta volutpat."
+        ),
+    ]
+
+    chunks = chunk_by_title(elements, combine_text_under_n_chars=50, max_characters=50)
+
+    assert chunks == [
+        CompositeElement("Introduction"),
+        CompositeElement("Lorem ipsum dolor sit amet consectetur adipiscing "),
+        CompositeElement("elit. In rhoncus ipsum sed lectus porta volutpat."),
+    ]
+
+
 def test_split_elements_by_title_and_table():
     elements: List[Element] = [
         Title("A Great Day"),

diff --git a/unstructured/__version__.py b/unstructured/__version__.py
@@ -1 +1 @@
-__version__ = "0.10.25-dev9"  # pragma: no cover
+__version__ = "0.10.25"  # pragma: no cover
diff --git a/unstructured/chunking/title.py b/unstructured/chunking/title.py
@@ -152,25 +152,16 @@ def chunk_by_title(
                     chunk_matches.extend(matches)
                     chunk_regex_metadata[regex_name] = chunk_matches
 
-        # Check if text exceeds max_characters
-        if len(text) > max_characters:
-            # Chunk the text from the end to the beginning
-            while len(text) > 0:
-                if len(text) <= max_characters:
-                    # If the remaining text is shorter than max_characters
-                    # create a chunk from the beginning
-                    chunk_text = text
-                    text = ""
-                else:
-                    # Otherwise, create a chunk from the end
-                    chunk_text = text[-max_characters:]
-                    text = text[:-max_characters]
-
-                # Prepend the chunk to the beginning of the list
-                chunked_elements.insert(0, CompositeElement(text=chunk_text, metadata=metadata))
-        else:
-            # If it doesn't exceed, create a single CompositeElement
-            chunked_elements.append(CompositeElement(text=text, metadata=metadata))
+        # -- split chunk into CompositeElements objects maxlen or smaller --
+        text_len = len(text)
+        start = 0
+        remaining = text_len
+
+        while remaining > 0:
+            end = min(start + max_characters, text_len)
+            chunked_elements.append(CompositeElement(text=text[start:end], metadata=metadata))
+            start = end
+            remaining = text_len - end
 
     return chunked_elements
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		__version__ = "0.10.25-dev9" # pragma: no cover
		__version__ = "0.10.25" # pragma: no cover