Unstructured-IO · LaverdeS · Oct 5, 2023 · Sep 25, 2023 · Sep 25, 2023 · Sep 25, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,9 +15,9 @@
 
 ### Fixes
 
+* **Fixes category_depth None value for Title elements** Problem: The mapping of `Headline` and `Subheadline` element types from `chipper` to `Title`, assigns metadata `category_depth` 1 and 2 respectively. During the same mapping, the `Title` elements from `chipper` are left unchanged, with metadata `category_depth` = None. Fix: Whenever `Headline` or `Subheadline` type layout elements are present in a page, then all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead.
 * **Fixes a metadata source serialization bug** Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
 * **Fixes issue where unstructured-inference was not getting updated** Problem: unstructured-inference was not getting upgraded to the version to match unstructured release when doing a pip install.  Solution: using `pip install unstructured[all-docs]` it will now upgrade both unstructured and unstructured-inference. Importance: This will ensure that the inference library is always in sync with the unstructured library, otherwise users will be using outdated libraries which will likely lead to unintended behavior.
-* **Fixes SharePoint connector failures if any document has an unsupported filetype** Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's `__post_init__`. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
 * **Fixes Sharepoint connector server_path issue** Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
 * **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class
 should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class 

diff --git a/test_unstructured/partition/test_common.py b/test_unstructured/partition/test_common.py
@@ -35,6 +35,18 @@ def elements(self):
                 type="Headline",
                 text="Charlie Brown and the Great Pumpkin",
             ),
+            LocationlessLayoutElement(
+                type="Subheadline",
+                text="The Beginning",
+            ),
+            LocationlessLayoutElement(
+                type="Text",
+                text="This time Charlie Brown had it really tricky...",
+            ),
+            LocationlessLayoutElement(
+                type="Title",
+                text="Another book title in the same page",
+            ),
         ]
 
 
@@ -405,3 +417,12 @@ def test_set_element_hierarchy_custom_rule_set():
     assert (
         elements[5].metadata.parent_id == elements[4].id
     ), "FigureCaption should be child of Title 2"
+
+
+def test_document_to_element_list_sets_category_depth_titles():
+    layout_with_hierarchies = MockDocumentLayout()
+    elements = document_to_element_list(layout_with_hierarchies)
+    assert elements[0].metadata.category_depth == 1
+    assert elements[1].metadata.category_depth == 2
+    assert elements[2].metadata.category_depth is None
+    assert elements[3].metadata.category_depth == 0
diff --git a/unstructured/partition/common.py b/unstructured/partition/common.py
@@ -31,6 +31,7 @@
     ListItem,
     PageBreak,
     Text,
+    Title,
 )
 from unstructured.logger import logger
 from unstructured.nlp.patterns import ENUMERATED_BULLETS_RE, UNICODE_BULLETS_RE
@@ -560,7 +561,6 @@ def document_to_element_list(
                 infer_list_items=infer_list_items,
                 source_format=source_format if source_format else "html",
             )
-
             if isinstance(element, List):
                 for el in element:
                     if last_modification_date:
@@ -574,6 +574,14 @@ def document_to_element_list(
                 element.metadata.text_as_html = (
                     layout_element.text_as_html if hasattr(layout_element, "text_as_html") else None
                 )
+                try:
+                    if (isinstance(element, Title) and element.metadata.category_depth is None) and any(
+                        el.type in ["Headline", "Subheadline"] for el in page.elements
+                    ):
+                        element.metadata.category_depth = 0
+                except AttributeError:
+                    logger.info("HTML element instance has no attribute type")
+
                 page_elements.append(element)
             coordinates = (
                 element.metadata.coordinates.points if element.metadata.coordinates else None