fix: ignore connector extract if no connector folder available (#2198)

When no connector is provided in the folder structure, the code get filename as connector instead. Fix the code so that if folder structure has no connector subfolder, it leaves blank or None for connector field.
Unstructured-IO · Dec 7, 2023 · 46cb306 · 46cb306
1 parent cde11d1
commit 46cb306
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 3 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,7 +4,7 @@
 
 * **Refactor image extraction code.** The image extraction code is moved from `unstructured-inference` to `unstructured`. 
 * **Refactor pdfminer code.** The pdfminer code is moved from `unstructured-inference` to `unstructured`.
-* **Improve handling of auth data for fsspec connectors** Leverage an extension of the dataclass paradigm to support a `sensitive` annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
+* **Improve handling of auth data for fsspec connectors.** Leverage an extension of the dataclass paradigm to support a `sensitive` annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
 
 ### Features
 

diff --git a/unstructured/metrics/evaluate.py b/unstructured/metrics/evaluate.py
@@ -68,7 +68,7 @@ def measure_text_extraction_accuracy(
         filename = (doc.split("/")[-1]).split(".json")[0]
         doctype = filename.rsplit(".", 1)[-1]
         fn_txt = filename + ".txt"
-        connector = doc.split("/")[0]
+        connector = doc.split("/")[0] if len(doc.split("/")) > 1 else None
 
         # not all odetta cct files follow the same naming convention;
         # some exclude the original filetype from the name
@@ -143,7 +143,8 @@ def measure_element_type_accuracy(
         filename = (doc.split("/")[-1]).split(".json")[0]
         doctype = filename.rsplit(".", 1)[-1]
         fn_json = filename + ".json"
-        connector = doc.split("/")[0]
+        connector = doc.split("/")[0] if len(doc.split("/")) > 1 else None
+
         if fn_json in source_list:  # type: ignore
             output = get_element_type_frequency(_read_text(os.path.join(output_dir, doc)))
             source = get_element_type_frequency(_read_text(os.path.join(source_dir, fn_json)))