Skip to content

Commit

Permalink
fix: ignore connector extract if no connector folder available (#2198)
Browse files Browse the repository at this point in the history
When no connector is provided in the folder structure, the code get
filename as connector instead. Fix the code so that if folder structure
has no connector subfolder, it leaves blank or None for connector field.
  • Loading branch information
Klaijan authored Dec 7, 2023
1 parent cde11d1 commit 46cb306
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 3 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

* **Refactor image extraction code.** The image extraction code is moved from `unstructured-inference` to `unstructured`.
* **Refactor pdfminer code.** The pdfminer code is moved from `unstructured-inference` to `unstructured`.
* **Improve handling of auth data for fsspec connectors** Leverage an extension of the dataclass paradigm to support a `sensitive` annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
* **Improve handling of auth data for fsspec connectors.** Leverage an extension of the dataclass paradigm to support a `sensitive` annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.

### Features

Expand Down
5 changes: 3 additions & 2 deletions unstructured/metrics/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def measure_text_extraction_accuracy(
filename = (doc.split("/")[-1]).split(".json")[0]
doctype = filename.rsplit(".", 1)[-1]
fn_txt = filename + ".txt"
connector = doc.split("/")[0]
connector = doc.split("/")[0] if len(doc.split("/")) > 1 else None

# not all odetta cct files follow the same naming convention;
# some exclude the original filetype from the name
Expand Down Expand Up @@ -143,7 +143,8 @@ def measure_element_type_accuracy(
filename = (doc.split("/")[-1]).split(".json")[0]
doctype = filename.rsplit(".", 1)[-1]
fn_json = filename + ".json"
connector = doc.split("/")[0]
connector = doc.split("/")[0] if len(doc.split("/")) > 1 else None

if fn_json in source_list: # type: ignore
output = get_element_type_frequency(_read_text(os.path.join(output_dir, doc)))
source = get_element_type_frequency(_read_text(os.path.join(source_dir, fn_json)))
Expand Down

0 comments on commit 46cb306

Please sign in to comment.