Merge branch 'main' into fix/docx-without-sections

Unstructured-IO · Oct 24, 2023 · 24d6259 · 24d6259
2 parents 0fa0b7e + 37e8413
commit 24d6259
Show file tree

Hide file tree

Showing 133 changed files with 16,150 additions and 3,465 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -1,5 +1,5 @@
 [run]
 omit =
     unstructured/ingest/*
-    # TODO(yuming): please remove this line after adding tests for paddle (CORE-1886)
+    # TODO(yuming): please remove this line after adding tests for paddle
     unstructured/partition/utils/ocr_models/paddle_ocr.py
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -305,7 +305,7 @@ jobs:
         AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
         OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
         TABLE_OCR: "tesseract"
-        ENTIRE_PAGE_OCR: "tesseract"
+        OCR_AGENT: "tesseract"
         CI: "true"
       run: |
         source .venv/bin/activate

diff --git a/.github/workflows/ingest-test-fixtures-update-pr.yml b/.github/workflows/ingest-test-fixtures-update-pr.yml
@@ -97,7 +97,7 @@ jobs:
           AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
           TABLE_OCR: "tesseract"
-          ENTIRE_PAGE_OCR: "tesseract"
+          OCR_AGENT: "tesseract"
           OVERWRITE_FIXTURES: "true"
           CI: "true"
         run: |

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,11 +1,30 @@
-## 0.10.25-dev7
+## 0.10.26-dev3
+
+### Enhancements
+
+* **Add CI evaluation workflow** Adds evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
+
+### Features
+
+* **Add Local connector source metadata** python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
+
+### Fixes
+
+* **Fix a bug on Table partitioning** Previously the `skip_infer_table_types` variable used in partition was not being passed down to specific file partitioners. Now you can utilize the `skip_infer_table_types` list variable in partition to pass the filetype you want to exclude `text_as_html` metadata field for, or the `infer_table_structure` boolean variable on the file specific partitioning function.
+* **Fix partition docx without sections** Some docx files, like those from teams output, do not contain sections and it would produce no results because the code assumes all components are in sections. Now if no sections is detected from a document we iterate through the paragraphs and return contents found in the paragraphs.
+
+## 0.10.25
 
 ### Enhancements
 
 * **Duplicate CLI param check** Given that many of the options associated with the `Click` based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options.
+* **Ingest CLI refactor for better code reuse** Much of the ingest cli code can be templated and was a copy-paste across files, adding potential risk. Code was refactored to use a base class which had much of the shared code templated.
 
 ### Features
 
+* **Table OCR refactor** support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify
+ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document.
+* **Adds accuracy function** The accuracy scoring was originally an option under `calculate_edit_distance`. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score".
 * **Adds HuggingFaceEmbeddingEncoder** The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API.
 * **Add AWS bedrock embedding connector** `unstructured.embed.bedrock` now provides a connector to use AWS bedrock's `titan-embed-text` model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.
 
@@ -16,7 +35,13 @@
 * **Fix chunks breaking on regex-metadata matches.** Fixes "over-chunking" when `regex_metadata` was used, where every element that contained a regex-match would start a new chunk.
 * **Fix regex-metadata match offsets not adjusted within chunk.** Fixes incorrect regex-metadata match start/stop offset in chunks where multiple elements are combined.
 * **Map source cli command configs when destination set** Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
-* **Fix partition docx without sections** Some docx files, like those from teams output, do not contain sections and it would produce no results because the code assumes all components are in sections. Now if no sections is detected from a document we iterate through the paragraphs and return contents found in the paragraphs.
+* **Fix metrics folder not discoverable** Fixes issue where unstructured/metrics folder is not discoverable on PyPI by adding an `__init__.py` file under the folder.
+* **Fix a bug when `parition_pdf` get `model_name=None`** In API usage the `model_name` value is `None` and the `cast` function in `partition_pdf` would return `None` and lead to attribution error. Now we use `str` function to explicit convert the content to string so it is garanteed to have `starts_with` and other string functions as attributes
+* **Fix html partition fail on tables without `tbody` tag** HTML tables may sometimes just contain headers without body (`tbody` tag)
+* **Fix out-of-order sequencing of split chunks.** Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded `max_characters`.
+* **Deserialization of ingest docs fixed** When ingest docs are being deserialized as part of the ingest pipeline process (cli), there were certain fields that weren't getting persisted (metadata and date processed). The from_dict method was updated to take these into account and a unit test added to check.
+* **Map source cli command configs when destination set** Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.
+* **Deserialization of ingest docs fixed** When ingest docs are being deserialized as part of the ingest pipeline process (cli), there were certain fields that weren't getting persisted (metadata and date processed). The from_dict method was updated to take these into account and a unit test added to check.
 
 ## 0.10.24
 

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -460,7 +460,7 @@ To extract the table structure from PDF files using the ``hi_res`` strategy, ens
 Table Extraction for other filetypes
 ------------------------------------
 
-We also provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter ``skip_infer_table_types`` to specify the document types that you want to skip table extraction with. By default, we skip table extraction for PDFs Images, and Excel files which are ``pdf``, ``jpg``, ``png``, ``xlsx``, and ``xls``. Note that table extraction only works with ``hi_res`` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to ``skip_infer_table_types`` with:
+We also provide support for enabling and disabling table extraction for file types other than PDF files. Set parameter ``skip_infer_table_types`` to specify the document types that you want to skip table extraction with. By default, we skip table extraction for PDFs, Images, and Excel files which are ``pdf``, ``jpg``, ``png``, ``xlsx``, and ``xls``. Note that table extraction for Images and PDFs only works with ``hi_res`` strategy. For example, if you don't want to skip table extraction for images, you can pass an empty value to ``skip_infer_table_types`` with:
 
 .. tabs::
 

diff --git a/example-docs/korean-text-with-tables.pdf b/example-docs/korean-text-with-tables.pdf
diff --git a/examples/argilla-summarization/isw-summarization.ipynb b/examples/argilla-summarization/isw-summarization.ipynb
@@ -43,7 +43,7 @@
    "source": [
     "from IPython.display import Image\n",
     "\n",
-    "Image(filename=\"img/isw.png\", width=800) "
+    "Image(filename=\"img/isw.png\", width=800)"
    ]
   },
   {
@@ -94,6 +94,7 @@
    "source": [
     "ISW_BASE_URL = \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment\"\n",
     "\n",
+    "\n",
     "def datetime_to_url(dt):\n",
     "    month = dt.strftime(\"%B\").lower()\n",
     "    return f\"{ISW_BASE_URL}-{month}-{dt.day}\""
@@ -134,8 +135,8 @@
     "    r = requests.get(url)\n",
     "    if r.status_code != 200:\n",
     "        return None\n",
-    "        \n",
-    "    elements = partition_html(text=r.text)    \n",
+    "\n",
+    "    elements = partition_html(text=r.text)\n",
     "    return elements"
    ]
   },
@@ -170,7 +171,7 @@
     }
    ],
    "source": [
-    "Image(filename=\"img/isw-key-takeaways.png\", width=500) "
+    "Image(filename=\"img/isw-key-takeaways.png\", width=500)"
    ]
   },
   {
@@ -185,13 +186,14 @@
     "        if element.text == \"Key Takeaways\":\n",
     "            return idx\n",
     "\n",
+    "\n",
     "def get_key_takeaways(elements):\n",
     "    key_takeaways_idx = _find_key_takeaways_idx(elements)\n",
     "    if not key_takeaways_idx:\n",
     "        return None\n",
-    "    \n",
+    "\n",
     "    takeaways = []\n",
-    "    for element in elements[key_takeaways_idx + 1:]:\n",
+    "    for element in elements[key_takeaways_idx + 1 :]:\n",
     "        if not isinstance(element, ListItem):\n",
     "            break\n",
     "        takeaways.append(element)\n",
@@ -245,12 +247,12 @@
    "source": [
     "def get_narrative(elements):\n",
     "    narrative_text = \"\"\n",
-    "    for element in elements:        \n",
+    "    for element in elements:\n",
     "        if isinstance(element, NarrativeText) and len(element.text) > 500:\n",
     "            # NOTE: Removes citations like [3] from the text\n",
     "            element_text = re.sub(\"\\[\\d{1,3}\\]\", \"\", element.text)\n",
     "            narrative_text += f\"\\n\\n{element_text}\"\n",
-    "        \n",
+    "\n",
     "    return NarrativeText(text=narrative_text.strip())"
    ]
   },
@@ -337,10 +339,10 @@
     "    elements = url_to_elements(url)\n",
     "    if url is None or not elements:\n",
     "        continue\n",
-    "    \n",
+    "\n",
     "    text = get_narrative(elements)\n",
     "    annotation = get_key_takeaways(elements)\n",
-    "    \n",
+    "\n",
     "    if text and annotation:\n",
     "        inputs.append(text)\n",
     "        annotations.append(annotation.text)\n",
@@ -600,7 +602,7 @@
     }
    ],
    "source": [
-    "Image(filename=\"img/argilla-dataset.png\", width=800) "
+    "Image(filename=\"img/argilla-dataset.png\", width=800)"
    ]
   },
   {
@@ -634,7 +636,7 @@
     }
    ],
    "source": [
-    "Image(filename=\"img/argilla-annotation.png\", width=800) "
+    "Image(filename=\"img/argilla-annotation.png\", width=800)"
    ]
   },
   {
@@ -688,7 +690,7 @@
    ],
    "source": [
     "from transformers import AutoTokenizer\n",
-    "    \n",
+    "\n",
     "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)"
    ]
   },
@@ -702,6 +704,7 @@
     "max_input_length = 1024\n",
     "max_target_length = 128\n",
     "\n",
+    "\n",
     "def preprocess_function(examples):\n",
     "    inputs = [doc for doc in examples[\"text\"]]\n",
     "    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)\n",
@@ -754,7 +757,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer\n",
+    "from transformers import (\n",
+    "    AutoModelForSeq2SeqLM,\n",
+    "    DataCollatorForSeq2Seq,\n",
+    "    Seq2SeqTrainingArguments,\n",
+    "    Seq2SeqTrainer,\n",
+    ")\n",
     "\n",
     "model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)"
    ]
@@ -770,7 +778,7 @@
     "model_name = model_checkpoint.split(\"/\")[-1]\n",
     "args = Seq2SeqTrainingArguments(\n",
     "    \"t5-small-isw-summaries\",\n",
-    "    evaluation_strategy = \"epoch\",\n",
+    "    evaluation_strategy=\"epoch\",\n",
     "    learning_rate=2e-5,\n",
     "    per_device_train_batch_size=batch_size,\n",
     "    per_device_eval_batch_size=batch_size,\n",
@@ -1068,8 +1076,8 @@
    ],
    "source": [
     "summarization_model = pipeline(\n",
-    "task=\"summarization\",\n",
-    "model=\"./t5-small-isw-summaries\",\n",
+    "    task=\"summarization\",\n",
+    "    model=\"./t5-small-isw-summaries\",\n",
     ")"
    ]
   },