roman/cli infer table arg (#1685)

### Description Add new parameter to map to `skip_infer_table_types` partition arg. Applies to partition config which is set on all connectors. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]>
Unstructured-IO · Oct 12, 2023 · 9b5d5e0 · 9b5d5e0
1 parent 35852bb
commit 9b5d5e0
Show file tree

Hide file tree

Showing 8 changed files with 184 additions and 5 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,8 @@
 
 ### Enhancements
 
+* **Expose skip_infer_table_types in ingest CLI** For each connector a new `--skip-infer-table-types` parameter was added to map to the `skip_infer_table_types` partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify which file types for which we should attempt table extraction.
+
 ### Features
 
 ### Fixes

diff --git a/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg b/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg
diff --git a/test_unstructured_ingest/example-docs/layout-parser-paper.pdf b/test_unstructured_ingest/example-docs/layout-parser-paper.pdf
diff --git a/.../local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json b/.../local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json
@@ -0,0 +1,162 @@
+[
+  {
+    "type": "Title",
+    "element_id": "5fc3b3d02c954fce8bdb8742665da14d",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "LayoutParser: A Unified Toolkit for DL-Based DIA 5,"
+  },
+  {
+    "type": "FigureCaption",
+    "element_id": "53522497f48d7f32acd862a28dee0253",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "‘Table 1: Current layout detection models in the LayoutParser model z00"
+  },
+  {
+    "type": "Table",
+    "element_id": "0f0bad1db94e5aaa06c7ff033f7a27cf",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "Dataset | Bare Model! Large Mode! | Noter PablagNee [5] P/M M_|Tayous of moder scientie documents Rima [) « = | tagout of scanned modern magains and cee reports Newspaper [IT]| FP {Layout of ranned US nemepapers fom the 2h entry ‘Thbtebenk (i) | ‘able epon cn modern aciente and business document apace 1) |_P/M =| tagout of itary Inpanere documents"
+  },
+  {
+    "type": "FigureCaption",
+    "element_id": "37a781e7327333b86b022ed7fb12d620",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "Sek ca pe ogee ile Se riot (ade ines ores Tsbooe 09, pect, One ca nin moose cet acct ie Ptr LOND Bi (7) and Mask ‘hay tn Hee tens Te platen etd sete it bem ee me"
+  },
+  {
+    "type": "NarrativeText",
+    "element_id": "62253f3b9ad80b81d9fe3656d597ba21",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility fanctions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training We now provide detailed descriptions for each component."
+  },
+  {
+    "type": "Title",
+    "element_id": "958174bfb8153f0b2c1d247196bcf8b1",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "3.1 Layout Detection Models"
+  },
+  {
+    "type": "NarrativeText",
+    "element_id": "f72c039d55d8062b540b8f075bf697fb",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Different from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art’ models like Faster R-CNNY [28] and Mask R-CNN [12] are used. ‘This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron? [38], provides a minimal API that can perform layout detection with only four lines of code in Python:"
+  },
+  {
+    "type": "ListItem",
+    "element_id": "742f93af10c235d2612a2b85c7ce9294",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "\\ import layoutparser as 1p"
+  },
+  {
+    "type": "ListItem",
+    "element_id": "44723386662ffb524ec7b20b0ddf2382",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "2 image = cv2.imread(\"image_file\") # load images"
+  },
+  {
+    "type": "UncategorizedText",
+    "element_id": "32ebb1abcc1c601ceb9c4e3c4faba0ca",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "("
+  },
+  {
+    "type": "ListItem",
+    "element_id": "cd84964c612152f5362ee38fab9cad62",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "» model = 1p. Detectron2LayoutModel"
+  },
+  {
+    "type": "ListItem",
+    "element_id": "868ea4c7705456b218188afb4e2a04ab",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "\"1p: //PubLayllet /faster_renn_t |-50_FPI_3x/config\")"
+  },
+  {
+    "type": "UncategorizedText",
+    "element_id": "4b227777d4dd1fc61c6f884f48641d02",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "4"
+  },
+  {
+    "type": "Title",
+    "element_id": "cfacfd3ec33b9608b59a343d05da204c",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "detect"
+  },
+  {
+    "type": "ListItem",
+    "element_id": "ec23428744214fb4e7dd4d5d25939ae9",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "layout = model. (image)"
+  },
+  {
+    "type": "NarrativeText",
+    "element_id": "9e7beafe373dc2fbff761d7997defec9",
+    "metadata": {
+      "data_source": {},
+      "filetype": "image/jpeg",
+      "page_number": 1
+    },
+    "text": "LayoutParser provides a wealth of pre-trained model weights using various datasets covering different languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are significantly different from the training dataset. As document structures and layouts vary greatly in different domains, itis important to select models trained on adataset similar to the test samples. A semantic syntax is used for initializing the model weights in LayoutParser, using both the dataset name and model name 1p://<dataset-nane>/<nodel-archi tecture-nane>."
+  }
+]
diff --git a/test_unstructured_ingest/test-ingest-local-single-file-with-pdf-infer-table-structure.sh b/test_unstructured_ingest/test-ingest-local-single-file-with-pdf-infer-table-structure.sh
@@ -22,11 +22,12 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
     --num-processes "$max_processes" \
     --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
     --output-dir "$OUTPUT_DIR" \
+    --skip-infer-table-types "jpg" \
     --pdf-infer-table-structure true \
     --strategy hi_res \
     --verbose \
     --reprocess \
-    --input-path example-docs/layout-parser-paper.pdf \
+    --input-path "$SCRIPT_DIR"/example-docs/ \
     --work-dir "$WORK_DIR"
 
 set +e

diff --git a/unstructured/ingest/cli/interfaces.py b/unstructured/ingest/cli/interfaces.py
@@ -139,6 +139,12 @@ class CliPartitionConfig(PartitionConfig, CliMixin):
     @staticmethod
     def add_cli_options(cmd: click.Command) -> None:
         options = [
+            click.Option(
+                ["--skip-infer-table-types"],
+                type=DelimitedString(),
+                default=None,
+                help="Optional list of document types to skip table extraction on",
+            ),
             click.Option(
                 ["--pdf-infer-table-structure"],
                 default=False,

diff --git a/unstructured/ingest/interfaces.py b/unstructured/ingest/interfaces.py
@@ -38,6 +38,7 @@ class BaseConfig(DataClassJsonMixin, ABC):
 class PartitionConfig(BaseConfig):
     # where to write structured data outputs
     pdf_infer_table_structure: bool = False
+    skip_infer_table_types: t.Optional[t.List[str]] = None
     strategy: str = "auto"
     ocr_languages: str = "eng"
     encoding: t.Optional[str] = None

diff --git a/unstructured/ingest/pipeline/partition.py b/unstructured/ingest/pipeline/partition.py
@@ -30,12 +30,19 @@ def run(self, ingest_doc_json) -> str:
             if self.partition_config.ocr_languages
             else []
         )
+        partition_kwargs = {
+            "strategy": self.partition_config.strategy,
+            "languages": languages,
+            "encoding": self.partition_config.encoding,
+            "pdf_infer_table_structure": self.partition_config.pdf_infer_table_structure,
+        }
+        if self.partition_config.skip_infer_table_types:
+            partition_kwargs[
+                "skip_infer_table_types"
+            ] = self.partition_config.skip_infer_table_types
         elements = doc.process_file(
             partition_config=self.partition_config,
-            strategy=self.partition_config.strategy,
-            languages=languages,
-            encoding=self.partition_config.encoding,
-            pdf_infer_table_structure=self.partition_config.pdf_infer_table_structure,
+            **partition_kwargs,
         )
         with open(json_path, "w", encoding="utf8") as output_f:
             logger.info(f"writing partitioned content to {json_path}")