Skip to content

Commit

Permalink
Add new parameter to map to skip_infer_table_types partition arg
Browse files Browse the repository at this point in the history
  • Loading branch information
rbiseck3 committed Oct 9, 2023
1 parent 8b93217 commit 6a0fbfb
Show file tree
Hide file tree
Showing 5 changed files with 22 additions and 6 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.20-dev6
## 0.10.20-dev7

### Enhancements

Expand All @@ -9,6 +9,8 @@
* **Improve title detection in pptx documents** The default title textboxes on a pptx slide are now categorized as titles.
* **Improve hierarchy detection in pptx documents** List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents.
* **Refactor of the ingest cli workflow** The refactored approach uses a dynamically set pipeline with a snapshot along each step to save progress and accommodate continuation from a snapshot if an error occurs. This also allows the pipeline to dynamically assign any number of steps to modify the partitioned content before it gets written to a destination.
* **Expose skip_infer_table_types in ingest CLI** For each connector a new `--skip-infer-table-types` parameter was added to map to the `skip_infer_table_types` partition argument.

### Features

* **Adds `edit_distance` calculation metrics** In order to benchmark the cleaned, extracted text with unstructured, `edit_distance` (`Levenshtein distance`) is included.
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.20-dev6" # pragma: no cover
__version__ = "0.10.20-dev7" # pragma: no cover
6 changes: 6 additions & 0 deletions unstructured/ingest/cli/interfaces.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,12 @@ class CliPartitionConfig(PartitionConfig, CliMixin):
@staticmethod
def add_cli_options(cmd: click.Command) -> None:
options = [
click.Option(
["--skip-infer-table-types"],
type=DelimitedString(),
default=None,
help="Option list of document types to skip table extraction on",
),
click.Option(
["--pdf-infer-table-structure"],
default=False,
Expand Down
1 change: 1 addition & 0 deletions unstructured/ingest/interfaces.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ class BaseConfig(DataClassJsonMixin, ABC):
class PartitionConfig(BaseConfig):
# where to write structured data outputs
pdf_infer_table_structure: bool = False
skip_infer_table_types: t.Optional[t.List[str]] = None
strategy: str = "auto"
ocr_languages: str = "eng"
encoding: t.Optional[str] = None
Expand Down
15 changes: 11 additions & 4 deletions unstructured/ingest/pipeline/partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,19 @@ def run(self, ingest_doc_json) -> str:
if self.partition_config.ocr_languages
else []
)
partition_kwargs = {
"strategy": self.partition_config.strategy,
"languages": languages,
"encoding": self.partition_config.encoding,
"pdf_infer_table_structure": self.partition_config.pdf_infer_table_structure,
}
if self.partition_config.skip_infer_table_types:
partition_kwargs[
"skip_infer_table_types"
] = self.partition_config.skip_infer_table_types
elements = doc.process_file(
partition_config=self.partition_config,
strategy=self.partition_config.strategy,
languages=languages,
encoding=self.partition_config.encoding,
pdf_infer_table_structure=self.partition_config.pdf_infer_table_structure,
**partition_kwargs,
)
with open(json_path, "w", encoding="utf8") as output_f:
logger.info(f"writing partitioned content to {json_path}")
Expand Down

0 comments on commit 6a0fbfb

Please sign in to comment.