feat: add --include-orig-elements option to Ingest CLI

Unstructured-IO · Mar 25, 2024 · 942ce48 · 942ce48
1 parent 40db2e3
commit 942ce48
Show file tree

Hide file tree

Showing 8 changed files with 226 additions and 3 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,10 +1,11 @@
-## 0.13.0-dev11
+## 0.13.0-dev12
 
 ### Enhancements 
 
 * **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
 * **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
 * **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.
+* **Add `--include_orig_elements` option to Ingest CLI.** By default, when chunking, the original elements used to form each chunk are added to `chunk.metadata.orig_elements` for each chunk. * The `include_orig_elements` parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata.
 
 ### Features
 

diff --git a/docs/source/ingest/configs/chunking_config.rst b/docs/source/ingest/configs/chunking_config.rst
@@ -28,6 +28,9 @@ Configs
   series of ``Title`` elements) until a section reaches a length of n characters. Only operative for
   the ``"by_title"`` chunking strategy. Defaults to `max_characters` which combines chunks whenever
   space allows. Specifying 0 for this argument suppresses combining of small chunks.
+* ``include_orig_elements (default: True)``: Adds the document elements consolidated to form each
+  chunk to the ``chunk.metadata.orig_elements: list[Element]`` metadata field. Setting this to false
+  allows for somewhat smaller payloads when you don't need that metadata.
 * ``max_characters (Default: 500)``: Combine elements into chunks no larger than n characters (hard
   max). No chunk with text longer than this value will appear in the output stream.
 * ``multipage_sections (Default: True)``: When False, in addition to section boundaries, page

diff --git a/...ected-structured-output/local-single-file-chunk-no-orig-elements/multi-column-2p.pdf.json b/...ected-structured-output/local-single-file-chunk-no-orig-elements/multi-column-2p.pdf.json
@@ -0,0 +1,142 @@
+[
+  {
+    "type": "CompositeElement",
+    "element_id": "eb8897ac2f1ceb5e7bc1fb849e834768",
+    "text": "0 2 0 2\n\np e S 0 3\n\n] L C . s c [\n\n3 v 6 0 9 4 0 . 4 0 0 2 : v i X r a\n\nDense Passage Retrieval for Open-Domain Question Answering\n\nVladimir Karpukhin∗, Barlas O˘guz∗, Sewon Min†, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen‡, Wen-tau Yih\n\nFacebook AI\n\n†University of Washington\n\n‡Princeton University\n\n{vladk, barlaso, plewis, ledell, edunov, scottyih}@fb.com [email protected] [email protected]\n\nAbstract\n\nOpen-domain question answering relies on ef- ﬁcient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented us- ing dense representations alone, where em- beddings are learned from a small number of questions and passages by a simple dual- encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene- BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.1\n\n1",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 1
+    }
+  },
+  {
+    "type": "CompositeElement",
+    "element_id": "1d0d9836600df1239dd9c22a3bb17a6e",
+    "text": "Introduction\n\nOpen-domain question answering (QA) (Voorhees, 1999) is a task that answers factoid questions us- ing a large collection of documents. While early QA systems are often complicated and consist of multiple components (Ferrucci (2012); Moldovan et al. (2003), inter alia), the advances of reading comprehension models suggest a much simpliﬁed two-stage framework: (1) a context retriever ﬁrst selects a small subset of passages where some of them contain the answer to the question, and then (2) a machine reader can thoroughly exam- ine the retrieved contexts and identify the correct answer (Chen et al., 2017). Although reducing open-domain QA to machine reading is a very rea- sonable strategy, a huge performance degradation is often observed in practice2, indicating the needs of improving retrieval.\n\n∗Equal contribution 1The code and trained models have been released at\n\nhttps://github.com/facebookresearch/DPR.",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 1
+    }
+  },
+  {
+    "type": "CompositeElement",
+    "element_id": "aab91cf73e32570b68f089f49032ad9e",
+    "text": "2For instance, the exact match score on SQuAD v1.1 drops\n\nRetrieval in open-domain QA is usually imple- mented using TF-IDF or BM25 (Robertson and Zaragoza, 2009), which matches keywords efﬁ- ciently with an inverted index and can be seen as representing the question and context in high- dimensional, sparse vectors (with weighting). Con- versely, the dense, latent semantic encoding is com- plementary to sparse representations by design. For example, synonyms or paraphrases that consist of completely different tokens may still be mapped to vectors close to each other. Consider the question “Who is the bad guy in lord of the rings?”, which can be answered from the context “Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy.” A term-based system would have difﬁculty retrieving such a context, while a dense retrieval system would be able to better match “bad guy” with “villain” and fetch the cor- rect context. Dense encodings are also learnable by adjusting the embedding functions, which pro- vides additional ﬂexibility to have a task-speciﬁc representation. With special in-memory data struc- tures and indexing schemes, retrieval can be done efﬁciently using maximum inner product search (MIPS) algorithms (e.g., Shrivastava and Li (2014); Guo et al. (2016)).\n\nHowever, it is generally believed that learn- ing a good dense vector representation needs a large number of labeled pairs of question and con- texts. Dense retrieval methods have thus never be shown to outperform TF-IDF/BM25 for open- domain QA before ORQA (Lee et al., 2019), which proposes a sophisticated inverse cloze task (ICT) objective, predicting the blocks that contain the masked sentence, for additional pretraining. The question encoder and the reader model are then ﬁne- tuned using pairs of questions and answers jointly. Although ORQA successfully demonstrates that dense retrieval can outperform BM25, setting new state-of-the-art results on multiple open-domain",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 1
+    }
+  },
+  {
+    "type": "CompositeElement",
+    "element_id": "e40f54b7c59bc98f8c1cd13fceae6443",
+    "text": "from above 80% to less than 40% (Yang et al., 2019a).\n\nQA datasets, it also suffers from two weaknesses. First, ICT pretraining is computationally intensive and it is not completely clear that regular sentences are good surrogates of questions in the objective function. Second, because the context encoder is not ﬁne-tuned using pairs of questions and answers, the corresponding representations could be subop- timal.",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 1
+    }
+  },
+  {
+    "type": "CompositeElement",
+    "element_id": "3bed499e55636392d074ef00589538d0",
+    "text": "In this paper, we address the question: can we train a better dense embedding model using only pairs of questions and passages (or answers), with- out additional pretraining? By leveraging the now standard BERT pretrained model (Devlin et al., 2019) and a dual-encoder architecture (Bromley et al., 1994), we focus on developing the right training scheme using a relatively small number of question and passage pairs. Through a series of careful ablation studies, our ﬁnal solution is surprisingly simple: the embedding is optimized for maximizing inner products of the question and relevant passage vectors, with an objective compar- ing all pairs of questions and passages in a batch. Our Dense Passage Retriever (DPR) is exception- ally strong. It not only outperforms BM25 by a large margin (65.2% vs. 42.9% in Top-5 accuracy), but also results in a substantial improvement on the end-to-end QA accuracy compared to ORQA (41.5% vs. 33.3%) in the open Natural Questions setting (Lee et al., 2019; Kwiatkowski et al., 2019). Our contributions are twofold. First, we demon- strate that with the proper training setup, sim- ply ﬁne-tuning the question and passage encoders on existing question-passage pairs is sufﬁcient to greatly outperform BM25. Our empirical results also suggest that additional pretraining may not be needed. Second, we verify that, in the context of open-domain question answering, a higher retrieval precision indeed translates to a higher end-to-end QA accuracy. By applying a modern reader model to the top retrieved passages, we achieve compara- ble or better results on multiple QA datasets in the open-retrieval setting, compared to several, much complicated systems.",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 2
+    }
+  },
+  {
+    "type": "CompositeElement",
+    "element_id": "9cb0f8e709154db205e6a4a64118078a",
+    "text": "2 Background\n\nThe problem of open-domain QA studied in this paper can be described as follows. Given a factoid question, such as “Who ﬁrst voiced Meg on Family Guy?” or “Where was the 8th Dalai Lama born?”, a system is required to answer it using a large corpus of diversiﬁed topics. More speciﬁcally, we assume\n\nthe extractive QA setting, in which the answer is restricted to a span appearing in one or more pas- sages in the corpus. Assume that our collection contains D documents, d1, d2, · · · , dD. We ﬁrst split each of the documents into text passages of equal lengths as the basic retrieval units3 and get M total passages in our corpus C = {p1, p2, . . . , pM }, where each passage pi can be viewed as a sequence 2 , · · · , w(i) 1 , w(i) of tokens w(i) |pi|. Given a question q, the task is to ﬁnd a span w(i) s+1, · · · , w(i) s , w(i) from one of the passages pi that can answer the question. Notice that to cover a wide variety of domains, the corpus size can easily range from millions of docu- ments (e.g., Wikipedia) to billions (e.g., the Web). As a result, any open-domain QA system needs to include an efﬁcient retriever component that can se- lect a small set of relevant texts, before applying the reader to extract the answer (Chen et al., 2017).4 Formally speaking, a retriever R : (q, C) → CF is a function that takes as input a question q and a corpus C and returns a much smaller ﬁlter set of texts CF ⊂ C, where |CF | = k (cid:28) |C|. For a ﬁxed k, a retriever can be evaluated in isolation on top-k retrieval accuracy, which is the fraction of ques- tions for which CF contains a span that answers the question.\n\ne",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 2
+    }
+  },
+  {
+    "type": "CompositeElement",
+    "element_id": "2a95cabf5db13882f694842710829026",
+    "text": "3 Dense Passage Retriever (DPR)\n\nWe focus our research in this work on improv- ing the retrieval component in open-domain QA. Given a collection of M text passages, the goal of our dense passage retriever (DPR) is to index all the passages in a low-dimensional and continuous space, such that it can retrieve efﬁciently the top k passages relevant to the input question for the reader at run-time. Note that M can be very large (e.g., 21 million passages in our experiments, de- scribed in Section 4.1) and k is usually small, such as 20–100.\n\n3.1 Overview\n\nOur dense passage retriever (DPR) uses a dense encoder EP (·) which maps any text passage to a d- dimensional real-valued vectors and builds an index for all the M passages that we will use for retrieval.\n\n3The ideal size and boundary of a text passage are func- tions of both the retriever and reader. We also experimented with natural paragraphs in our preliminary trials and found that using ﬁxed-length passages performs better in both retrieval and ﬁnal QA accuracy, as observed by Wang et al. (2019).\n\n4Exceptions include (Seo et al., 2019) and (Roberts et al., 2020), which retrieves and generates the answers, respectively.",
+    "metadata": {
+      "data_source": {
+        "url": "/Users/scanny/Library/CloudStorage/Dropbox/src/unstructured/test_unstructured_ingest/../example-docs/multi-column-2p.pdf",
+        "permissions_data": [
+          {
+            "mode": 33188
+          }
+        ]
+      },
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 2
+    }
+  }
+]
diff --git a/test_unstructured_ingest/src/local-single-file-chunk-no-orig-elements.sh b/test_unstructured_ingest/src/local-single-file-chunk-no-orig-elements.sh
@@ -0,0 +1,64 @@
+#!/usr/bin/env bash
+
+# ------------------------------------------------------------------------------------------------
+# This test exercises the `--chunk-no-include-orig-elements` option which turns off inclusion of
+# `.metadata.orig_elements` in chunks. It also exercises the `--chunk-no-multipage-sections`
+# option which otherwise has no other coverage.
+# ------------------------------------------------------------------------------------------------
+
+set -e
+
+# -- Test Parameters: These vary by test file, others are common computed values --
+TEST_ROOT_NAME=local-single-file-chunk-no-orig-elements
+EXAMPLE_DOC=multi-column-2p.pdf
+
+# -- computed parameters, common across similar tests --
+SRC_PATH=$(dirname "$(realpath "$0")")
+SCRIPT_DIR=$(dirname "$SRC_PATH")
+cd "$SCRIPT_DIR"/.. || exit 1
+OUTPUT_FOLDER_NAME=$TEST_ROOT_NAME
+OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}
+OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME
+WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME
+# -- use absolute path of input file to verify passing an absolute path --
+ABS_INPUT_PATH="$SCRIPT_DIR/../example-docs/$EXAMPLE_DOC"
+max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
+
+# shellcheck disable=SC1091
+source "$SCRIPT_DIR"/cleanup.sh
+# shellcheck disable=SC2317
+function cleanup() {
+  cleanup_dir "$OUTPUT_DIR"
+  cleanup_dir "$WORK_DIR"
+}
+trap cleanup EXIT
+
+RUN_SCRIPT=${RUN_SCRIPT:-./unstructured/ingest/main.py}
+
+PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
+  local \
+  --chunking-strategy by_title \
+  --chunk-no-include-orig-elements \
+  --chunk-max-characters 2000 \
+  --chunk-no-multipage-sections \
+  --input-path "$ABS_INPUT_PATH" \
+  --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_created,metadata.data_source.date_modified,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
+  --num-processes "$max_processes" \
+  --output-dir "$OUTPUT_DIR" \
+  --reprocess \
+  --verbose \
+  --work-dir "$WORK_DIR"
+
+set +e
+"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
+EXIT_CODE=$?
+set -e
+
+if [ "$EXIT_CODE" -ne 0 ]; then
+  echo "The last script run exited with a non-zero exit code: $EXIT_CODE."
+  # Handle the error or exit
+fi
+
+"$SCRIPT_DIR"/evaluation-ingest-cp.sh "$OUTPUT_DIR" "$OUTPUT_FOLDER_NAME"
+
+exit $EXIT_CODE
diff --git a/test_unstructured_ingest/test-ingest-src.sh b/test_unstructured_ingest/test-ingest-src.sh
@@ -46,6 +46,7 @@ all_tests=(
   # 'airtable-large.sh'
   'local-single-file.sh'
   'local-single-file-basic-chunking.sh'
+  'local-single-file-chunk-no-orig-elements.sh'
   'local-single-file-with-encoding.sh'
   'local-single-file-with-pdf-infer-table-structure.sh'
   'notion.sh'

diff --git a/unstructured/__version__.py b/unstructured/__version__.py
@@ -1 +1 @@
-__version__ = "0.13.0-dev11"  # pragma: no cover
+__version__ = "0.13.0-dev12"  # pragma: no cover
diff --git a/unstructured/ingest/cli/interfaces.py b/unstructured/ingest/cli/interfaces.py
@@ -500,6 +500,15 @@ def get_cli_options() -> t.List[click.Option]:
                     " operative for 'by_title' chunking-strategy."
                 ),
             ),
+            click.Option(
+                ["--chunk-include-orig-elements/--chunk-no-include-orig-elements"],
+                is_flag=True,
+                default=True,
+                help=(
+                    "When chunking, add the original elements consolidated to form each chunk to"
+                    " `.metadata.orig_elements` on that chunk."
+                ),
+            ),
             click.Option(
                 ["--chunk-max-characters"],
                 type=int,
@@ -511,7 +520,7 @@ def get_cli_options() -> t.List[click.Option]:
                 ),
             ),
             click.Option(
-                ["--chunk-multipage-sections"],
+                ["--chunk-multipage-sections/--chunk-no-multipage-sections"],
                 is_flag=True,
                 default=CHUNK_MULTI_PAGE_DEFAULT,
                 help=(

diff --git a/unstructured/ingest/interfaces.py b/unstructured/ingest/interfaces.py
@@ -231,6 +231,7 @@ class ChunkingConfig(BaseConfig):
     chunk_elements: bool = False
     chunking_strategy: t.Optional[str] = None
     combine_text_under_n_chars: t.Optional[int] = None
+    include_orig_elements: t.Optional[bool] = None
     max_characters: t.Optional[int] = None
     multipage_sections: t.Optional[bool] = None
     new_after_n_chars: t.Optional[int] = None
@@ -248,6 +249,7 @@ def chunk(self, elements: t.List[Element]) -> t.List[Element]:
             return chunk_by_title(
                 elements=elements,
                 combine_text_under_n_chars=self.combine_text_under_n_chars,
+                include_orig_elements=self.include_orig_elements,
                 max_characters=self.max_characters,
                 multipage_sections=self.multipage_sections,
                 new_after_n_chars=self.new_after_n_chars,
@@ -258,6 +260,7 @@ def chunk(self, elements: t.List[Element]) -> t.List[Element]:
         if chunking_strategy == "basic":
             return chunk_elements(
                 elements=elements,
+                include_orig_elements=self.include_orig_elements,
                 max_characters=self.max_characters,
                 new_after_n_chars=self.new_after_n_chars,
                 overlap=self.overlap,
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		__version__ = "0.13.0-dev11" # pragma: no cover
		__version__ = "0.13.0-dev12" # pragma: no cover