Skip to content

Commit

Permalink
feat(chunking): add metadata.orig_elements serde (#2680)
Browse files Browse the repository at this point in the history
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny <[email protected]>
  • Loading branch information
scanny and scanny authored Mar 22, 2024
1 parent fd8b682 commit 56fbaae
Show file tree
Hide file tree
Showing 9 changed files with 265 additions and 69 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
* **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.

### Features

Expand Down
78 changes: 57 additions & 21 deletions docs/source/apis/api_parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,12 @@ encoding
- **Description**: The encoding method used to decode the text input. Default: utf-8.
- **Example**: utf-8

extract_image_block_types
-------------------------
- **Type**: array
- **Description**: The types of image blocks to extract from the document. Supports various Element types.
- **Example**: ['Image', 'Table']

hi_res_model_name
-----------------
- **Type**: string
Expand All @@ -48,7 +54,8 @@ hi_res_model_name
include_page_breaks
-------------------
- **Type**: boolean
- **Description**: If True, the output will include page breaks if the filetype supports it. Default: false.
- **Description**: When true, the output will include page break elements when the filetype supports
it. Default: false.

languages
---------
Expand All @@ -72,37 +79,66 @@ xml_keep_tags
- **Type**: boolean
- **Description**: If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml.


Chunking Parameters
-------------------

The following parameters control chunking behavior. Chunking is automatically performed after
partitioning when a value is provided for the ``chunking_strategy`` argument. The remaining chunking
parameters are only operative when a chunking strategy is specified. Note that not all chunking
parameters apply to all chunking strategies. Any chunking arguments not supported by the selected
chunker are ignored.

chunking_strategy
-----------------
- **Type**: string
- **Description**: Use one of the supported strategies to chunk the returned elements. Currently supports: by_title.
- **Example**: by_title

multipage_sections
------------------
- **Type**: boolean
- **Description**: If chunking strategy is set, determines if sections can span multiple sections. Default: true.
- **Description**: Use one of the supported strategies to chunk the returned elements. When omitted,
no chunking is performed and any other chunking parameters provided are ignored.
- **Valid values**: ``"basic"``, ``"by_title"``

combine_under_n_chars
---------------------
- **Type**: integer
- **Description**: If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500.
- **Applicable Chunkers**: "by_title" only
- **Description**: When chunking strategy is set to "by_title", combine small chunks until the
combined chunk reaches a length of n chars. This can mitigate the appearance of small chunks
created by short paragraphs, not intended as section headings, being identified as ``Title``
elements in certain documents.
- **Default**: the same value as ``max_characters``
- **Example**: 500

new_after_n_chars
-----------------
- **Type**: integer
- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500.
- **Example**: 1500
include_orig_elements
---------------------
- **Type**: boolean
- **Applicable Chunkers**: All
- **Description**: Add the elements used to form each chunk to ``.metadata.orig_elements`` for that
chunk. These can be used to recover the original text and metadata for individual elements when
that is required, for example to identify the page-numbers or coordinates spanned by a chunk.
When an element larger than ``max_characters`` is divided into two or more chunks via
text-splitting, each of those chunks will contain the entire original chunk as the only item in
its ``.metadata.orig_elements`` list.
- **Default**: true

max_characters
--------------
- **Type**: integer
- **Description**: If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 1500.
- **Example**: 1500
- **Applicable Chunkers**: All
- **Description**: When chunking strategy is set, cut off new chunks after reaching a length of n
chars (hard max).
- **Default**: 500

extract_image_block_types
-------------------------
- **Type**: array
- **Description**: The types of image blocks to extract from the document. Supports various Element types.
- **Example**: ['Image', 'Table']
multipage_sections
------------------
- **Type**: boolean
- **Applicable Chunkers**: "by_title" only
- **Description**: When true and chunking strategy is set to "by_title", allows a chunk to include
elements from more than one page. Otherwise chunks are broken on page boundaries.
- **Default**: true

new_after_n_chars
-----------------
- **Type**: integer
- **Applicable Chunkers**: "basic", "by_title"
- **Description**: When chunking strategy is set, cut off new chunk after reaching a length of n
chars (soft max).
- **Default**: 1500
46 changes: 44 additions & 2 deletions docs/source/core/chunking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ be specified when a non-default setting is required. Specific chunking strategie
need to decide based on your use-case whether this option is right for you.


Chunking elements
-----------------
Chunking
--------

Chunking can be performed as part of partitioning or as a separate step after
partitioning:
Expand Down Expand Up @@ -170,3 +170,45 @@ following behaviors:
``combine_text_under_n_chars`` argument. This defaults to the same value as ``max_characters``
such that sequential small sections are combined to maximally fill the chunking window. Setting
this to ``0`` will disable section combining.


Recovering Chunk Elements
-------------------------

In general, a chunk consolidates multiple document elements to maximally fill a chunk of the desired
size. Information is naturally lost in this consolidation, for example which element a portion of
the text came from and certain metadata like page-number and coordinates which cannot always be
resolved to a single value.

The original elements combined to make a chunk can be accessed using the `.metadata.orig_elements`
field on the chunk:

.. code:: python
>>> elements = [
... Title("Lorem Ipsum"),
... NarrativeText("Lorem ipsum dolor sit."),
... ]
>>> chunk = chunk_elements(elements)[0]
>>> print(chunk.text)
'Lorem Ipsum\n\nLorem ipsum dolor sit.'
>>> print(chunk.metadata.orig_elements)
[Title("Lorem Ipsum"), NarrativeText("Lorem ipsum dolor sit.")]
These elements will contain all their original metadata so can be used to access metadata that
cannot reliably be consolidated, for example:

--code:: python

>>> {e.metadata.page_number for e in chunk.metadata.orig_elements}
{2, 3}

>>> [e.metadata.coordinates for e in chunk.metadata.orig_elements]
[<CoordinatesMetadata ...>, <CoordinatesMetadata ...>, ...]

>>> [
e.metadata.image_path
for e in chunk.metadata.orig_elements
if e.metadata.image_path is not None
]
['/tmp/lorem.jpg', '/tmp/ipsum.png']
62 changes: 31 additions & 31 deletions docs/source/metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Metadata
========

The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
For example, a user may be interested in selected document elements from a given page number
or an e-mail with a given subject line.
Element metadata has a variety of uses including:
* filtering document elements based on an element metadata value, for example, elements from a given page number or an e-mail with a subject matching a regular expression.
* mapping an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.

Metadata is tracked at the element level. You can extract the metadata for a given document element
with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
Expand Down Expand Up @@ -136,34 +136,34 @@ returned. If the ``in_place`` flag is ``False``, only the altered coordinates ar
Additional Metadata Fields by Document Type
###########################################

+-------------------------+---------------------+--------------------------------------------------------+
| Field Name | Applicable Doc Types| Short Description |
+=========================+=====================+========================================================+
| page_number | DOCX,PDF, PPT,XLSX | Page Number |
+-------------------------+---------------------+--------------------------------------------------------+
| page_name | XLSX | Sheet Name in Excel document |
+-------------------------+---------------------+--------------------------------------------------------+
| sent_from | EML | Email Sender |
+-------------------------+---------------------+--------------------------------------------------------+
| sent_to | EML | Email Recipient |
+-------------------------+---------------------+--------------------------------------------------------+
| subject | EML | Email Subject |
+-------------------------+---------------------+--------------------------------------------------------+
| attached_to_filename | MSG | filename that attachment file is attached to |
+-------------------------+---------------------+--------------------------------------------------------+
| header_footer_type | Word Doc | Pages a header or footer applies to: "primary", |
| | | "even_only", and "first_page" |
+-------------------------+---------------------+--------------------------------------------------------+
| link_urls | HTML | The url associated with a link in a document. |
+-------------------------+---------------------+--------------------------------------------------------+
| link_texts | HTML | The text associated with a link in a document. |
+-------------------------+---------------------+--------------------------------------------------------+
| links | HTML | List of {”text”: “<the text>, “url”: <the url>} items. |
| | | Note: this element will be removed in the near future |
| | | in favor of the above link_urls and link_texts. |
+-------------------------+---------------------+--------------------------------------------------------+
| section | EPUB | Book section title corresponding to table of contents |
+-------------------------+---------------------+--------------------------------------------------------+
+-------------------------+-----------------------+--------------------------------------------------------+
| Field Name | Applicable Doc Types | Short Description |
+=========================+=======================+========================================================+
| page_number | DOCX, PDF, PPT, XLSX | Page Number |
+-------------------------+-----------------------+--------------------------------------------------------+
| page_name | XLSX | Sheet Name in Excel document |
+-------------------------+-----------------------+--------------------------------------------------------+
| sent_from | EML | Email Sender |
+-------------------------+-----------------------+--------------------------------------------------------+
| sent_to | EML | Email Recipient |
+-------------------------+-----------------------+--------------------------------------------------------+
| subject | EML | Email Subject |
+-------------------------+-----------------------+--------------------------------------------------------+
| attached_to_filename | MSG | filename that attachment file is attached to |
+-------------------------+-----------------------+--------------------------------------------------------+
| header_footer_type | Word Doc | Pages a header or footer applies to: "primary", |
| | | "even_only", and "first_page" |
+-------------------------+-----------------------+--------------------------------------------------------+
| link_urls | HTML | The url associated with a link in a document. |
+-------------------------+-----------------------+--------------------------------------------------------+
| link_texts | HTML | The text associated with a link in a document. |
+-------------------------+-----------------------+--------------------------------------------------------+
| links | HTML | List of {”text”: “<the text>, “url”: <the url>} items. |
| | | Note: this element will be removed in the near future |
| | | in favor of the above link_urls and link_texts. |
+-------------------------+-----------------------+--------------------------------------------------------+
| section | EPUB | Book section title corresponding to table of contents |
+-------------------------+-----------------------+--------------------------------------------------------+

:raw-html:`<br />`
Notes on additional metadata by document type:
Expand Down
17 changes: 17 additions & 0 deletions test_unstructured/documents/test_elements.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
Points,
RegexMetadata,
Text,
Title,
)


Expand Down Expand Up @@ -381,6 +382,22 @@ def and_it_serializes_a_data_source_sub_object_to_a_dict_when_it_is_present(self
"page_number": 2,
}

def and_it_serializes_an_orig_elements_sub_object_to_base64_when_it_is_present(self):
meta = ElementMetadata(
category_depth=1,
orig_elements=[Title("Lorem"), Text("Lorem Ipsum")],
page_number=2,
)
assert meta.to_dict() == {
"category_depth": 1,
"orig_elements": (
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWt"
"JVm5WDoqNUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQp"
"ZRxYZcCA/1spDtP98dU6DTEw3sa5fWOqs10vH0cLQn0="
),
"page_number": 2,
}

def but_unlike_in_ElementMetadata_unknown_fields_in_sub_objects_are_ignored(self):
"""Metadata sub-objects ignore fields they do not explicitly define.
Expand Down
22 changes: 22 additions & 0 deletions test_unstructured/staging/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,28 @@
from unstructured.staging import base


def test_base64_gzipped_json_to_elements_can_deserialize_compressed_elements_from_a_JSON_string():
base64_elements_str = (
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWtJVm5WDoq"
"NUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQpZRxYZcCA/1spDtP9"
"8dU6DTEw3sa5fWOqs10vH0cLQn0="
)

elements = base.elements_from_base64_gzipped_json(base64_elements_str)

assert elements == [Title("Lorem"), Text("Lorem Ipsum")]


def test_elements_to_base64_gzipped_json_can_serialize_elements_to_a_base64_str():
elements = [Title("Lorem"), Text("Lorem Ipsum")]

assert base.elements_to_base64_gzipped_json(elements) == (
"eJyFzcsKwjAQheFXKVm7yDS3xjcQXNaViKTJjBR6o46glr67zVI3Lmf4Dv95EdhhjwNf2yT2hYDGUaWtJVm5WDoq"
"NUL0UoJrqtLHJHaF6JFDChw2v6zbzfjkvD2OM/YZ8GvC/Khb7lBs5LcilUwRyCsblQYTiBQpZRxYZcCA/1spDtP9"
"8dU6DTEw3sa5fWOqs10vH0cLQn0="
)


def test_elements_to_dicts():
elements = [Title(text="Title 1"), NarrativeText(text="Narrative 1")]
isd = base.elements_to_dicts(elements)
Expand Down
Loading

0 comments on commit 56fbaae

Please sign in to comment.