From 97d1ec5d04bd4749ce449d29ab267ef5e009e4d0 Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 12:02:42 -0700
Subject: [PATCH 01/19] Update installation guide for specific file type and
data connector
---
.../source/installation/full_installation.rst | 44 +++++++++++++++----
1 file changed, 36 insertions(+), 8 deletions(-)
diff --git a/docs/source/installation/full_installation.rst b/docs/source/installation/full_installation.rst
index 431e750598..b2ef934f46 100644
--- a/docs/source/installation/full_installation.rst
+++ b/docs/source/installation/full_installation.rst
@@ -1,28 +1,45 @@
+.. role:: raw-html(raw)
+ :format: html
+
Full Installation
=================
-1. **Installing Extras for Specific Document Types**:
- If you're processing document types beyond the basics, you can install the necessary extras:
+**Basic Usage**
+
+For a complete set of extras catering to every document type, use:
+
+.. code-block:: bash
+
+ pip install "unstructured[all-docs]"
+
+**Installation for Specific Document Types**
+
+If you're processing document types beyond the basics, you can install the necessary extras:
.. code-block:: bash
pip install "unstructured[docx,pptx]"
- For a complete set of extras catering to every document type, use:
+*Available document types:*
.. code-block:: bash
- pip install "unstructured[all-docs]"
+ "csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"
-2. **Note on Older Versions**:
- For versions earlier than `unstructured<0.9.0`, the following installation pattern was recommended:
+:raw-html:`
`
+**Installation for Specific Data Connectors**
+
+To use any of the data connectors, you must install the specific dependency:
.. code-block:: bash
- pip install "unstructured[local-inference]"
+ pip install "unstructured[s3]"
- While "local-inference" remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It's advisable to transition to the "all-docs" extra for comprehensive support.
+*Available data connectors:*
+ .. code-block:: bash
+
+ "airtable", "azure", "azure-cognitive-search", "biomed", "box", "confluence", "delta-table", "discord", "dropbox", "elasticsearch", "gcs", "github", "gitlab", "google-drive", "jira", "notion", "onedrive", "outlook", "reddit", "s3", "sharepoint", "salesforce", "slack", "wikipedia"
Installation with ``conda`` on Windows
--------------------------------------
@@ -155,3 +172,14 @@ library. This is not included as an ``unstructured`` dependency because it only
to some tokenizers. See the
`sentencepiece install instructions `_ for
information on how to install ``sentencepiece`` if your tokenizer requires it.
+
+Note on Older Versions
+----------------------
+ For versions earlier than `unstructured<0.9.0`, the following installation pattern was recommended:
+
+ .. code-block:: bash
+
+ pip install "unstructured[local-inference]"
+
+ While "local-inference" remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It's advisable to transition to the "all-docs" extra for comprehensive support.
+
From c39be925273a033c78a15dc0632c541264bccc93 Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 14:33:49 -0700
Subject: [PATCH 02/19] Update Metadata fields
Additions:
- common and additional metadata fields by doc type
- common and additional metadata fields by connector type
---
docs/source/metadata.rst | 182 ++++++++++++++++++++++++++++++++++++---
1 file changed, 168 insertions(+), 14 deletions(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 3f406d4028..3e7ba1cc17 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -1,3 +1,7 @@
+.. role:: raw-html(raw)
+ :format: html
+
+
Metadata
========
@@ -8,29 +12,67 @@ or an e-mail with a given subject line.
Metadata is tracked at the element level. You can extract the metadata for a given document element
with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
-All document types return the following metadata fields when the information is available from
-the source file:
-* ``filename``
-* ``file_directory``
-* ``date``
-* ``filetype``
-* ``page_number``
+######################
+Common Metadata Fields
+######################
+
+All document types return the following metadata fields when the information is available from
+the source file:
-####################
-Element coordinates
-####################
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Metadata Field Name | Short Description | Details |
++=============================+==========================================================+=============================================================================================================================================================================================================================================================================================+
+| ``filename`` | Filename | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``file_directory`` | File Directory | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``last_modified`` | Last Modified Date | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``filetype`` | File Type | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``type`` | Element Type | Categorizes elements into types such as Title, NarrativeText. Not a metadata field |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``coordinates`` | XY Bounding Box Coordinates | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``parent_id`` | Element Hierarchy (Parent ID) | Hierarchies are determined by a combination of a ruleset and element category depth. The current ruleset sets a parent ID if a title element follows a header element or any other element follows a title element. |
+| | | The ID is also set if the element follows an element of the same category and the category_depth is greater than the category depth of the element it follows. Hierarchies enable more robust chunking configurations. |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``category_depth`` | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
+| | other elements of the same category | Category depth is set using native document hierarchies (e.g., h1, h2, h3 or the indentation level of a bulleted list in a word document). |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``text_as_html`` | HTML representation of extracted tables | Elements with type Table |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``languages`` | Document Languages | At document level or element level |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``emphasized_text_contents``| Emphasized text (bold or italic) in the original document| |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``emphasized_text_tags`` | Tags on text that is emphasized in the original document | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``num_characters`` | The number of characters used | Used for chunking |
+| | for max_characters in add_chunking_strategy | |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``is_continuation`` | True if element is a continuation of table being chunked | Used for chunking |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| ``detection_class_prob`` | Detection Model Class Probabilities | From unstructured-inference, hi-res strategy |
++-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+:raw-html:`
`
+Notes on common metadata fields:
+
+Coordinates
+-----------
Some document types support location data for the elements, usually in the form of bounding boxes.
If it exists, an element's location data is available with ``element.metadata.coordinates``.
The ``coordinates`` property of an ``ElementMetadata`` stores:
-* points: These specify the corners of the bounding box starting from the top left corner and
+* **points**: These specify the corners of the bounding box starting from the top left corner and
proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
the ``y`` coordinate increases in the downward direction.
-* system: The points have an associated coordinate system. A typical example of a coordinate system is
+* **system**: The points have an associated coordinate system. A typical example of a coordinate system is
``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
name, orientation, layout width, and layout height.
@@ -49,11 +91,13 @@ returned. If the ``in_place`` flag is ``False``, only the altered coordinates ar
coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
coordinate_system = PixelSpace(width=850, height=1100)
+
element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
print(element.metadata.coordinates.to_dict())
print(element.metadata.coordinates.system.orientation)
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)
+
element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
# Should now be in terms of new coordinate system
print(element.metadata.coordinates.to_dict())
@@ -61,6 +105,42 @@ returned. If the ``in_place`` flag is ``False``, only the altered coordinates ar
print(element.metadata.coordinates.system.width)
print(element.metadata.coordinates.system.height)
+###########################################
+Additional Metadata Fields by Document Type
+###########################################
+
++-------------------------+---------------------+--------------------------------------------------------+
+| ``Field Name`` | Applicable Doc Types| Short Description |
++=========================+=====================+========================================================+
+| ``page_number`` | PDF, HTML, PPT | Page Number |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``page_name`` | XLSX | Sheet Name in Excel document |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``sent_from`` | EML | Email Sender |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``sent_to`` | EML | Email Recipient |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``subject`` | EML | Email Subject |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``attached_to_filename``| MSG | filename that attachment file is attached to |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``header_footer_type`` | Word Doc | Pages a header or footer applies to: "primary", |
+| | | "even_only", and "first_page" |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``url`` | HTML | Webpage URL |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``link_urls`` | HTML | The url associated with a link in a document. |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``link_texts`` | HTML | The text associated with a link in a document. |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``links`` | HTML | List of {”text”: “, “url”: } items. |
++-------------------------+---------------------+--------------------------------------------------------+
+| ``section`` | EPUB | Book section title corresponding to table of contents |
++-------------------------+---------------------+--------------------------------------------------------+
+
+:raw-html:`
`
+Notes on additional metadata by document type:
+
Email
-----
@@ -89,13 +169,87 @@ Webpages
Elements from webpages will include a ``url`` metadata field, corresponding to the URL for the webpage.
+##############################
+Data Connector Metadata Fields
+##############################
+
+Common Data Connector Metadata Fields
+-------------------------------------
+
+- Source Metadata
+ - Source metadata includes (field on the `BaseIngestDoc` class:
+ - date created
+ - date modified
+ - version
+ - source url
+ - exists
+- Data Source metadata (on json output):
+ - url
+ - version
+ - date created
+ - date modified
+ - date processed
+ - record locator
+- Record locator is specific to each connector
+
+Additional Metadata Fields by Connector Type (via record locator)
+-----------------------------------------------------------------
+
+- airtable
+ - base id
+ - table id
+ - view id
+- azure (from fsspec)
+ - protocol
+ - remote file path
+- box (from fsspec)
+ - protocol
+ - remote file path
+- confluence
+ - url
+ - page id
+- discord
+ - channel
+- dropbox (from fsspec)
+ - protocol
+ - remote file path
+- elasticsearch
+ - url
+ - index name
+ - document id
+- fsspec
+ - protocol
+ - remote file path
+- google drive
+ - drive id
+ - file id
+- gcs (from fsspec)
+ - protocol
+ - remote file path
+- jira
+ - base url
+ - issue key
+- onedrive
+ - user pname
+ - server relative path
+- outlook
+ - message id
+ - user email
+- s3 (from fsspec)
+ - protocol
+ - remote file path
+- sharepoint
+ - server path
+ - site url
+- wikipedia
+ - page title
+ - page url
+
##########################
Advanced Metadata Options
##########################
-
-
Extract Metadata with Regexes
------------------------------
From 12eac85bc87142d5f9b0f33bfa0f19c08498861d Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 14:38:32 -0700
Subject: [PATCH 03/19] add embedding brick in TOC & fix sphinx warnings
---
docs/source/bricks.rst | 1 +
docs/source/bricks/embedding.rst | 12 ++++++------
.../azure_cognitive_search.rst | 6 ++++--
3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/docs/source/bricks.rst b/docs/source/bricks.rst
index 71dbf337db..f09718f322 100644
--- a/docs/source/bricks.rst
+++ b/docs/source/bricks.rst
@@ -19,3 +19,4 @@ After reading this section, you should understand the following:
bricks/extracting
bricks/staging
bricks/chunking
+ bricks/embedding
diff --git a/docs/source/bricks/embedding.rst b/docs/source/bricks/embedding.rst
index ef51c7c364..29481a4215 100644
--- a/docs/source/bricks/embedding.rst
+++ b/docs/source/bricks/embedding.rst
@@ -1,21 +1,21 @@
-########
+#########
Embedding
-########
+#########
EmbeddingEncoder classes in ``unstructured`` use document elements detected
with ``partition`` or document elements grouped with ``chunking`` to obtain
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).
-``BaseEmbeddingEncoder``
-------------------
+BaseEmbeddingEncoder
+--------------------
The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
for each ``EmbeddingEncoder`` subclass.
-``OpenAIEmbeddingEncoder``
-------------------
+OpenAIEmbeddingEncoder
+----------------------
The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
diff --git a/docs/source/destination_connectors/azure_cognitive_search.rst b/docs/source/destination_connectors/azure_cognitive_search.rst
index 9cd0eb09db..e08a1d6b8b 100644
--- a/docs/source/destination_connectors/azure_cognitive_search.rst
+++ b/docs/source/destination_connectors/azure_cognitive_search.rst
@@ -1,5 +1,6 @@
Azure Cognitive Search
-==========
+======================
+
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to an Azure Cognitive Search index.
First you'll need to install the azure cognitive search dependencies as shown here.
@@ -72,7 +73,8 @@ For a full list of the options the CLI accepts check ``unstructured-ingest `_.
Sample Index Schema
------------
+-------------------
+
To make sure the schema of the index matches the data being written to it, a sample schema json can be used:
.. literalinclude:: azure_cognitive_sample_index_schema.json
From 32b3a41d0928b7718fb63eb838dcae185448c35b Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 15:52:10 -0700
Subject: [PATCH 04/19] Update docs/source/metadata.rst
Update `text_as_html` description
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 3e7ba1cc17..ec34245f18 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -42,7 +42,7 @@ the source file:
| ``category_depth`` | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
| | other elements of the same category | Category depth is set using native document hierarchies (e.g., h1, h2, h3 or the indentation level of a bulleted list in a word document). |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``text_as_html`` | HTML representation of extracted tables | Elements with type Table |
+| ``text_as_html`` | HTML representation of extracted tables | Only applicable to ``Table`` Elements |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``languages`` | Document Languages | At document level or element level |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
From 51f79be4db54f0da7266b5a925b0e18d97199144 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 15:52:57 -0700
Subject: [PATCH 05/19] Update `is_continuation` description
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index ec34245f18..8d67cd53ab 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -53,7 +53,7 @@ the source file:
| ``num_characters`` | The number of characters used | Used for chunking |
| | for max_characters in add_chunking_strategy | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``is_continuation`` | True if element is a continuation of table being chunked | Used for chunking |
+| ``is_continuation`` | True if element is a continuation of a previous element | Only relevant for chunking, if an element was divided into two due to ``max_characters`` |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``detection_class_prob`` | Detection Model Class Probabilities | From unstructured-inference, hi-res strategy |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
From 146735dc28ec6b0bfe391dbd50fd30a9226d4c83 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 15:53:44 -0700
Subject: [PATCH 06/19] Update `page_number` applicable doc types
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 8d67cd53ab..168634ff20 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -112,7 +112,7 @@ Additional Metadata Fields by Document Type
+-------------------------+---------------------+--------------------------------------------------------+
| ``Field Name`` | Applicable Doc Types| Short Description |
+=========================+=====================+========================================================+
-| ``page_number`` | PDF, HTML, PPT | Page Number |
+| ``page_number`` | DOCX,PDF, PPT,XLSX | Page Number |
+-------------------------+---------------------+--------------------------------------------------------+
| ``page_name`` | XLSX | Sheet Name in Excel document |
+-------------------------+---------------------+--------------------------------------------------------+
From 90e983f8044affb399efb5868993d723c05cf260 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 16:25:50 -0700
Subject: [PATCH 07/19] Update `links` description
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 168634ff20..2a45e1dd67 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -133,7 +133,7 @@ Additional Metadata Fields by Document Type
+-------------------------+---------------------+--------------------------------------------------------+
| ``link_texts`` | HTML | The text associated with a link in a document. |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``links`` | HTML | List of {”text”: “, “url”: } items. |
+| ``links`` | HTML | List of {”text”: “, “url”: } items. This element will be removed in the near future in favor of the above two rows |
+-------------------------+---------------------+--------------------------------------------------------+
| ``section`` | EPUB | Book section title corresponding to table of contents |
+-------------------------+---------------------+--------------------------------------------------------+
From 8fd185b7f115edcd616c5c6103f8738717e9a0a1 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 16:27:26 -0700
Subject: [PATCH 08/19] change to ``points``
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 2a45e1dd67..a446a95d23 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -69,7 +69,7 @@ If it exists, an element's location data is available with ``element.metadata.co
The ``coordinates`` property of an ``ElementMetadata`` stores:
-* **points**: These specify the corners of the bounding box starting from the top left corner and
+* ``points`` : These specify the corners of the bounding box starting from the top left corner and
proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
the ``y`` coordinate increases in the downward direction.
* **system**: The points have an associated coordinate system. A typical example of a coordinate system is
From ccef1058348dce61a81706f1fbfce752c07893e1 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 16:30:22 -0700
Subject: [PATCH 09/19] change to ``system``
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index a446a95d23..3a008b5b91 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -72,7 +72,7 @@ The ``coordinates`` property of an ``ElementMetadata`` stores:
* ``points`` : These specify the corners of the bounding box starting from the top left corner and
proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
the ``y`` coordinate increases in the downward direction.
-* **system**: The points have an associated coordinate system. A typical example of a coordinate system is
+* ``system``: The points have an associated coordinate system. A typical example of a coordinate system is
``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
name, orientation, layout width, and layout height.
From 441c7be1402067872764ade5a8030b5fab786418 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 16:40:31 -0700
Subject: [PATCH 10/19] Apply suggestions from code review
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
---
docs/source/metadata.rst | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 3a008b5b91..eddb32400c 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -32,7 +32,7 @@ the source file:
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``filetype`` | File Type | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``type`` | Element Type | Categorizes elements into types such as Title, NarrativeText. Not a metadata field |
+| ``type`` | Element Type | Categorizes elements into types such as Title, NarrativeText. Not a metadata field. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``coordinates`` | XY Bounding Box Coordinates | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
@@ -44,18 +44,18 @@ the source file:
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``text_as_html`` | HTML representation of extracted tables | Only applicable to ``Table`` Elements |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``languages`` | Document Languages | At document level or element level |
+| ``languages`` | Document Languages | At document level or element level. List is ordered by probability of being the primary language of the text. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``emphasized_text_contents``| Emphasized text (bold or italic) in the original document| |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``emphasized_text_tags`` | Tags on text that is emphasized in the original document | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``num_characters`` | The number of characters used | Used for chunking |
+| ``num_characters`` | The number of characters used | Used for chunking. |
| | for max_characters in add_chunking_strategy | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ``is_continuation`` | True if element is a continuation of a previous element | Only relevant for chunking, if an element was divided into two due to ``max_characters`` |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``detection_class_prob`` | Detection Model Class Probabilities | From unstructured-inference, hi-res strategy |
+| ``detection_class_prob`` | Detection model class probabilities | From unstructured-inference, hi-res strategy. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
:raw-html:`
`
@@ -110,7 +110,7 @@ Additional Metadata Fields by Document Type
###########################################
+-------------------------+---------------------+--------------------------------------------------------+
-| ``Field Name`` | Applicable Doc Types| Short Description |
+| Field Name | Applicable Doc Types| Short Description |
+=========================+=====================+========================================================+
| ``page_number`` | DOCX,PDF, PPT,XLSX | Page Number |
+-------------------------+---------------------+--------------------------------------------------------+
From 44b886c8b6c679c95e4bfde78dc8d05e2511b43c Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:05:58 -0700
Subject: [PATCH 11/19] various fixes from code reviews
---
docs/source/bricks/embedding.rst | 10 +++---
docs/source/metadata.rst | 59 +++++++++++++++-----------------
2 files changed, 33 insertions(+), 36 deletions(-)
diff --git a/docs/source/bricks/embedding.rst b/docs/source/bricks/embedding.rst
index 29481a4215..42a569ca58 100644
--- a/docs/source/bricks/embedding.rst
+++ b/docs/source/bricks/embedding.rst
@@ -2,20 +2,20 @@
Embedding
#########
-EmbeddingEncoder classes in ``unstructured`` use document elements detected
+Embedding encoder classes in ``unstructured`` use document elements detected
with ``partition`` or document elements grouped with ``chunking`` to obtain
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).
-BaseEmbeddingEncoder
---------------------
+``BaseEmbeddingEncoder``
+------------------------
The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
for each ``EmbeddingEncoder`` subclass.
-OpenAIEmbeddingEncoder
-----------------------
+``OpenAIEmbeddingEncoder``
+--------------------------
The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index eddb32400c..5c3063f207 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -24,38 +24,36 @@ the source file:
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Metadata Field Name | Short Description | Details |
+=============================+==========================================================+=============================================================================================================================================================================================================================================================================================+
-| ``filename`` | Filename | |
+| filename | Filename | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``file_directory`` | File Directory | |
+| file_directory | File Directory | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``last_modified`` | Last Modified Date | |
+| last_modified | Last Modified Date | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``filetype`` | File Type | |
+| filetype | File Type | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``type`` | Element Type | Categorizes elements into types such as Title, NarrativeText. Not a metadata field. |
+| coordinates | XY Bounding Box Coordinates | See notes below for further details about the bounding box. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``coordinates`` | XY Bounding Box Coordinates | |
-+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``parent_id`` | Element Hierarchy (Parent ID) | Hierarchies are determined by a combination of a ruleset and element category depth. The current ruleset sets a parent ID if a title element follows a header element or any other element follows a title element. |
+| parent_id | Element Hierarchy (Parent ID) | Hierarchies are determined by a combination of a ruleset and element category depth. The current ruleset sets a parent ID if a title element follows a header element or any other element follows a title element. |
| | | The ID is also set if the element follows an element of the same category and the category_depth is greater than the category depth of the element it follows. Hierarchies enable more robust chunking configurations. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``category_depth`` | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
+| category_depth | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
| | other elements of the same category | Category depth is set using native document hierarchies (e.g., h1, h2, h3 or the indentation level of a bulleted list in a word document). |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``text_as_html`` | HTML representation of extracted tables | Only applicable to ``Table`` Elements |
+| text_as_html | HTML representation of extracted tables | Only applicable to table elements. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``languages`` | Document Languages | At document level or element level. List is ordered by probability of being the primary language of the text. |
+| languages | Document Languages | At document level or element level. List is ordered by probability of being the primary language of the text. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``emphasized_text_contents``| Emphasized text (bold or italic) in the original document| |
+| emphasized_text_contents | Emphasized text (bold or italic) in the original document| |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``emphasized_text_tags`` | Tags on text that is emphasized in the original document | |
+| emphasized_text_tags | Tags on text that is emphasized in the original document | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``num_characters`` | The number of characters used | Used for chunking. |
+| num_characters | The number of characters used | Used for chunking. |
| | for max_characters in add_chunking_strategy | |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``is_continuation`` | True if element is a continuation of a previous element | Only relevant for chunking, if an element was divided into two due to ``max_characters`` |
+| is_continuation | True if element is a continuation of a previous element | Only relevant for chunking, if an element was divided into two due to ``max_characters``. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| ``detection_class_prob`` | Detection model class probabilities | From unstructured-inference, hi-res strategy. |
+| detection_class_prob | Detection model class probabilities | From unstructured-inference, hi-res strategy. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
:raw-html:`
`
@@ -110,32 +108,32 @@ Additional Metadata Fields by Document Type
###########################################
+-------------------------+---------------------+--------------------------------------------------------+
-| Field Name | Applicable Doc Types| Short Description |
+| Field Name | Applicable Doc Types| Short Description |
+=========================+=====================+========================================================+
-| ``page_number`` | DOCX,PDF, PPT,XLSX | Page Number |
+| page_number | DOCX,PDF, PPT,XLSX | Page Number |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``page_name`` | XLSX | Sheet Name in Excel document |
+| page_name | XLSX | Sheet Name in Excel document |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``sent_from`` | EML | Email Sender |
+| sent_from | EML | Email Sender |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``sent_to`` | EML | Email Recipient |
+| sent_to | EML | Email Recipient |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``subject`` | EML | Email Subject |
+| subject | EML | Email Subject |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``attached_to_filename``| MSG | filename that attachment file is attached to |
+| attached_to_filename | MSG | filename that attachment file is attached to |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``header_footer_type`` | Word Doc | Pages a header or footer applies to: "primary", |
+| header_footer_type | Word Doc | Pages a header or footer applies to: "primary", |
| | | "even_only", and "first_page" |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``url`` | HTML | Webpage URL |
-+-------------------------+---------------------+--------------------------------------------------------+
-| ``link_urls`` | HTML | The url associated with a link in a document. |
+| link_urls | HTML | The url associated with a link in a document. |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``link_texts`` | HTML | The text associated with a link in a document. |
+| link_texts | HTML | The text associated with a link in a document. |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``links`` | HTML | List of {”text”: “, “url”: } items. This element will be removed in the near future in favor of the above two rows |
+| links | HTML | List of {”text”: “, “url”: } items. |
+| | | Note: this element will be removed in the near future |
+| | | in favor of the above link_urls and link_texts. |
+-------------------------+---------------------+--------------------------------------------------------+
-| ``section`` | EPUB | Book section title corresponding to table of contents |
+| section | EPUB | Book section title corresponding to table of contents |
+-------------------------+---------------------+--------------------------------------------------------+
:raw-html:`
`
@@ -177,7 +175,6 @@ Common Data Connector Metadata Fields
-------------------------------------
- Source Metadata
- - Source metadata includes (field on the `BaseIngestDoc` class:
- date created
- date modified
- version
From 6881e3e9cd6258194d8773d8baec376e3e730f55 Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:36:53 -0700
Subject: [PATCH 12/19] Update list of document element
---
docs/source/introduction/getting_started.rst | 49 ++++++++++++++------
docs/source/metadata.rst | 1 +
2 files changed, 36 insertions(+), 14 deletions(-)
diff --git a/docs/source/introduction/getting_started.rst b/docs/source/introduction/getting_started.rst
index dd093f88c9..e5cb710a17 100644
--- a/docs/source/introduction/getting_started.rst
+++ b/docs/source/introduction/getting_started.rst
@@ -101,20 +101,41 @@ Document elements
When we partition a document, the output is a list of document ``Element`` objects.
These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:
-* ``Element``
- * ``Text``
- * ``FigureCaption``
- * ``NarrativeText``
- * ``ListItem``
- * ``Title``
- * ``Address``
- * ``Table``
- * ``PageBreak``
- * ``Header``
- * ``Footer``
- * ``EmailAddress``
- * ``CheckBox``
- * ``Image``
+**Elements**
+^^^^^^^^^^^^
+
+* ``type``
+
+ * ``FigureCaption``
+
+ * ``NarrativeText``
+
+ * ``ListItem``
+
+ * ``Title``
+
+ * ``Address``
+
+ * ``Table``
+
+ * ``PageBreak``
+
+ * ``Header``
+
+ * ``Footer``
+
+ * ``UncategorizedText``
+
+ * ``Image``
+
+ * ``Formula``
+
+* ``element_id``
+
+* ``metadata`` - see: :ref:`Metadata page `
+
+* ``text``
+
Other element types that we will add in the future include tables and figures.
Different partitioning functions use different methods for determining the element type and extracting the associated content.
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 5c3063f207..e63bd06518 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -1,6 +1,7 @@
.. role:: raw-html(raw)
:format: html
+.. _metadata-label:
Metadata
========
From c72385508ca445d19212d01088cfbea84222e161 Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:45:07 -0700
Subject: [PATCH 13/19] Remove webpages and source metadata sub-section on
Metadata page
---
docs/source/metadata.rst | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index e63bd06518..1ac40fff60 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -162,12 +162,6 @@ Headers and footers in Word documents include a ``header_footer_type`` indicatin
a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``.
-Webpages
----------
-
-Elements from webpages will include a ``url`` metadata field, corresponding to the URL for the webpage.
-
-
##############################
Data Connector Metadata Fields
##############################
@@ -175,12 +169,6 @@ Data Connector Metadata Fields
Common Data Connector Metadata Fields
-------------------------------------
-- Source Metadata
- - date created
- - date modified
- - version
- - source url
- - exists
- Data Source metadata (on json output):
- url
- version
From 66c7ab57602e4e564d446aca6aaa4298ecda0946 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:47:31 -0700
Subject: [PATCH 14/19] update hierarchy description
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 1ac40fff60..0b4b901b09 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -35,7 +35,7 @@ the source file:
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| coordinates | XY Bounding Box Coordinates | See notes below for further details about the bounding box. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| parent_id | Element Hierarchy (Parent ID) | Hierarchies are determined by a combination of a ruleset and element category depth. The current ruleset sets a parent ID if a title element follows a header element or any other element follows a title element. |
+| parent_id | Element Hierarchy (Parent ID) | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title). |
| | | The ID is also set if the element follows an element of the same category and the category_depth is greater than the category depth of the element it follows. Hierarchies enable more robust chunking configurations. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category_depth | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
From 264edad8099f01c7d4bf30b8905809ed689aabbb Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:48:03 -0700
Subject: [PATCH 15/19] remove some description from parent_id
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 1 -
1 file changed, 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 0b4b901b09..8a8dbec3b9 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -36,7 +36,6 @@ the source file:
| coordinates | XY Bounding Box Coordinates | See notes below for further details about the bounding box. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| parent_id | Element Hierarchy (Parent ID) | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title). |
-| | | The ID is also set if the element follows an element of the same category and the category_depth is greater than the category depth of the element it follows. Hierarchies enable more robust chunking configurations. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category_depth | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
| | other elements of the same category | Category depth is set using native document hierarchies (e.g., h1, h2, h3 or the indentation level of a bulleted list in a word document). |
From 3ee24d1cf9397b4838ccf81621fa4bb1b6e87a4a Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:48:42 -0700
Subject: [PATCH 16/19] Update description for hierarchy_depth
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 8a8dbec3b9..dd37ce4800 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -38,7 +38,7 @@ the source file:
| parent_id | Element Hierarchy (Parent ID) | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title). |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category_depth | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
-| | other elements of the same category | Category depth is set using native document hierarchies (e.g., h1, h2, h3 or the indentation level of a bulleted list in a word document). |
+| | other elements of the same category | Category depth may be set using native document hierarchies, e.g. reflecting \, \, or `\` tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| text_as_html | HTML representation of extracted tables | Only applicable to table elements. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
From e7d075936481e1d37da64cf65ae1811647467378 Mon Sep 17 00:00:00 2001
From: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:49:24 -0700
Subject: [PATCH 17/19] Add paragraph for Data Connector Metadata Fields
Co-authored-by: cragwolfe
---
docs/source/metadata.rst | 1 +
1 file changed, 1 insertion(+)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index dd37ce4800..8fb39e6eb5 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -165,6 +165,7 @@ a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``,
Data Connector Metadata Fields
##############################
+Documents processed through unstructured-ingest connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.
Common Data Connector Metadata Fields
-------------------------------------
From 247bbba0aec7797243e24aff2274767de741ff9d Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:52:00 -0700
Subject: [PATCH 18/19] reformatting rst table
---
docs/source/metadata.rst | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/docs/source/metadata.rst b/docs/source/metadata.rst
index 8fb39e6eb5..9a616a97af 100644
--- a/docs/source/metadata.rst
+++ b/docs/source/metadata.rst
@@ -35,10 +35,10 @@ the source file:
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| coordinates | XY Bounding Box Coordinates | See notes below for further details about the bounding box. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| parent_id | Element Hierarchy (Parent ID) | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title). |
+| parent_id | Element Hierarchy (Parent ID) | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title). |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category_depth | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. |
-| | other elements of the same category | Category depth may be set using native document hierarchies, e.g. reflecting \, \, or `\` tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
+| | other elements of the same category | Category depth may be set using native document hierarchies, e.g. reflecting \, \, or `\` tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| text_as_html | HTML representation of extracted tables | Only applicable to table elements. |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
@@ -166,6 +166,7 @@ Data Connector Metadata Fields
##############################
Documents processed through unstructured-ingest connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.
+
Common Data Connector Metadata Fields
-------------------------------------
From 6f2cd2cb0c6ba66b40547e282c6ccfd201bbd007 Mon Sep 17 00:00:00 2001
From: ron-unstructured <138828701+ron-unstructured@users.noreply.github.com>
Date: Wed, 4 Oct 2023 17:56:24 -0700
Subject: [PATCH 19/19] Update changelog and version to v0.10.19
---
CHANGELOG.md | 4 ++--
unstructured/__version__.py | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index e3b73ba615..da4ec2f223 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,4 +1,4 @@
-## 0.10.19-dev11
+## 0.10.19
### Enhancements
@@ -10,8 +10,8 @@
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
* **change default `hi_res` model for pdf/image partition to `yolox`** Now partitioning pdf/image using `hi_res` strategy utilizes `yolox_quantized` model isntead of `detectron2_onnx` model. This new default model has better recall for tables and produces more detailed categories for elements.
-
* **XLSX can now reads subtables within one sheet** Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This `partition_xlsx` now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.
+* **Update Documentation on Element Types and Metadata**: We have updated the documentation according to the latest element types and metadata. It includes the common and additional metadata provided by the Partitions and Connectors.
### Fixes
diff --git a/unstructured/__version__.py b/unstructured/__version__.py
index 5af4c987f0..d69ab44df8 100644
--- a/unstructured/__version__.py
+++ b/unstructured/__version__.py
@@ -1 +1 @@
-__version__ = "0.10.19-dev11" # pragma: no cover
+__version__ = "0.10.19" # pragma: no cover