-
Notifications
You must be signed in to change notification settings - Fork 817
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add ability to extract extra metadata with regex (#763)
* first pass on regex metadata * fix typing for regex metadata * add dataclass back in * add decorators * fix tests * update docs * add tests for regex metadata * add process metadata to tsv * changelog and version * docs typos * consolidate to using a single kwarg * fix test
- Loading branch information
1 parent
ec403e2
commit 4ea7168
Showing
27 changed files
with
281 additions
and
41 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
Metadata | ||
======== | ||
|
||
The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents. | ||
Tracking metadata enables users to filter document elements downstream based on element metadata of interest. | ||
For example, a user may be interested in selected document elements from a given page number | ||
or an e-mail with a given subject line. | ||
|
||
Metadata is tracked at the element level. You can extract the metadata for a given document element | ||
with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``. | ||
All document types return the following metadata fields when the information is available from | ||
the source file: | ||
|
||
* ``filename`` | ||
* ``file_directory`` | ||
* ``date`` | ||
* ``filetype`` | ||
* ``page_number`` | ||
|
||
|
||
----- | ||
|
||
Emails will include ``sent_from``, ``sent_to``, and ``subject`` metadata. | ||
``sent_from`` is a list of strings because the `RFC 822 <https://www.rfc-editor.org/rfc/rfc822>`_ | ||
spec for emails allows for multiple sent from email addresses. | ||
|
||
|
||
Microsoft Excel Documents | ||
-------------------------- | ||
|
||
For Excel documents, ``ElementMetadata`` will contain a ``page_name`` element, which corresponds | ||
to the sheet name in the Excel document. | ||
|
||
|
||
Microsoft Word Documents | ||
------------------------- | ||
|
||
Headers and footers in Word documents include a ``header_footer_type`` indicating which page | ||
a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``. | ||
|
||
|
||
Webpages | ||
--------- | ||
|
||
Elements from webpages will include a ``url`` metadata field, corresponding to the URL for the webpage. | ||
|
||
|
||
|
||
########################## | ||
Advanced Metadata Options | ||
########################### | ||
|
||
|
||
|
||
Extract Metadata with Regexes | ||
------------------------------ | ||
|
||
``unstructured`` allows users to extract additional metadata with regexes using the ``regex_metadata`` kwarg. | ||
Here is an example of how to extract regex metadata: | ||
|
||
|
||
.. code:: python | ||
from unstructured.partition.text import partition_text | ||
text = "SPEAKER 1: It is my turn to speak now!" | ||
elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"}) | ||
elements[0].metadata.regex_metadata | ||
The result will look like: | ||
|
||
|
||
.. code:: python | ||
{'speaker': | ||
[ | ||
{ | ||
'text': 'SPEAKER 1:', | ||
'start': 0, | ||
'end': 10, | ||
} | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -206,15 +206,18 @@ def test_partition_email_has_metadata(): | |
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email-header.eml") | ||
elements = partition_email(filename=filename) | ||
assert len(elements) > 0 | ||
assert elements[0].metadata == ElementMetadata( | ||
filename=filename, | ||
date="2022-12-16T17:04:16-05:00", | ||
page_number=None, | ||
url=None, | ||
sent_from=["Matthew Robinson <[email protected]>"], | ||
sent_to=["Matthew Robinson <[email protected]>"], | ||
subject="Test Email", | ||
filetype="message/rfc822", | ||
assert ( | ||
elements[0].metadata.to_dict() | ||
== ElementMetadata( | ||
filename=filename, | ||
date="2022-12-16T17:04:16-05:00", | ||
page_number=None, | ||
url=None, | ||
sent_from=["Matthew Robinson <[email protected]>"], | ||
sent_to=["Matthew Robinson <[email protected]>"], | ||
subject="Test Email", | ||
filetype="message/rfc822", | ||
).to_dict() | ||
) | ||
|
||
expected_dt = datetime.datetime.fromisoformat("2022-12-16T17:04:16-05:00") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,15 +36,18 @@ def test_partition_msg_from_filename(): | |
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-email.msg") | ||
elements = partition_msg(filename=filename) | ||
assert elements == EXPECTED_MSG_OUTPUT | ||
assert elements[0].metadata == ElementMetadata( | ||
filename=filename, | ||
date="2022-12-16T17:04:16-05:00", | ||
page_number=None, | ||
url=None, | ||
sent_from=["Matthew Robinson <[email protected]>"], | ||
sent_to=["Matthew Robinson (None)"], | ||
subject="Test Email", | ||
filetype="application/vnd.ms-outlook", | ||
assert ( | ||
elements[0].metadata.to_dict() | ||
== ElementMetadata( | ||
filename=filename, | ||
date="2022-12-16T17:04:16-05:00", | ||
page_number=None, | ||
url=None, | ||
sent_from=["Matthew Robinson <[email protected]>"], | ||
sent_to=["Matthew Robinson (None)"], | ||
subject="Test Email", | ||
filetype="application/vnd.ms-outlook", | ||
).to_dict() | ||
) | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.