Skip to content

Commit

Permalink
update ingest python doc (#1446)
Browse files Browse the repository at this point in the history
### Description
Updating the python version of the example docs to show how to run the
same code that the CLI runs, but using python. Rather than copying the
same command that would be run via the terminal and using the subprocess
library to run it, this updates it to use the supported code exposed in
the inference directory.

For now only the wikipedia one has been updated to get some opinions on
this before updating all other connector docs.

Would close out
#1445
  • Loading branch information
rbiseck3 authored Oct 3, 2023
1 parent 89bd2fa commit 9d81971
Show file tree
Hide file tree
Showing 52 changed files with 938 additions and 1,274 deletions.
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
## 0.10.19-dev4
## 0.10.19-dev5

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.

## 0.10.17-dev3

### Enhancements

* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.

### Features

Expand Down
80 changes: 32 additions & 48 deletions docs/source/source_connectors/airtable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,21 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Run via the API
---------------
Expand All @@ -78,31 +70,23 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"airtable",
"--metadata-exclude", "filename,file_directory,metadata.data_source.date_processed",
"--personal-access-token", "$AIRTABLE_PERSONAL_ACCESS_TOKEN",
"--output-dir", "airtable-ingest-output"
"--num-processes", "2",
"--reprocess",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.airtable import airtable
if __name__ == "__main__":
airtable(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="airtable-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
personal_access_token=os.getenv("AIRTABLE_PERSONAL_ACCESS_TOKEN"),
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.

Expand Down
91 changes: 32 additions & 59 deletions docs/source/source_connectors/azure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,20 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"azure",
"--remote-url", "abfs://container1/",
"--account-name", "azureunstructured1"
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.azure import azure
if __name__ == "__main__":
azure(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="azure-ingest-output",
num_processes=2,
),
remote_url="abfs://container1/",
account_name="azureunstructured1",
)
Run via the API
---------------
Expand All @@ -62,43 +54,24 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: shell
unstructured-ingest \
azure \
--remote-url abfs://container1/ \
--account-name azureunstructured1 \
--output-dir azure-ingest-output \
--num-processes 2 \
--partition-by-api \
--api-key "<UNSTRUCTURED-API-KEY>"
.. tab:: Python

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"azure",
"--remote-url", "abfs://container1/",
"--account-name", "azureunstructured1"
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.azure import azure
if __name__ == "__main__":
azure(
verbose=True,
read_config=ReadConfig(),
partition_config=PartitionConfig(
output_dir="azure-ingest-output",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
remote_url="abfs://container1/",
account_name="azureunstructured1",
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.
Expand Down
82 changes: 34 additions & 48 deletions docs/source/source_connectors/biomed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,21 @@ Run Locally

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"biomed",
"--path", "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--verbose",
"--preserve-downloads",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.biomed import biomed
if __name__ == "__main__":
biomed(
verbose=True,
read_config=ReadConfig(
preserve_downloads=True,
),
partition_config=PartitionConfig(
output_dir="biomed-ingest-output-path",
num_processes=2,
),
path="oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
)
Run via the API
---------------
Expand All @@ -78,31 +70,25 @@ You can also use upstream connectors with the ``unstructured`` API. For this you

.. code:: python
import subprocess
command = [
"unstructured-ingest",
"biomed",
"--path", "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
"--output-dir", "/Output/Path/To/Files",
"--num-processes", "2",
"--verbose",
"--preserve-downloads",
"--partition-by-api",
"--api-key", "<UNSTRUCTURED-API-KEY>",
]
# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
# Print output
if process.returncode == 0:
print('Command executed successfully. Output:')
print(output.decode())
else:
print('Command failed. Error:')
print(error.decode())
import os
from unstructured.ingest.interfaces import PartitionConfig, ReadConfig
from unstructured.ingest.runner.biomed import biomed
if __name__ == "__main__":
biomed(
verbose=True,
read_config=ReadConfig(
preserve_downloads=True,
),
partition_config=PartitionConfig(
output_dir="biomed-ingest-output-path",
num_processes=2,
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
),
path="oa_pdf/07/07/sbaa031.073.PMC7234218.pdf",
)
Additionally, you will need to pass the ``--partition-endpoint`` if you're running the API locally. You can find more information about the ``unstructured`` API `here <https://github.com/Unstructured-IO/unstructured-api>`_.

Expand Down
Loading

0 comments on commit 9d81971

Please sign in to comment.