Skip to content

Commit

Permalink
feat: refactor ingest (#3009)
Browse files Browse the repository at this point in the history
### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.

### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.

### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia
```

You can run any of the local or s3 specific ingest tests and these
should now work.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
  • Loading branch information
3 people authored May 21, 2024
1 parent 73739b3 commit 3eaf65a
Show file tree
Hide file tree
Showing 120 changed files with 43,791 additions and 34,966 deletions.
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
## 0.14.1-dev0
## 0.14.1-dev1

* **Add support for Python 3.12**. `unstructured` now works with Python 3.12!

### Features
* **Large improvements to the ingest process:**
* Support for multiprocessing and async, with limits for both.
* Streamlined to process when mapping CLI invocations to the underlying code
* More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
* Use the python client when calling the unstructured api for partitioning or chunking
* Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
* Leverage last modified date when deciding if new files should be downloaded and reprocessed.

### Fixes

Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/chroma/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1p.txt \
--output-dir local-to-chroma \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/clarifai/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-clarifai \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--num-processes 2 \
--verbose \
clarifai \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/elasticsearch/destination.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-elasticsearch \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/mongodb/destination.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-mongodb \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/opensearch/destination.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-opensearch \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/pinecone/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-pinecone \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/qdrant/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ unstructured-ingest \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-qdrant \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "$EMBEDDING_PROVIDER" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/sql/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-pinecone \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/weaviate/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--reprocess \
--input-path example-docs/book-war-and-peace-1225p.txt \
--work-dir weaviate-work-dir \
--chunk-elements \
--chunking-strategy by_title \
--chunk-new-after-n-chars 2500 --chunk-multipage-sections \
--embedding-provider "langchain-huggingface" \
weaviate \
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
{
"properties": {
"element_id": {
"type": "keyword"
},
"text": {
"type": "text",
"analyzer": "english"
},
"type": {
"type": "text"
},
"embeddings": {
"type": "dense_vector",
"dims": 384
},
"metadata": {
"type": "object",
"properties": {
"category_depth": {
"type": "integer"
},
"parent_id": {
"type": "keyword"
},
"attached_to_filename": {
"type": "keyword"
},
"filetype": {
"type": "keyword"
},
"last_modified": {
"type": "date"
},
"file_directory": {
"type": "keyword"
},
"filename": {
"type": "keyword"
},
"data_source": {
"type": "object",
"properties": {
"url": {
"type": "text",
"analyzer": "standard"
},
"version": {
"type": "keyword"
},
"date_created": {
"type": "date"
},
"date_modified": {
"type": "date"
},
"date_processed": {
"type": "date"
},
"record_locator": {
"type": "keyword"
},
"permissions_data": {
"type": "object"
}
}
},
"coordinates": {
"type": "object",
"properties": {
"system": {
"type": "keyword"
},
"layout_width": {
"type": "float"
},
"layout_height": {
"type": "float"
},
"points": {
"type": "float"
}
}
},
"languages": {
"type": "keyword"
},
"page_number": {
"type": "integer"
},
"page_name": {
"type": "keyword"
},
"url": {
"type": "text",
"analyzer": "standard"
},
"links": {
"type": "object"
},
"link_urls": {
"type": "text"
},
"link_texts": {
"type": "text"
},
"sent_from": {
"type": "text",
"analyzer": "standard"
},
"sent_to": {
"type": "text",
"analyzer": "standard"
},
"subject": {
"type": "text",
"analyzer": "standard"
},
"section": {
"type": "text",
"analyzer": "standard"
},
"header_footer_type": {
"type": "keyword"
},
"emphasized_text_contents": {
"type": "text"
},
"emphasized_text_tags": {
"type": "keyword"
},
"text_as_html": {
"type": "text",
"analyzer": "standard"
},
"regex_metadata": {
"type": "object"
},
"detection_class_prob": {
"type": "float"
}
}
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
INDEX_NAME = "ingest-test-destination"
USER = os.environ["ELASTIC_USER"]
PASSWORD = os.environ["ELASTIC_PASSWORD"]
MAPPING_PATH = "docs/source/ingest/destination_connectors/data/elasticsearch_elements_mappings.json"
MAPPING_PATH = (
"scripts/elasticsearch-test-helpers/destination_connector/elasticsearch_elements_mappings.json"
)

with open(MAPPING_PATH) as f:
mappings = json.load(f)
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
INDEX_NAME = "ingest-test-destination"
USER = "admin"
PASSWORD = "admin"
MAPPING_PATH = "docs/source/ingest/destination_connectors/data/opensearch_elements_mappings.json"
MAPPING_PATH = (
"scripts/opensearch-test-helpers/destination_connector/opensearch_elements_mappings.json"
)

with open(MAPPING_PATH) as f:
mappings = json.load(f)
Loading

0 comments on commit 3eaf65a

Please sign in to comment.