Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor ingest #3009

Merged
merged 53 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
f48e608
Create new interfaces to support more versatility in how ingest proce…
rbiseck3 May 10, 2024
4945d09
Begin flushing out pipeline
rbiseck3 May 10, 2024
20fd7d1
Add partitioner pipelien step
rbiseck3 May 13, 2024
f8c18f3
Add chunker pipeline step
rbiseck3 May 13, 2024
7a6b8e4
Add upload pipeline step
rbiseck3 May 13, 2024
7bc79df
Support file level reprocess flag
rbiseck3 May 13, 2024
4d3a5c6
Add local destination as default
rbiseck3 May 13, 2024
3898b22
Add support for uncompress via new pipeline step
rbiseck3 May 13, 2024
0f2822a
Move files around
rbiseck3 May 13, 2024
c3e7113
Add s3 connector
rbiseck3 May 14, 2024
db3fe7e
Add cli commands
rbiseck3 May 14, 2024
97c14b7
bring over more logic from original implementation
rbiseck3 May 15, 2024
59cce07
dynamically add new commands into existing list, annotated with v2 as…
rbiseck3 May 15, 2024
0b1e72a
fix fsspec inputs
rbiseck3 May 15, 2024
934acb8
print all errors at the end of pipeline
rbiseck3 May 15, 2024
88e7587
Add optional limit on connections when using asyncio
rbiseck3 May 15, 2024
b0201ab
Add entry to changelog
rbiseck3 May 15, 2024
33bf040
support python3.9
rbiseck3 May 16, 2024
b6a4434
improve type checking in fsspec connectors
rbiseck3 May 16, 2024
e7739a9
Add __future__ to top level __init__ for v2 code
rbiseck3 May 16, 2024
e7203a2
Add better type checking in cli command code
rbiseck3 May 16, 2024
7bb91d9
update fsspec metadata to include record locator info
rbiseck3 May 16, 2024
d823510
Fix endpoint param in s3 fsspec connector
rbiseck3 May 16, 2024
0dd164d
Small optimization in getting acccess configs from s3 connector config
rbiseck3 May 16, 2024
5f128de
Add recursive flag to local cli inputs
rbiseck3 May 16, 2024
dd2706d
Add checks when getting values from os.stat
rbiseck3 May 17, 2024
0ad80ca
Add a classmethod to generate pipeline from configs
rbiseck3 May 17, 2024
6951a3a
Add dependency check wrapper for s3 connector
rbiseck3 May 17, 2024
95ea1cd
Add new README in v2
rbiseck3 May 17, 2024
044c758
Fix local connector
rbiseck3 May 17, 2024
2489290
Fix await in s3 connector
rbiseck3 May 17, 2024
7985601
feat: refactor ingest <- Ingest test fixtures update (#3048)
ryannikolaidis May 17, 2024
84a6ee0
Improve typing
rbiseck3 May 17, 2024
c47a8a1
expose max connections in CLI
rbiseck3 May 17, 2024
91039e1
Add sequence diagram
rbiseck3 May 17, 2024
acd3220
remove print statement
rbiseck3 May 20, 2024
2ab7994
Don't pass unset partition kwargs
rbiseck3 May 20, 2024
bd1315e
skip confluence
rbiseck3 May 20, 2024
41361f4
feat: refactor ingest <- Ingest test fixtures update (#3059)
ryannikolaidis May 20, 2024
5c7cfbb
Add back in confluence tests
rbiseck3 May 20, 2024
467a887
fix s3 uploader
rbiseck3 May 20, 2024
11312ba
fix s3 uploader
rbiseck3 May 20, 2024
0d8b5b9
Skip date created for minio as this will never be consistent
rbiseck3 May 20, 2024
811f3bf
tidy shell
rbiseck3 May 20, 2024
fef3134
skip confluence
rbiseck3 May 20, 2024
ff9ef03
feat: refactor ingest <- Ingest test fixtures update (#3060)
ryannikolaidis May 20, 2024
d1ce694
Add back in confluence tests
rbiseck3 May 20, 2024
57ff33d
fix minio test
rbiseck3 May 21, 2024
ff52ece
Update use of chunking strategy in CLI inputs
rbiseck3 May 21, 2024
eb09210
fix chunk strategy cli param
rbiseck3 May 21, 2024
aa53711
feat: refactor ingest <- Ingest test fixtures update (#3064)
ryannikolaidis May 21, 2024
57b6f2b
Add back in elasticsearch_elements_mappings.json into the es scripts dir
rbiseck3 May 21, 2024
58e1f8f
Add back in elasticsearch_elements_mappings.json into the opensearch …
rbiseck3 May 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
## 0.14.1-dev0
## 0.14.1-dev1

* **Add support for Python 3.12**. `unstructured` now works with Python 3.12!

### Features
* **Large improvements to the ingest process:**
* Support for multiprocessing and async, with limits for both.
* Streamlined to process when mapping CLI invocations to the underlying code
* More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
* Use the python client when calling the unstructured api for partitioning or chunking
* Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
* Leverage last modified date when deciding if new files should be downloaded and reprocessed.

### Fixes

Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/chroma/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1p.txt \
--output-dir local-to-chroma \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/clarifai/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-clarifai \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--num-processes 2 \
--verbose \
clarifai \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/elasticsearch/destination.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-elasticsearch \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/mongodb/destination.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-mongodb \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/opensearch/destination.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-opensearch \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/pinecone/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-pinecone \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/qdrant/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ unstructured-ingest \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-qdrant \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "$EMBEDDING_PROVIDER" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/sql/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-pinecone \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--embedding-provider "<an unstructured embedding provider, ie. langchain-huggingface>" \
--num-processes 2 \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion examples/ingest/weaviate/ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--reprocess \
--input-path example-docs/book-war-and-peace-1225p.txt \
--work-dir weaviate-work-dir \
--chunk-elements \
--chunking-strategy by_title \
--chunk-new-after-n-chars 2500 --chunk-multipage-sections \
--embedding-provider "langchain-huggingface" \
weaviate \
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured_ingest/dest/astra.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--verbose \
--input-path example-docs/book-war-and-peace-1p.txt \
--work-dir "$WORK_DIR" \
--chunk-elements \
--chunking-strategy by_title \
--chunk-max-characters 1500 \
--chunk-multipage-sections \
--embedding-provider "langchain-huggingface" \
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured_ingest/dest/chroma.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--verbose \
--input-path example-docs/book-war-and-peace-1p.txt \
--work-dir "$WORK_DIR" \
--chunk-elements \
--chunking-strategy by_title \
--chunk-max-characters 1500 \
--chunk-multipage-sections \
--embedding-provider "langchain-huggingface" \
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured_ingest/dest/clarifai.sh
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--input-path example-docs/book-war-and-peace-1p.txt \
--output-dir "$OUTPUT_DIR" \
--strategy fast \
--chunk-elements \
--chunking-strategy by_title \
--num-processes "$max_processes" \
--work-dir "$WORK_DIR" \
--verbose \
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured_ingest/dest/elasticsearch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--reprocess \
--input-path example-docs/book-war-and-peace-1225p.txt \
--work-dir "$WORK_DIR" \
--chunk-elements \
--chunking-strategy by_title \
--chunk-combine-text-under-n-chars 200 \
--chunk-new-after-n-chars 2500 \
--chunk-max-characters 38000 \
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured_ingest/dest/pinecone.sh
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--reprocess \
--input-path example-docs/book-war-and-peace-1225p.txt \
--work-dir "$WORK_DIR" \
--chunk-elements \
--chunking-strategy by_title \
--chunk-combine-text-under-n-chars 200 --chunk-new-after-n-chars 2500 --chunk-max-characters 38000 --chunk-multipage-sections \
--embedding-provider "langchain-huggingface" \
pinecone \
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured_ingest/dest/qdrant.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--reprocess \
--input-path example-docs/book-war-and-peace-1225p.txt \
--work-dir "$WORK_DIR" \
--chunk-elements \
--chunking-strategy by_title \
--chunk-combine-text-under-n-chars 200 --chunk-new-after-n-chars 2500 --chunk-max-characters 38000 --chunk-multipage-sections \
--embedding-provider "langchain-huggingface" \
qdrant \
Expand Down
3 changes: 0 additions & 3 deletions test_unstructured_ingest/dest/s3.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ SCRIPT_DIR=$(dirname "$DEST_PATH")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=s3-dest
OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}
OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
DESTINATION_S3="s3://utic-dev-tech-fixtures/utic-ingest-test-fixtures-output/$(uuidgen)/"
Expand All @@ -16,7 +15,6 @@ CI=${CI:-"false"}
# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$WORK_DIR"

if aws s3 ls "$DESTINATION_S3" --region us-east-2; then
Expand All @@ -31,7 +29,6 @@ RUN_SCRIPT=${RUN_SCRIPT:-./unstructured/ingest/main.py}
PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
local \
--num-processes "$max_processes" \
--output-dir "$OUTPUT_DIR" \
--strategy fast \
--verbose \
--reprocess \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
--path "Shared Documents" \
--recursive \
--embedding-provider "langchain-huggingface" \
--chunk-elements \
--chunking-strategy by_title \
--chunk-multipage-sections \
--work-dir "$WORK_DIR" \
azure-cognitive-search \
Expand Down
Loading
Loading