Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand fsspec downstream connectors #1777

Merged
merged 38 commits into from
Oct 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c48a670
refactor writers into their own directory
rbiseck3 Oct 17, 2023
054081c
Add all other fsspec writers
rbiseck3 Oct 17, 2023
eb199bd
Finish azure dest with e2e test
rbiseck3 Oct 17, 2023
9b89d10
Add s3 e2e test
rbiseck3 Oct 17, 2023
8d21ab3
Add box dest connector
rbiseck3 Oct 17, 2023
8d0ec9e
Add dropbox dest connector
rbiseck3 Oct 18, 2023
a01edf8
WIP: adding gcs dest connector
rbiseck3 Oct 18, 2023
bbbe539
finish setting up e2e test for gcs
rbiseck3 Oct 18, 2023
85c5cb8
update changelog
rbiseck3 Oct 18, 2023
50bc263
Add cloud login for az and gcloud in CI
rbiseck3 Oct 24, 2023
20a207c
Add dest tests to ingest script
rbiseck3 Oct 24, 2023
88a6d38
Add permissions to CI
rbiseck3 Oct 24, 2023
a1e743a
Debugging CI
rbiseck3 Oct 24, 2023
e17cc43
Add generic kwargs input for all writers
rbiseck3 Oct 24, 2023
c36a734
debugging CI
rbiseck3 Oct 24, 2023
84dfbdb
debugging CI
rbiseck3 Oct 24, 2023
279c4bd
debugging CI
rbiseck3 Oct 24, 2023
00d4aa4
Add cloud auth to upgest ingest job
rbiseck3 Oct 25, 2023
a0e4c8c
Add permissions to upgest ingest job
rbiseck3 Oct 25, 2023
275062f
debugging CI
rbiseck3 Oct 25, 2023
0cf5fb2
debugging CI
rbiseck3 Oct 25, 2023
e54c1b9
debugging CI
rbiseck3 Oct 25, 2023
ef659fd
debugging CI
rbiseck3 Oct 25, 2023
e5bf533
move permissions to top level
tabossert Oct 25, 2023
de47529
move permissions to top level
tabossert Oct 25, 2023
718b789
bump version
tabossert Oct 25, 2023
d4742e6
activate gcp credentials
tabossert Oct 25, 2023
80aae66
test logins
tabossert Oct 25, 2023
71ccbc1
add login command
tabossert Oct 25, 2023
8c366a1
update path to credentials file
tabossert Oct 25, 2023
d17a588
add test commnds
tabossert Oct 25, 2023
e3c2df7
remove extra steps for gcloud
tabossert Oct 25, 2023
f773223
add back setup cloud sdk
tabossert Oct 25, 2023
37edcd3
debugging CI
rbiseck3 Oct 26, 2023
9deb94a
set environment for azure federated login
tabossert Oct 26, 2023
482ccc0
Fix s3 dest test
rbiseck3 Oct 26, 2023
3a350c8
fix shellcheck
rbiseck3 Oct 26, 2023
463f4c3
expand fsspec downstream connectors <- Ingest test fixtures update (#…
ryannikolaidis Oct 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 46 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,35 @@ on:
env:
GHA_CACHE_KEY_VERSION: "v1"

permissions:
id-token: write
contents: read

jobs:
test_logins:
runs-on: ubuntu-latest
steps:
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
id: gauth
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'run gcloud command'
run: |-
gcloud projects list
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: 'azure test command'
run: |-
az account show
setup:
strategy:
matrix:
Expand Down Expand Up @@ -268,6 +296,7 @@ jobs:


test_ingest:
environment: ci
strategy:
matrix:
python-version: ["3.8","3.9","3.10","3.11"]
Expand All @@ -276,7 +305,23 @@ jobs:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup_ingest, lint]
steps:
- uses: actions/checkout@v3
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
create_credentials_file: true
activate_credentials_file: true
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
Expand Down
55 changes: 54 additions & 1 deletion .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,36 @@ env:
GHA_CACHE_KEY_VERSION: "v1"
PYTHON_VERSION: "3.10"

permissions:
id-token: write
contents: read

jobs:
test_logins:
runs-on: ubuntu-latest
environment: ci
steps:
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
id: gauth
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'run gcloud command'
run: |-
gcloud projects list
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: 'azure test command'
run: |-
az account show
setup:
runs-on: ubuntu-latest
if: |
Expand Down Expand Up @@ -75,12 +104,36 @@ jobs:
make install-all-ingest

update-fixtures-and-pr:
environment: ci
runs-on: ubuntu-latest-m
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup_ingest]
steps:
- uses: actions/checkout@v3
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
create_credentials_file: true
activate_credentials_file: true
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Get full Python version
id: full-python-version
run: echo version=$(python -c "import sys; print('-'.join(str(v) for v in sys.version_info))") >> $GITHUB_OUTPUT
- uses: actions/cache/restore@v3
id: virtualenv-cache
with:
Expand Down
9 changes: 7 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
## 0.10.28-dev4
## 0.10.28-dev5

### Enhancements

* **Add element type CI evaluation workflow** Adds element type frequency evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
* **Add table structure evaluation helpers** Adds functions to evaluate the similarity between predicted table structure and actual table structure.
* **Use `yolox` by default for table extraction when partitioning pdf/image** `yolox` model provides higher recall of the table regions than the quantized version and it is now the default element detection model when `infer_table_structure=True` for partitioning pdf/image files
* **Remove pdfminer elements from inside tables** Previously, when using `hi_res` some elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements.
* **Fsspec downstream connectors** New destination connector added to ingest CLI, users may now use `unstructured-ingest` to write to any of the following:
* Azure
* Box
* Dropbox
* Google Cloud Service

### Features

Expand Down Expand Up @@ -1609,4 +1614,4 @@ of an email.

## 0.2.0

* Initial release of unstructured
* Initial release of unstructured
4 changes: 2 additions & 2 deletions test_unstructured_ingest/metrics/aggregate-scores-cct.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
strategy average sample_sd population_sd count
cct-accuracy 0.777 0.088 0.072 3
cct-%missing 0.087 0.045 0.037 3
cct-accuracy 0.798 0.083 0.072 4
cct-%missing 0.087 0.037 0.032 4
2 changes: 1 addition & 1 deletion test_unstructured_ingest/metrics/all-docs-cct.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
filename connector cct-accuracy cct-%missing
science-exploration-1p.pptx box 0.861 0.09
example-10k.html local 0.686 0.04
IRS-form-1987.pdf azure 0.783 0.13
IRS-form-1987.pdf azure 0.783 0.13
57 changes: 57 additions & 0 deletions test_unstructured_ingest/test-ingest-azure-dest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=azure-dest
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$SCRIPT_DIR/workdir/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}

if [ -z "$AZURE_DEST_CONNECTION_STR" ]; then
echo "Skipping Azure destination ingest test because the AZURE_DEST_CONNECTION_STR env var is not set."
exit 0
fi

CONTAINER=utic-ingest-test-fixtures-output
DIRECTORY=$(date +%s)
REMOTE_URL="abfs://$CONTAINER/$DIRECTORY/"

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$WORK_DIR"

echo "deleting azure storage blob directory $CONTAINER/$DIRECTORY"
az storage fs directory delete -f "$CONTAINER" -n "$DIRECTORY" --connection-string "$AZURE_DEST_CONNECTION_STR" --yes

}
trap cleanup EXIT

# Create directory to use for testing
az storage fs directory create -f "$CONTAINER" --n "$DIRECTORY" --connection-string "$AZURE_DEST_CONNECTION_STR"

PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--num-processes "$max_processes" \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--output-dir "$OUTPUT_DIR" \
--strategy fast \
--verbose \
--reprocess \
--input-path example-docs/fake-memo.pdf \
--work-dir "$WORK_DIR" \
azure \
--overwrite \
--remote-url "$REMOTE_URL" \
--connection-string "$AZURE_DEST_CONNECTION_STR"

# Simply check the number of files uploaded
expected_num_files=1
num_files_in_azure=$(az storage blob list -c "$CONTAINER" --prefix "$DIRECTORY"/example-docs/ --connection-string "$AZURE_DEST_CONNECTION_STR" | jq 'length')
if [ "$num_files_in_azure" -ne "$expected_num_files" ]; then
echo "Expected $expected_num_files files to be uploaded to azure, but found $num_files_in_azure files."
exit 1
fi
54 changes: 54 additions & 0 deletions test_unstructured_ingest/test-ingest-box-dest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env bash
#TODO currently box api/sdk does not work to create folders and check for content similar to other fsspec ingest tests
ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved

#
#set -e
#
#SCRIPT_DIR=$(dirname "$(realpath "$0")")
#cd "$SCRIPT_DIR"/.. || exit 1
#OUTPUT_FOLDER_NAME=box-dest
#OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
#WORK_DIR=$SCRIPT_DIR/workdir/$OUTPUT_FOLDER_NAME
#max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
#DESTINATION_BOX="box://utic-dev-tech-fixtures/utic-ingest-test-fixtures-output/$(date +%s)/"
#
#CI=${CI:-"false"}
#
#if [ -z "$BOX_APP_CONFIG" ] && [ -z "$BOX_APP_CONFIG_PATH" ]; then
# echo "Skipping Box ingest test because neither BOX_APP_CONFIG nor BOX_APP_CONFIG_PATH env vars are set."
# exit 0
#fi
#
#if [ -z "$BOX_APP_CONFIG_PATH" ]; then
# # Create temporary service key file
# BOX_APP_CONFIG_PATH=$(mktemp)
# echo "$BOX_APP_CONFIG" >"$BOX_APP_CONFIG_PATH"
#fi
#
## shellcheck disable=SC1091
#source "$SCRIPT_DIR"/cleanup.sh
#function cleanup() {
# cleanup_dir "$OUTPUT_DIR"
# cleanup_dir "$WORK_DIR"
# if [ "$CI" == "true" ]; then
# cleanup_dir "$DOWNLOAD_DIR"
# fi
#}
#trap cleanup EXIT
#
#PYTHONPATH=. ./unstructured/ingest/main.py \
# local \
# --num-processes "$max_processes" \
# --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
# --output-dir "$OUTPUT_DIR" \
# --strategy fast \
# --verbose \
# --reprocess \
# --input-path example-docs/fake-memo.pdf \
# --work-dir "$WORK_DIR" \
# box \
# --box-app-config "$BOX_APP_CONFIG_PATH" \
# --remote-url "$DESTINATION_BOX" \
#
## Simply check the number of files uploaded
#expected_num_files=1
81 changes: 81 additions & 0 deletions test_unstructured_ingest/test-ingest-dropbox-dest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=dropbox-dest
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$SCRIPT_DIR/workdir/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
DESTINATION_DROPBOX="/test-output/$(date +%s)"
CI=${CI:-"false"}

if [ -z "$DROPBOX_APP_KEY" ] || [ -z "$DROPBOX_APP_SECRET" ] || [ -z "$DROPBOX_REFRESH_TOKEN" ]; then
echo "Skipping Dropbox ingest test because one or more of these env vars is not set:"
echo "DROPBOX_APP_KEY, DROPBOX_APP_SECRET, DROPBOX_REFRESH_TOKEN"
exit 0
fi

# Get a new access token from Dropbox
DROPBOX_RESPONSE=$(curl -s https://api.dropbox.com/oauth2/token -d refresh_token="$DROPBOX_REFRESH_TOKEN" -d grant_type=refresh_token -d client_id="$DROPBOX_APP_KEY" -d client_secret="$DROPBOX_APP_SECRET")
DROPBOX_ACCESS_TOKEN=$(jq -r '.access_token' <<< "$DROPBOX_RESPONSE")

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$WORK_DIR"
if [ "$CI" == "true" ]; then
cleanup_dir "$DOWNLOAD_DIR"
fi

echo "deleting test folder $DESTINATION_DROPBOX"
curl -X POST https://api.dropboxapi.com/2/files/delete_v2 \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"path\":\"$DESTINATION_DROPBOX\"}" | jq
}
trap cleanup EXIT

# Create new folder for test
echo "creating temp directory in dropbox for testing: $DESTINATION_DROPBOX"
response=$(curl -X POST -s -w "\n%{http_code}" https://api.dropboxapi.com/2/files/create_folder_v2 \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"autorename\":false,\"path\":\"$DESTINATION_DROPBOX\"}");
http_code=$(tail -n1 <<< "$response") # get the last line
content=$(sed '$ d' <<< "$response") # get all but the last line which contains the status code

if [ "$http_code" -ge 300 ]; then
echo "Failed to create temp dir in dropbox: [$http_code] $content"
exit 1
else
echo "$http_code:"
jq <<< "$content"
fi

PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--num-processes "$max_processes" \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--output-dir "$OUTPUT_DIR" \
--strategy fast \
--verbose \
--reprocess \
--input-path example-docs/fake-memo.pdf \
--work-dir "$WORK_DIR" \
dropbox \
--token "$DROPBOX_ACCESS_TOKEN" \
--remote-url "dropbox://$DESTINATION_DROPBOX" \

# Simply check the number of files uploaded
expected_num_files=1
num_files_in_dropbox=$(curl -X POST https://api.dropboxapi.com/2/files/list_folder \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"path\":\"$DESTINATION_DROPBOX/example-docs/\"}" | jq '.entries | length')
if [ "$num_files_in_dropbox" -ne "$expected_num_files" ]; then
echo "Expected $expected_num_files files to be uploaded to dropbox, but found $num_files_in_dropbox files."
exit 1
fi
2 changes: 1 addition & 1 deletion test_unstructured_ingest/test-ingest-dropbox.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--verbose \
--token "$DROPBOX_ACCESS_TOKEN" \
--recursive \
--remote-url "dropbox:// /" \
--remote-url "dropbox://test-input/" \
--work-dir "$WORK_DIR"


Expand Down
Loading
Loading