Skip to content

Commit

Permalink
expand fsspec downstream connectors (#1777)
Browse files Browse the repository at this point in the history
### Description
Replacing PR
[1383](#1383)

---------

Co-authored-by: Trevor Bossert <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
  • Loading branch information
4 people authored Oct 30, 2023
1 parent 645a0fb commit 680cfba
Show file tree
Hide file tree
Showing 44 changed files with 760 additions and 129 deletions.
47 changes: 46 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,35 @@ on:
env:
GHA_CACHE_KEY_VERSION: "v1"

permissions:
id-token: write
contents: read

jobs:
test_logins:
runs-on: ubuntu-latest
steps:
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
id: gauth
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'run gcloud command'
run: |-
gcloud projects list
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: 'azure test command'
run: |-
az account show
setup:
strategy:
matrix:
Expand Down Expand Up @@ -268,6 +296,7 @@ jobs:
test_ingest:
environment: ci
strategy:
matrix:
python-version: ["3.8","3.9","3.10","3.11"]
Expand All @@ -276,7 +305,23 @@ jobs:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup_ingest, lint]
steps:
- uses: actions/checkout@v3
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
create_credentials_file: true
activate_credentials_file: true
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
Expand Down
55 changes: 54 additions & 1 deletion .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,36 @@ env:
GHA_CACHE_KEY_VERSION: "v1"
PYTHON_VERSION: "3.10"

permissions:
id-token: write
contents: read

jobs:
test_logins:
runs-on: ubuntu-latest
environment: ci
steps:
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
id: gauth
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'run gcloud command'
run: |-
gcloud projects list
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: 'azure test command'
run: |-
az account show
setup:
runs-on: ubuntu-latest
if: |
Expand Down Expand Up @@ -75,12 +104,36 @@ jobs:
make install-all-ingest
update-fixtures-and-pr:
environment: ci
runs-on: ubuntu-latest-m
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
needs: [setup_ingest]
steps:
- uses: actions/checkout@v3
# actions/checkout MUST come before auth
- uses: 'actions/checkout@v4'
- name: 'Google Cloud Auth'
uses: 'google-github-actions/auth@v1'
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
create_credentials_file: true
activate_credentials_file: true
- name: 'Set up Cloud SDK'
uses: 'google-github-actions/setup-gcloud@v1'
- name: 'Az CLI login'
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Get full Python version
id: full-python-version
run: echo version=$(python -c "import sys; print('-'.join(str(v) for v in sys.version_info))") >> $GITHUB_OUTPUT
- uses: actions/cache/restore@v3
id: virtualenv-cache
with:
Expand Down
9 changes: 7 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
## 0.10.28-dev4
## 0.10.28-dev5

### Enhancements

* **Add element type CI evaluation workflow** Adds element type frequency evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
* **Add table structure evaluation helpers** Adds functions to evaluate the similarity between predicted table structure and actual table structure.
* **Use `yolox` by default for table extraction when partitioning pdf/image** `yolox` model provides higher recall of the table regions than the quantized version and it is now the default element detection model when `infer_table_structure=True` for partitioning pdf/image files
* **Remove pdfminer elements from inside tables** Previously, when using `hi_res` some elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements.
* **Fsspec downstream connectors** New destination connector added to ingest CLI, users may now use `unstructured-ingest` to write to any of the following:
* Azure
* Box
* Dropbox
* Google Cloud Service

### Features

Expand Down Expand Up @@ -1609,4 +1614,4 @@ of an email.

## 0.2.0

* Initial release of unstructured
* Initial release of unstructured
4 changes: 2 additions & 2 deletions test_unstructured_ingest/metrics/aggregate-scores-cct.tsv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
strategy average sample_sd population_sd count
cct-accuracy 0.777 0.088 0.072 3
cct-%missing 0.087 0.045 0.037 3
cct-accuracy 0.798 0.083 0.072 4
cct-%missing 0.087 0.037 0.032 4
2 changes: 1 addition & 1 deletion test_unstructured_ingest/metrics/all-docs-cct.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
filename connector cct-accuracy cct-%missing
science-exploration-1p.pptx box 0.861 0.09
example-10k.html local 0.686 0.04
IRS-form-1987.pdf azure 0.783 0.13
IRS-form-1987.pdf azure 0.783 0.13
57 changes: 57 additions & 0 deletions test_unstructured_ingest/test-ingest-azure-dest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=azure-dest
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$SCRIPT_DIR/workdir/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}

if [ -z "$AZURE_DEST_CONNECTION_STR" ]; then
echo "Skipping Azure destination ingest test because the AZURE_DEST_CONNECTION_STR env var is not set."
exit 0
fi

CONTAINER=utic-ingest-test-fixtures-output
DIRECTORY=$(date +%s)
REMOTE_URL="abfs://$CONTAINER/$DIRECTORY/"

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$WORK_DIR"

echo "deleting azure storage blob directory $CONTAINER/$DIRECTORY"
az storage fs directory delete -f "$CONTAINER" -n "$DIRECTORY" --connection-string "$AZURE_DEST_CONNECTION_STR" --yes

}
trap cleanup EXIT

# Create directory to use for testing
az storage fs directory create -f "$CONTAINER" --n "$DIRECTORY" --connection-string "$AZURE_DEST_CONNECTION_STR"

PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--num-processes "$max_processes" \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--output-dir "$OUTPUT_DIR" \
--strategy fast \
--verbose \
--reprocess \
--input-path example-docs/fake-memo.pdf \
--work-dir "$WORK_DIR" \
azure \
--overwrite \
--remote-url "$REMOTE_URL" \
--connection-string "$AZURE_DEST_CONNECTION_STR"

# Simply check the number of files uploaded
expected_num_files=1
num_files_in_azure=$(az storage blob list -c "$CONTAINER" --prefix "$DIRECTORY"/example-docs/ --connection-string "$AZURE_DEST_CONNECTION_STR" | jq 'length')
if [ "$num_files_in_azure" -ne "$expected_num_files" ]; then
echo "Expected $expected_num_files files to be uploaded to azure, but found $num_files_in_azure files."
exit 1
fi
54 changes: 54 additions & 0 deletions test_unstructured_ingest/test-ingest-box-dest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env bash
#TODO currently box api/sdk does not work to create folders and check for content similar to other fsspec ingest tests

#
#set -e
#
#SCRIPT_DIR=$(dirname "$(realpath "$0")")
#cd "$SCRIPT_DIR"/.. || exit 1
#OUTPUT_FOLDER_NAME=box-dest
#OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
#WORK_DIR=$SCRIPT_DIR/workdir/$OUTPUT_FOLDER_NAME
#max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
#DESTINATION_BOX="box://utic-dev-tech-fixtures/utic-ingest-test-fixtures-output/$(date +%s)/"
#
#CI=${CI:-"false"}
#
#if [ -z "$BOX_APP_CONFIG" ] && [ -z "$BOX_APP_CONFIG_PATH" ]; then
# echo "Skipping Box ingest test because neither BOX_APP_CONFIG nor BOX_APP_CONFIG_PATH env vars are set."
# exit 0
#fi
#
#if [ -z "$BOX_APP_CONFIG_PATH" ]; then
# # Create temporary service key file
# BOX_APP_CONFIG_PATH=$(mktemp)
# echo "$BOX_APP_CONFIG" >"$BOX_APP_CONFIG_PATH"
#fi
#
## shellcheck disable=SC1091
#source "$SCRIPT_DIR"/cleanup.sh
#function cleanup() {
# cleanup_dir "$OUTPUT_DIR"
# cleanup_dir "$WORK_DIR"
# if [ "$CI" == "true" ]; then
# cleanup_dir "$DOWNLOAD_DIR"
# fi
#}
#trap cleanup EXIT
#
#PYTHONPATH=. ./unstructured/ingest/main.py \
# local \
# --num-processes "$max_processes" \
# --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
# --output-dir "$OUTPUT_DIR" \
# --strategy fast \
# --verbose \
# --reprocess \
# --input-path example-docs/fake-memo.pdf \
# --work-dir "$WORK_DIR" \
# box \
# --box-app-config "$BOX_APP_CONFIG_PATH" \
# --remote-url "$DESTINATION_BOX" \
#
## Simply check the number of files uploaded
#expected_num_files=1
81 changes: 81 additions & 0 deletions test_unstructured_ingest/test-ingest-dropbox-dest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash

set -e

SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_FOLDER_NAME=dropbox-dest
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$SCRIPT_DIR/workdir/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
DESTINATION_DROPBOX="/test-output/$(date +%s)"
CI=${CI:-"false"}

if [ -z "$DROPBOX_APP_KEY" ] || [ -z "$DROPBOX_APP_SECRET" ] || [ -z "$DROPBOX_REFRESH_TOKEN" ]; then
echo "Skipping Dropbox ingest test because one or more of these env vars is not set:"
echo "DROPBOX_APP_KEY, DROPBOX_APP_SECRET, DROPBOX_REFRESH_TOKEN"
exit 0
fi

# Get a new access token from Dropbox
DROPBOX_RESPONSE=$(curl -s https://api.dropbox.com/oauth2/token -d refresh_token="$DROPBOX_REFRESH_TOKEN" -d grant_type=refresh_token -d client_id="$DROPBOX_APP_KEY" -d client_secret="$DROPBOX_APP_SECRET")
DROPBOX_ACCESS_TOKEN=$(jq -r '.access_token' <<< "$DROPBOX_RESPONSE")

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
cleanup_dir "$OUTPUT_DIR"
cleanup_dir "$WORK_DIR"
if [ "$CI" == "true" ]; then
cleanup_dir "$DOWNLOAD_DIR"
fi

echo "deleting test folder $DESTINATION_DROPBOX"
curl -X POST https://api.dropboxapi.com/2/files/delete_v2 \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"path\":\"$DESTINATION_DROPBOX\"}" | jq
}
trap cleanup EXIT

# Create new folder for test
echo "creating temp directory in dropbox for testing: $DESTINATION_DROPBOX"
response=$(curl -X POST -s -w "\n%{http_code}" https://api.dropboxapi.com/2/files/create_folder_v2 \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"autorename\":false,\"path\":\"$DESTINATION_DROPBOX\"}");
http_code=$(tail -n1 <<< "$response") # get the last line
content=$(sed '$ d' <<< "$response") # get all but the last line which contains the status code

if [ "$http_code" -ge 300 ]; then
echo "Failed to create temp dir in dropbox: [$http_code] $content"
exit 1
else
echo "$http_code:"
jq <<< "$content"
fi

PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--num-processes "$max_processes" \
--metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
--output-dir "$OUTPUT_DIR" \
--strategy fast \
--verbose \
--reprocess \
--input-path example-docs/fake-memo.pdf \
--work-dir "$WORK_DIR" \
dropbox \
--token "$DROPBOX_ACCESS_TOKEN" \
--remote-url "dropbox://$DESTINATION_DROPBOX" \

# Simply check the number of files uploaded
expected_num_files=1
num_files_in_dropbox=$(curl -X POST https://api.dropboxapi.com/2/files/list_folder \
--header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"path\":\"$DESTINATION_DROPBOX/example-docs/\"}" | jq '.entries | length')
if [ "$num_files_in_dropbox" -ne "$expected_num_files" ]; then
echo "Expected $expected_num_files files to be uploaded to dropbox, but found $num_files_in_dropbox files."
exit 1
fi
2 changes: 1 addition & 1 deletion test_unstructured_ingest/test-ingest-dropbox.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--verbose \
--token "$DROPBOX_ACCESS_TOKEN" \
--recursive \
--remote-url "dropbox:// /" \
--remote-url "dropbox://test-input/" \
--work-dir "$WORK_DIR"


Expand Down
Loading

0 comments on commit 680cfba

Please sign in to comment.