AWS workflow for digitizing transactional documents with Amazon Textract and a human-review loop with Amazon Augmented Artificial Intelligence (A2I). Deployed using CDK.
- Put a scanned tabular transactional PDF document in s3 at
s3://tdd-<git-branch>-store-resource-<account-number>-<region>/acquire/
- This action triggers the
tdd-<git-branch>-state-pipeline
step function which will:- Analyze the PDF document with Textract with table extraction
- Convert the textract table extractions to a simplified JSON format
- Apply business rules to the simplified JSON format (currently a pass-through operation)
- Send the PDF document and simplified JSON to A2I for human review of low-confidence values
- Once human review is complete, convert output to a spreadsheet that has one sheet per table found in the original PDF document
- Monitor document status in DynamoDB in the
tdd-<git-branch>-table-pipeline
table - Once your document has reached the
Status
Augment#Waiting
, it's time to perform human review using the worker portal (descrbied below) - After human review, find the final Excel spreadsheet in
s3://tdd-<git-branch>-store-resource-<account-number>-<region>/acquire/catalog
This interface allows annotators to review all tables found in a scanned pdf document and perform the following actions:
- Classify each table table with a type
- Classify each table column header with a type
- Make cell-level text corrections to any table cell value.
This application uses SageMaker labeling workforces to manage workers and distribute tasks. Create a private workforce, workers team called primary
and quality
, and assign yourself to both teams using these instructions: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html#create-workforce-sm-console
Once you’ve added yourself to the private workforce teams and confirmed your email, take note of the worker portal URL from the AWS Console by:
- Navigate to SageMaker
- Navigate to Ground Truth → Labeling workforces
- Click the Private tab
- Note the URL
Labeling portal sign-in
- you will log in here to perform A2I human reviews.
Deploying this application to your AWS account will create various S3 buckets, Lambda functions, IAM policies, an SQS queue, and a step function.
Pre-Requisites
- Install CDK Toolkit
- npm install -g aws-cdk
- Install Docker, and Run
- For Mac : https://docs.docker.com/docker-for-mac/install
- For Win : https://docs.docker.com/docker-for-windows/install
Instruction to Deploy Application to AWS Cloud
- cd TabularDocumentDigitization
- python3 -m venv .venv --prompt tdd - Create virtual environment
- source .venv/bin/activate - Enter virtual environment
- pip install . - Install dependencies in virtual environment
- cdk bootstrap - Only run this once per account setup
- edit cdk.json, set your work team name - Pre-create the workteam via aws console, and make sure to match workteam name in same region/account
- cdk deploy --all - Deploy application
To deploy the same application multiple times to the same account, either
change your current git branch and redeploy or set the DEPLOY
env variable.
Ex:
DEPLOY="development" cdk deploy --all
DEPLOY="production" cdk deploy --all
This will deploy two copies of the application under different prefixes.
Install pytest
dependencies
pip install pytest coverage pytest-cov pytest-xdist pytest-env
Run lambda unit tests from the base project folder.
Omit --looponfail
to run the tests once instead of continuously.
PYTHONPATH=./source/lambdas/ pytest \
tests/lambdas/ \
--cov=source/lambdas/ \
--cov-report term-missing \
--looponfail