Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc AI incubator tools #624

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
0870977
Create README.md
bharadwajreddyvangimalla Sep 12, 2023
ec9a4bd
Update README.md
bharadwajreddyvangimalla Sep 12, 2023
3c573b3
Update README.md
bharadwajreddyvangimalla Sep 12, 2023
a8fa8e2
Update README.md
bharadwajreddyvangimalla Sep 15, 2023
2ab7dab
Update README.md
bharadwajreddyvangimalla Sep 15, 2023
6983b1c
Update README.md
bharadwajreddyvangimalla Sep 15, 2023
7c0f557
Merge branch 'GoogleCloudPlatform:main' into main
bharadwajreddyvangimalla Sep 15, 2023
012cebc
Update README.md
bharadwajreddyvangimalla Sep 15, 2023
6f75cf0
Add files via upload
bharadwajreddyvangimalla Sep 15, 2023
398a0b8
uploaded ipynb files
bharadwajreddyvangimalla Sep 21, 2023
c7df35e
Update line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
a8bfa0c
Delete DocAI Incubator Tools/line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
b221cba
Create Line item improver tool
bharadwajreddyvangimalla Sep 21, 2023
0e575fe
Delete DocAI Incubator Tools/Line item improver tool
bharadwajreddyvangimalla Sep 21, 2023
2cd3c70
Create *
bharadwajreddyvangimalla Sep 21, 2023
9f44005
Add files via upload
bharadwajreddyvangimalla Sep 21, 2023
8cbcb23
Delete DocAI Incubator Tool/Line item improver tool directory
bharadwajreddyvangimalla Sep 21, 2023
79329a2
Add files via upload
bharadwajreddyvangimalla Sep 21, 2023
b7622bf
Update line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
66862aa
Merge branch 'main' into main
holtskinner Sep 21, 2023
9deb5bb
Delete DocAI Incubator Tools/line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
35434b8
Add files via upload
bharadwajreddyvangimalla Sep 21, 2023
03bcb6b
Delete DocAI Incubator Tools/line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
9b6d788
Add files via upload
bharadwajreddyvangimalla Sep 21, 2023
717bcd6
Update line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
2d9eecc
Update line_item_improver.ipynb
bharadwajreddyvangimalla Sep 21, 2023
5f2916f
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] Sep 21, 2023
428f7e2
Merge branch 'main' into main
holtskinner Sep 25, 2023
7e77308
Merge branch 'main' into main
holtskinner Oct 3, 2023
9c851ca
Merge branch 'main' into main
holtskinner Oct 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
107 changes: 107 additions & 0 deletions DocAI Incubator Tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
**Doc AI Incubator team**

Incubator team supports Doc AI clients in providing assistance on bugs,technical guidance and solutions based on the business needs.
GCP Doc AI experienced team members also suggests the best practices to get best out of the product for the given business case.

**Tools**
This Folder will have various documents and code snippets which is made for the benefit of Doc AI users.


1. **File name:** DocAI PAI Best Practices Guide v1.0 - External.pdf

This document will have the best practices of using Doc AI to get the better performance for your business needs.
Based on the experience of the team members, best practices for various processors and guide to improve the performance and few sample code snippets which helps to trouble shoot the issues .

2. **File name:** Doc AI Key_Value Entity Conversion (External).pdf

This Document will guide to convert the Key value pairs from the form parser output to the entities.

Input : Form parser output in a GCS folder
Entity name and corresponding synonyms to be considered as keys (to check in form parser)
GCS output folder

Expected output:
updated Json will be uploaded to GCS output folder provided with new entities updated in the jsons.

3. **File name:** DocAI - Script for Removing Empty Bounding Boxes (External).pdf

This Document will help to remove the entities which are have empty bounding boxes (without any info in the mentiontext of the entity)

Input : Parsed json in a GCS folder
GCS output folder to upload the updated jsons

Expected output:
Updated Json files with removing the entities with empty bounding boxes are uploaded into GCS output folder.


4. **File name:** Child Entity Tag Using Header Keyword (External).pdf

This Document will guide us to find the line item entities based on the header key words provided

Input : Parsed json in a GCS folder
Header keys with matching entity names
GCS output folder to upload the updated jsons

Expected output:
The updated json will have the line item entities added based on finding of header keywords given as input .
Gcs output folder will have the updated jsons.

5. **File name:** HITL Visualization Tool (External).pdf

This Document uses HITL updated Json and will Visualize the document with HITL updated entities (bounding boxes) and other entities with colour variation

Input : HITL Parsed jsons in a GCS folder


Expected output:
Excel file with image of document appended with bounding boxes for the entities updated in HITL and other entities with colour variation and also a dot diagram which visualizes parent to child entity relation.

6. **File name:** HITL REJECTED DOCUMENTS TRACKING [External].pdf

This Document gives the HITL rejected documents with a reason for rejection in csv file and also saves the rejected documents in a GCS folder given

Input : LRO numbers in list
GCS folder to save the rejected documents

Expected output:
Csv file which has document name and reason for rejection and files saved in GCS path

7. **File name:** PRE - POST HITL PARSER AND OCR ISSUE IDENTIFIER (External).pdf

This Document guides to find out the Parser performance and OCR issues by comparing pre and post HITL jsons

Input : Pre HITL jsons in a GCS folder
Post HITL jsons in a GCS folder

Expected output:
A detailed comparision between pre and post HITL jsons with OCR and Parser issues highlighted and Entity wise issues are also provided in a seperate analysis folder.

8. **File name:** Document AI Parser Result Merger(External).pdf

This Tool will combine various parser outputs into a single json file, the parser results provided can be of same document different parsers or different documents and different parsers.

Input : GCS folder which contains multiple parser output files to combine into single json
GCS folder to save the merged jsons
Expected output:
Merged parser output will be saved in the GCS folder.





















1,346 changes: 1,346 additions & 0 deletions DocAI Incubator Tools/line_item_improver.ipynb

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.