Add utilities and modularize code

GoogleCloudPlatform · Oct 19, 2023 · 80c130b · 80c130b
1 parent 5d3cef5
commit 80c130b
Show file tree

Hide file tree

Showing 23 changed files with 3,696 additions and 4,068 deletions.
diff --git a/DocAI Incubator Tools/best-practices/Child_Entity_Tag_Using_Header_Keyword.ipynb b/DocAI Incubator Tools/best-practices/Child_Entity_Tag_Using_Header_Keyword.ipynb
diff --git a/DocAI Incubator Tools/best-practices/Document_AI_Parser_Result_Merger.ipynb b/DocAI Incubator Tools/best-practices/Document_AI_Parser_Result_Merger.ipynb
diff --git a/...es/HITL_Rejected_Documents_Tracking.ipynb → ...ng/HITL Rejected Documents Tracking.ipynb b/...es/HITL_Rejected_Documents_Tracking.ipynb → ...ng/HITL Rejected Documents Tracking.ipynb
@@ -60,7 +60,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "a66c8067-7657-4873-99c0-f215d2b1fcea",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "!pip install google-cloud-documentai"

diff --git a/DocAI Incubator Tools/best-practices/HITL Rejected Documents Tracking/readme.md b/DocAI Incubator Tools/best-practices/HITL Rejected Documents Tracking/readme.md
@@ -0,0 +1,19 @@
+## Overview
+
+This Script is to oversee documents that the HITL system has declined. By accepting a set of Long Running Operation (LRO) IDs, it offers a glimpse into the documents that HITL has rejected based on those LROs. In addition to this, the Script also modifies and relocates processed JSON data to a predetermined GCS directory, now inclusive of a 'HITL_Status' entity.
+
+## Input Guidelines
+**LRO List:** This is essentially a list of IDs that are generated once batch processing is completed.
+
+**GCS HITL Rejected Path:** This is the designated directory within GCS where you'd want your JSON data stored. Ensure that this path concludes with a forward slash ('/').
+
+**Project ID:** This refers to the unique identifier of your project.
+
+**Location:** This is the geographical region or locale where the processing takes place, such as 'us' or 'eu'.
+
+## Output Details
+Post-execution, the utility delivers outcomes in two distinct formats:
+
+**CSV Format:** A file named HITL_Status_Update.csv will be generated. This file will contain details like the file names, their HITL status, and, if applicable, reasons for their rejection by HITL.
+
+**JSON Format:** The processed JSON will now have an added entity, 'HITL_Status', and will be relocated to the previously specified GCS directory.
diff --git a/...ls/best-practices/Identifying Poor Performing Docs/Identifying_Poor_Performing_Docs.ipynb b/...ls/best-practices/Identifying Poor Performing Docs/Identifying_Poor_Performing_Docs.ipynb
diff --git a/DocAI Incubator Tools/best-practices/Identifying Poor Performing Docs/readme.md b/DocAI Incubator Tools/best-practices/Identifying Poor Performing Docs/readme.md
@@ -0,0 +1,23 @@
+## Introduction
+Our primary objective is to seamlessly automate the process of pinpointing documents that underperform, in order to facilitate their uptraining. A document's performance is quantified based on its count of missed crucial fields. This script operates according to the following specifications:
+
+## Input Specifications
+**Labeled Document Bucket:** The source containing labeled documents.
+
+**Destination Bucket for Underperformers:** Where the poorly performing documents will be placed.
+
+**Project and Processor Details:** Needed to invoke the desired processor. This includes the project ID, processor ID, and its version.
+
+**Critical Fields List:** The script should first confirm that the names of these fields align with the schema. Discrepancies result in errors, prompting an update to the critical fields' input for congruence with the schema.
+
+**Performance Threshold:** Determines when a document is deemed underperforming and should be transferred to the designated bucket.
+
+## Numerical Substring Matching Criterion
+**Processor-driven Document Evaluation:** The script processes documents using a designated processor. It identifies underperformers by assessing each document's critical fields against the Ground Truth (GT).
+
+**optional Numerical Substring Matching:** Can be activated per entity. If this feature is on, as long as the numerical subset is accurate, the processor doesn't mark it as an oversight. For instance, if GT shows “ID 5123” and the prediction is “5123”, it isn't considered an error. The script acknowledges it as correct as long as the substring with the right numerical digits is detected.
+
+## Logic for Relocating Underperforming Documents Based on Thresholds
+**Output of Most Underperforming Documents:** The script outputs the worst-performing documents based on a custom-defined threshold. For example, documents that incorrectly identify over 50% of crucial fields are included in the output. The script can also recognize and process integer values; e.g., any document missing more than 5 crucial fields will be sent to the output bucket.
+## Summary & Statistics of Output
+**Missed Field List:** The script outputs a detailed list, either in sheets or CSV format, specifying misses in the critical fields for each document that's been transferred to the output bucket.
diff --git a/...ls/best-practices/Key Value Pair Entity Conversion/Key Value Pair Entity Conversion.ipynb b/...ls/best-practices/Key Value Pair Entity Conversion/Key Value Pair Entity Conversion.ipynb
@@ -0,0 +1,335 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "33e81a43-1c3e-4ba6-8ddf-cf1371e031ca",
+   "metadata": {},
+   "source": [
+    "# Key Value Pair Entity Conversion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "357b28b9-4bc9-4c05-b144-b9bd7f08ea69",
+   "metadata": {},
+   "source": [
+    "* Author: [email protected]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43d9299e-97fd-4bb9-813e-e3a26c29daa6",
+   "metadata": {},
+   "source": [
+    "## Disclaimer\n",
+    "\n",
+    "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.\t"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f0dab58-25b4-4710-ad18-abdc98e0fae5",
+   "metadata": {},
+   "source": [
+    "## Purpose and Description"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dba9362f-09bd-4481-b354-ebcde6133fe9",
+   "metadata": {},
+   "source": [
+    "This tool uses Form parser JSON files (Parsed from a processor) from the GCS bucket as input, converts the key/value pair to the entities and stores it to the GCS bucket as JSON files."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8fe7a56-d471-427f-a0ff-a7c6c44d1b3c",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "1. Vertex AI Notebook\n",
+    "2. Labeled json files in GCS Folder"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9955cea1-8f9f-413b-ad0b-d082f7f554e3",
+   "metadata": {},
+   "source": [
+    "## Step by Step procedure "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "255d46a4-f49c-4627-9ea0-c252b50554ed",
+   "metadata": {},
+   "source": [
+    "### 1. Setup Input Variables"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c916f6ec-0329-4d1a-a0df-fd7e5dd431be",
+   "metadata": {},
+   "source": [
+    "\n",
+    " * **PROJECT_ID:** provide your GCP project ID (Optional)\n",
+    " * **bucket_name:** provide the bucket name \n",
+    " * **formparser_path:** provide the folder name of the jsons gor parsed with form parser.\n",
+    " * **output_path:** provide for the folder name where jsons will be saved.\n",
+    " * **entity_synonyms_list:** Here add the entity names in place of \"Entity_1\", \"Entity_2\" and  add the synonyms related to the entity in place of \"Entity_1_synonyms_1\" and so on. Add multiple entities with their synonyms in the list.\n",
+    "\n",
+    "     [{\"Entity_1\":[\"Entity_1_synonyms_1\",\"Entity_1_synonyms_2\",\"Entity_1_synonyms_3\"]},{\"Entity_2\":[\"Entity_2_synonyms_1\",\"Entity_2_synonyms_2\",\"Entity_2_synonyms_3\"]}]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9bc87bb-44f7-44d2-b69d-6fd59071279b",
+   "metadata": {},
+   "source": [
+    "### 2. Output"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "707cc815-35e2-44bb-a767-48baeb9422be",
+   "metadata": {},
+   "source": [
+    "We get the converted Json in the GCS path which is provided in the script with the variable name **output_path**. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf831c2e-45b5-474f-b3cd-044d37ec7964",
+   "metadata": {},
+   "source": [
+    "![](https://screenshot.googleplex.com/43tgnEWB3HXSRpt.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95b79b16-c50d-4e73-95f4-223c38a5188e",
+   "metadata": {},
+   "source": [
+    "### 3. Sample Code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7f68a4db-78f5-421b-8bc9-84c4c06839bd",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#### importing necessary modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "76f25301-742f-4dbc-83f4-06bf57219eaf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import re\n",
+    "from io import BytesIO\n",
+    "from pathlib import Path\n",
+    "from utilities import *\n",
+    "# import gcsfs\n",
+    "import google.auth\n",
+    "from google.cloud import documentai_v1beta3 as documentai\n",
+    "from google.cloud import storage\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4e6b074-57ce-45f8-8a86-044f3e938bf0",
+   "metadata": {},
+   "source": [
+    "#### Setup the required inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "732eec2a-3da1-4759-bb34-ecdddb5a3a50",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "PROJECT_ID = \"xxx-xxx-xxx\" # your project id\n",
+    "bucket_name = \"xxxxxx\" # bucket name\n",
+    "\n",
+    "formparser_path = \"xxx/xxxxxxx/xxxxxx\" # path of the form parser output without bucket name\n",
+    "output_path = \"xxxx/xxxxxxxx/xxxxx\" # output path for this script without bucket name\n",
+    "\n",
+    "entity_synonyms_list = [{\"Bill_to\":[\"Bill To:\",\"Bill To\",\"BillTo\"]},\n",
+    " {\"Due_date\":[\"Due Date:\",\"Due Date\",\"DueDate\"]}] #example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a4cef02c-b4e6-4233-b04f-cf0b98a5045d",
+   "metadata": {},
+   "source": [
+    "#### Execute the code"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 171,
+   "id": "b65904ff-89d1-4229-b375-47aadd51781f",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Status : 100%|██████████| 2/2 [00:01<00:00,  1.14it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "def list_blobs(bucket_name):\n",
+    "    \"\"\"This function will give the list of files in a bucket \n",
+    "    args: gcs bucket name\n",
+    "    output: list of files\"\"\"\n",
+    "    from google.cloud import storage\n",
+    "    blob_list = []\n",
+    "    storage_client = storage.Client()\n",
+    "    blobs = storage_client.list_blobs(bucket_name)\n",
+    "    for blob in blobs:\n",
+    "        blob_list.append(blob.name)\n",
+    "    return blob_list\n",
+    "\n",
+    "\n",
+    "def store_blob(document, file_name: str):\n",
+    "    \"\"\"\n",
+    "    Store files in cloud storage.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    storage_client = storage.Client()\n",
+    "    process_result_bucket = storage_client.get_bucket(bucket_name)\n",
+    "    document_blob = storage.Blob(name=str(Path(output_path, file_name)),\n",
+    "                                 bucket=process_result_bucket)\n",
+    "    document_blob.upload_from_string(document,\n",
+    "                                     content_type=\"application/json\")\n",
+    "\n",
+    "\n",
+    "def entity_synonyms(old_entity: str):\n",
+    "    \"\"\"\n",
+    "    To check for any synonyms for the entites and replace.\n",
+    "    \"\"\"\n",
+    "    for item in entity_synonyms_list:\n",
+    "        synonym_list = list(map(str.lower,[*item.values()][0]))\n",
+    "        if old_entity.lower() in synonym_list:\n",
+    "            return [*item][0]\n",
+    "\n",
+    "    # if entity does not match with any synonyms, will return none.\n",
+    "    return \"\"\n",
+    "\n",
+    "\n",
+    "def entity_data(formField_data, page_number: int):\n",
+    "    \"\"\"\n",
+    "    Function to create entity objects with some cleaning.\n",
+    "    \"\"\"\n",
+    "    # Cleaning the entity name\n",
+    "    key_name = (re.sub(r\"[^\\w\\s]\",\"\", formField_data.field_name.text_anchor.content.replace(\" \", \"\").strip()))\n",
+    "    # checking for entity synonyms\n",
+    "    key_name = entity_synonyms(key_name)\n",
+    "    #initializing new entity \n",
+    "    entity = documentai.Document.Entity()\n",
+    "    \n",
+    "    if key_name:\n",
+    "        entity.confidence = formField_data.field_value.confidence\n",
+    "        entity.mention_text = formField_data.field_value.text_anchor.content\n",
+    "        page_ref=entity.page_anchor.PageRef()\n",
+    "        page_ref.bounding_poly.normalized_vertices.extend(formField_data.field_value.bounding_poly.normalized_vertices)\n",
+    "        page_ref.page = page_number\n",
+    "        entity.page_anchor.page_refs.append(page_ref)\n",
+    "        entity.text_anchor = formField_data.field_value.text_anchor\n",
+    "        entity.type = key_name\n",
+    "        return entity\n",
+    "    else:\n",
+    "        return {}\n",
+    "\n",
+    "\n",
+    "def convert_kv_entities(file: str):\n",
+    "    \"\"\"\n",
+    "    Function to convert form parser key value to entities.\n",
+    "    \"\"\"\n",
+    "    # initializing entities list\n",
+    "    file.entities = []\n",
+    "                \n",
+    "    for page_number, page_data in enumerate(file.pages):\n",
+    "        for formField_number, formField_data in enumerate(\n",
+    "                getattr(page_data,\"form_fields\", [])):\n",
+    "          \n",
+    "            # get the element and push it to the entities array\n",
+    "            entity_obj = entity_data(formField_data, page_number)\n",
+    "            if entity_obj:\n",
+    "                file.entities.append(entity_obj)\n",
+    "    # removing the form parser data\n",
+    "    for page in file.pages:\n",
+    "        del page.form_fields\n",
+    "        del page.tables\n",
+    "        \n",
+    "        \n",
+    "        \n",
+    "\n",
+    "    return file\n",
+    "\n",
+    "\n",
+    "def main():\n",
+    "    \"\"\"\n",
+    "    Main function to call helper functions\n",
+    "    \"\"\"\n",
+    "    # fetching all the files\n",
+    "    files = list(file_names(f\"gs://{bucket_name}/{formparser_path}\")[1].values())\n",
+    "    for file in tqdm(files, desc=\"Status : \"):\n",
+    "        # converting key value to entites\n",
+    "    \n",
+    "        entity_json = convert_kv_entities(documentai_json_proto_downloader(bucket_name,file))\n",
+    "        \n",
+    "        # storing the json\n",
+    "        store_blob(documentai.Document.to_json(entity_json), file.split(\"/\")[-1])\n",
+    "\n",
+    "\n",
+    "# calling main function\n",
+    "main()"
+   ]
+  }
+ ],
+ "metadata": {
+  "environment": {
+   "kernel": "python3",
+   "name": "common-cpu.m104",
+   "type": "gcloud",
+   "uri": "gcr.io/deeplearning-platform-release/base-cpu:m104"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}