From 649d811418cdd4fbd33fadd26d2670103daaa4cd Mon Sep 17 00:00:00 2001 From: Sunile Manjee Date: Tue, 15 Oct 2024 09:51:08 -0700 Subject: [PATCH] file names updated file names --- ...ve-approach-for-parsing-pdfs-in-rag.ipynb} | 100 +++++++++--------- 1 file changed, 50 insertions(+), 50 deletions(-) rename supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/{PDF_Parsing_Table_Extraction.ipynb => alternative-approach-for-parsing-pdfs-in-rag.ipynb} (96%) diff --git a/supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/PDF_Parsing_Table_Extraction.ipynb b/supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/alternative-approach-for-parsing-pdfs-in-rag.ipynb similarity index 96% rename from supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/PDF_Parsing_Table_Extraction.ipynb rename to supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/alternative-approach-for-parsing-pdfs-in-rag.ipynb index 961db7e6..71a29be2 100644 --- a/supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/PDF_Parsing_Table_Extraction.ipynb +++ b/supporting-blog-content/alternative-approach-for-parsing-pdfs-in-rag/alternative-approach-for-parsing-pdfs-in-rag.ipynb @@ -1,58 +1,44 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } - }, "cells": [ { "cell_type": "markdown", + "metadata": { + "id": "e9-GuDRKCz_1" + }, "source": [ "# PDF Parsing - Table Extraction\n", "\"Open\n" - ], - "metadata": { - "id": "e9-GuDRKCz_1" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "MBdflc9G0ICc" + }, "source": [ "##Objective\n", "This Python script extracts text and tables from a PDF file, converts the tables into a human-readable text format using Azure OpenAI, and writes the processed content to a text file. The script uses pdfplumber to extract text and table data from each page of the PDF. For tables, it sends a cleaned version (handling any missing or None values) to Azure OpenAI, which generates a natural language summary of the table. The extracted non-table text and the summarized table text are then saved to a text file for easy search and readability." - ], - "metadata": { - "id": "MBdflc9G0ICc" - } + ] }, { "cell_type": "code", - "source": [ - "!pip install pdfplumber" - ], + "execution_count": null, "metadata": { "id": "QBwz0_VNL1p6" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "!pip install pdfplumber" + ] }, { "cell_type": "markdown", - "source": [ - "This code imports necessary libraries for PDF extraction, data processing, and interacting with Azure OpenAI via API calls. It retrieves the Azure OpenAI API key and endpoint from Google Colab's userdata storage, sets up the required headers, and prepares for sending requests to the Azure OpenAI service." - ], "metadata": { "id": "QC37eVM90few" - } + }, + "source": [ + "This code imports necessary libraries for PDF extraction, data processing, and interacting with Azure OpenAI via API calls. It retrieves the Azure OpenAI API key and endpoint from Google Colab's userdata storage, sets up the required headers, and prepares for sending requests to the Azure OpenAI service." + ] }, { "cell_type": "code", @@ -85,15 +71,20 @@ }, { "cell_type": "markdown", - "source": [ - "This code defines two functions: extract_table_text_from_openai and parse_pdf. The extract_table_text_from_openai function sends a table's plain text to Azure OpenAI for conversion into a human-readable description by building a request payload and handling the response. The parse_pdf function processes a PDF file page by page, extracting both text and tables, and sends the extracted tables to Azure OpenAI for summarization, saving all the content (including summarized tables) to a text file." - ], "metadata": { "id": "79VOKKam0leA" - } + }, + "source": [ + "This code defines two functions: extract_table_text_from_openai and parse_pdf. The extract_table_text_from_openai function sends a table's plain text to Azure OpenAI for conversion into a human-readable description by building a request payload and handling the response. The parse_pdf function processes a PDF file page by page, extracting both text and tables, and sends the extracted tables to Azure OpenAI for summarization, saving all the content (including summarized tables) to a text file." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CdMm1AKJLKbA" + }, + "outputs": [], "source": [ "def extract_table_text_from_openai(table_text):\n", " # Payload for the Azure OpenAI request\n", @@ -162,28 +153,37 @@ " output_file.write(f\"Table {idx + 1} (Page {page_num}) Text Representation:\\n\")\n", " output_file.write(table_description + \"\\n\\n\")\n", " print(\"Text representation of the table:\", table_description)\n" - ], - "metadata": { - "id": "CdMm1AKJLKbA" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7ig9NSSnLMGt" + }, + "outputs": [], "source": [ "# URL of the PDF file\n", "file_url = \"https://sunmanapp.blob.core.windows.net/publicstuff/pdfs/quarterly_report.pdf\"\n", "##output stored here: /content/parsed_output/parsed_output.txt\n", "\n", "# Call the function to parse the PDF from the URL\n", - "parse_pdf_from_url(file_url)" - ], - "metadata": { - "id": "7ig9NSSnLMGt" - }, - "execution_count": null, - "outputs": [] + "parse_pdf_from_url(file_url)\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" } - ] -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +}