IBM · Qiragg · Oct 24, 2024 · hmtbr · Oct 25, 2024
diff --git a/examples/notebooks/rag/rag_1A_dpk_connector_python.ipynb b/examples/notebooks/rag/rag_1A_dpk_connector_python.ipynb
@@ -0,0 +1,270 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Before Running the notebook\n",
+    "\n",
+    "Please complete [setting up python dev environment](./setup-python-dev-env.md)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Overview\n",
+    "\n",
+    "This notebook will process PDF documents as part of RAG pipeline\n",
+    "\n",
+    "![](media/rag-overview-2.png)\n",
+    "\n",
+    "This notebook will perform steps 1, 2 and 3 in RAG pipeline.\n",
+    "\n",
+    "Here are the processing steps:\n",
+    "\n",
+    "- **pdf2parquet** : Extract text from PDF and convert them into parquet files\n",
+    "- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n",
+    "- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash\n",
+    "- **Exact Dedup**: Chunks with exact same content are filtered out\n",
+    "- **Text encoder**: Convert chunks into vectors using embedding models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step-1: Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from my_config import MY_CONFIG"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step-2:  Data Acqusition using the data-prep-connector\n",
+    "\n",
+    "Data-Prep-Connector\n",
+    "\n",
+    "Let's say we want to run RAG on the content published in the conference proceedings of Advances in Neural Information Processing Systems (NeurIPS) for the year 2017. The Data-Prep-Connector is a scalable and compliant web crawler that can be used to acquire targeted content for use cases such as RAG or LLM development. In this notebook example, we will run it to selectively crawl pages under a specific path (https://proceedings.neurips.cc/paper_files/paper/2017) and only save the PDFs. The crawler will automatically follow robots.txt and auto-throttle based on the server response time.\n",
+    "\n",
+    "You can of course substite your own data below"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0060ef47b12160b9198302ebdb144dcf-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/12a1d073d5ed3fa12169c67c4e2ce415-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1264a061d82a2edae1574b07249800d6-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/136f951362dab62e64eb8e841183c2a9-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1113d7a76ffceca1bb350bfe145467c6-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1271a7029c9df08643b631b02cf9e116-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10ce03a1ed01077e3e289f3e53c72813-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c272d06794d3e5785d5e7c5356e9ff-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/11b921ef080f7736089c757404650e40-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1006ff12c465532f8c574aeaa4461b16-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0f3d014eead934bbdbacb62a01dc4831-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1177967c7957072da3dc1db4ceb30e7a-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0e7c7d6c41c76b9ee6445ae01cc0181d-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0e55666a4ad822e0e34299df3591d979-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c66082c124f8afe3df4886f5e516e0-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0d9095b0d6bbe98ea0c9c02b11b59ee3-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0c74b7f78409a4022a2c4c5a5ca3ee19-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0bed45bd5774ffddc95ffe500024f628-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/ffeabd223de0d4eacb9a3e6e53e5448d-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0ebcc77dc72360d0eb8e9504c78d38bd-Abstract.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Reviews.html\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Paper.pdf\n",
+      "Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Paper.pdf\n",
+      "Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Paper.pdf\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'Crawl is done'"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from dpk_connector import crawl, shutdown\n",
+    "import nest_asyncio\n",
+    "import os\n",
+    "from utils import get_mime_type, get_filename_from_url\n",
+    "from dpk_connector.core.utils import validate_url\n",
+    "\n",
+    "# Use nest_asyncio to enable a nested event loop run for the crawler inside the Jupyter notebook\n",
+    "nest_asyncio.apply()\n",
+    "\n",
+    "# Initialize counter\n",
+    "retrieved_pages = 0\n",
+    "saved_pages = 0\n",
+    "\n",
+    "# Define a callback function to be executed at the retrieval of each page during a crawl\n",
+    "def on_downloaded(url: str, body: bytes, headers: dict) -> None:\n",
+    "    \"\"\"\n",
+    "    Callback function called when a page has been downloaded.\n",
+    "    You have access to the request URL, response body and headers.\n",
+    "    \"\"\"\n",
+    "    global retrieved_pages, saved_pages\n",
+    "    retrieved_pages+=1\n",
+    "    \n",
+    "    if saved_pages<20:\n",
+    "        print(f\"Visited url: {url}\")\n",
+    "\n",
+    "    # Get mime_type of retrieved page\n",
+    "    mime_type = get_mime_type(body)\n",
+    "    \n",
+    "    # Save the page if it is a PDF to only download research papers\n",
+    "    if 'pdf' in mime_type.lower():\n",
+    "        filename = get_filename_from_url(url)\n",
+    "        local_file_path = os.path.join(MY_CONFIG.INPUT_DATA_DIR, filename)\n",
+    "        \n",
+    "        with open(local_file_path, 'wb') as f:\n",
+    "            f.write(body)\n",
+    "            \n",
+    "        if saved_pages<20:\n",
+    "            print(f\"Saved contents of url: {url}\")\n",
+    "        saved_pages+=1\n",
+    "        \n",
+    "# Define a user agent to provide information about the client making the request\n",
+    "user_agent = \"dpk-connector\"\n",
+    "\n",
+    "async def run_my_crawl():\n",
+    "    crawl([\"https://proceedings.neurips.cc/paper_files/paper/2017\"], on_downloaded,  user_agent=user_agent, depth_limit = 2, path_focus = True, download_limit = 50)\n",
+    "    return \"Crawl is done\"\n",
+    "\n",
+    "# Now run the configured crawl\n",
+    "await run_my_crawl()\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Pages retrieved during the crawl: 62\n",
+      "Pages downloaded locally during the crawl: 20\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Note that the number of retrieved pages can be slightly different from download limit set which is a soft limit\n",
+    "print(f'Pages retrieved during the crawl: {retrieved_pages}')\n",
+    "print(f'Pages downloaded locally during the crawl: {saved_pages}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2.2 - Set input/output path variables for the pipeline\n",
+    "\n",
+    "The next steps following the content acquisition from the data-prep-connector can be followed in the same fashion as described in [rag_1A_dpk_process_python notebook](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "dpk",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/notebooks/rag/requirements.txt b/examples/notebooks/rag/requirements.txt
@@ -2,7 +2,7 @@
 
 data-prep-toolkit-transforms==0.2.1
 data-prep-toolkit-transforms-ray==0.2.1
-
+data-prep-connector==0.2.2.dev1
-data-prep-connector==0.2.2.dev1
+data-prep-connector==0.2.2
-data-prep-connector==0.2.2.dev1
+data-prep-connector==0.2.2
 deepsearch-toolkit
 
 
@@ -42,6 +42,7 @@ llama-index-vector-stores-milvus
 # --- Utils
 python-dotenv==1.0.0
 humanfriendly
+mimetypes-magic
 
 ## --- Jupyter Utils
 jupyterlab
@@ -51,3 +52,4 @@ ipywidgets
 IProgress
 chardet==5.2.0
 charset-normalizer==3.3.2
+nest-asyncio
diff --git a/examples/notebooks/rag/utils.py b/examples/notebooks/rag/utils.py
@@ -2,7 +2,12 @@
 import requests
 from humanfriendly import format_size
 import pandas as pd
+import magic
 import glob
+from dpk_connector.core.utils import (
+    urlparse_cached
+)
+from urllib.parse import unquote
 
 rootdir = os.path.abspath(os.path.join(__file__, "../../../../"))
 
@@ -70,3 +75,46 @@ def download_file(url, local_file, chunk_size=1024*1024):
         print(f"{local_file} ({file_size}) downloaded successfully.")
 ## --- end: download_file ------
 
+
+def get_mime_type(byte_data: bytes) -> str:
+    """
+    Obtain the MIME type for provided byte data using the magic library.
+
+    Args:
+        byte_data: bytes: Bytes data to identify mimetype for.
+
+    Returns:
+        str: Mimetype for given bytes data.
+
+    Example:
+        >>> byte_data = b'<!DOCTYPE html>\n...'
+        >>> get_mime_type(byte_data)
+        'text/html'
+    """
+
+    # Validate input type
+    if not isinstance(byte_data, bytes):
+        raise TypeError("Input must be of type 'bytes'")
+
+    # Initialize a magic object for MIME type detection
+    mime = magic.Magic(mime=True)
+
+    # Return MIME type from the byte data
+    return mime.from_buffer(byte_data)
+
+
+def get_extension(url: str) -> str:
+    parsed = urlparse_cached(url)
+    path = parsed.path
+    filename = unquote(os.path.basename(path))
+    ext = os.path.splitext(filename)[1]
+    if 16 < len(ext):
+        ext = ext[:16]
+    return ext
+
+
+def get_filename_from_url(url: str) -> str:
+    parsed = urlparse_cached(url)
+    basename = unquote(os.path.splitext(os.path.basename(parsed.path))[0])
+    ext = get_extension(url)
+    return basename + ext