Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a notebook demonstrating the use of DPK connector for RAG #740

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
270 changes: 270 additions & 0 deletions examples/notebooks/rag/rag_1A_dpk_connector_python.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Before Running the notebook\n",
"\n",
"Please complete [setting up python dev environment](./setup-python-dev-env.md)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overview\n",
"\n",
"This notebook will process PDF documents as part of RAG pipeline\n",
"\n",
"![](media/rag-overview-2.png)\n",
"\n",
"This notebook will perform steps 1, 2 and 3 in RAG pipeline.\n",
"\n",
"Here are the processing steps:\n",
"\n",
"- **pdf2parquet** : Extract text from PDF and convert them into parquet files\n",
"- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n",
"- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash\n",
"- **Exact Dedup**: Chunks with exact same content are filtered out\n",
"- **Text encoder**: Convert chunks into vectors using embedding models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step-1: Configuration"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from my_config import MY_CONFIG"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step-2: Data Acqusition using the data-prep-connector\n",
"\n",
"Data-Prep-Connector\n",
"\n",
"Let's say we want to run RAG on the content published in the conference proceedings of Advances in Neural Information Processing Systems (NeurIPS) for the year 2017. The Data-Prep-Connector is a scalable and compliant web crawler that can be used to acquire targeted content for use cases such as RAG or LLM development. In this notebook example, we will run it to selectively crawl pages under a specific path (https://proceedings.neurips.cc/paper_files/paper/2017) and only save the PDFs. The crawler will automatically follow robots.txt and auto-throttle based on the server response time.\n",
"\n",
"You can of course substite your own data below"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0060ef47b12160b9198302ebdb144dcf-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/12a1d073d5ed3fa12169c67c4e2ce415-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1264a061d82a2edae1574b07249800d6-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/136f951362dab62e64eb8e841183c2a9-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1113d7a76ffceca1bb350bfe145467c6-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1271a7029c9df08643b631b02cf9e116-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10ce03a1ed01077e3e289f3e53c72813-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c272d06794d3e5785d5e7c5356e9ff-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/11b921ef080f7736089c757404650e40-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1006ff12c465532f8c574aeaa4461b16-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0f3d014eead934bbdbacb62a01dc4831-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1177967c7957072da3dc1db4ceb30e7a-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0e7c7d6c41c76b9ee6445ae01cc0181d-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0e55666a4ad822e0e34299df3591d979-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c66082c124f8afe3df4886f5e516e0-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0d9095b0d6bbe98ea0c9c02b11b59ee3-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0c74b7f78409a4022a2c4c5a5ca3ee19-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0bed45bd5774ffddc95ffe500024f628-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/ffeabd223de0d4eacb9a3e6e53e5448d-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0ebcc77dc72360d0eb8e9504c78d38bd-Abstract.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Reviews.html\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Paper.pdf\n",
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Paper.pdf\n",
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Paper.pdf\n"
]
},
{
"data": {
"text/plain": [
"'Crawl is done'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from dpk_connector import crawl, shutdown\n",
"import nest_asyncio\n",
"import os\n",
"from utils import get_mime_type, get_filename_from_url\n",
"from dpk_connector.core.utils import validate_url\n",
"\n",
"# Use nest_asyncio to enable a nested event loop run for the crawler inside the Jupyter notebook\n",
"nest_asyncio.apply()\n",
"\n",
"# Initialize counter\n",
"retrieved_pages = 0\n",
"saved_pages = 0\n",
"\n",
"# Define a callback function to be executed at the retrieval of each page during a crawl\n",
"def on_downloaded(url: str, body: bytes, headers: dict) -> None:\n",
" \"\"\"\n",
" Callback function called when a page has been downloaded.\n",
" You have access to the request URL, response body and headers.\n",
" \"\"\"\n",
" global retrieved_pages, saved_pages\n",
" retrieved_pages+=1\n",
" \n",
" if saved_pages<20:\n",
" print(f\"Visited url: {url}\")\n",
"\n",
" # Get mime_type of retrieved page\n",
" mime_type = get_mime_type(body)\n",
" \n",
" # Save the page if it is a PDF to only download research papers\n",
" if 'pdf' in mime_type.lower():\n",
" filename = get_filename_from_url(url)\n",
" local_file_path = os.path.join(MY_CONFIG.INPUT_DATA_DIR, filename)\n",
" \n",
" with open(local_file_path, 'wb') as f:\n",
" f.write(body)\n",
" \n",
" if saved_pages<20:\n",
" print(f\"Saved contents of url: {url}\")\n",
" saved_pages+=1\n",
" \n",
"# Define a user agent to provide information about the client making the request\n",
"user_agent = \"dpk-connector\"\n",
"\n",
"async def run_my_crawl():\n",
" crawl([\"https://proceedings.neurips.cc/paper_files/paper/2017\"], on_downloaded, user_agent=user_agent, depth_limit = 2, path_focus = True, download_limit = 50)\n",
" return \"Crawl is done\"\n",
"\n",
"# Now run the configured crawl\n",
"await run_my_crawl()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pages retrieved during the crawl: 62\n",
"Pages downloaded locally during the crawl: 20\n"
]
}
],
"source": [
"# Note that the number of retrieved pages can be slightly different from download limit set which is a soft limit\n",
"print(f'Pages retrieved during the crawl: {retrieved_pages}')\n",
"print(f'Pages downloaded locally during the crawl: {saved_pages}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 - Set input/output path variables for the pipeline\n",
"\n",
"The next steps following the content acquisition from the data-prep-connector can be followed in the same fashion as described in [rag_1A_dpk_process_python notebook](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dpk",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
4 changes: 3 additions & 1 deletion examples/notebooks/rag/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

data-prep-toolkit-transforms==0.2.1
data-prep-toolkit-transforms-ray==0.2.1

data-prep-connector==0.2.2.dev1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use the official release version.

Suggested change
data-prep-connector==0.2.2.dev1
data-prep-connector==0.2.2

deepsearch-toolkit


Expand Down Expand Up @@ -42,6 +42,7 @@ llama-index-vector-stores-milvus
# --- Utils
python-dotenv==1.0.0
humanfriendly
mimetypes-magic

## --- Jupyter Utils
jupyterlab
Expand All @@ -51,3 +52,4 @@ ipywidgets
IProgress
chardet==5.2.0
charset-normalizer==3.3.2
nest-asyncio
48 changes: 48 additions & 0 deletions examples/notebooks/rag/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@
import requests
from humanfriendly import format_size
import pandas as pd
import magic
import glob
from dpk_connector.core.utils import (
urlparse_cached
)
from urllib.parse import unquote

rootdir = os.path.abspath(os.path.join(__file__, "../../../../"))

Expand Down Expand Up @@ -70,3 +75,46 @@ def download_file(url, local_file, chunk_size=1024*1024):
print(f"{local_file} ({file_size}) downloaded successfully.")
## --- end: download_file ------


def get_mime_type(byte_data: bytes) -> str:
"""
Obtain the MIME type for provided byte data using the magic library.

Args:
byte_data: bytes: Bytes data to identify mimetype for.

Returns:
str: Mimetype for given bytes data.

Example:
>>> byte_data = b'<!DOCTYPE html>\n...'
>>> get_mime_type(byte_data)
'text/html'
"""

# Validate input type
if not isinstance(byte_data, bytes):
raise TypeError("Input must be of type 'bytes'")

# Initialize a magic object for MIME type detection
mime = magic.Magic(mime=True)

# Return MIME type from the byte data
return mime.from_buffer(byte_data)


def get_extension(url: str) -> str:
parsed = urlparse_cached(url)
path = parsed.path
filename = unquote(os.path.basename(path))
ext = os.path.splitext(filename)[1]
if 16 < len(ext):
ext = ext[:16]
return ext


def get_filename_from_url(url: str) -> str:
parsed = urlparse_cached(url)
basename = unquote(os.path.splitext(os.path.basename(parsed.path))[0])
ext = get_extension(url)
return basename + ext