-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a notebook demonstrating the use of DPK connector for RAG #740
Open
Qiragg
wants to merge
1
commit into
IBM:dev
Choose a base branch
from
Qiragg:dev
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+321
−1
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
270 changes: 270 additions & 0 deletions
270
examples/notebooks/rag/rag_1A_dpk_connector_python.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,270 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Before Running the notebook\n", | ||
"\n", | ||
"Please complete [setting up python dev environment](./setup-python-dev-env.md)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Overview\n", | ||
"\n", | ||
"This notebook will process PDF documents as part of RAG pipeline\n", | ||
"\n", | ||
"![](media/rag-overview-2.png)\n", | ||
"\n", | ||
"This notebook will perform steps 1, 2 and 3 in RAG pipeline.\n", | ||
"\n", | ||
"Here are the processing steps:\n", | ||
"\n", | ||
"- **pdf2parquet** : Extract text from PDF and convert them into parquet files\n", | ||
"- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n", | ||
"- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash\n", | ||
"- **Exact Dedup**: Chunks with exact same content are filtered out\n", | ||
"- **Text encoder**: Convert chunks into vectors using embedding models" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step-1: Configuration" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from my_config import MY_CONFIG" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step-2: Data Acqusition using the data-prep-connector\n", | ||
"\n", | ||
"Data-Prep-Connector\n", | ||
"\n", | ||
"Let's say we want to run RAG on the content published in the conference proceedings of Advances in Neural Information Processing Systems (NeurIPS) for the year 2017. The Data-Prep-Connector is a scalable and compliant web crawler that can be used to acquire targeted content for use cases such as RAG or LLM development. In this notebook example, we will run it to selectively crawl pages under a specific path (https://proceedings.neurips.cc/paper_files/paper/2017) and only save the PDFs. The crawler will automatically follow robots.txt and auto-throttle based on the server response time.\n", | ||
"\n", | ||
"You can of course substite your own data below" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0060ef47b12160b9198302ebdb144dcf-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/12a1d073d5ed3fa12169c67c4e2ce415-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1264a061d82a2edae1574b07249800d6-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/136f951362dab62e64eb8e841183c2a9-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1113d7a76ffceca1bb350bfe145467c6-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1271a7029c9df08643b631b02cf9e116-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10ce03a1ed01077e3e289f3e53c72813-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c272d06794d3e5785d5e7c5356e9ff-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/11b921ef080f7736089c757404650e40-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1006ff12c465532f8c574aeaa4461b16-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0f3d014eead934bbdbacb62a01dc4831-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1177967c7957072da3dc1db4ceb30e7a-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0e7c7d6c41c76b9ee6445ae01cc0181d-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0e55666a4ad822e0e34299df3591d979-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c66082c124f8afe3df4886f5e516e0-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0d9095b0d6bbe98ea0c9c02b11b59ee3-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0c74b7f78409a4022a2c4c5a5ca3ee19-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0bed45bd5774ffddc95ffe500024f628-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/ffeabd223de0d4eacb9a3e6e53e5448d-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0ebcc77dc72360d0eb8e9504c78d38bd-Abstract.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/136f951362dab62e64eb8e841183c2a9-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1177967c7957072da3dc1db4ceb30e7a-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0d9095b0d6bbe98ea0c9c02b11b59ee3-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/ffeabd223de0d4eacb9a3e6e53e5448d-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0bed45bd5774ffddc95ffe500024f628-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c66082c124f8afe3df4886f5e516e0-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e55666a4ad822e0e34299df3591d979-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0e7c7d6c41c76b9ee6445ae01cc0181d-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/11b921ef080f7736089c757404650e40-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1271a7029c9df08643b631b02cf9e116-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Reviews.html\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1264a061d82a2edae1574b07249800d6-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/12a1d073d5ed3fa12169c67c4e2ce415-Paper.pdf\n", | ||
"Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Paper.pdf\n", | ||
"Saved contents of url: https://proceedings.neurips.cc/paper_files/paper/2017/file/1113d7a76ffceca1bb350bfe145467c6-Paper.pdf\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'Crawl is done'" | ||
] | ||
}, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"from dpk_connector import crawl, shutdown\n", | ||
"import nest_asyncio\n", | ||
"import os\n", | ||
"from utils import get_mime_type, get_filename_from_url\n", | ||
"from dpk_connector.core.utils import validate_url\n", | ||
"\n", | ||
"# Use nest_asyncio to enable a nested event loop run for the crawler inside the Jupyter notebook\n", | ||
"nest_asyncio.apply()\n", | ||
"\n", | ||
"# Initialize counter\n", | ||
"retrieved_pages = 0\n", | ||
"saved_pages = 0\n", | ||
"\n", | ||
"# Define a callback function to be executed at the retrieval of each page during a crawl\n", | ||
"def on_downloaded(url: str, body: bytes, headers: dict) -> None:\n", | ||
" \"\"\"\n", | ||
" Callback function called when a page has been downloaded.\n", | ||
" You have access to the request URL, response body and headers.\n", | ||
" \"\"\"\n", | ||
" global retrieved_pages, saved_pages\n", | ||
" retrieved_pages+=1\n", | ||
" \n", | ||
" if saved_pages<20:\n", | ||
" print(f\"Visited url: {url}\")\n", | ||
"\n", | ||
" # Get mime_type of retrieved page\n", | ||
" mime_type = get_mime_type(body)\n", | ||
" \n", | ||
" # Save the page if it is a PDF to only download research papers\n", | ||
" if 'pdf' in mime_type.lower():\n", | ||
" filename = get_filename_from_url(url)\n", | ||
" local_file_path = os.path.join(MY_CONFIG.INPUT_DATA_DIR, filename)\n", | ||
" \n", | ||
" with open(local_file_path, 'wb') as f:\n", | ||
" f.write(body)\n", | ||
" \n", | ||
" if saved_pages<20:\n", | ||
" print(f\"Saved contents of url: {url}\")\n", | ||
" saved_pages+=1\n", | ||
" \n", | ||
"# Define a user agent to provide information about the client making the request\n", | ||
"user_agent = \"dpk-connector\"\n", | ||
"\n", | ||
"async def run_my_crawl():\n", | ||
" crawl([\"https://proceedings.neurips.cc/paper_files/paper/2017\"], on_downloaded, user_agent=user_agent, depth_limit = 2, path_focus = True, download_limit = 50)\n", | ||
" return \"Crawl is done\"\n", | ||
"\n", | ||
"# Now run the configured crawl\n", | ||
"await run_my_crawl()\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Pages retrieved during the crawl: 62\n", | ||
"Pages downloaded locally during the crawl: 20\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Note that the number of retrieved pages can be slightly different from download limit set which is a soft limit\n", | ||
"print(f'Pages retrieved during the crawl: {retrieved_pages}')\n", | ||
"print(f'Pages downloaded locally during the crawl: {saved_pages}')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2.2 - Set input/output path variables for the pipeline\n", | ||
"\n", | ||
"The next steps following the content acquisition from the data-prep-connector can be followed in the same fashion as described in [rag_1A_dpk_process_python notebook](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "dpk", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use the official release version.