-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a notebook demonstrating the use of DPK connector for RAG #740
base: dev
Are you sure you want to change the base?
Conversation
Added notebook for DPK connector, requirements and utils
I noticed that there is a minor bug in the PR. This is being executed
even when the visited url count is greater than 20. I will fix that in the next commit after I get feedback on the notebook. |
Thanks, @Qiragg. Nice job! I tested your branch by running the Notebook and got the expected output. 58 pages were retrieved, and 18 were downloaded to the input subdirectory. Together with the "attention" and "granite" papers, it is the 20-limit bug that you have noted above. This input directory is now ready to be used by @sujee's process Notebook. Of course, the top part of Notebook (overview section, including the picture) needs to be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job. One minor comment.
@@ -2,7 +2,7 @@ | |||
|
|||
data-prep-toolkit-transforms==0.2.1 | |||
data-prep-toolkit-transforms-ray==0.2.1 | |||
|
|||
data-prep-connector==0.2.2.dev1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use the official release version.
data-prep-connector==0.2.2.dev1 | |
data-prep-connector==0.2.2 |
My feedback is
|
Why are these changes needed?
These changes will allow us to demonstrate an end-to-end pipeline starting from the target acquisition of crawled content.
This PR will also update the requirements.txt and utils to support certain functions used during the acquisition.
The notebook uploaded demonstrates the download of only research papers (PDF files) published in NeurIPS 2017 using path_focus and mime_type extraction. The crawled PDFs can be fed further in our RAG pipeline and the rest of the steps remain the same starting from Step 2.2 of https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb
The only issue I see is somewhere in the notebook : https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_2B_llamaindex_query.ipynb we do queries on the attention mechanism and the granite model. Papers pertaining to those are not downloaded in the notebook but I can make changes to have them be downloaded during the crawl. If I make the change and don't limit the number of downloads, around ~700 PDFs will get downloaded. I don't think it will cause hiccups during the rest of RAG steps but I haven't tested that yet.
Let's use this PR to discuss how can we this notebook fit with the rest of the RAG pipeline and whether any further changes are needed.
Related issue number (if any).
#739