-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF Support #20
Comments
Do you have any libs in mind that we could use for pdf parsing ? |
I've used PyMuPDF to good success in the past. I don't know if it's the best choice nowadays, so I'm open to suggestions. |
Got it. So from what I understand, we basically need to allow users to upload PDF and extract the same information that we currently do from repos, right ? Please correct me if I am wrong here |
So, what happens now is that the user selects a github repo and then we go through all the files from the repo and only select the plain text files. gh_repo_download/downloader/file_utils.py Line 234 in 0f4d531
Right here we check if the file is plain text... gh_repo_download/downloader/file_utils.py Line 303 in 0f4d531
The updates this issue needs are going to revolve around saying something like |
It would be neat to be able to parse the text from PDFs and include them.
Most of the work would happen in
downloader.file_utils.extract_text_files
and supporting functions. You'd have to determine if a file was a PDF and then implement a function for extracting text from it. Not too big of a deal.PDFs are kind of hairy to get nice text from, but luckily LLMs are pretty good at dealing with poorly-formatted text so we don't have to get crazy with making sure everything is perfect.
I think this shouldn't be too bad, so I'm going to mark this as a good first issue. Hopefully I'm not under-thinking it.
The text was updated successfully, but these errors were encountered: