Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Support #20

Open
dmwyatt opened this issue Mar 26, 2024 · 4 comments
Open

PDF Support #20

dmwyatt opened this issue Mar 26, 2024 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@dmwyatt
Copy link
Owner

dmwyatt commented Mar 26, 2024

It would be neat to be able to parse the text from PDFs and include them.

Most of the work would happen in downloader.file_utils.extract_text_files and supporting functions. You'd have to determine if a file was a PDF and then implement a function for extracting text from it. Not too big of a deal.

PDFs are kind of hairy to get nice text from, but luckily LLMs are pretty good at dealing with poorly-formatted text so we don't have to get crazy with making sure everything is perfect.

I think this shouldn't be too bad, so I'm going to mark this as a good first issue. Hopefully I'm not under-thinking it.

@dmwyatt dmwyatt added enhancement New feature or request good first issue Good for newcomers labels Mar 26, 2024
@SwarajBaral
Copy link

Do you have any libs in mind that we could use for pdf parsing ?

@dmwyatt
Copy link
Owner Author

dmwyatt commented Mar 30, 2024

I've used PyMuPDF to good success in the past. I don't know if it's the best choice nowadays, so I'm open to suggestions.

@SwarajBaral
Copy link

Got it. So from what I understand, we basically need to allow users to upload PDF and extract the same information that we currently do from repos, right ? Please correct me if I am wrong here

@dmwyatt
Copy link
Owner Author

dmwyatt commented Mar 31, 2024

So, what happens now is that the user selects a github repo and then we go through all the files from the repo and only select the plain text files.

async def extract_text_files(

Right here we check if the file is plain text...

is_plain_text, first_chunk = is_plain_text_file(file)

The updates this issue needs are going to revolve around saying something like if is plain text **or** if is PDF and in the if is PDF branch, extract the PDF contents and save it with text_files[member.filename] = content

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants