PDF Support #20

dmwyatt · 2024-03-26T15:24:49Z

It would be neat to be able to parse the text from PDFs and include them.

Most of the work would happen in downloader.file_utils.extract_text_files and supporting functions. You'd have to determine if a file was a PDF and then implement a function for extracting text from it. Not too big of a deal.

PDFs are kind of hairy to get nice text from, but luckily LLMs are pretty good at dealing with poorly-formatted text so we don't have to get crazy with making sure everything is perfect.

I think this shouldn't be too bad, so I'm going to mark this as a good first issue. Hopefully I'm not under-thinking it.

The text was updated successfully, but these errors were encountered:

SwarajBaral · 2024-03-30T19:57:20Z

Do you have any libs in mind that we could use for pdf parsing ?

dmwyatt · 2024-03-30T22:06:33Z

I've used PyMuPDF to good success in the past. I don't know if it's the best choice nowadays, so I'm open to suggestions.

SwarajBaral · 2024-03-31T16:25:03Z

Got it. So from what I understand, we basically need to allow users to upload PDF and extract the same information that we currently do from repos, right ? Please correct me if I am wrong here

dmwyatt · 2024-03-31T20:45:26Z

So, what happens now is that the user selects a github repo and then we go through all the files from the repo and only select the plain text files.

gh_repo_download/downloader/file_utils.py

Line 234 in 0f4d531

async def extract_text_files(

Right here we check if the file is plain text...

gh_repo_download/downloader/file_utils.py

Line 303 in 0f4d531

is_plain_text, first_chunk = is_plain_text_file(file)

The updates this issue needs are going to revolve around saying something like if is plain text **or** if is PDF and in the if is PDF branch, extract the PDF contents and save it with text_files[member.filename] = content

dmwyatt added enhancement New feature or request good first issue Good for newcomers labels Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Support #20

PDF Support #20

dmwyatt commented Mar 26, 2024

SwarajBaral commented Mar 30, 2024

dmwyatt commented Mar 30, 2024

SwarajBaral commented Mar 31, 2024

dmwyatt commented Mar 31, 2024

PDF Support #20

PDF Support #20

Comments

dmwyatt commented Mar 26, 2024

SwarajBaral commented Mar 30, 2024

dmwyatt commented Mar 30, 2024

SwarajBaral commented Mar 31, 2024

dmwyatt commented Mar 31, 2024