Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When crawling the net, parse pdf documents as well #514

Closed
1 task done
sanesanyo opened this issue Apr 8, 2023 · 4 comments
Closed
1 task done

When crawling the net, parse pdf documents as well #514

sanesanyo opened this issue Apr 8, 2023 · 4 comments
Labels
enhancement New feature or request Stale

Comments

@sanesanyo
Copy link

Duplicates

  • I have searched the existing issues

Summary 💡

When crawling the web to do market research, a lot of links are sometimes just pdf documents. It would be great if Auto GPT had an inherent ability to parse those pdfs & feed the text for GPT4 to analyse.

Examples 🌈

  • Research on investing in Emerging Markets in 2023 --> The first few hits on Google Search are pdf documents. Auto GPT fails to parse them.

Motivation 🔦

This way Auto GPT can do the market research task far better than it currently can.

@Qoyyuum Qoyyuum added the enhancement New feature or request label Apr 16, 2023
@Boostrix
Copy link
Contributor

Boostrix commented May 4, 2023

this would probably be a plugin to use a python pdf parsing library analogous to pdf2text (not sure how to mark/label the issue or if I am lacking permissions to do so)

@anonhostpi
Copy link

Agreed with @Boostrix on this one. PDF parsing is an extraneous task, and isn't as straightforward as it ought to be. It would be better to assign that to developers who are skilled in PDF parsing.

@Boostrix
Copy link
Contributor

Boostrix commented May 4, 2023

There already is PR #3031 which supports plain text based PDF processing.

that would also provide the option to support arguments, such as searching a PDF file based on authors, date, pages etc (which would return a list of pages/matches etc)

a higher level command would probably be an adaption of browse_website or to search specifically just for PDF files using different search engines/APIs (think research servers as per #826), as per: #503 (comment)

Probably covered by #2730

Plugin candidate, once the dust settles with #3652

@github-actions
Copy link
Contributor

This issue was closed automatically because it has been stale for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

4 participants