When crawling the net, parse pdf documents as well #514

sanesanyo · 2023-04-08T19:36:30Z

Duplicates

I have searched the existing issues

Summary 💡

When crawling the web to do market research, a lot of links are sometimes just pdf documents. It would be great if Auto GPT had an inherent ability to parse those pdfs & feed the text for GPT4 to analyse.

Examples 🌈

Research on investing in Emerging Markets in 2023 --> The first few hits on Google Search are pdf documents. Auto GPT fails to parse them.

Motivation 🔦

This way Auto GPT can do the market research task far better than it currently can.

Boostrix · 2023-05-04T06:53:12Z

this would probably be a plugin to use a python pdf parsing library analogous to pdf2text (not sure how to mark/label the issue or if I am lacking permissions to do so)

anonhostpi · 2023-05-04T07:25:10Z

Agreed with @Boostrix on this one. PDF parsing is an extraneous task, and isn't as straightforward as it ought to be. It would be better to assign that to developers who are skilled in PDF parsing.

Boostrix · 2023-05-04T07:33:25Z

There already is PR #3031 which supports plain text based PDF processing.

that would also provide the option to support arguments, such as searching a PDF file based on authors, date, pages etc (which would return a list of pages/matches etc)

a higher level command would probably be an adaption of browse_website or to search specifically just for PDF files using different search engines/APIs (think research servers as per #826), as per: #503 (comment)

Probably covered by #2730

Plugin candidate, once the dust settles with #3652

github-actions · 2023-09-17T01:53:45Z

This issue was closed automatically because it has been stale for 10 days with no activity.

Qoyyuum added the enhancement New feature or request label Apr 16, 2023

github-actions bot added the Stale label Sep 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When crawling the net, parse pdf documents as well #514

When crawling the net, parse pdf documents as well #514

sanesanyo commented Apr 8, 2023

Boostrix commented May 4, 2023

anonhostpi commented May 4, 2023

Boostrix commented May 4, 2023 •

edited

Loading

github-actions bot commented Sep 17, 2023

When crawling the net, parse pdf documents as well #514

When crawling the net, parse pdf documents as well #514

Comments

sanesanyo commented Apr 8, 2023

Duplicates

Summary 💡

Examples 🌈

Motivation 🔦

Boostrix commented May 4, 2023

anonhostpi commented May 4, 2023

Boostrix commented May 4, 2023 • edited Loading

github-actions bot commented Sep 17, 2023

Boostrix commented May 4, 2023 •

edited

Loading