Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF reader to read_file #1353

Closed
wants to merge 2 commits into from
Closed

Add PDF reader to read_file #1353

wants to merge 2 commits into from

Conversation

jacobtohahn
Copy link
Contributor

@jacobtohahn jacobtohahn commented Apr 14, 2023

Background

Python makes it easy to extract text from PDF files, which allows us to improve the read_file command to support PDF documents. Thanks to andai on Discord for the idea.

Changes

The read_file command was improved to detect if a PDF file is being read and, if so, extract and return the text. Otherwise, the read_file command works as before and treats files as text.

This functionality makes use of the pdfminer.six library, and thus it was added to the requirements.txt file.

Documentation

Test Plan

Tested by using human feedback to instruct AutoGPT to read both a text file and a PDF file. Functionality worked as expected in both cases.

PR Quality Checklist

  • My pull request is atomic and focuses on a single change.
  • I have thoroughly tested my changes with multiple different prompts.
  • I have considered potential risks and mitigations for my changes.
  • I have documented my changes clearly and comprehensively.
  • I have not snuck in any "extra" small tweaks changes

@jacobtohahn
Copy link
Contributor Author

Potential improvement: check length of text extracted from the PDF and truncate if it's too long

content = f.read()
return content
# Check if the file is a PDF and extract text if so
if filename.lower().endswith('.pdf'):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secure way to check if it's a PDF?

def is_pdf(file_path):
    with open(file_path, 'rb') as file:
        file_header = file.read(5)
    return file_header == b'%PDF-'

@jacobtohahn jacobtohahn closed this by deleting the head repository Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants