Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert xrefstream["/Type"] == "/XRef" #357

Closed
phoccavalcante opened this issue Jul 3, 2017 · 6 comments
Closed

assert xrefstream["/Type"] == "/XRef" #357

phoccavalcante opened this issue Jul 3, 2017 · 6 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@phoccavalcante
Copy link

I cant read one PDF file because /Type == /Page and I cant understand the reason.

<class 'PyPDF2.generic.DictionaryObject'> {'/Parent': IndirectObject(1, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': Indirec tObject(2, 0)}

Follow the file: http://communy.com.br/static/cobranca.pdf

Please, can anybody help me?

@guysoft
Copy link

guysoft commented Jul 14, 2019

I am getting the same issue with another PDF. That can be causing this?

Also getting this with the PDF provided.

@guysoft
Copy link

guysoft commented Jul 14, 2019

Workaround:
Repair the file with ghostscript

gs \
  -o repaired.pdf \
  -sDEVICE=pdfwrite \
  -dPDFSETTINGS=/prepress \
   corrupted.pdf

@AzizieAbuduaini
Copy link

AzizieAbuduaini commented Oct 18, 2019

I got this error when I try to read pdf from s3. later I found that there is some unexpected unicode character apostrophe in pdf content. What I did is to replace apostrophe with "’" then read agian, it works fine.
so before you pass the content
content = text.content("'", "’")

then pass content to file reader.
not sure apostrophe cause this issue but it works for me. Please try this approach and let me know it works or not.

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma
Copy link
Member

Can somebody create a minimal Python script that shows the issue with the shared PDF?

@MartinThoma
Copy link
Member

The comment by @AzizieAbuduaini indicates that it might be related to #384

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022
@MartinThoma
Copy link
Member

As here is not PDF to check, I assume that #924 has fixed this issue. I'll release the new PyPDF2==2.1.0 today.

Please ping me if you still encounter this issue with 2.1.0 or later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants