Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot extract_text from weasyprint generated PDF #242

Closed
mattjmorrison-imt opened this issue Jan 12, 2016 · 10 comments
Closed

Cannot extract_text from weasyprint generated PDF #242

mattjmorrison-imt opened this issue Jan 12, 2016 · 10 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@mattjmorrison-imt
Copy link

mattjmorrison-imt commented Jan 12, 2016

Generating a PDF with the following code ends up not returning anything from extractText.

"""
PyPDF2==2.1.0
WeasyPrint==55.0
"""

from io import BytesIO
from PyPDF2 import PdfReader

# Create example
from weasyprint import HTML
stream = BytesIO()
HTML(string="""
<html>
<body>
<div>Hello World</div>
</body>
</html>
""").write_pdf(stream)
stream.seek(0)

# Try to read "Hello World"
reader = PdfReader(stream)
print(reader.pages[0].extract_text())

In this issue: Kozea/WeasyPrint/issues/290 @liZe points out that other tools are able to extract the text.

@mstamy2 mstamy2 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label May 19, 2016
@afedosenko
Copy link

pypdf2==1.26.0
weasyprint==53.2

In these versions, the text is also not extracted for the example above.

@liZe
Copy link

liZe commented Aug 30, 2021

In these versions, the text is also not extracted for the example above.

It doesn’t work because PyPDF2 doesn’t use the /Encoding and/or the /ToUnicode information included in embedded fonts. As far as I can tell, there’s no easy fix :/, it will probably require a certain amount of work.

@MartinThoma MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Apr 18, 2022
@MartinThoma
Copy link
Member

PyPDF2 now uses /Encoding and /ToUnicode. Sadly, this issue is still open.

@liZe
Copy link

liZe commented Jun 6, 2022

PyPDF2 now uses /Encoding and /ToUnicode. Sadly, this issue is still open.

I can make the script work by adding this code in _extract_text:

if isinstance(op, bytes):
    process_operation(b"Tj", [op.decode('utf-16be')])

Maybe this case is not supported because WeasyPrint uses 2-byte codes for its strings. This code is probably a dirty workaround, but you’ll find the correct fix faster than me 😁.

(We get the right text, so /Encoding and /ToUnicode seem to work correctly as we always use custom encodings in WeasyPrint. 👏🎉)

@MartinThoma
Copy link
Member

@liZe Very nice! Thanks for sharing!

@pubpub-zz You're the expert here. Do you think adding it like this is ok?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jun 6, 2022

the problem is a little more tricky.
Under analysis

@MartinThoma MartinThoma changed the title Cannot extractText from weasyprint generated PDF Cannot extract_text from weasyprint generated PDF Jun 10, 2022
@pubpub-zz
Copy link
Collaborator

@liZe, can you retest with the latest version and give feed-backs

@liZe
Copy link

liZe commented Jun 19, 2022

@liZe, can you retest with the latest version and give feed-backs

👏 it works 👏

@pubpub-zz
Copy link
Collaborator

@MartinThoma
can you close it then

@MartinThoma
Copy link
Member

Very nice! Amazing work @pubpub-zz and thank you for confirming @liZe 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

6 participants